Python method to return a textual similarity score for two hadith units

As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:

- The ability to specify whether we want to include the tashkil/diacritics in the similarity computation or not 
- For words that don't match exactly, compare their roots and have that contribute to a slightly lower similarity score
- ignore spacing and punctuation differences
- strip out HTML tags

These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python method to return a textual similarity score for two hadith units #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python method to return a textual similarity score for two hadith units #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions