-
Notifications
You must be signed in to change notification settings - Fork 3
Description
As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:
- The ability to specify whether we want to include the tashkil/diacritics in the similarity computation or not
- For words that don't match exactly, compare their roots and have that contribute to a slightly lower similarity score
- ignore spacing and punctuation differences
- strip out HTML tags
These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.