Skip to content

Python method to return a textual similarity score for two hadith units #1

@ahadith

Description

@ahadith

As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:

  • The ability to specify whether we want to include the tashkil/diacritics in the similarity computation or not
  • For words that don't match exactly, compare their roots and have that contribute to a slightly lower similarity score
  • ignore spacing and punctuation differences
  • strip out HTML tags

These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions