Skip to content

MetaMap API breaks when special characters (e.g. 'ß') occurs in a word #8

@KimBenjaminTang

Description

@KimBenjaminTang

Hello, I am trying to let MetaMap process some translated german texts, which include words with the letter 'ß'.

After analyzing why the JSON output breaks, I found out that the character 'ß' seems to cause an error, if it is included in a word (not a standalone character).

Example request:

from skr_web_api import Submission, METAMAP_INTERACTIVE_URL

args = "-AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase -Z 2022AA"
inst = Submission(email, apikey)
inst.init_mm_interactive('This is a test with Straße', args=args)
response = inst.submit()

When I decode the content of the response via response.content.decode(), it returns a broken JSON string (broken, since it does not clsoe at the end and seems cut off):

/dmzfiler/II_Group/MetaMap2020/public_mm/bin/SKRrun.20 /dmzfiler/II_Group/MetaMap2020/public_mm/bin/metamap20.BINARY.Linux --lexicon db -Z 2022AA --silent -AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase
{"AllDocuments":[
{
   "Document": {
     "CmdLine": {
       "Command": "metamap --lexicon db -Z 2022AA --silent -AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase",
       "Options": [
         {
           "OptName": "lexicon",
           "OptValue": "db"
         },
         {
           "OptName": "mm_data_year",
           "OptValue": "2022AA"
         },
         {
           "OptName": "silent"
         },
         {
           "OptName": "strict_model"
         },
         {
           "OptName": "show_cuis"
         },
         {
           "OptName": "restrict_to_sources",
           "OptValue": ["SNOMEDCT_US_2022_03_01"]
         },
         {
           "OptName": "JSONf",
           "OptValue": "2"
         },
         {
           "OptName": "mm_data_version",
           "OptValue": "USAbase"
         },
         {
           "OptName": "infile",
           "OptValue": "user_input"
         },
         {
           "OptName": "outfile",
           "OptValue": "user_output"
         }]
     },
     "AAs": [],
     "Negations": [],
     "Utterances": [
       {
         "PMID": "USER",
         "UttSection": "tx",
         "UttNum": "1",
         "UttText": [

Somewhat of fix would be possible by replacing the character 'ß' with 'ss' to avoid this issue, but I am not sure if the results will be the same as with the online version of MetaMap, since words containing 'ß' are not a problem there:

Request:

User Information: fu-sung.kim-benjamin.tang@rwth-aachen.de
Run Time: 12/06/2022 06:12:29

MetaMap Version Used: metamap20
MetaMap Options: -A+ -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase
Knowledge Source Used: 2022AA

Input Text:

This is a test with Straße


Output:

{
   "Document": {
     "CmdLine": {
       "Command": "metamap --lexicon db -Z 2022AA -A+ -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase /usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.tmp /usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.out",
       "Options": [
         {
           "OptName": "lexicon",
           "OptValue": "db"
         },
         {
           "OptName": "mm_data_year",
           "OptValue": "2022AA"
         },
         {
           "OptName": "strict_model"
         },
         {
           "OptName": "bracketed_output"
         },
         {
           "OptName": "restrict_to_sources",
           "OptValue": ["SNOMEDCT_US_2022_03_01"]
         },
         {
           "OptName": "JSONf",
           "OptValue": "2"
         },
         {
           "OptName": "mm_data_version",
           "OptValue": "USAbase"
         },
         {
           "OptName": "infile",
           "OptValue": "/usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.tmp"
         },
         {
           "OptName": "outfile",
           "OptValue": "/usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.out"
         }]
     },
     "AAs": [],
     "Negations": [],
     "Utterances": [
       {
         "PMID": "inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.tmp",
         "UttSection": "tx",
         "UttNum": "1",
         "UttText": "This is a test with Straße",
         "UttStartPos": "0",
         "UttLength": "26",
         "Phrases": [
           {
             "PhraseText": "This",
             "SyntaxUnits": [
               {
                 "SyntaxType": "pron",
                 "LexMatch": "this",
                 "InputMatch": "This",
                 "LexCat": "pron",
                 "Tokens": ["this"]
               }],
             "PhraseStartPos": "0",
             "PhraseLength": "4",
             "Candidates": [],
             "Mappings": []
           },
           {
             "PhraseText": "is",
             "SyntaxUnits": [
               {
                 "SyntaxType": "aux",
                 "LexMatch": "is",
                 "InputMatch": "is",
                 "LexCat": "aux",
                 "Tokens": ["is"]
               }],
             "PhraseStartPos": "5",
             "PhraseLength": "2",
             "Candidates": [],
             "Mappings": []
           },
           {
             "PhraseText": "a test with Straße",
             "SyntaxUnits": [
               {
                 "SyntaxType": "det",
                 "LexMatch": "a",
                 "InputMatch": "a",
                 "LexCat": "det",
                 "Tokens": ["a"]
               },
               {
                 "SyntaxType": "head",
                 "LexMatch": "test",
                 "InputMatch": "test",
                 "LexCat": "noun",
                 "Tokens": ["test"]
               },
               {
                 "SyntaxType": "prep",
                 "LexMatch": "with",
                 "InputMatch": "with",
                 "LexCat": "prep",
                 "Tokens": ["with"]
               },
               {
                 "SyntaxType": "mod",
                 "InputMatch": "Straße",
                 "LexCat": "noun",
                 "Tokens": ["straße"]
               }],
             "PhraseStartPos": "8",
             "PhraseLength": "18",
             "Candidates": [],
             "Mappings": [
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0022885",
                     "CandidateMatched": "Laboratory procedures",
                     "CandidatePreferred": "Laboratory Procedures",
                     "MatchedWords": ["test"],
                     "SemTypes": ["lbpr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               },
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0392366",
                     "CandidateMatched": "Tests (qualifier value)",
                     "CandidatePreferred": "Tests (qualifier value)",
                     "MatchedWords": ["test"],
                     "SemTypes": ["inpr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               },
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0456984",
                     "CandidateMatched": "Test finding",
                     "CandidatePreferred": "Test Result",
                     "MatchedWords": ["test"],
                     "SemTypes": ["lbtr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               }]
           }]
       }]
   }
 }
]}

Can this be fixed by adjusting the MetaMap API to match the procedure of the MetaMap Online version?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions