Skip to content

MDverse/mdner

Repository files navigation

📑 Molecular dynamics named entity recognition 🧬

A Named Entity Recognition model for molecular dynamics data.

MDNER is a NER model developed specifically to extract information from MD simulations. This project is part of the MDVERSE project [1].

GitHub last commit

⚙️ Prerequisites

Hardware

For the GPU code, it's essential to have a relatively new Nvidia GPU that has a minimum memory capacity of 8.0 GiB. It should also be noted that you must have a CUDA driver installed on your system. The code uses CUDA, a parallel computing platform developed by NVIDIA, to interact with GPUs. No specific requirements are needed for the CPU code. To use spaCy, see the spaCy documentation.

📥 Download sources

Clone the repository and move to the newly created mdner directory:

git clone https://github.com/pierrepo/mdner.git
cd mdner

📦 Setup your environment

Install conda and mamba.

Create the mdner and mdner_app conda environments:

mamba env create -f binder/environment.yml
mamba env create -f binder/app.yml

Note: you can also update conda environments with:

mamba env update -f binder/environment.yml
mamba env update -f binder/app.yml

To deactivate an active conda environment, use:

conda deactivate

✍ Annotate texts

This section consists of selecting text datasets (titles and descriptions) that will be used to build our NER model. This consists of generating json files and text files containing the titles and descriptions of our available MD datasets here.

Load the mdner conda environment and launch the generation of text files and json files:

conda activate mdner
python3 scripts/generate_annotation.py

➤ Outputs :

[2023-07-07 17:43:56,748] [INFO] 974 texts selected according the threshold length
[2023-07-07 17:43:59,457] [INFO] 284 texts selected according the corpus similarity
[2023-07-07 17:43:59,545] [INFO] Generation completed

This script does not require a GPU.

Parameters

usage: generate_annotation.py [-h] [-c] [-p {mbart,bart-paraphrase,pegasus}] [-s SEED] [threshold]
[cutoff]

Generate text and json files in the annotation folder to be used as training sets.

positional arguments:
  threshold             The threshold for the length of the descriptive texts. By default
                        the value is 594.
  cutoff                Select the descriptive texts where the cosine similarity is below
                        the threshold. By default the value is 0.2.

options:
  -h, --help            show this help message and exit
  -c, --clear           Clear the annotation folder and generate files.
  -p {mbart,bart-paraphrase,pegasus}, --paraphrase {mbart,bart-paraphrase,pegasus}
                        Paraphrase the annotation according to three paraphrasing models.
  -s SEED, --seed SEED  Set the seed for reproducibility in paraphrase. By default the
                        value is 42.

Annotating json files requires manual annotation and must be in the annotations folder. Load the mdner_app conda environment and use the entity annotator to annotate and edit json files by typing the following command :

conda activate mdner_app
streamlit run scripts/entity_annotator.py

There are various other tools for annotating such as Prodigy or a site that allows it: https://tecoholic.github.io/ner-annotator/.

✍ Add paraphrase

If you think you don't have enough data, you can paraphrase the annotated texts. Paraphrasing consists at keeping the context of the original text and reformulating it in another way.

python3 scripts/generate_annotation.py -p mbart

Here you will use the mBART model for paraphrasing.

➤ Output:

[2023-07-21 10:36:56,305] [DEBUG] Starting new HTTPS connection (1): huggingface.co:443
[2023-07-21 10:36:57,272] [DEBUG] https://huggingface.co:443 "HEAD /facebook/mbart-large-50-many-to-many-mmt/resolve/main/config.json HTTP/1.1" 200 0
[2023-07-21 10:37:12,036] [DEBUG] https://huggingface.co:443 "HEAD /facebook/mbart-large-50-many-to-many-mmt/resolve/main/generation_config.json HTTP/1.1" 200 0
[2023-07-21 10:37:12,243] [DEBUG] https://huggingface.co:443 "HEAD /facebook/mbart-large-50-many-to-many-mmt/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
[2023-07-21 10:37:18,072] [INFO] Seed: 42
[2023-07-21 10:37:18,074] [INFO] Paraphrase processing with mbart model: 100%| Files found: 380

The execution time for paraphrasing depends on the model used. For mBART, this takes about 2.5 hours with the use of GPU. Paraphrasing could run without GPU, on CPU only, but it will take more time.

A presentation of the annotation structure can be found on ANNOTATIONS.

📑 Train and evaluate the NER model 📈

Create MDNER

The mdner.py script is used to create the model according to the defined parameters. Using this script requires a GPU. In our test case, we used an NVIDIA GeForce RTX 3060 12 GB vram.

Parameters

usage: mdner.py [-h] [-c] [-t d f p r] [-n NAME] [-g] [-p] [-m] [-s SEED]

Create a model for the molecular dynamics data.

options:
  -h, --help            show this help message and exit
  -c, --create          Create a dedicated Named Entity Recognition model for our molecular
                        dynamics data.
  -t d f p r, --train d f p r
                        Hyperparameters for the learning process where d is the percentage
                        of dropout. The f, p and r scores define what SpaCy believes to be
                        the best model after the learning process. Each value must be between
                        0 and 1, and the sum of these three values must be equal to 1.
  -n NAME, --name NAME  Name of the model.
  -g, --gpu             Use GPU for learning. Using this parameter, the model that will be
                        created will be based on the Transfromers "BioMed-RoBERTa-base".
  -p, --paraphrase      Add paraphrase in the learning dataset.
  -m, --mol             Use only MOL entities.
  -s SEED, --seed SEED  Seed used to sample data sets for reproducibility. By default the
                        value is 42.

To create the mdner, the -c, -t and -n options must be used. The -c option tells the script to create a model. The -t option is the hyperparameters to be used to train the model. The -n option corresponds to the name model that will be created.

You can introduce paraphrases only in the learning set (training + test) with the -p option. Creating a model with a GPU takes more than 1 hour on average, whereas with a CPU it takes around 30 minutes.

Example

conda activate mdner
python3 scripts/mdner.py -c -t 0.1 0.0 1.0 0.0 -n my_model -g -p -s 7522

Here, we define a model where the dropout value will be 0.1`` (10% of the nodes will be deactivated). The three next values (0.0 1.0 0.0`) correspond to the metrics f-score, precision and recall. They allow us to consider what is the best model. Here we prefer the precision score rather overt the recall and f- score. The sum of these 3 values must be equal to 1.0.

We have also chosen to create a model based on Transformers by using the -g option. If the -g option is not chosen, the model generated will be based on the cpu and will use a basic spaCy model.

The -p option is used to add paraphrases to the learning dataset.

The -s option specifies the seed used to sample the data sets. You should be able to obtain similar results wit the same seed.

At the end of the code execution, the best NER model will be evaluated on the validation set. The model will be located in the results/models directory. In this example, the model will be in results/models/my_model.

➤ Output:

[2023-07-21 19:43:34,641] [INFO] Seed: 7522                                                                                                              
[2023-07-21 19:43:34,645] [INFO] Add paraphrase in the learning dataset
[2023-07-21 19:43:34,682] [WARNING] 42 files ignored because there are not many entities
[2023-07-21 19:43:35,058] [INFO] train_data: 100%| Size: 488                                                                                             
[2023-07-21 19:43:36,442] [INFO] test_data: 100%| Size: 122
[2023-07-21 19:43:36,980] [INFO] eval_data: 100%| Size: 34    
[2023-07-21 19:43:37,147] [INFO] Checking GPU availability          
[2023-07-21 19:43:38,710] [INFO] GPU is available     
[...]
================================== Results ==================================

TOK     100.00
NER P   90.60 
NER R   72.29 
NER F   80.42 
SPEED   3078  


=============================== NER (per type) ===============================

            P       R       F
MOL     93.49   70.11   80.13
FFM     88.89   68.09   77.11
SOFT    90.48   80.85   85.39
STIME   72.97   93.10   81.82
TEMP    85.71   66.67   75.00

Here, the training phase took about 2 hours with the use of GPU.

Results

From the original and paraphrased texts obtained with the mBART model, we have trained two NER model based on the Transformers "BioMed-RoBERTa-base" and we evaluated the models on the validation set as shown in Table 1. The models were obtained on the seed 7522 and 10 replicates were generated for each of the 2 models. These replicates took more than 20 hours to run. The results obtained are available in the file results/outputs/results.csv and the bash script scripts/build is used to create the different models :

bash scripts/build.sh
Table 1 - Mean precision scores with standard deviation for each entity of the Transformers model based on "BioMed-RoBERTa-base" without and with paraphrase. Each model was generated over 10 replicates. The best precision scores per entity are shown in bold.
Entities
Precision score (%)
Transformers Transformers + Paraphrase
MOL (molecule) 80 ± 1.9 91 ± 1.3
FFM (force field & model) 75 ± 12.8 94 ± 2.8
TEMP (temperature) 46 ± 17.6 90 ± 0.0
STIME (simulation time) 62 ± 9.7 82 ± 3.6
SOFT (software) 89 ± 7.5 92 ± 3.3
Total 78 ± 1.9 91 ± 1.0

We note an increase in the precision score, particularly for our key entity, the MOL entity, which rises from 80% to 91%. Performance for the other entities is improved. The NER models were able to identify molecule names not present in the learning dataset, perfectly underlining the ability of the NER model to generalize and identify the desired entities, and demonstrating the relevance of fine-tuning on Transformer models [2].

🚀 Use the NER model

In order to run an example, you can launch a website with Streamlit to apply the MDNER model to a text and annotate it automatically. Load the mdner_app conda environment and run the Streamlit app by entering the name of the model as an argument, as in the following command:

conda activate mdner_app
streamlit run scripts/mdner_app.py -- --model my_model

Using MDNER does not require a GPU. It may be advantageous to use a GPU to speed up the predictions of the NER model, but it is not mandatory.

References

[1] Tiemann JKS, Szczuka M, Bouarroudj L, Oussaren M, Garcia S, Howard RJ, Delemotte L, Lindahl E, Baaden M, Lindorff-Larsen K, Chavent M, Poulain P. MDverse: Shedding Light on the Dark Matter of Molecular Dynamics Simulations. bioRxiv [Preprint]. 2023 May 2:2023.05.02.538537. doi: 10.1101/2023.05.02.538537. PMID: 37205542; PMCID: PMC10187166.

[2] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •