Repo for the GreenNLP/OpenEuroLLM document descriptor research project. The research project aims to use LLMs to create a dynamic taxonomy of descriptive labels ("descriptors") for web documents.
Made to run on the LUMI supercomputer: https://lumi-supercomputer.eu/. Runs vLLM 0.6.6.
The basic steps for generating a new descriptor schema from scratch are:
- Generate a raw set of descriptors with
descriptor_generation/generate_descriptor.py. - Extract descriptor groups from raw results with
disambiguation/extract_descriptor_groups.py. Split data into smaller batches for faster parallel processing by setting --num-splits. - Use the extracted descriptor groups as input for
disambiguation/disambiguate_descriptors.py. - If you ran multiple jobs in parallel (which you should with any larger data), concatenate all results together with
disambiguation/concat_disambig_results.sh. - You'll probably have to repeat this a few times. So extract groups from the output of the disambiguation, disambiguate again, etc. Concatenate results after each disambiguation run if running in parallel.
- Merge the results of you final disambiguation run with
merging/find_synonyms.py. - If duplicates still remain, force merge them with
merging/force_merge.py.
You schema is done!
Once you have a schema, you can generate descriptors for any dataset and then align those descriptors with the schema.
- Generate a raw set of descriptors with
descriptor_generation/generate_descriptor.py. - Harmonize descriptors with your schema with
harmonize/harmonize_with_schema.py.
Harmonization done!
Below are more detailed instructions for running the pipeline on LUMI. If you run this on another machine or cluster, these instructions might not be fully accurate.
To run descriptor generation pipeline on LUMI:
-
Clone this repo into your project
scratch/ -
cd doc_descriptors -
Create a virtual environment. Read more: https://docs.csc.fi/support/tutorials/python-usage-guide/#installing-python-packages-to-existing-modules
module purge
module use /appl/local/csc/modulefiles
module load pytorch/2.5
python3 -m venv --system-site-packages venv
source venv/bin/activate
-
Install requirements:
pip install -r requirements.txt -
By default, the model will be downloaded into you home directory, which will fill up very quickly. I recommend creating a cache folder in your
scratchand adding this line to your .bashrc so you don't have to worry about setting the cache folder manually:export HF_HOME="/scratch/<your-project>/my_cache". You can also set the caching directory in run_vllm.sh with the flag--cache-dir, e.g.--cache-dir="/scratch/<your-project>/my_cache" -
In
run_generate_descriptors.sh, change--accountto your project. It is recommended to reserve a full node, i.e., 8 GPUs because reserving less tends to cause NCCL errors. You have to give a--run-id, e.g. 'run1'. All other parameters are set to reasonable defaults that you can change if you want to. -
Run the descriptor generation pipeline:
sbatch run_vllm.sh