GitHub

Repo for the GreenNLP/OpenEuroLLM document descriptor research project. The research project aims to use LLMs to create a dynamic taxonomy of descriptive labels ("descriptors") for web documents.

Made to run on the LUMI supercomputer: https://lumi-supercomputer.eu/. Runs vLLM 0.6.6.

Generating a new schema

The basic steps for generating a new descriptor schema from scratch are:

Generate a raw set of descriptors with descriptor_generation/generate_descriptor.py.
Extract descriptor groups from raw results with disambiguation/extract_descriptor_groups.py. Split data into smaller batches for faster parallel processing by setting --num-splits.
Use the extracted descriptor groups as input for disambiguation/disambiguate_descriptors.py.
If you ran multiple jobs in parallel (which you should with any larger data), concatenate all results together with disambiguation/concat_disambig_results.sh.
You'll probably have to repeat this a few times. So extract groups from the output of the disambiguation, disambiguate again, etc. Concatenate results after each disambiguation run if running in parallel.
Merge the results of you final disambiguation run with merging/find_synonyms.py.
If duplicates still remain, force merge them with merging/force_merge.py.

You schema is done!

Generating new descriptor for existing schema

Once you have a schema, you can generate descriptors for any dataset and then align those descriptors with the schema.

Generate a raw set of descriptors with descriptor_generation/generate_descriptor.py.
Harmonize descriptors with your schema with harmonize/harmonize_with_schema.py.

Harmonization done!

More details

Below are more detailed instructions for running the pipeline on LUMI. If you run this on another machine or cluster, these instructions might not be fully accurate.

To run descriptor generation pipeline on LUMI:

Clone this repo into your project scratch/
cd doc_descriptors
Create a virtual environment. Read more: https://docs.csc.fi/support/tutorials/python-usage-guide/#installing-python-packages-to-existing-modules

module purge
module use /appl/local/csc/modulefiles
module load pytorch/2.5
python3 -m venv --system-site-packages venv
source venv/bin/activate

Install requirements: pip install -r requirements.txt
By default, the model will be downloaded into you home directory, which will fill up very quickly. I recommend creating a cache folder in your scratch and adding this line to your .bashrc so you don't have to worry about setting the cache folder manually: export HF_HOME="/scratch/<your-project>/my_cache". You can also set the caching directory in run_vllm.sh with the flag --cache-dir, e.g. --cache-dir="/scratch/<your-project>/my_cache"
In run_generate_descriptors.sh, change --account to your project. It is recommended to reserve a full node, i.e., 8 GPUs because reserving less tends to cause NCCL errors. You have to give a --run-id, e.g. 'run1'. All other parameters are set to reasonable defaults that you can change if you want to.
Run the descriptor generation pipeline: sbatch run_vllm.sh

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
descriptor_generation		descriptor_generation
descriptor_merging		descriptor_merging
disambiguation		disambiguation
doc_descriptors		doc_descriptors
embedder		embedder
eval_descriptors		eval_descriptors
harmonize		harmonize
merging		merging
notebooks		notebooks
scripts		scripts
situational_characteristics		situational_characteristics
trace_merges		trace_merges
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generating a new schema

Generating new descriptor for existing schema

More details

About

Uh oh!

Releases

Packages

Uh oh!

Languages

TurkuNLP/LLM_document_descriptors

Folders and files

Latest commit

History

Repository files navigation

Generating a new schema

Generating new descriptor for existing schema

More details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages