cdm-data-loader-utils

Repo for CDM input data loading and wrangling

cdm-data-loader-utils

Environment and python management

The data loader utils package uses uv for python environment and package management. See the installation instructions to set up uv on your system.

Installation

The data loader utils run on python 3.13 and above.

To install dependencies (including python), run

> uv sync

To activate a virtual environment with these dependencies installed, run

> uv venv
# you will now be prompted to activate the virtual environment
> source .venv/bin/activate

If you are using IDEs like VSCode, they should pick up the creation of the new environment and offer it for executing python code.

Development

Spark and other non-python dependencies

Some parts of this codebase rely on having a Spark instance available. Spark dependencies are pulled in by the berdl-notebook-utils package from BERDataLakehouse/spark_notebook, and the Docker container generated by the same repo should be used for development and testing to mimic the container where code will be run.

Pull the docker image:

> docker pull ghcr.io/berdatalakehouse/spark_notebook:main

Mount the current directory at /tmp/cdm and run the tests:

> docker run --rm -e NB_USER=runner -v .:/tmp/cdm ghcr.io/berdatalakehouse/spark_notebook:main /bin/bash /tmp/cdm/scripts/run_tests.sh

Run the container interactively as the user runner; current directory is mounted at /tmp/cdm:

> docker run --rm -e NB_USER=runner -it -v .:/tmp/cdm ghcr.io/berdatalakehouse/spark_notebook:main

This will launch a bash shell; the contents of the cdm-data-loader-utils directory are mounted at /tmp/cdm.

Run the container and sleep:

> docker run --rm -e NB_USER=runner -it -v .:/tmp/cdm ghcr.io/berdatalakehouse/spark_notebook:main sleep 100000000

The sleep command will run the container for long enough that you can then connect to it via Docker Desktop or the VSCode Containers extension.

See the BERDataLakehouse/spark_notebook repo for more information on the container and for a full docker-compose set up to mimic the BER Data Lakehouse container infrastructure.

Tests

To run the tests, execute the command:

> uv run pytest

To generate coverage for the tests, run

> uv run pytest --cov=src --cov-report=xml tests/

The standard python coverage package is used and coverage can be generated as html or other formats by changing the parameters.

Loading genomes, contigs, and features

The genome loader can be used to load and integrate data from related GFF and FASTA files. Currently, the loader requires a GFF file and two FASTA files (one for amino acid seqs, one for nucleic acid seqs) for each genome. The list of files to be processed should be specified in the genome paths file, which has the following format:

{
    "FW305-3-2-15-C-TSA1.1": {
        "fna": "tests/data/FW305-3-2-15-C-TSA1/FW305-3-2-15-C-TSA1_scaffolds.fna",
        "gff": "tests/data/FW305-3-2-15-C-TSA1/FW305-3-2-15-C-TSA1_genes.gff",
        "protein": "tests/data/FW305-3-2-15-C-TSA1/FW305-3-2-15-C-TSA1_genes.faa"
    },
    "FW305-C-112.1": {
        "fna": "tests/data/FW305-C-112.1/FW305-C-112.1_scaffolds.fna",
        "gff": "tests/data/FW305-C-112.1/FW305-C-112.1_genes.gff",
        "protein": "tests/data/FW305-C-112.1/FW305-C-112.1_genes.faa"
    }
}

Running bbmap stats and checkm2 on genome or contigset files

run_tools.sh runs the stats script from bbmap and checkm2 on files with the suffix "fna". These tools can be installed using conda:

conda env create -f env.yml
conda activate genome_loader_env
# download the checkm2 database
checkm2 database --download

Run the stats and checkm2 tools with the following command:

bash scripts/run_tools.sh path/to/genome_paths_file.json output_dir

where path/to/genome_paths_file.json specifies the path to the genome paths file (format specified above) and output_dir is the directory for the results.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github		.github
.vscode		.vscode
docs		docs
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cdm-data-loader-utils

Environment and python management

Installation

Development

Spark and other non-python dependencies

Tests

Loading genomes, contigs, and features

Running bbmap stats and checkm2 on genome or contigset files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

kbase/cdm-data-loader-utils

Folders and files

Latest commit

History

Repository files navigation

cdm-data-loader-utils

Environment and python management

Installation

Development

Spark and other non-python dependencies

Tests

Loading genomes, contigs, and features

Running bbmap stats and checkm2 on genome or contigset files

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages