Command line utilities for building InterPro7 data warehouse.
This repository contains the code used to build the data warehouse for InterPro's new website (InterPro 7, in our vernacular). The data warehouse is based on MySQL and Elasticsearch. Data is exported from InterPro's Oracle production database to binary, compressed, indexed files in order to perform out-of-core operations, by using the EBI cluster instead of relying exclusively on Oracle.
Requirements:
- Python 3.11+
- Python packages
oracledb,elasticsearch, andmysqlclient - Python package
mundone
git clone https://github.com/ProteinsWebTeam/interpro7-dw.git
cd interpro7-dw
pip install .Copy or edit config.toml to set the options described below.
| Option | Description |
|---|---|
| version | InterPro release version, e.g. 80.0 |
| date | InterPro release date (expected format: YYYY-MM-DD) |
| update | Should the Oracle database be updated with version and date |
| Option | Description |
|---|---|
| alphafold | Mapping between UniProtKB accessions and AlpaFold IDs |
| bfvd | Tar archive of BFDV predictions |
| intact | IntAct PSI-MI TAB file of interactions |
| interpro | Directory of member database directories/files |
| metacyc | MetaCyc data file (expects a .tar.gz archive) |
| work | Directory used to store data files |
| temp | Directory used for temporary files created in each task |
Expected format: user/password@host:port/schema.
| Option | Description |
|---|---|
| interpro.production | Oracle production database (interpro user) |
| iprscan.production | Oracle production database (iprscan user) |
| interpro.staging | InterPro release/staging MySQL database |
| interpro.release | InterPro release/offsite MySQL database |
| goa | GOA Oracle database |
| pdbe | PDBe Oracle database |
| uniprot | UniProtKB/Swiss-Prot Oracle database |
For each Elasticsearch cluster, the following options need to be provided:
nodes: list of Elasticsearch nodes (format:host:port)user: Elasticsearch userpassword: Password for the userfingerprint: SSL certificate fingerprint
Assuming we have two clusters (test and prod), the configuration would be something like:
e.g.
[elasticsearch]
test.nodes = [ "es-test-node1:9200", "es-test-node2:9200" ]
test.user = "elastic"
test.password = "..."
test.fingerprint = "..."
prod.nodes = [ "es-prod-node1:9200", "es-prod-node2:9200" ]
prod.user = "elastic"
prod.password = "..."
prod.fingerprint = "..."| Option | Description |
|---|---|
| ebisearch | Directory monitored by EBI Search to index cross-references |
| goa | Directory for mappings required by the GOA team |
| interpro | Directory for archived FTP files (should not finish with the release number, as release.version is appended at run time) |
| Option | Description |
|---|---|
| path | Directory for job input/output files |
| scheduler | Scheduler and queue (format: scheduler[:queue], e.g. lsf:production or slurm) |
Exporting data from Oracle
| Task name | Description |
|---|---|
| export-clans | Export clan information, including profile-profile alignments |
| export-databases | Export database information |
| export-entries | Export InterPro entries and member database signatures |
| export-isoforms | Export Varsplic matches |
| export-features | Export sequence feature matches (e.g. Coils, MobiDB-Lite) |
| export-matches | Export protein matches from member databases |
| export-proteins | Export protein information such a taxon ID, length, UniProt identifier, etc. |
| export-residues | Export residue annotations (site matches) |
| export-uniparc | Export all member database matches again UniParc |
| export-pdbe-matches | Export matches against sequences in PDBe |
| export-taxa | Export taxonomic data |
| export-structures | Export PDBe structures |
| export-structure-chains | Export the UniProt-PDBe mapping |
| export-pfam-alignments | Export sequences alignments of Pfam families |
| export-alphafold | Export the UniProt-AlphaFold mapping |
| export-hmms | Export models for HMMER3-based member databases |
| export-sequences | Export protein sequences |
| export-reference-proteomes | Export references proteomes |
| export-evidences | Export UniProt evidences/genes |
| export-functions | Export Swiss-Prot function comments |
| export-names | Export UniProt descriptions/names |
| export-proteomes | Export UniProt-proteome mapping |
Tracking cross-references between clans, entries, proteomes, proteins, structures, and taxa
| Task name | Description |
|---|---|
| export-dom-orgs | Track the domain architecture/organisation for each UniProt protein |
| export-sim-entries | Track the similar entries based on how much they overlap in all the proteins they match |
| export-clan2xrefs | Clans × (domain organisations, proteomes, proteins, structures, taxa) |
| export-entry2xrefs | Entries × (AlphaFold models, domain organisations, EC numbers, pathways, proteins, structures, taxa) |
| export-proteome2xrefs | Proteomes × (domain organisations, clans, entries, proteins, structures, taxa) |
| export-structure2xrefs | Structures × (domain organisations, clans, entries, proteomes, proteins, taxa) |
| export-taxon2xrefs | Taxa × (entries, proteomes, proteins, structures, taxa) |
Creating/populating MySQL tables
| Task name | Description |
|---|---|
| insert-annotations | Insert Pfam sequence alignments, profile HMMs, and logos from profile HMMs |
| insert-clans | Insert clans |
| insert-databases | Insert database information |
| insert-entries | Insert InterPro entries and member database signatures |
| insert-entries-taxa | Insert the taxonomic distribution of InterPro entries and member database signatures |
| insert-features | Insert sequence feature matches |
| insert-isoforms | Insert alternatively spliced isoforms |
| insert-residues | Insert site annotations |
| insert-proteins | Insert UniProt proteins with enriched information (e.g. structural features, sequences, etc.) |
| insert-proteomes | Insert UniProt reference proteomes |
| insert-structures | Insert PDBe structures with enriched information (e.g. secondary structures, literature references) |
| insert-struct-models | Insert structural models |
| insert-taxa | Insert taxonomic data |
| insert-release-notes | Insert (or update) release notes (number of entries, proteins, recent integrations, etc.) |
Indexing (some) MySQL tables
From some tables, the indexes creation is in a different task not to have to re-populate the table from scratch should an error occur.
| Task name |
|---|
| index-annotations |
| index-entries |
| index-features |
| index-residues |
| index-proteins |
| index-taxa |
Creating Elasticsearch clusters
In the following tasks, id represents the cluster identifier, as defined in the config file
| Task name | Description |
|---|---|
| es-export | Export documents for Elasticsearch (IDA and relationship documents) |
| es-init-id | Create the staging indexes on cluster id |
| es-index-id | Index documents on cluster id |
| es-publish-id | Make staging indexes live on cluster id |
Exporting files for the public FTP
| Task name | Description |
|---|---|
| export-features-xml | Export an XML file of sequence feature matches (MobiDB-Lite, TMHMM, Phobius, Coils) |
| export-flat-files | Export flat files (list of entries, InterPro-GO mapping, protein matches, etc.) |
| export-interpro-xml | Export an XML file of InterPro entries and their annotations (e.g. abstract, member database signatures, etc.) |
| export-matches-xml | Export an XML file of member databases protein matches |
| export-release-notes | Export a text file containing the release notes |
| export-structures-xml | Export an XML file of structural matches (PDBe, CATH, SCOP) |
| export-site-annotations | Export an XML file of all site annotations |
| export-uniparc-xml | Export a tar.gz archive of all UniParc matches |
Exporting files for internal use (other EMBL-EBI groups)
| Task name | Description |
|---|---|
| export-ebisearch | Export JSON files of InterPro entries and member database signatures, and their cross-references |
| publish-ebisearch | Move JSON files created in ebisearch to a directory monitored by EBI Search |
| export-goa | Export mappings between PDBe, InterPro, GO, and UniProt |
| publish-goa | Move files to the directory monitored by the GOA team |
Building Oracle tables for the match look-up service
| Task name | Description |
|---|---|
| build-upi-md5 | Build table of protein MD5 checksums |
| build-lookup-tab | Build table of protein matches |
| build-lookup-tab-idx | Index table of protein matches |
| build-site-lookup-tab | Build table of site annotations |
| build-site-lookup-tab_idx | Index table of site annotations |
interprodw-build
usage: interprodw-build [-h] [-t [TASK [TASK ...]]] [--dry-run] [--detach] [-v] config.ini
Build InterPro7 data warehouse
positional arguments:
config.ini configuration file
optional arguments:
-h, --help show this help message and exit
-t [TASK [TASK ...]], --tasks [TASK [TASK ...]]
tasks to run
--dry-run list tasks to run and exit
--detach enqueue tasks to run and exit
-v, --version show the version and exit