Example Box Plot Output for Bootstrap Predictions of Opsin λmax by OPTICS
- OPTICS is an open-source tool that predicts the Opsin Phenotype (λmax) from unaligned opsin amino-acid sequences.
- OPTICS leverages machine learning models trained on the Visual Physiology Opsin Database (VPOD).
- OPTICS can be downloaded and used as a command-line or GUI tool.
- OPTICS is also avaliable as an online tool here, hosted on our Galaxy Project server.
- Check out our pre-print Accessible and Robust Machine Learning Approaches to Improve the Opsin Genotype-Phenotype Map to read more about it!
- λmax Prediction: Predicts the peak light absorption wavelength (λmax) for opsin proteins.
- Model Selection: Choose from different pre-trained models for prediction.
- Encoding Methods: Select between one-hot encoding or amino-acid property encoding for model training and prediction.
- BLAST Analysis: Optionally perform BLASTp analysis to compare query sequences against reference datasets.
- Bootstrap Predictions: Optionally enable bootstrap predictions for enhanced accuracy assessment (suggested limit to 10 sequences for bootstrap visulzations).
- Prediction Explanation: Utilizes SHAP to explain the key features driving the λmax difference between any two sequences.
-
Clone the repository:
git clone https://github.com/VisualPhysiologyDB/optics.git
-
Install dependencies: [Make sure you are working in the repository directory from here-after]
A. Create a Conda environment for OPTICS (make sure you have Conda installed)
conda create --name optics_env python=3.11
conda activate optics_env
B. Use the 'requirements.txt' file to download base package dependencies for OPTICS
pip install -r requirements.txt
C. Download MAFFT and BLAST
IF working on MAC or LINUX device:
- Install BLAST and MAFFT directly from the bioconda channel
conda install bioconda::blast bioconda::mafft
IF working on WINDOWS device:
- Manaully install the Windows compatable BLAST executable on your system PATH; the download list is here
- We suggest downloading 'ncbi-blast-2.16.0+-win64.exe'
- You DO NOT need to download MAFFT, OPTICS should be able to run MAFFT from the files we provide when downloading this GitHub.
- Install BLAST and MAFFT directly from the bioconda channel
MAKE SURE YOU HAVE ALL DEPENDENCIES DOWNLOADED AND THAT YOU ARE IN THE FOLDER DIRECTORY FOR OPTICS (or have loaded it as a module) BEFORE RUNNING ANY SCRIPTS!
Required Args:
-i, --input: Either a single sequence or a path to a FASTA file.
General Optional Args:
-o, --output_dir: Desired directory to save output folder/files (optional). Default: './prediction_outputs'
-p, --prediction_prefix: Base filename for prediction outputs. Default: 'unnamed'
-v, --model_version: Version of models to use (optional). Based on the version of VPOD used to train models. Options/Default: vpod_1.3 (More version coming later)
-m, --model: Prediction model to use. Options: whole-dataset, wildtype, vertebrate, invertebrate, wildtype-vert, type-one, whole-dataset-mnm, wildtype-mnm, vertebrate-mnm, invertebrate-mnm, wildtype-vert-mnm. **Default: whole-dataset**
-e, --encoding: Encoding method to use (optional). Options: one_hot, aa_prop. Default: aa_prop
--tolerate_non_standard_aa: Allows OPTICS to run predictions on sequences with 'non-standard' amino-acids (e.g. - 'X','O','B', etc...)(optional). Default: False
--n_jobs: Number of parallel processes to run (optional). -1 is the default, utilizing all avaiable processors.,
BLASTp Analysis Args (optional):
--blastp: Enable BLASTp analysis.
--blastp_report: Filename for BLASTp report. Default: blastp_report.txt
--refseq: Reference sequence used for blastp analysis. Options: bovine, squid, microbe, custom. Default: bovine
--custom_ref_file: Path to a custom reference sequence file for BLASTp. Required if --refseq custom is selected.
Bootstrap Analysis Args (optional):
--bootstrap: Enable bootstrap predictions.
--visualize_bootstrap: Enable visualization of bootstrap predictions.
--bootstrap_num: Number of bootstrap models to load for prediction replicates. Default // Maximum: 100
--bootstrap_viz_file: Filename prefix for bootstrap visualization. Default: bootstrap_viz
--save_viz_as: File type for bootstrap visualizations. Options: SVG, PNG, or PDF Default: SVG
--full_spectrum_xaxis: Enables visualization of predictions on a full spectrum x-axis (300-650nm). Otherwise, x-axis is scaled with predictions.
python optics_predictions.py -i ./examples/optics_ex_short.txt -o ex_test_of_optics -p ex_predictions -m wildtype -e aa_prop --blastp -blastp_report blastp_report.txt --refseq squid --bootstrap --visualize_bootstrap --bootstrap_viz_file bootstrap_viz --save_viz_as SVG- Unaligned FASTA file containing opsin amino-acid sequences.
- Example FASTA Entry:
>NP_001014890.1_rhodopsin_Bos_taurus MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLYVTVQHKKLRT PLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVC KPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVV HFIIPLIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQG SDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA
-
Predictions (TSV): λmax values, model used, and encoding method.
-
BLAST Results (TXT, optional): Comparison of query sequences to reference datasets.
-
Bootstrap Graphs (PDF, optional): Visualization of bootstrap prediction results.
-
Job Log (TXT): Log file containing input command to OPTICS, including encoding method and model used.
Note - All outputs are written into subfolders generated based on your 'prediction-prefix' under your specified output directory, and are marked by time and date.
That's right! No-need for command line, OPTICS can also be used as a GUI! The usage is quite simple, just use the command below (with your OPTICS conda enviornment activated) and get to predicting. ;)
python run_optics_gui.pyExample of the OPTICS GUI interface
The --model flag allows you to select a specific pre-trained model for wavelength prediction. Each available model is named after the data-subset it was trained on, allowing you to choose the one best suited for your research question. This was originally done to test how factors like taxonomic group or gene family inclusivity impact prediction performance.
The primary models include:
- whole-dataset: Trained on the entire VPOD dataset, including all taxonomic groups and both wild-type and mutant sequences. In most cases, this is the recommended model as it leverages the most data.
- Generally, more data = better models (assuming that data is good data)
- wildtype: Trained exclusively on wild-type opsin sequences, with all mutant sequences removed.
- vertebrate: Trained only on sequences from the phylum Chordata.
- invertebrate: Trained only on sequences from species not in the phylum Chordata.
- wildtype-vert: A more specific subset containing only wild-type sequences from vertebrates.
The key difference between models with and without the -mnm suffix lies in the source of the phenotype data (the λmax values).
- Standard models (e.g., wildtype): These are trained exclusively on data where the sequence-to-relationship was validated experimentally through heterologous expression. This represents a controlled, in-vitro dataset.
-mnmmodels (e.g., wildtype-mnm): These are trained on an augmented dataset. It includes the standard heterologous expression data plus additional data from our "Mine-n-Match" (mnm) procedure. This process systematically infers connections between sequences and in-vivo measurements, providing a broader and more biologically contextualized training set.- Note, the methodology behind MNM and the implimentation of that data into VPOD/OPTICS is elaborated upon in our publication introducing OPTICS (Frazer et al. 2025 )
For users interested in the "nitty-gritty" of why sequences have different predicted λmax values, we provide a specialized script that uses SHAP (SHapley Additive exPlanations). This tool generates a plot and detailed data files that attribute the difference in prediction to specific features (i.e., amino acid sites and their properties).
Example SHAP plot for explaining individual predictions of opsin λmax by OPTICS
Example SHAP comparison plot for explaining pair-wise differences in predictions of opsin λmax by OPTICS
This script requires a FASTA file
- File must contain at least two or more sequences if you are running a SHAP comparison.
- Only a single sequence is needed for an individual SHAP explination
Most parameters are identical to the main prediction script. Below are the key arguments:
Required Args:
-i, --input: Path to a FASTA file containing two sequences to compare.
Optional Args:
-o, --output_dir: Directory to save the SHAP analysis output folder.
-p, --prediction_prefix: Base filename for the SHAP plot and data files.
--mode: Analysis mode: select 'comparison' for pairwise SHAP comparison of all sequence predictions, 'single' for individual SHAP explinations of all sequences, or 'both' for both outputs.
-m, --model: Prediction model to use for the comparison.
-e, --encoding: Encoding method to use.
--save_viz_as: File type for the SHAP visualization (svg, png, or pdf).
--use_reference_sites : Enable to use reference site numbering (i.e. - Bovine or Squid Rhodopsin), instead of feature names.
python optics_shap.py -i ./examples/optics_ex_short.fasta -o ./examples -p short_ex_test_aa_prop --mode both --use_reference_sites- Unaligned FASTA file containing any number of opsin amino-acid sequences for shap comparison.
- Please note - if you are doing comparison mode (or both) this is combinatorial (so all sequences will be ccompared in pairwise fashion) which can become computationally expensive.
- SHAP Plot (SVG/PNG/PDF): Visual explanation for the top 10 sites cotributing to prediction differences.
- SHAP Data (CSV): Detailed feature attribution values.
- Run Log (TXT): A record of the commands use and other information pertaining to the shap prediction.
***Note - Once again, all outputs are written into subfolders generated based on your 'prediction-prefix' under your specified output directory, and are marked by time and date.
All data and code is covered under a GNU General Public License (GPL)(Version 3), in accordance with Open Source Initiative (OSI)-policies
-
IF citing this GitHub and its contents use the following DOI provided by Zenodo...
10.5281/zenodo.10667840 -
IF you use OPTICS in your research, please cite the following paper(s):
-
Our more recent publication directly on the making/utility of OPTICS.
Seth A. Frazer, Todd H. Oakley. Accessible and Robust Machine Learning Approaches to Improve the Opsin Genotype-Phenotype Map. bioRxiv, 2025.08.22.671864. https://doi.org/10.1101/2025.08.22.671864 -
Our original paper on the development of VPOD; the opsin genotype-phenotype database backbone for training the ML models used in OPTICS.
Seth A. Frazer, Mahdi Baghbanzadeh, Ali Rahnavard, Keith A. Crandall, & Todd H Oakley. Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD). GigaScience, 2024.09.01. https://doi.org/10.1093/gigascience/giae073
-
Contact information for author questions or feedback.
Todd H. Oakley - ORCID ID
oakley@ucsb.edu
Seth A. Frazer - ORCID ID
sethfrazer@ucsb.edu
-
Want to use OPTICS without the hassle of the setup? -> CLICK HERE to visit our Galaxy Project server and use our tool!
-
OPTICS v1.3 uses VPOD_v1.3 for training.
-
Here is a link to a bibliography of the publications used to build VPOD_v1.2 (VPOD_v1.3 version not yet released)
-
If you know of publications for training opsin ML models not included in the VPOD_v1.2 database, please send them to us through this form
-
Check out the VPOD GitHub repository to learn more about our database and ML models!
