lamlpro is a maximum likelihood cell lineage tree inference algorithm
based on the under the Probabilistic Mixed-type Missing Observation (PMMO)
model. lamlpro has two modes and can i) quickly infer branch lengths for a given cell
lineage tree topology and ii) find the most likely cell lineage tree using
topology search.
If you find lamlpro useful in your research, please cite the following paper:
If you are interested in reproducing the analyses in this paper, the reproducibility repository can be found at LAML-Pro-experiments.
To build lamlpro manually requires only a modern C++20 compiler and CMake. We recommend using python 3.10 for the helper scripts.
To install, simply clone the repository and compile the code using CMake, making
sure to initialize all git submodules.
git clone git@github.com:raphael-group/LAML-Pro.git --recursiveWe refer the user to the ipopt installation instructions here.
Generally, however, if you are on MacOS you can install with brew install ipopt and on Linux you can run conda install -c conda-forge ipopt.
Please change directories into LAML-Pro (cd LAML-Pro) and run the following commands:
mkdir build
cd build
cmake .. -DCMAKE_POLICY_VERSION_MINIMUM=3.5
make
mv src/lamlpro ../lamlproThe output files consist of the binary lamlpro which can be executed
from the command line. For ease of use, we suggest adding the lamlpro
binary to a directory listed in the PATH environmental variable.
At the end of these steps, lamlpro should sit in the root directory.
lamlpro is a command-line tool to infer a cell lineage tree
- The observed data
$X$ is an$n\text{-by-}m$ character matrix specifying the character-states of each of$n$ cells at$m$ characters. - The observed data
$X$ is an$n\text{-by-}m+2$ matrix, where$n$ specifies the number of observations,$m$ is the number of edit states (including$0$ ), and$2$ corresponds to additional metadata columns. The columns should becell_name,target_site, followed by columns corresponding to each statestate0_prob,state1_prob,state2_prob,state3_prob. The state probability values should correspond to the generative probability (e.g. probability of hidden state$0$ having generated this observation), reported be in log space. They should be named with this convention, since we identify these columns asstate*_prob.
lamlpro requires an initial cell lineage tree
- mode:
optimizefinds the optimal branch lengths$\delta_e$ and model parameter$\Theta$ for$\mathcal{T}_0$ under the PMMO model. - mode:
searchfinds the most likely tree$\mathcal{T}^*$ , branch lengths$\delta_e$ and model parameters$\Theta$ under the PMMO model.
If you are interested in inferring a tree from scratch, we recommend using the search mode of
lamlpro. If you are interested fitting branch lengths to a tree we recommend using the optimize mode
of lamlpro. In either case, one is required to specify both the observations --tree and --matrix.
The tool has the following usage format:
Usage: lamlpro [--help] [--version] [--mutation-priors VAR] --matrix VAR [--data-type VAR] --tree VAR --output VAR [--verbose] [--ultrametric] [--threads VAR] [--mode VAR] [--seed VAR] [--max-iterations VAR] [--temp VAR] [--min-branch-length VAR]
Optional arguments:
-h, --help shows help message and exits
--version prints version information and exits
-m, --mutation-priors path to the mutation priors file (CSV) [nargs=0..1] [default: ""]
-c, --matrix path to the observed data file (CSV) [required]
-d, --data-type options are 'character-matrix' or 'observation-matrix'. [nargs=0..1] [default: "character-matrix"]
-t, --tree path to the rooted binary tree (newick) [required]
-o, --output prefix for output files [required]
-v, --verbose save all console logs to a file automatically.
-u, --ultrametric enforce ultrametric constraint during optimization.
--threads number of threads to use [nargs=0..1] [default: 10]
--mode 'optimize' for parameter optimization or 'search' for tree search [nargs=0..1] [default: "optimize"]
--seed random seed for reproducibility [nargs=0..1] [default: 73]
--max-iterations maximum number of iterations for hill climbing [nargs=0..1] [default: 20000]
--temp starting temperature for topology search [nargs=0..1] [default: 0.1]
--min-branch-length minimum branch length relative to scaled tree with unit height [nargs=0..1] [default: 0.01]
There are two main output files of lamlpro: the *.newick output tree in newick format, and *results.json which contains important parameter estimates as well as metadata.
Tip
Use the flag --ultrametric to ensure the cell lineage tree has equal
length root-to-leaf paths.
We provide simulated cell lineage trees with lamlpro.
To apply lamlpro to character matrix data, we first infer a cell lineage tree
python scripts/neighbor_joining.py examples/n250_m30_character_matrix/character_matrix.csv examples/n250_m30_character_matrix/initial
Tip
This script uses the neighbor joining implementation from biopython and pandas. You can install Python dependencies for the other scripts with pip3 install biopython pandas dendropy scipy.
This results in two files examples/n250_m30_character_matrix/initial_hamming_tree.nwk and
examples/n250_m30_character_matrix/initial_weighted_hamming_tree.nwk. If we compute the
distance from the inferred and true trees, we see that both are quite far away from
the ground truth:
python scripts/metrics.py --reference examples/n250_m30_character_matrix/tree.nwk --trees examples/n250_m30_character_matrix/initial_hamming_tree.nwk examples/n250_m30_character_matrix/initial_weighted_hamming_tree.nwk
Tree RF Distance Normalized RF
initial_hamming_tree.nwk 404 0.817814
initial_weighted_hamming_tree.nwk 356 0.720648
With either initial tree, one can run lamlpro with the following command:
./lamlpro --matrix examples/n250_m30_character_matrix/character_matrix.csv --tree examples/n250_m30_character_matrix/initial_weighted_hamming_tree.nwk -o examples/n250_m30_character_matrix/lamlpro --ultrametric --mode search --max-iterations 2500
The preceding command enforces that the tree is ultrametric and runs topology search for 2500
iterations. For practical applications, we recommend setting this value higher and making
sure that the algorithm converges by checking that the log-likelihood improvements have plateaued (by looking at the .json output file). The preceding command results in two output files:
examples/n250_m30_character_matrix/lamlpro_tree.newick and
examples/n250_m30_character_matrix/lamlpro_results.json.
The first file is the inferred cell lineage tree with branch lengths (in mutation units) and the second contains
parameter estimates and important metadata, such as the per-iteration log-likelihood.
On this example, lamlpro should improve the tree topology and this can be verified by
running:
python scripts/metrics.py --reference examples/n250_m30_character_matrix/tree.nwk --trees examples/n250_m30_character_matrix/initial_weighted_hamming_tree.nwk examples/n250_m30_character_matrix/initial_weighted_hamming_tree.nwk examples/n250_m30_character_matrix/lamlpro_tree.newick
Example output:
Tree RF Distance Normalized RF
initial_hamming_tree.nwk 404 0.817814
lamlpro_tree.hamming.newick 316 0.753036
initial_weighted_hamming_tree.nwk 356 0.720648
lamlpro_tree.weighted_hamming.newick 316 0.639676
lamlpro_tree.tree.newick 136 0.275304
To apply lamlpro to observation matrix data, we first infer a cell lineage tree
We provide an initial starting tree argmax_nj.nwk if you would like to skip directly to testing lamlpro. This example corresponds to simulated data condition k400_s0_sub100_r01_dim3_r0.0439_p0.0.
With an initial tree, one can run lamlpro with the following command:
./lamlpro --matrix examples/n100_m400_observation_matrix/observation_matrix.csv --tree examples/n100_m400_observation_matrix/argmax_nj.nwk -o examples/n100_m400_observation_matrix/lamlpro --ultrametric --mode search --max-iterations 2500 --data-type "observation-matrix"
The preceding command enforces that the tree is ultrametric and runs topology search for 2500
iterations. You can change this number of iterations to be smaller for an initial test. For practical applications, we recommend setting this value higher and making
sure that the algorithm converges by checking that the log-likelihood improvements have plateaued (by looking at the .json output file). The preceding command results in four output files:
examples/n100_m400_observation_matrix/lamlpro_tree.newick and
examples/n100_m400_observation_matrix/lamlpro_results.json and
examples/n100_m400_observation_matrix/lamlpro_posterior_probs.csv and
examples/n100_m400_observation_matrix/lamlpro_posterior_argmax.csv.
The first file is the inferred cell lineage tree with branch lengths (in mutation units) and the second contains parameter estimates and important metadata, such as the per-iteration log-likelihood. The third file contains the probabilities over all states at each cell and site on the fixed maximum likelihood tree after convergence, and the fourth file contains the argmax maximum a posteriori (MAP) genotypes.
On this example, lamlpro should improve the tree topology and this can be verified by
running:
python scripts/metrics.py --reference examples/n100_m400_observation_matrix/tree.nwk --trees examples/n100_m400_observation_matrix/argmax_nj.nwk examples/n100_m400_observation_matrix/example_output/lamlpro_test_tree.newick
Tree RF Distance Normalized RF
examples/n100_m400_observation_matrix/argmax_nj.nwk 24 0.123711
examples/n100_m400_observation_matrix/example_output/lamlpro_test_tree.newick 0 0.000000