The original intent of assembling a data set of publicly-available tumor-infiltrating T cells (TILs) with paired TCR sequencing was to expand and improve the scRepertoire R package. However, after some discussion, we decided to release the data set for everyone, a complete summary of the sequencing runs and the sample information can be found in the meta data of the Seurat object.
This involves several steps 1) loading the respective GE data, 2) harmonizing the data by sample and cohort information, 3) iterating through automatic annotation, and 4) adding the TCR information. This information is stored in the meta data of the Seurat objects - an explanation of each variable is available here.
├── config.yaml - parameters control for processing and integrating
├── data
│ ├── sequencingRuns - 10x Outputs
│ └── processedData - processed .rds and larger combined cohorts
├── environment.yml - python environment
├── figs - image ouputs for processing and integration
├── LICENSE.txt
├── NEWS.txt - update information
├── py - python scripts
├── R - R scripts
├── README.md
├── results - intermediate and final ouputs
├── run_pipeline.R - main pipeline to run
└── summary - tables summarizing the data
Last Updated: 2025-12-03
| Metric | Count |
|---|---|
| Total Cells | 2,606,129 |
| Sequencing Runs | 722 |
| Unique Tissues | 13 |
| Unique Patients | 420 |
| Cells with TCR | 1,841,128 |
Here is the current list of data sources, the number of cells that passed filtering by tissue type. Please cite the data if you are using uTILity.
| Tumor | Normal | Blood | Juxta | LN | Met | Cancer Type | Citations | |
|---|---|---|---|---|---|---|---|---|
| CCR-20-4394 | 26760 | 0 | 0 | 0 | 0 | 0 | Ovarian | cite |
| EGAS00001004809 | 181667 | 0 | 0 | 0 | 0 | 0 | Breast | cite |
| GSE114724 | 27651 | 0 | 0 | 0 | 0 | 0 | Breast | cite |
| GSE121636 | 11436 | 0 | 12319 | 0 | 0 | 0 | Renal | cite |
| GSE123814 | 78034 | 0 | 0 | 0 | 0 | 0 | Multiple | cite |
| GSE139555 | 93160 | 78625 | 25363 | 0 | 0 | 0 | Multiple | cite |
| GSE145370 | 66592 | 40916 | 0 | 0 | 0 | 0 | Esophageal | cite |
| GSE148190 | 2263 | 0 | 6201 | 0 | 15644 | 0 | Melanoma | cite |
| GSE154826 | 14491 | 13414 | 0 | 0 | 0 | 0 | Lung | cite |
| GSE159251 | 8356 | 0 | 47721 | 0 | 5705 | 0 | Melanoma | cite |
| GSE162500 | 14644 | 0 | 23401 | 3761 | 0 | 0 | Lung | cite |
| GSE164522 | 36990 | 86811 | 46027 | 0 | 46376 | 36648 | Colorectal | cite |
| GSE168844 | 0 | 0 | 55302 | 0 | 0 | 0 | Lung | cite |
| GSE176021 | 436609 | 128411 | 132673 | 0 | 71063 | 32011 | Lung | cite |
| GSE179994 | 78574 | 0 | 0 | 0 | 0 | 62341 | Lung | cite |
| GSE180268 | 23215 | 0 | 0 | 0 | 29699 | 0 | HNSCC | cite |
| GSE181061 | 40429 | 27622 | 37426 | 0 | 0 | 0 | Renal | cite |
| GSE185206 | 163294 | 17231 | 0 | 0 | 9820 | 0 | Lung | cite |
| GSE195486 | 122512 | 0 | 0 | 0 | 0 | 0 | Ovarian | cite |
| GSE200218 | 0 | 0 | 0 | 0 | 0 | 18495 | Melanoma | cite |
| GSE200996 | 86235 | 0 | 152722 | 0 | 0 | 0 | HNSCC | cite |
| GSE201425 | 22888 | 0 | 27781 | 0 | 11350 | 12253 | Biliary | cite |
| GSE211504 | 0 | 0 | 33685 | 0 | 0 | 0 | Melanoma | cite |
| GSE212217 | 0 | 0 | 229505 | 0 | 0 | 0 | Endometrial | cite |
| GSE213243 | 2835 | 0 | 18363 | 0 | 0 | 2693 | Ovarian | cite |
| GSE215219 | 26303 | 0 | 66000 | 0 | 0 | 0 | Lung | cite |
| GSE227708 | 53087 | 0 | 0 | 0 | 0 | 0 | Merkel Cell | cite |
| GSE242477 | 41595 | 0 | 21595 | 0 | 0 | 0 | Melanoma | cite |
| PRJNA705464 | 98892 | 15113 | 30340 | 0 | 3505 | 0 | Renal | cite |
The filtered gene matrices output from Cell Ranger align function from individual sequencing runs (10x Genomics, Pleasanton, CA) loaded into the R global environment. For each sequencing run cell barcodes were appended to contain a unique prefix to prevent issues with duplicate barcodes. The results were then ported into individual Seurat objects (citation), where the cells with > 10% mitochondrial genes and/or 2.5x standard deviation from the mean of features were excluded for quality control purposes. At the individual sequencing run level, doublets were estimated using the scDblFinder (v1.4.0) R package.
Automatic annotation was performed using the singler (v2.2.0) R package (citation) with the HPCA (citation) and DICE (citation) data sets as references and the fine label discriminators. Individual sequencing runs were subsetted to run through the singleR algorithm in order to reduce memory demands. The output of all the singleR analyses were collated and appended to the meta data of the Seurat object. Likewise, the Azimuth (v0.4.6.9004) R Package (citation was used for automatic annotation as a partially orthogonal approach.
The filtered contig annotation T cell receptor (TCR) data for available sequencing runs were loaded into the R global environment. Individual contigs were combined using the combineTCR() function of scRepertoire (v2.0.0) R Package (citation). Clonotypes were assigned to barcodes and were multiple duplicate chains for individual cells were filtered to select for the top expressing contig by read count. The clonotype data was then added to the Seurat Object with proportion across individual patients being used to calculate frequency.
Session Info for the initial data processing and analysis can be found here.
As of right now, there is no citation associated with the assembled data set. However if using the data, please find the corresponding manuscript for each data set summarized above or can be found in the summary table. In addition, if using the processed data, feel free to modify the language in the methods section (above) and please cite the appropriate manuscripts of the software or references that were used.
- Seurat v5.3.1 - citation
- Singler v2.10.0 - citation
- Azimuth v0.4.6.9004 - citation
- scRepertoire v2.0.0 - citation
- scanpy v1.9 - citation
- scVI - citation
- scANVI - citation
- scirpy - citation
- Human Primary Cell Atlas (HPCA) - citation
- Monaco Data Set (Monaco) - citation
- PBMC reference - citation
If you are interested in the set up and running of the evalauation of the uTILity
pipeline, please download the processedData file from the zenodo
archive, unzip and place it in the ./data directory.
git clone https://github.com/ncborcherding/utility
cd utility
The pipeline uses a single conda environment for all Python dependencies.
# Create the environment (this may take 5-10 minutes)
conda env create -f environment.yml
# Verify installation
conda activate sc-integration-benchmark
python -c "import scvi; import scanpy; print('✓ Python packages OK')"
conda deactivate
For NVIDIA GPU acceleration, edit environment.yml before creating the environment:
Comment out - cpuonly
Uncomment the appropriate CUDA version:
# Option B: CUDA 11.8
- pytorch>=2.2
- pytorch-cuda=11.8
# OR Option C: CUDA 12.1
- pytorch>=2.2
- pytorch-cuda=12.1
Verify GPU detection after installation:
conda activate sc-integration-benchmark
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
The CPU-only configuration works on Apple Silicon. MPS (Metal Performance Shaders) acceleration is automatically detected but has limitations:
- Mixed precision is not supported on MPS; the scripts automatically force precision:
"32-true" - Batch sizes are capped at 128 for stability
- Performance is good but not as fast as NVIDIA GPUs
No changes to environment.yml are needed for Apple Silicon.
# Install BiocManager if needed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Core packages
install.packages(c("Seurat", "yaml", "dplyr", "data.table", "reticulate"))
# Bioconductor packages
BiocManager::install(c("batchelor", "rhdf5"))
# SeuratDisk (for h5ad conversion)
remotes::install_github("mojaveazure/seurat-disk")
Configure per-session in R:
library(reticulate)
# Use the conda environment
use_condaenv("sc-integration-benchmark", required = TRUE)
# Verify
py_config()
The data and analysis of uTILity is provided under a CC BY-NC 4.0 license, please feel free to remix, transform, and build upon the material. However, the intent of this resource is noncommercial.
Questions, comments, suggestions, please feel free to contact Nick Borcherding via this repository.

