uTILity

Comprehensive collection of Single-Cell Tumor-Infiltrating Lymphocyte Data

Introduction

The original intent of assembling a data set of publicly-available tumor-infiltrating T cells (TILs) with paired TCR sequencing was to expand and improve the scRepertoire R package. However, after some discussion, we decided to release the data set for everyone, a complete summary of the sequencing runs and the sample information can be found in the meta data of the Seurat object.

This involves several steps 1) loading the respective GE data, 2) harmonizing the data by sample and cohort information, 3) iterating through automatic annotation, and 4) adding the TCR information. This information is stored in the meta data of the Seurat objects - an explanation of each variable is available here.

Folder Structure

├── config.yaml         - parameters control for processing and integrating
├── data
│   ├── sequencingRuns  - 10x Outputs
│   └── processedData   - processed .rds and larger combined cohorts
├── environment.yml     - python environment
├── figs                - image ouputs for processing and integration
├── LICENSE.txt
├── NEWS.txt            - update information
├── py                  - python scripts
├── R                   - R scripts
├── README.md
├── results             - intermediate and final ouputs
├── run_pipeline.R      - main pipeline to run
└── summary             - tables summarizing the data

Sample ID:

Cohort Information

Cohort Summary

Last Updated: 2025-12-03

Metric	Count
Total Cells	2,606,129
Sequencing Runs	722
Unique Tissues	13
Unique Patients	420
Cells with TCR	1,841,128

Here is the current list of data sources, the number of cells that passed filtering by tissue type. Please cite the data if you are using uTILity.

	Tumor	Normal	Blood	Juxta	LN	Met	Cancer Type	Citations
CCR-20-4394	26760	0	0	0	0	0	Ovarian	cite
EGAS00001004809	181667	0	0	0	0	0	Breast	cite
GSE114724	27651	0	0	0	0	0	Breast	cite
GSE121636	11436	0	12319	0	0	0	Renal	cite
GSE123814	78034	0	0	0	0	0	Multiple	cite
GSE139555	93160	78625	25363	0	0	0	Multiple	cite
GSE145370	66592	40916	0	0	0	0	Esophageal	cite
GSE148190	2263	0	6201	0	15644	0	Melanoma	cite
GSE154826	14491	13414	0	0	0	0	Lung	cite
GSE159251	8356	0	47721	0	5705	0	Melanoma	cite
GSE162500	14644	0	23401	3761	0	0	Lung	cite
GSE164522	36990	86811	46027	0	46376	36648	Colorectal	cite
GSE168844	0	0	55302	0	0	0	Lung	cite
GSE176021	436609	128411	132673	0	71063	32011	Lung	cite
GSE179994	78574	0	0	0	0	62341	Lung	cite
GSE180268	23215	0	0	0	29699	0	HNSCC	cite
GSE181061	40429	27622	37426	0	0	0	Renal	cite
GSE185206	163294	17231	0	0	9820	0	Lung	cite
GSE195486	122512	0	0	0	0	0	Ovarian	cite
GSE200218	0	0	0	0	0	18495	Melanoma	cite
GSE200996	86235	0	152722	0	0	0	HNSCC	cite
GSE201425	22888	0	27781	0	11350	12253	Biliary	cite
GSE211504	0	0	33685	0	0	0	Melanoma	cite
GSE212217	0	0	229505	0	0	0	Endometrial	cite
GSE213243	2835	0	18363	0	0	2693	Ovarian	cite
GSE215219	26303	0	66000	0	0	0	Lung	cite
GSE227708	53087	0	0	0	0	0	Merkel Cell	cite
GSE242477	41595	0	21595	0	0	0	Melanoma	cite
PRJNA705464	98892	15113	30340	0	3505	0	Renal	cite

Methods

Single-Cell Data Processing

The filtered gene matrices output from Cell Ranger align function from individual sequencing runs (10x Genomics, Pleasanton, CA) loaded into the R global environment. For each sequencing run cell barcodes were appended to contain a unique prefix to prevent issues with duplicate barcodes. The results were then ported into individual Seurat objects (citation), where the cells with > 10% mitochondrial genes and/or 2.5x standard deviation from the mean of features were excluded for quality control purposes. At the individual sequencing run level, doublets were estimated using the scDblFinder (v1.4.0) R package.

Annotation of Cells

Automatic annotation was performed using the singler (v2.2.0) R package (citation) with the HPCA (citation) and DICE (citation) data sets as references and the fine label discriminators. Individual sequencing runs were subsetted to run through the singleR algorithm in order to reduce memory demands. The output of all the singleR analyses were collated and appended to the meta data of the Seurat object. Likewise, the Azimuth (v0.4.6.9004) R Package (citation was used for automatic annotation as a partially orthogonal approach.

Addition of TCR data

The filtered contig annotation T cell receptor (TCR) data for available sequencing runs were loaded into the R global environment. Individual contigs were combined using the combineTCR() function of scRepertoire (v2.0.0) R Package (citation). Clonotypes were assigned to barcodes and were multiple duplicate chains for individual cells were filtered to select for the top expressing contig by read count. The clonotype data was then added to the Seurat Object with proportion across individual patients being used to calculate frequency.

Session Info

Session Info for the initial data processing and analysis can be found here.

Citations

As of right now, there is no citation associated with the assembled data set. However if using the data, please find the corresponding manuscript for each data set summarized above or can be found in the summary table. In addition, if using the processed data, feel free to modify the language in the methods section (above) and please cite the appropriate manuscripts of the software or references that were used.

Itemized List of the Software Used

Seurat v5.3.1 - citation
Singler v2.10.0 - citation
Azimuth v0.4.6.9004 - citation
scRepertoire v2.0.0 - citation
scanpy v1.9 - citation
scVI - citation
scANVI - citation
scirpy - citation

Itemized List of Reference Data Used

Human Primary Cell Atlas (HPCA) - citation
Monaco Data Set (Monaco) - citation
PBMC reference - citation

Installation

If you are interested in the set up and running of the evalauation of the uTILity pipeline, please download the processedData file from the zenodo archive, unzip and place it in the ./data directory.

Step 1: Clone the Repository

git clone https://github.com/ncborcherding/utility
cd utility

Step 2: Create the Conda Environment

The pipeline uses a single conda environment for all Python dependencies.

# Create the environment (this may take 5-10 minutes)
conda env create -f environment.yml

# Verify installation
conda activate sc-integration-benchmark
python -c "import scvi; import scanpy; print('✓ Python packages OK')"
conda deactivate

GPU Setup

For NVIDIA GPU acceleration, edit environment.yml before creating the environment:

Comment out - cpuonly Uncomment the appropriate CUDA version:

# Option B: CUDA 11.8
- pytorch>=2.2
- pytorch-cuda=11.8

# OR Option C: CUDA 12.1
- pytorch>=2.2
- pytorch-cuda=12.1

Verify GPU detection after installation:

conda activate sc-integration-benchmark
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Apple Silicon (M1/M2/M3)

The CPU-only configuration works on Apple Silicon. MPS (Metal Performance Shaders) acceleration is automatically detected but has limitations:

Mixed precision is not supported on MPS; the scripts automatically force precision: "32-true"
Batch sizes are capped at 128 for stability
Performance is good but not as fast as NVIDIA GPUs

No changes to environment.yml are needed for Apple Silicon.

Step 3: Install R Packages

# Install BiocManager if needed
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Core packages
install.packages(c("Seurat", "yaml", "dplyr", "data.table", "reticulate"))

# Bioconductor packages
BiocManager::install(c("batchelor", "rhdf5"))

# SeuratDisk (for h5ad conversion)
remotes::install_github("mojaveazure/seurat-disk")

Step 4: Configure reticulate

Configure per-session in R:

library(reticulate)

# Use the conda environment
use_condaenv("sc-integration-benchmark", required = TRUE)

# Verify
py_config()

License

The data and analysis of uTILity is provided under a CC BY-NC 4.0 license, please feel free to remix, transform, and build upon the material. However, the intent of this resource is noncommercial.

Contact

Questions, comments, suggestions, please feel free to contact Nick Borcherding via this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

uTILity

Comprehensive collection of Single-Cell Tumor-Infiltrating Lymphocyte Data

Introduction

Folder Structure

Sample ID:

Cohort Information

Cohort Summary

Methods

Single-Cell Data Processing

Annotation of Cells

Addition of TCR data

Session Info

Citations

Itemized List of the Software Used

Itemized List of Reference Data Used

Installation

Step 1: Clone the Repository

Step 2: Create the Conda Environment

GPU Setup

Apple Silicon (M1/M2/M3)

Step 3: Install R Packages

Step 4: Configure reticulate

License

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
R		R
data		data
figs		figs
py		py
results/09_scanpy_export		results/09_scanpy_export
summary		summary
www		www
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NEWS.txt		NEWS.txt
README.md		README.md
config.yaml		config.yaml
environment.yml		environment.yml
run_pipeline.R		run_pipeline.R
tcell_markers.yaml		tcell_markers.yaml

License

ncborcherding/utility

Folders and files

Latest commit

History

Repository files navigation

uTILity

Comprehensive collection of Single-Cell Tumor-Infiltrating Lymphocyte Data

Introduction

Folder Structure

Sample ID:

Cohort Information

Cohort Summary

Methods

Single-Cell Data Processing

Annotation of Cells

Addition of TCR data

Session Info

Citations

Itemized List of the Software Used

Itemized List of Reference Data Used

Installation

Step 1: Clone the Repository

Step 2: Create the Conda Environment

GPU Setup

Apple Silicon (M1/M2/M3)

Step 3: Install R Packages

Step 4: Configure reticulate

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages