scGPT-Tangram
Introduction
The integration of multi-omics data, particularly single-cell and spatial transcriptomics, has become crucial for understanding complex biological systems at a higher resolution. This project leverages the scGPT model, a foundation model for single-cell multi-omics data, in combination with Tangram, a deep learning-based framework for spatial transcriptomics alignment, to improve the accuracy and reliability of cell annotations and spatial mapping.
In this project, we utilize four datasets: MOp 10Xv3 Mouse Primary Motor Cortex, Human Breast, and Human Lymph Nodes, to evaluate the effectiveness of our approach. These datasets include both single-cell and spatial transcriptomic data, along with a set of marker genes that are consistently used across all four datasets to ensure comparability.
The project also includes detailed tutorials for using Tangram and the scGPT model, providing users with step-by-step guidance on fine-tuning the scGPT model and generating single-cell references for spatial transcriptomics. The results of our method are systematically compared across different hyperparameters, including the ratio of masked values, the number of spots and cell references, and the effects of fine-tuning, offering insights into the performance improvements achieved through our approach.
Create Environment
We can first create the conda environment using
conda create -n scgpt_tangram python==3.10.4
conda activate scgpt_tangram
pip install -r ./requirements.txt
pip install -e ./tangram
Note: It is highly recommend to install the developed tangram version, although the method used is the same, it returns additional outputs and added a threshold filter.
Datasets
Note: We used the same Marker Genes accross all four datasets.
-
MOp 10Xv3 Mouse Primary Motor Cortex (from Tangram)
- Single Cell Data
- https://storage.googleapis.com/tommaso-brain-data/tangram_demo/mop_sn_tutorial.h5ad.gz
- 26431 cells × 27742 genes
- Spatial Data
- https://storage.googleapis.com/tommaso-brain-data/tangram_demo/slideseq_MOp_1217.h5ad.gz
- 9852 spots × 24518 genes
- Marker Genes
- Single Cell Data
-
Human Breast (from Redeconve)
- Single Cell Data
- 100064 cells × 29733 genes
- Spatial Data
- 2426 spots × 19237 genes
- Single Cell Data
-
Human Lymph Nodes (from Redeconve)
- Single Cell Data
- 73260 cells × 18079 genes
- Spatial Data
- 4039 spots × 33538 genes
- Single Cell Data
-
PDAC (from Redeconve)
- Single Cell Data
- 1926 cells × 19736 genes
- Spatial Data
- 428 spots x 19736 genes
- Single Cell Data
Tangram Tutorials
For new users who are not familar to tangram, it is recommend to go through the files
./tangram/tutorial_tangram_with_squidpy.ipynb and
./tangram/tutorial_tangram_without_squidpy.ipynb.
Additional tutorials can be found on tangram's github page.
Fine-tuning scGPT Model
The input of scGPT model fine-tuning would requires a single-cell expression dxata in 'AnnData' format. With correct labels of gene_names and str_batch. Note that the assignment of str_batch would not influence the final generation results. Although the results with the base scGPT "whole-human" model is already outstanding, a correct fine-tuning model may still slightly increase the results. For more details, please run the file ./scGPT/scGPT_Model_Finetuning.ipynb
Single Cell Reference Generation
To generate the single cell reference, please run ./scGPT/scGPT_Cell_Reference_Generation.ipynb for more details. The file scGPT_Cell_Reference_Generation.py also contains the same content.
The file scGPT_Cell_Reference_Generation_Version2.py used a slightly different approach. It generates each spot's cell reference separately instead of combining together into one. This approach would use less GPU memories but takes longer to finish. The results of both approach is the same.
The generation happened within the function process_spots. It requires an input of spatial data, hyperparameters k, m, and ratio.
-
k: The number of spots to randomly sample from the entire spatial data.
-
m: The number of cell reference to generate for each selected spot.
-
ratio: The ratio of the values to mask.
The relationship between k and m remains ambiguou. Generally, it performs better when \frac{k}{m}
has a ratio near 10. The potential future improvements could use defined mathatical functions to find the relationship between k and m. In Addition, the higher the ratio is, the lower the accuracy of generated single cell sequencce.
The output of the single cell reference S
would be in dimension (k * m) \times n_{genes}
Note: The input of spatial data also requires data preprocessing similar to what is done in fine-tuning scGPT model.
Metrics Evaluation
For metrics evaluation, we chose to measure the correlation between each predicted spot and the actual spot. Which can be represent in the following equation.
All the comparsion and metrics evaluation are stored in the file ./scGPT/Tangram_scGPT_Comparsion.ipynb.
\frac{1}{n_{spot}}\sum_{i=1}^{n_{spot}} corr(A_i, G_i)
Note:
-
A
is the predicted spatial expression matrix with shapen_{spot} \times n_{cells}
. It is defined asM^T \times S
. -
M
is the mapping matrix using tangram, with dimensionn_{cell} \times n_{spot}
-
S
is the gerated single cell reference from scGPT with dimensionn_{cell} \times n_{genes}
. The selection of genes have two options, the first case chosing its overlap with the marker genes, defined in tangram. The second case is using the entire overlap genes between spatial data and generated single cell. Generally, case one would result in a higher correlation. Therefore, in the evaluation, we calculated the correlation using the overlapped marker genes. -
G
is the actual spatial expression matrix with dimensionn_{spot} \times n_{genes}
. Then_{genes}
is determined by the dimension ofS
. -
corr
is referred as the Pearson Correlation. Implemented through the functionnumpy.corrcore
.
Method Overview
Results
The generation results are all stored within the folder ./scGPT/Generation_Result. The generated single cell reference is stored in files with name adata_result_XXX.h5ad, the tangram mapping results is stored in files with name scgpt_cell_reference_XXX.h5ad. To findout the performance of our new method under different circumstances, we created three tables.
Note: After scGPT generation, there are 10727 intersected genes between the spatial and single cell expression data with 188 marker genes.
Ratio Comparsion
Note: We used MOp 10Xv3 dataset. The default hyperparameters is k=500
, m=52
, n_{genes}
is the overlap with marker genes.
Ratio | Median | Mean |
---|---|---|
0.5 | 0.689921 | 0.546518 |
0.6 | 0.687070 | 0.540321 |
0.8 | 0.599128 | 0.471109 |
Tangram | 0.215512 | 0.232485 |
K-M Comparsion
Note: We used MOp 10Xv3 dataset. The default hyperparameters is ratio=0.8
, n_{genes}
is the overlap with marker genes.
Hyperparameters | Median | Mean |
---|---|---|
K_500_M_52 | 0.598805 | 0.474123 |
K_400_M_65 | 0.591193 | 0.470011 |
K_650_M_40 | 0.589522 | 0.467705 |
Tangram | 0.215512 | 0.232485 |
Fine-tuning Comparsion
Note: We used MOp 10Xv3 dataset. The default hyperparameters is ratio=0.8
, k==500
, m=52
, n_{genes}
is the overlap with marker genes. The row "Single Cell" is where we fine-tuned the scGPT "whole-human" model by inputing the MOp 10Xv3 dataset's true single cell expression. The results is roughly the same as the original (whole-human) model. Yet this might due to MOp 10Xv3 datasets is not originated from human species, therefore potential increase may exist if fine-tuning with other datasets.
Fine-tuned | Median | Mean |
---|---|---|
Original | 0.599128 | 0.471109 |
Single Cell | 0.598805 | 0.474123 |
Tangram | 0.215512 | 0.232485 |
Method Comparsion
To run the result for Redeconve, go to folder ./scGPT/Redeconve. Run formatting.py before running Redeconve.R.
For PDAC dataset plotting, please visit the file ./scGPT/Redeconve/PDAC.R
To preprocess PDAC dataset for scGPT generation, please visit the file ./scGPT/Redeconve/PDAC_Data_Formatting.R
To see where to store the overlapped common genes and marker genes, please visit the file ./scGPT/Tangram_scGPT_Comparsion.ipynb. Note that it to filter the overlapped genes when dealing with case such as selecting 1000 cell references from the entire cell references, some gene may have zero-expression and needs to be fitered out.
Note: It is highly recommended to use R version >=4.2.0.
Seurat* -- Needs to be implemented
Method Comparsion with Redeconve & Seurat (Human Breast)
In this dataset, we compared 5 different results. Since in Redeconve's original result, it only selected 1000 cells as reference, therefore use our method and generated 1000 cell reference to compare the result. We have also generated the same number of cell reference as the original single-cell expression data, which is 100000 cell reference. Same procedures is been applied to dataset 3 (Human Lymph Node).
Note: We used Human Breast dataset. The default hyperparameters is ratio=0.8
, k=1000
, m=100
, n_{genes}
is the overlap with marker genes. Notice that there exist genes with 0's accorss all 1000 selected cells (in Redeconve method). Therefore Tangram requires to filter those zero-expression gene out and resulting an overlapped 165 marker genes and 14655 overlapped genes in total.
Method | Median | Mean |
---|---|---|
Ours (1000) | 0.738423 | 0.690996 |
Tangram (1000) | 0.497457 | 0.473931 |
Redeconve | 0.458208 | 0.453385 |
Seurat | \ | \ |
Ours (Full) | 0.929573 | 0.869638 |
Tangram (Full) | 0.540444 | 0.528668 |
Method Comparsion with Redeconve & Seurat (Human Lymph Node)
Note: We used Human Lymph Node dataset. The default hyperparameters is ratio=0.8
, k=730
, m=100
, n_{genes}
is the overlap with marker genes. After scGPT generation and filtering, there are 13515 intersected genes between the spatial and single cell expression data with 114 overlapped marker genes.
Method | Median | Mean |
---|---|---|
Ours (1000) | 0.845477 | 0.768616 |
Tangram (1000) | 0.533852 | 0.541026 |
Redeconve | 0.549875 | 0.546411 |
Seurat | \ | \ |
Ours (Full) | 0.929626 | 0.904360 |
Tangram (Full) | 0.590144 | 0.601816 |
Method Comparsion with Redeconve & Seurat (PDAC)
Note: We used PDAC dataset. The default hyperparameters is ratio=0.8
, k=107
, m=18
, n_{genes}
is the overlap with marker genes. After scGPT generation and filtering, there are 11960 intersected genes between the spatial and single cell expression data with 127 overlapped marker genes. Since for dataset 4 (PDAC), Redeconve did use the entire 1926 cells reference, we did not have to generate the 1000 cell reference using scGPT.
Note: For the plotting portion, please visit the file ./scGPT/Redeconve/PDAC.R
Method | Median | Mean |
---|---|---|
Tangram | 0.234127 | 0.253742 |
Redeconve | 0.240583 | 0.232150 |
Seurat | \ | \ |
Ours (Full) | 0.678516 | 0.608690 |
Acknowledgement
-
Tingyi Wanyan
-
Peifeng Ruan
-
Yang Xie
-
Guanghua Xiao
-
Qin Zhou
Citations
-
Biancalani, T., Scalia, G., Buffoni, L. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat Methods 18, 1352–1362 (2021). https://doi.org/10.1038/s41592-021-01264-7
-
Cui, H., Wang, C., Maan, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 21, 1470–1480 (2024). https://doi.org/10.1038/s41592-024-02201-0
-
Zhou, Z., Zhong, Y., Zhang, Z. et al. Spatial transcriptomics deconvolution at single-cell resolution using Redeconve. Nat Commun 14, 7930 (2023). https://doi.org/10.1038/s41467-023-43600-9