Snippets Groups Projects

finished commenting PDAC.R for plotting

Jefferson Chen authored 8 months ago

aec0b677

History Find file

Name	Last commit	Last update
img
scGPT
tangram
LICENSE
README.md
requirements.txt

scGPT-Tangram

Introduction

The integration of multi-omics data, particularly single-cell and spatial transcriptomics, has become crucial for understanding complex biological systems at a higher resolution. This project leverages the scGPT model, a foundation model for single-cell multi-omics data, in combination with Tangram, a deep learning-based framework for spatial transcriptomics alignment, to improve the accuracy and reliability of cell annotations and spatial mapping.

In this project, we utilize four datasets: MOp 10Xv3 Mouse Primary Motor Cortex, Human Breast, and Human Lymph Nodes, to evaluate the effectiveness of our approach. These datasets include both single-cell and spatial transcriptomic data, along with a set of marker genes that are consistently used across all four datasets to ensure comparability.

The project also includes detailed tutorials for using Tangram and the scGPT model, providing users with step-by-step guidance on fine-tuning the scGPT model and generating single-cell references for spatial transcriptomics. The results of our method are systematically compared across different hyperparameters, including the ratio of masked values, the number of spots and cell references, and the effects of fine-tuning, offering insights into the performance improvements achieved through our approach.

Create Environment

We can first create the conda environment using

conda create -n scgpt_tangram python==3.10.4
conda activate scgpt_tangram
pip install -r ./requirements.txt 
pip install -e ./tangram

Note: It is highly recommend to install the developed tangram version, although the method used is the same, it returns additional outputs and added a threshold filter.

Datasets

Note: We used the same Marker Genes accross all four datasets.

MOp 10Xv3 Mouse Primary Motor Cortex (from Tangram)
- Single Cell Data
  - https://storage.googleapis.com/tommaso-brain-data/tangram_demo/mop_sn_tutorial.h5ad.gz
  - 26431 cells × 27742 genes
- Spatial Data
  - https://storage.googleapis.com/tommaso-brain-data/tangram_demo/slideseq_MOp_1217.h5ad.gz
  - 9852 spots × 24518 genes
- Marker Genes
  - https://www.biorxiv.org/content/10.1101/2020.06.04.105700v1
  - 253 genes
Human Breast (from Redeconve)
- Single Cell Data
  - 100064 cells × 29733 genes
- Spatial Data
  - 2426 spots × 19237 genes
Human Lymph Nodes (from Redeconve)
- Single Cell Data
  - 73260 cells × 18079 genes
- Spatial Data
  - 4039 spots × 33538 genes
PDAC (from Redeconve)
- Single Cell Data
  - 1926 cells × 19736 genes
- Spatial Data
  - 428 spots x 19736 genes

Tangram Tutorials

For new users who are not familar to tangram, it is recommend to go through the files

./tangram/tutorial_tangram_with_squidpy.ipynb and

./tangram/tutorial_tangram_without_squidpy.ipynb.

Additional tutorials can be found on tangram's github page.

Fine-tuning scGPT Model

The input of scGPT model fine-tuning would requires a single-cell expression dxata in 'AnnData' format. With correct labels of gene_names and str_batch. Note that the assignment of str_batch would not influence the final generation results. Although the results with the base scGPT "whole-human" model is already outstanding, a correct fine-tuning model may still slightly increase the results. For more details, please run the file ./scGPT/scGPT_Model_Finetuning.ipynb

Single Cell Reference Generation

To generate the single cell reference, please run ./scGPT/scGPT_Cell_Reference_Generation.ipynb for more details. The file scGPT_Cell_Reference_Generation.py also contains the same content.

The file scGPT_Cell_Reference_Generation_Version2.py used a slightly different approach. It generates each spot's cell reference separately instead of combining together into one. This approach would use less GPU memories but takes longer to finish. The results of both approach is the same.

The generation happened within the function process_spots. It requires an input of spatial data, hyperparameters k, m, and ratio.

k: The number of spots to randomly sample from the entire spatial data.
m: The number of cell reference to generate for each selected spot.
ratio: The ratio of the values to mask.

The relationship between k and m remains ambiguou. Generally, it performs better when

\frac{k}{m}

has a ratio near 10. The potential future improvements could use defined mathatical functions to find the relationship between k and m. In Addition, the higher the ratio is, the lower the accuracy of generated single cell sequencce.

The output of the single cell reference

S

would be in dimension

(k * m) \times n_{genes}

Note: The input of spatial data also requires data preprocessing similar to what is done in fine-tuning scGPT model.

Metrics Evaluation

For metrics evaluation, we chose to measure the correlation between each predicted spot and the actual spot. Which can be represent in the following equation.

All the comparsion and metrics evaluation are stored in the file ./scGPT/Tangram_scGPT_Comparsion.ipynb.

\frac{1}{n_{spot}}\sum_{i=1}^{n_{spot}} corr(A_i, G_i)

Note:

A
is the predicted spatial expression matrix with shape
n_{spot} \times n_{cells}
. It is defined as
M^T \times S
.
M
is the mapping matrix using tangram, with dimension
n_{cell} \times n_{spot}
S
is the gerated single cell reference from scGPT with dimension
n_{cell} \times n_{genes}
. The selection of genes have two options, the first case chosing its overlap with the marker genes, defined in tangram. The second case is using the entire overlap genes between spatial data and generated single cell. Generally, case one would result in a higher correlation. Therefore, in the evaluation, we calculated the correlation using the overlapped marker genes.
G
is the actual spatial expression matrix with dimension
n_{spot} \times n_{genes}
. The
n_{genes}
is determined by the dimension of
S
.
corr
is referred as the Pearson Correlation. Implemented through the function
numpy.corrcore
.

Method Overview

Results

The generation results are all stored within the folder ./scGPT/Generation_Result. The generated single cell reference is stored in files with name adata_result_XXX.h5ad, the tangram mapping results is stored in files with name scgpt_cell_reference_XXX.h5ad. To findout the performance of our new method under different circumstances, we created three tables.

Note: After scGPT generation, there are 10727 intersected genes between the spatial and single cell expression data with 188 marker genes.

Ratio Comparsion

Note: We used MOp 10Xv3 dataset. The default hyperparameters is

k=500

,

m=52

,

n_{genes}

is the overlap with marker genes.

Ratio	Median	Mean
0.5	0.689921	0.546518
0.6	0.687070	0.540321
0.8	0.599128	0.471109
Tangram	0.215512	0.232485

K-M Comparsion

Note: We used MOp 10Xv3 dataset. The default hyperparameters is

ratio=0.8

,

n_{genes}

is the overlap with marker genes.

Hyperparameters	Median	Mean
K_500_M_52	0.598805	0.474123
K_400_M_65	0.591193	0.470011
K_650_M_40	0.589522	0.467705
Tangram	0.215512	0.232485

Fine-tuning Comparsion

Note: We used MOp 10Xv3 dataset. The default hyperparameters is

ratio=0.8

,

k==500

,

m=52

,

n_{genes}

is the overlap with marker genes. The row "Single Cell" is where we fine-tuned the scGPT "whole-human" model by inputing the MOp 10Xv3 dataset's true single cell expression. The results is roughly the same as the original (whole-human) model. Yet this might due to MOp 10Xv3 datasets is not originated from human species, therefore potential increase may exist if fine-tuning with other datasets.

Fine-tuned	Median	Mean
Original	0.599128	0.471109
Single Cell	0.598805	0.474123
Tangram	0.215512	0.232485

Method Comparsion

To run the result for Redeconve, go to folder ./scGPT/Redeconve. Run formatting.py before running Redeconve.R.

For PDAC dataset plotting, please visit the file ./scGPT/Redeconve/PDAC.R

To preprocess PDAC dataset for scGPT generation, please visit the file ./scGPT/Redeconve/PDAC_Data_Formatting.R

To see where to store the overlapped common genes and marker genes, please visit the file ./scGPT/Tangram_scGPT_Comparsion.ipynb. Note that it to filter the overlapped genes when dealing with case such as selecting 1000 cell references from the entire cell references, some gene may have zero-expression and needs to be fitered out.

Note: It is highly recommended to use R version >=4.2.0.

Seurat* -- Needs to be implemented

Method Comparsion with Redeconve & Seurat (Human Breast)

In this dataset, we compared 5 different results. Since in Redeconve's original result, it only selected 1000 cells as reference, therefore use our method and generated 1000 cell reference to compare the result. We have also generated the same number of cell reference as the original single-cell expression data, which is 100000 cell reference. Same procedures is been applied to dataset 3 (Human Lymph Node).

Note: We used Human Breast dataset. The default hyperparameters is

ratio=0.8

,

k=1000

,

m=100

,

n_{genes}

is the overlap with marker genes. Notice that there exist genes with 0's accorss all 1000 selected cells (in Redeconve method). Therefore Tangram requires to filter those zero-expression gene out and resulting an overlapped 165 marker genes and 14655 overlapped genes in total.

Method	Median	Mean
Ours (1000)	0.738423	0.690996
Tangram (1000)	0.497457	0.473931
Redeconve	0.458208	0.453385
Seurat	\	\
Ours (Full)	0.929573	0.869638
Tangram (Full)	0.540444	0.528668

Method Comparsion with Redeconve & Seurat (Human Lymph Node)

Note: We used Human Lymph Node dataset. The default hyperparameters is

ratio=0.8

,

k=730

,

m=100

,

n_{genes}

is the overlap with marker genes. After scGPT generation and filtering, there are 13515 intersected genes between the spatial and single cell expression data with 114 overlapped marker genes.

Method	Median	Mean
Ours (1000)	0.845477	0.768616
Tangram (1000)	0.533852	0.541026
Redeconve	0.549875	0.546411
Seurat	\	\
Ours (Full)	0.929626	0.904360
Tangram (Full)	0.590144	0.601816

Method Comparsion with Redeconve & Seurat (PDAC)

Note: We used PDAC dataset. The default hyperparameters is

ratio=0.8

,

k=107

,

m=18

,

n_{genes}

is the overlap with marker genes. After scGPT generation and filtering, there are 11960 intersected genes between the spatial and single cell expression data with 127 overlapped marker genes. Since for dataset 4 (PDAC), Redeconve did use the entire 1926 cells reference, we did not have to generate the 1000 cell reference using scGPT.

Note: For the plotting portion, please visit the file ./scGPT/Redeconve/PDAC.R

Method	Median	Mean
Tangram	0.234127	0.253742
Redeconve	0.240583	0.232150
Seurat	\	\
Ours (Full)	0.678516	0.608690

Acknowledgement

Tingyi Wanyan
Peifeng Ruan
Yang Xie
Guanghua Xiao
Qin Zhou

Citations

Biancalani, T., Scalia, G., Buffoni, L. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat Methods 18, 1352–1362 (2021). https://doi.org/10.1038/s41592-021-01264-7
Cui, H., Wang, C., Maan, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 21, 1470–1480 (2024). https://doi.org/10.1038/s41592-024-02201-0
Zhou, Z., Zhong, Y., Zhang, Z. et al. Spatial transcriptomics deconvolution at single-cell resolution using Redeconve. Nat Commun 14, 7930 (2023). https://doi.org/10.1038/s41467-023-43600-9