Merge branch 'main' of git.biohpc.swmed.edu:s233530/tangram-scgpt into main

dfd81224 · Jefferson Chen · bbc5f72a · 09272c4e · dfd81224
Commit dfd81224 authored 9 months ago by Jefferson Chen
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@

 The integration of multi-omics data, particularly single-cell and spatial transcriptomics, has become crucial for understanding complex biological systems at a higher resolution. This project leverages the scGPT model, a foundation model for single-cell multi-omics data, in combination with Tangram, a deep learning-based framework for spatial transcriptomics alignment, to improve the accuracy and reliability of cell annotations and spatial mapping.

-In this project, we utilize three datasets: MOp 10Xv3 Mouse Primary Motor Cortex, Human Breast, and Human Lymph Nodes, to evaluate the effectiveness of our approach. These datasets include both single-cell and spatial transcriptomic data, along with a set of marker genes that are consistently used across all three datasets to ensure comparability.
+In this project, we utilize four datasets: MOp 10Xv3 Mouse Primary Motor Cortex, Human Breast, and Human Lymph Nodes, to evaluate the effectiveness of our approach. These datasets include both single-cell and spatial transcriptomic data, along with a set of marker genes that are consistently used across all four datasets to ensure comparability.

 The project also includes detailed tutorials for using Tangram and the scGPT model, providing users with step-by-step guidance on fine-tuning the scGPT model and generating single-cell references for spatial transcriptomics. The results of our method are systematically compared across different hyperparameters, including the ratio of masked values, the number of spots and cell references, and the effects of fine-tuning, offering insights into the performance improvements achieved through our approach.

@@ -25,7 +25,7 @@ pip install -e ./tangram

 ## Datasets

-**Note**: We used the same Marker Genes accross all three datasets.
+**Note**: We used the same Marker Genes accross all four datasets.

 - **MOp 10Xv3 Mouse Primary Motor Cortex** (from [Tangram](https://www.nature.com/articles/s41592-021-01264-7#data-availability)) 
    - Single Cell Data
@@ -47,7 +47,11 @@ pip install -e ./tangram
        - 73260 cells × 18079 genes
    - Spatial Data
        - 4039 spots × 33538 genes
-
+- **PDAC** (from [Redeconve](https://www.nature.com/articles/s41592-021-01264-7#data-availability))
+    - Single Cell Data
+        - 1926 cells × 19736 genes
+    - Spatial Data
+        - 428 spots x 19736 genes

 ## Tangram Tutorials

@@ -67,6 +71,8 @@ The input of scGPT model fine-tuning would requires a single-cell expression dxa

 To generate the single cell reference, please run **./scGPT/scGPT_Cell_Reference_Generation.ipynb** for more details. The file **scGPT_Cell_Reference_Generation.py** also contains the same content. 

+The file **scGPT_Cell_Reference_Generation_Version2.py** used a slightly different approach. It generates each spot's cell reference separately instead of combining together into one. This approach would use less GPU memories but takes longer to finish. The results of both approach is the same. 
+
 The generation happened within the function **process_spots**. It requires an input of spatial data, hyperparameters **k**, **m**, and **ratio**. 

 - **k**: The number of spots to randomly sample from the entire spatial data.
@@ -93,7 +99,7 @@ $$\frac{1}{n_{spot}}\sum_{i=1}^{n_{spot}} corr(A_i, G_i)$$

 - $M$ is the mapping matrix using tangram, with dimension $n_{cell} \times n_{spot}$

- $S$ is the gerated single cell reference from scGPT with dimension $n_{cell} \times n_{genes}$. The selection of genes have two options, the first case chosing its overlap with the marker genes, defined in [tangram](https://www.nature.com/articles/s41592-021-01264-7#Sec12). The second case is using the entire overlap genes between spatial data and generated single cell. Generally, case one would result in a higher correlation.   
+- $S$ is the gerated single cell reference from scGPT with dimension $n_{cell} \times n_{genes}$. The selection of genes have two options, the first case chosing its overlap with the marker genes, defined in [tangram](https://www.nature.com/articles/s41592-021-01264-7#Sec12). The second case is using the entire overlap genes between spatial data and generated single cell. Generally, case one would result in a higher correlation. Therefore, in the evaluation, we calculated the correlation using the overlapped marker genes. 

 - $G$ is the actual spatial expression matrix with dimension $n_{spot} \times n_{genes}$. The $n_{genes}$ is determined by the dimension of $S$. 

@@ -107,6 +113,8 @@ $$\frac{1}{n_{spot}}\sum_{i=1}^{n_{spot}} corr(A_i, G_i)$$

 The generation results are all stored within the folder **./scGPT/Generation_Result**. The generated single cell reference is stored in files with name **adata_result_XXX.h5ad**, the tangram mapping results is stored in files with name **scgpt_cell_reference_XXX.h5ad**. To findout the performance of our new method under different circumstances, we created three tables. 

+**Note**: After scGPT generation, there are **10727** intersected genes between the spatial and single cell expression data with **188** marker genes.
+
 ### Ratio Comparsion

 **Note**: We used MOp 10Xv3 dataset. The default hyperparameters is $k=500$, $m=52$, $n_{genes}$ is the overlap with marker genes. 
@@ -145,32 +153,53 @@ The generation results are all stored within the folder **./scGPT/Generation_Res

 To run the result for Redeconve, go to folder **./scGPT/Redeconve**. Run **formatting.py** before running **Redeconve.R**. 

+For PDAC dataset plotting, please visit the file **./scGPT/Redeconve/PDAC.R**
+
 **Note**: It is highly recommended to use R version >=4.2.0. 

 Seurat* -- Needs to be implemented

 ### Method Comparsion with Redeconve & Seurat (Human Breast)

-**Note**: We used Human Breast dataset. The default hyperparameters is $ratio=0.8$, $k=1000$, $m=100$, $n_{genes}$ is the overlap with marker genes. 
+**Note**: We used Human Breast dataset. The default hyperparameters is $ratio=0.8$, $k=1000$, $m=100$, $n_{genes}$ is the overlap with marker genes. Notice that there exist genes with 0's accorss all 1000 selected cells (in Redeconve method). Therefore Tangram requires to filter those zero-expression gene out and resulting an overlapped **165** marker genes and **14655** overlapped genes in total.
+

-| Methods          | Median  | Mean    |
-|------------|---------|---------|
-| Ours       | 0.928724 | 0.869383 |
-| Redeconve | \ | \ |
-| Seurat | \ | \ |
-| Tangram    | 0.430505 | 0.437279 |
+
+| Method            | Median   | Mean     |
+|-------------------|----------|----------|
+| Ours (1000)       | 0.738423 | 0.690996 |
+| Tangram (1000)    | 0.497457 | 0.473931 |
+| Redeconve         | 0.458208 | 0.453385 |
+| Seurat            | \	       | \        | 
+| Ours (Full)       | 0.929573 | 0.869638 |
+| Tangram (Full)    | 0.540444 | 0.528668 |


 ### Method Comparsion with Redeconve & Seurat (Human Lymph Node)

-**Note**: We used Human Breast dataset. The default hyperparameters is $ratio=0.8$, $k=730$, $m=100$, $n_{genes}$ is the overlap with marker genes. 
+**Note**: We used Human Lymph Node dataset. The default hyperparameters is $ratio=0.8$, $k=730$, $m=100$, $n_{genes}$ is the overlap with marker genes. After scGPT generation and filtering, there are **13515** intersected genes between the spatial and single cell expression data with **114** overlapped marker genes. 

-| Methods          | Median  | Mean    |
-|------------|---------|---------|
-| Ours       | 0.931290 | 0.906363 |
-| Redeconve | \ | \ |
-| Seurat | \ | \ |
-| Tangram    | 0.557360 | 0.569141 |
+| Method            | Median   | Mean     |
+|-------------------|----------|----------|
+| Ours (1000)       | 0.845477 | 0.768616 |
+| Tangram (1000)    | 0.533852 | 0.541026 |
+| Redeconve         | 0.549875 | 0.546411 |
+| Seurat            | \	       | \        | 
+| Ours (Full)       | 0.929626 | 0.904360 |
+| Tangram (Full)    | 0.590144 | 0.601816 |
+
+
+### Method Comparsion with Redeconve & Seurat (PDAC)
+
+**Note**: We used PDAC dataset. The default hyperparameters is $ratio=0.8$, $k=107$, $m=18$, $n_{genes}$ is the overlap with marker genes. After scGPT generation and filtering, there are **11960** intersected genes between the spatial and single cell expression data with **127** overlapped marker genes. 
+
+
+| Method      | Median   | Mean     |
+|-------------|----------|----------|
+| Tangram     | 0.242775 | 0.254046 |
+| Redeconve   | 0.239750 | 0.231362 |
+| Seurat      | \	     | \        | 
+| Ours (Full) | 0.692099 | 0.614245 |


 ## Acknowledgement