Skip to content
Snippets Groups Projects
John Lafin's avatar
cc5511b8

scRNA-QC Astrocyte workflow

A general purpose Astrocyte workflow to run quality control on single cell RNA sequencing data.

Introduction

This Astrocyte workflow accepts at least one h5 file containing human or mouse single cell transcriptomic data, a metadata CSV-formatted file outlining sample names and input files, and an optional cutoffs CSV-formatted file for user-supplied quality control cutoffs. The input h5 file is assumed to have already been filtered to remove empty droplets. Tools such as Cell Ranger or CellBender can detect and remove empty droplets.

This workflow will perform the following steps:

  • Doublet detection and marking with scDblFinder
  • Quality control metric calculation
    • Reads per cell
    • Genes per cell
    • Percent mitochondrial genes
    • Stress score
  • Quality control filtering
  • Pre-processing: log-normalization, feature selection, scaling
  • Dimensional reduction: PCA, UMAP
  • Clustering
  • Differential expression analysis per cluster

If multiple samples are present, merging with optional batch correction will also be applied.

The workflow will output several diagnostic plots to allow users to rerun the workflow with adjusted parameters if desired. It will also output filtered and pre-processed Seurat objects for downstream use, including the CZ CELLxGENE workflow. If multiple samples were submitted, the workflow will output an RDS containing a list of objects and a second RDS file containing the merged data.

Parameters

  • h5_files: A list of filtered h5 files containing scRNA data. These can be generated using the Cell Ranger Count or CellBender Astrocyte workflows.

  • species: The species of the reference genome used for count matrix generation. Options are "human", "mouse", or "barnyard" (human and mouse combined). This influences stress score calculation and mitochondrial filtering parameters.

  • batch: If batch correction is desired, the name of the batch variable in the metadata file. If only one sample is present or batch correction is not desired, leave this field blank.

  • metadata: A CSV formatted file containing the samples names in the first column (one per row), the associated h5 file in the second column, followed by any optional number of columns. Any column after these will be added as sample-level metadata to the final object. You can download an example file here.

  • cutoffs: An optional CSV formatted file. If present, the workflow will use any user-provided cutoffs in place of automated ones. Blank entries correspond to default cutoffs. You can download an example here. This file must contain the following fields:

    • Sample: the name of the sample
    • nCount_threshold_lo: lower limit for reads per cell
    • nCount_threshold_hi: upper limit for reads per cell
    • nFeature_threshold_lo: lower limit for genes per cell
    • nFeature_threshold_hi: upper limit for genes per cell
    • percent_mito.threshold: upper limit for percent mitochondrial gene expression

Output

This workflow outputs the following:

  • plots directory: A directory with two sub-directories:
    • pre-filter: quality control plots prior to filtering with cutoffs displayed
    • post-filter: quality control plots after filtering has been applied
    • preprocessing: plots summarizing the preprocesing steps.
      • umaps: a directory containing UMAP plots labeled by 10 different clustering resolutions, sample names, and any other sample-level metadata supplied.
      • elbow.png: an elbow plot with a line marking the dimensionality selected.
      • pca.png: a PCA plot colored by Sample. Examine this to look for evidence of batch effects.
      • harmony.png: if multiple samples were submitted, this is a batch corrected dimensional reduction plot. Compare this with pca.png to evaluate whether batch correction was effective.
      • clustree.png: a clustree plot to examine cluster stability over several resolutions. Use this to help decide which clustering resolution is appropriate for your data.
      • doublet_score.png, stress_score.png: embedding plots showing doublet score across UMAP space. Use these to look for regional enrichment of these scores, which might signal a need to remove those clusters.
  • data directory: A directory with the following files:
    • pre-filtered_list.rds: an RDS file containing a list of Seurat objects (one per sample) prior to filtering
    • post-filtered_list.rds: As above, but after filtering
    • processed_object.rds: an RDS file containing a single Seurat object that has undergone processing. This file is directly compatible with the CZ CELLxGENE Astrocyte workflow.
    • barcode_metrics_pre-filter.csv: a CSV formatted file with quality control metrics (one row per cell/barcoded) prior to filtering
    • barcode_metrics_post-filter.csv: as above, but contains only barcodes that passed filtering
    • cutoffs_used.csv: a CSV formatted file listing the cutoffs used per metric per sample.
    • doublets_table.csv: a CSV formatted file listing summary statistics of the doublet detection procedure.
    • sessionInfo.txt: a file describing the analysis environment.
  • markers directory: A directory containing CSV files for cluster markers at each resolution.

Using the output

This workflow outputs several plots and types of data. Here are some suggestions on how to interpret and use them.

Assessing filtering

The default filtering applied here is intentionally conservative- we prefer to keep more than we remove. To assess how effective the filtering step was, compare the pre- and post-filter plots. Assess the distributions of nCount_RNA (total counts per cell) and nFeature_RNA (unique genes per cell) for an enrichment at the high and low ends. The filtering step should remove these. If the cutoff lines are beyond the bounds of the data, that's OK.

The trend plot is a good summary of several critical QC metrics. Pre-filtering, an scRNA sample might have a cluster of cells at the lower left end of the plot with high percent mitochondrial gene expression. Because snRNA data should lack mitochondrial gene expression, for these samples you may see a similar cluster without high percent mitocondrial gene expression. After filtering, this population should mostly disappear, with most of the cells lining up with the linear regression line.

Assessing pre-processing

If you have multiple samples, you can assess the effect of batch correction by comparing the PCA plot to the Harmony plot. Look for evidence of a batch effect in the PCA plot (visible clusters enriched for particular samples or batch variables). This effect should be reduced in the Harmony plot (eg, clusters from different samples should shift together). If you don't see a batch effect in the PCA plot, you may consider running the workflow again without batch correction. You can also look at the UMAP plot grouped by sample or batch variable to assess how well batch correction worked.

Next, look at the elbow plot. The vertical line indicates the dimensionality selected. This selection should be early in the 'plateau' to attempt to retain as much variation, without keeping noise. A value between 10-50 is usually sufficient.

Clustering is run at 10 different resolutions by default (higher resolution = more clusters). Selecting the proper resolution for your data can be challenging. The clustree plot will show you cluster definitions at each resolution, and how they change over the range. Use this plot to search for a resolution where you see some stability in cluster assignment. From this starting point, you can look at UMAP plots and marker lists to assess whether this resolution appears to capture the biological populations you expect to find.

Finally, examine the doublet score plot. Any cluster or subcluster that shows an enrichment of doublet scores may be made up of doublets. Data from these clusters should be viewed with skepticism, or the clusters may be removed from analysis entirely. For scRNA data, the same can be done with the stress score plot- clusters enriched with cells with high stress score may be a sign of transcriptional changes associated with enzymatic digestion.

Next steps

The data generated by this workflow is directly compatible with the CZ CELLxGENE Astrocyte workflow. Provide the processed_object.rds file as input to this workflow for an interactive exploration of your data. Please see that workflow's documentation for more information.

Notes

Note that although this workflow calculates stress and doublet scores, it does not filter the data based on these scores. Users are recommended to examine the distribution of these scores in clusters as well as cluster markers to aid in determination whether a given cluster may be the result of a processing artifact.