Astrocyte

Workflows

Strand Lab

scRNA-QC

Repository



scRNA-QC Astrocyte workflow
A general purpose Astrocyte workflow to run quality control on single cell RNA sequencing data.

Introduction
This Astrocyte workflow accepts at least one h5 file containing human or mouse
single cell transcriptomic data, a metadata CSV-formatted file outlining
sample names and input files, and an optional cutoffs CSV-formatted file for
user-supplied quality control cutoffs. The input h5 file is assumed to have
already been filtered to remove empty droplets. Tools such as Cell Ranger or
CellBender can detect and remove empty droplets.
This workflow will perform the following steps:

Doublet detection and marking with scDblFinder
Quality control metric calculation

Reads per cell
Genes per cell
Percent mitochondrial genes
Stress score


Quality control filtering
Pre-processing: log-normalization, feature selection, scaling
Dimensional reduction: PCA, UMAP
Clustering
Differential expression analysis per cluster

If multiple samples are present, merging with optional batch correction will
also be applied.
The workflow will output several diagnostic plots to allow users to rerun
the workflow with adjusted parameters if desired. It will also output
filtered and pre-processed Seurat objects for downstream use, including the
CZ CELLxGENE workflow. If multiple samples were submitted, the workflow will
output an RDS containing a list of objects and a second RDS file containing the
merged data.

Parameters


h5_files: A list of filtered h5 files containing scRNA data. These can be
generated using the Cell Ranger Count or CellBender Astrocyte workflows.


species: The species of the reference genome used for count matrix generation.
Options are "human", "mouse", or "barnyard" (human and mouse combined). This
influences stress score calculation and mitochondrial filtering parameters.


batch: If batch correction is desired, the name of the batch variable in
the metadata file. If only one sample is present or batch correction is not
desired, leave this field blank.


metadata: A CSV formatted file containing the samples names in the first
column (one per row), the associated h5 file in the second column,
followed by any optional number of columns. Any column after these will be
added as sample-level metadata to the final object. You can download an example
file here.


cutoffs: An optional CSV formatted file. If present, the workflow will use
any user-provided cutoffs in place of automated ones. Blank entries correspond to default cutoffs.
You can download an example here.
This file must contain the following fields:

Sample: the name of the sample
nCount_threshold_lo: lower limit for reads per cell
nCount_threshold_hi: upper limit for reads per cell
nFeature_threshold_lo: lower limit for genes per cell
nFeature_threshold_hi: upper limit for genes per cell
percent_mito.threshold: upper limit for percent mitochondrial gene expression


Output
This workflow outputs the following:

plots directory: A directory with two sub-directories:

pre-filter: quality control plots prior to filtering with cutoffs displayed
post-filter: quality control plots after filtering has been applied
preprocessing: plots summarizing the preprocesing steps.

umaps: a directory containing UMAP plots labeled by 10 different
clustering resolutions, sample names, and any other sample-level metadata
supplied.
elbow.png: an elbow plot with a line marking the dimensionality selected.
pca.png: a PCA plot colored by Sample. Examine this to look for
evidence of batch effects.
harmony.png: if multiple samples were submitted, this is a batch
corrected dimensional reduction plot. Compare this with pca.png to
evaluate whether batch correction was effective.
clustree.png: a clustree plot to examine cluster stability over
several resolutions. Use this to help decide which clustering
resolution is appropriate for your data.
doublet_score.png, stress_score.png: embedding plots showing doublet
score across UMAP space. Use these to look for regional enrichment of
these scores, which might signal a need to remove those clusters.


data directory: A directory with the following files:

pre-filtered_list.rds: an RDS file containing a list of Seurat objects
(one per sample) prior to filtering
post-filtered_list.rds: As above, but after filtering
processed_object.rds: an RDS file containing a single Seurat object that
has undergone processing. This file is directly compatible with the
CZ CELLxGENE Astrocyte workflow.
barcode_metrics_pre-filter.csv: a CSV formatted file with quality control
metrics (one row per cell/barcoded) prior to filtering
barcode_metrics_post-filter.csv: as above, but contains only barcodes
that passed filtering
cutoffs_used.csv: a CSV formatted file listing the cutoffs used per
metric per sample.
doublets_table.csv: a CSV formatted file listing summary statistics of
the doublet detection procedure.
sessionInfo.txt: a file describing the analysis environment.


markers directory: A directory containing CSV files for cluster markers at
each resolution.


Using the output
This workflow outputs several plots and types of data. Here are some suggestions
on how to interpret and use them.

Assessing filtering
The default filtering applied here is intentionally conservative- we prefer to keep
more than we remove. To assess how effective the filtering step was, compare the
pre- and post-filter plots. Assess the distributions of nCount_RNA (total counts
per cell) and nFeature_RNA (unique genes per cell) for an enrichment at the high
and low ends. The filtering step should remove these. If the cutoff lines are
beyond the bounds of the data, that's OK.
The trend plot is a good summary of several critical QC metrics. Pre-filtering,
an scRNA sample might have a cluster of cells at the lower left end of the
plot with high percent mitochondrial gene expression. Because snRNA data should
lack mitochondrial gene expression, for these samples you may see a similar
cluster without high percent mitocondrial gene expression. After filtering,
this population should mostly disappear, with most of the cells lining up with
the linear regression line.

Assessing pre-processing
If you have multiple samples, you can assess the effect of batch correction by
comparing the PCA plot to the Harmony plot. Look for evidence of a batch effect
in the PCA plot (visible clusters enriched for particular samples or batch
variables). This effect should be reduced in the Harmony plot (eg, clusters from
different samples should shift together). If you don't see a batch effect in the
PCA plot, you may consider running the workflow again without batch correction.
You can also look at the UMAP plot grouped by sample or batch variable to assess
how well batch correction worked.
Next, look at the elbow plot. The vertical line indicates the dimensionality
selected. This selection should be early in the 'plateau' to attempt to retain
as much variation, without keeping noise. A value between 10-50 is usually
sufficient.
Clustering is run at 10 different resolutions by default (higher resolution =
more clusters). Selecting the proper resolution for your data can be challenging.
The clustree plot will show you cluster definitions at each resolution, and
how they change over the range. Use this plot to search for a resolution where
you see some stability in cluster assignment. From this starting point, you
can look at UMAP plots and marker lists to assess whether this resolution
appears to capture the biological populations you expect to find.
Finally, examine the doublet score plot. Any cluster or subcluster that shows
an enrichment of doublet scores may be made up of doublets. Data from these
clusters should be viewed with skepticism, or the clusters may be removed
from analysis entirely. For scRNA data, the same can be done with the stress
score plot- clusters enriched with cells with high stress score may be a sign
of transcriptional changes associated with enzymatic digestion.

Next steps
The data generated by this workflow is directly compatible with the CZ CELLxGENE
Astrocyte workflow. Provide the processed_object.rds file as input to this
workflow for an interactive exploration of your data. Please see that workflow's
documentation for more information.

Notes
Note that although this workflow calculates stress and doublet scores, it does
not filter the data based on these scores. Users are recommended to examine the
distribution of these scores in clusters as well as cluster markers to aid in
determination whether a given cluster may be the result of a processing
artifact.