Add interpretation recommendations

cc5511b8 · John Lafin · d8a83119 · cc5511b8 · cc5511b8
Commit cc5511b8 authored 6 months ago by John Lafin
--- a/README.md
+++ b/README.md
@@ -107,6 +107,66 @@ This workflow outputs the following:
 - markers directory: A directory containing CSV files for cluster markers at
 each resolution.
+## Using the output
+This workflow outputs several plots and types of data. Here are some suggestions
+on how to interpret and use them.
+### Assessing filtering
+The default filtering applied here is intentionally conservative- we prefer to keep
+more than we remove. To assess how effective the filtering step was, compare the 
+pre- and post-filter plots. Assess the distributions of nCount_RNA (total counts
+per cell) and nFeature_RNA (unique genes per cell) for an enrichment at the high
+and low ends. The filtering step should remove these. If the cutoff lines are
+beyond the bounds of the data, that's OK.
+The trend plot is a good summary of several critical QC metrics. Pre-filtering,
+an scRNA sample might have a cluster of cells at the lower left end of the
+plot with high percent mitochondrial gene expression. Because snRNA data should
+lack mitochondrial gene expression, for these samples you may see a similar
+cluster without high percent mitocondrial gene expression. After filtering,
+this population should mostly disappear, with most of the cells lining up with
+the linear regression line.
+### Assessing pre-processing
+If you have multiple samples, you can assess the effect of batch correction by
+comparing the PCA plot to the Harmony plot. Look for evidence of a batch effect
+in the PCA plot (visible clusters enriched for particular samples or batch
+variables). This effect should be reduced in the Harmony plot (eg, clusters from
+different samples should shift together). If you don't see a batch effect in the
+PCA plot, you may consider running the workflow again without batch correction.
+You can also look at the UMAP plot grouped by sample or batch variable to assess
+how well batch correction worked.
+Next, look at the elbow plot. The vertical line indicates the dimensionality
+selected. This selection should be early in the 'plateau' to attempt to retain
+as much variation, without keeping noise. A value between 10-50 is usually
+sufficient.
+Clustering is run at 10 different resolutions by default (higher resolution = 
+more clusters). Selecting the proper resolution for your data can be challenging.
+The clustree plot will show you cluster definitions at each resolution, and
+how they change over the range. Use this plot to search for a resolution where
+you see some stability in cluster assignment. From this starting point, you
+can look at UMAP plots and marker lists to assess whether this resolution
+appears to capture the biological populations you expect to find.
+Finally, examine the doublet score plot. Any cluster or subcluster that shows
+an enrichment of doublet scores may be made up of doublets. Data from these
+clusters should be viewed with skepticism, or the clusters may be removed
+from analysis entirely. For scRNA data, the same can be done with the stress
+score plot- clusters enriched with cells with high stress score may be a sign
+of transcriptional changes associated with enzymatic digestion.
+## Next steps
+The data generated by this workflow is directly compatible with the CZ CELLxGENE
+Astrocyte workflow. Provide the `processed_object.rds` file as input to this
+workflow for an interactive exploration of your data. Please see that workflow's
+documentation for more information.
 ## Notes
 Note that although this workflow calculates stress and doublet scores, it does

--- a/docs/index.md
+++ b/docs/index.md
@@ -107,6 +107,66 @@ This workflow outputs the following:
 - markers directory: A directory containing CSV files for cluster markers at
 each resolution.
+## Using the output
+This workflow outputs several plots and types of data. Here are some suggestions
+on how to interpret and use them.
+### Assessing filtering
+The default filtering applied here is intentionally conservative- we prefer to keep
+more than we remove. To assess how effective the filtering step was, compare the 
+pre- and post-filter plots. Assess the distributions of nCount_RNA (total counts
+per cell) and nFeature_RNA (unique genes per cell) for an enrichment at the high
+and low ends. The filtering step should remove these. If the cutoff lines are
+beyond the bounds of the data, that's OK.
+The trend plot is a good summary of several critical QC metrics. Pre-filtering,
+an scRNA sample might have a cluster of cells at the lower left end of the
+plot with high percent mitochondrial gene expression. Because snRNA data should
+lack mitochondrial gene expression, for these samples you may see a similar
+cluster without high percent mitocondrial gene expression. After filtering,
+this population should mostly disappear, with most of the cells lining up with
+the linear regression line.
+### Assessing pre-processing
+If you have multiple samples, you can assess the effect of batch correction by
+comparing the PCA plot to the Harmony plot. Look for evidence of a batch effect
+in the PCA plot (visible clusters enriched for particular samples or batch
+variables). This effect should be reduced in the Harmony plot (eg, clusters from
+different samples should shift together). If you don't see a batch effect in the
+PCA plot, you may consider running the workflow again without batch correction.
+You can also look at the UMAP plot grouped by sample or batch variable to assess
+how well batch correction worked.
+Next, look at the elbow plot. The vertical line indicates the dimensionality
+selected. This selection should be early in the 'plateau' to attempt to retain
+as much variation, without keeping noise. A value between 10-50 is usually
+sufficient.
+Clustering is run at 10 different resolutions by default (higher resolution = 
+more clusters). Selecting the proper resolution for your data can be challenging.
+The clustree plot will show you cluster definitions at each resolution, and
+how they change over the range. Use this plot to search for a resolution where
+you see some stability in cluster assignment. From this starting point, you
+can look at UMAP plots and marker lists to assess whether this resolution
+appears to capture the biological populations you expect to find.
+Finally, examine the doublet score plot. Any cluster or subcluster that shows
+an enrichment of doublet scores may be made up of doublets. Data from these
+clusters should be viewed with skepticism, or the clusters may be removed
+from analysis entirely. For scRNA data, the same can be done with the stress
+score plot- clusters enriched with cells with high stress score may be a sign
+of transcriptional changes associated with enzymatic digestion.
+## Next steps
+The data generated by this workflow is directly compatible with the CZ CELLxGENE
+Astrocyte workflow. Provide the `processed_object.rds` file as input to this
+workflow for an interactive exploration of your data. Please see that workflow's
+documentation for more information.
 ## Notes
 Note that although this workflow calculates stress and doublet scores, it does