Commit e9b5c309 authored by yy1533's avatar yy1533
Browse files

📖 🍺 performance.md; 💪 more friendly layout of docs

parent ac6d3657
# Performance
We believe data digestion should be automated, and it should be done in a
computational efficient manner :rocket:.
very, very computational efficient manner :rocket:.
---
## Introduction
As the first generation, `CEL-Seq-pipeline` provided a stable and efficient
service to process CEL-Seq and CEL-Seq2 data. Improving computational
efficiency even on top of that was challenging to ourselves. However, we spared
not effort to challenge the status quo.
Generally speaking, there are two different ways to improve the runtime of a
pipeline: 1) Provide more computational resources (e.g., CPPU cores or nodes in
a cluster) and improve the parallelization of the pipeline. 2) Optimize the
code of each pipeline step in order to make it more computationally efficient.
I aimed to develop a new CEL-Seq2 pipeline using both strategies in
combination. Here, I describe the results of a runtime comparison between the
previous CEL-Seq2 pipeline (`CEL-Seq-pipeline`) and the new pipeline (`celseq2`
).
pipeline:
1. :fa-laptop: Provide more computational resources (e.g., CPU cores or nodes in
a cluster) thus allow more parallelized tasks running in the same time.
2. :fa-gear: Optimize the codes of each pipeline step in order to make it more
computationally efficient.
## Comparison of runtimes in serial mode
`celseq2` features in a new design of work flow with optimized codes written in
Python 3 which leads to boosted computational efficiency, and furthermore brings
flexibility to request computational resources from either local computer or
servers.
To compare the raw runtimes of each step (demultiplexing, alignment, and UMI
counting) in both pipelines, I ran them in "serial" mode, where every single
pipeline step is performed one after the other. The two pipelines were run for
one CEL-Seq2 dataset which consisted of ~40M read pairs (read-1 and read-2 has
length of 13 and 61 nucleotides, respectively). Each run was performed three
times independently to account for the fluctuation of run-time.
## How to compare
Here were two comparisons.
The results indicate that the new pipeline improved the speed of demultiplexing
by **3-fold**, and the speed of the UMI counting step by almost **4-fold**.
These improvements can be attributed solely to code optimization of these steps.
1. In order to demonstrate the efficiency boosted by the optimized codes alone,
`celseq2` and `CEL-Seq-pipeline` ran on the same *one* set of CEL-Seq2 data in
*serial* mode.
2. In order to quantify the difference of performance in real case where
parallelization is often allowed, `celseq2` and `CEL-Seq-pipeline` ran on same
*two* sets of CEL-Seq2 data in *parallel* mode.
Note that in this comparison, `CEL-Seq-pipeline` had an unfair advantage
because it internally used 15 threads for parallelizing alignment with `bowtie2`
, while the new pipeline only used one thread. Nevertheless, `celseq2` gave an
overall speed-up of more than **2-fold** compared to `CEL-Seq-pipeline`. With
only a single thread for alignment, I would expect the speed-up to be larger
than **2.5-fold**.
The example one set of CEL-Seq2 data consisted of ~40 million read pairs.
## Comparison of runtimes in serial mode
![single](https://i.imgur.com/68DzbN0.jpg)
Each run in *serial* mode was performed three times independently to account for
the fluctuation of run-time.
<!-- <img src='../img/efficiency_NonParralel_SingleLib.png' width=600> -->
![](../img/efficiency_NonParralel_SingleLib.png)
:white_check_mark: Each step of the work flow was accelerated by `celseq2`,
compared to `CEL-Seq-pipeline`, solely due to the new design with code
optimization.
1. Demultiplexing reached **3-fold** speed.
2. Counting UMIs reached **4-fold** speed.
3. Overall reached **2-fold** speed.
:bangbang: In this comparison `CEL-Seq-pipeline` had an unfair advantage because
it internally used 15 threads for parallelizing alignment with `bowtie2` ,
while the new pipeline stringently used only one thread. Therefore, `celseq2`
spent more time than `CEL-Seq-pipeline` in this example.
:white_check_mark: Nevertheless, `celseq2` achieved an overall speed of more than
**2-fold** compared to `CEL-Seq-pipeline`.
## Comparison of runtimes in parallel mode
![multi](https://i.imgur.com/Gxho5Qk.jpg)
In practice, pipeline users usually have multiple CPU cores available, and
In practice, users usually have multiple CPU or cores available, and
pipelines can take advantage of this fact by splitting up data into smaller
chunks and processing them in parallel, and/or by running individual processes (
e.g., alignment) using multiple threads. I therefore aimed to perform a more
realistic performance comparison of the two pipelines by running them on a
32-core server, and testing them on a dataset that consisted of two lanes (
pairs of FASTQ files), each consisting of 40M read pairs.
The results showed a **9-fold** improvement in the runtime of the
demultiplexing step for the `celseq2` pipeline, which was the combined result of
code optimization (see above) and the use of the snakemake framework, which
allowed the reads from both lanes to be demultiplexed in parallel.
chunks and processing them in parallel, and/or by running individual processes
(e.g., alignment) using multiple threads.
Here the second comparison aimed to quantify the performance of the two pipelines
in a more realistic scenario by running them on a 32-core server in *parallel*
mode [^how-parallel], and testing them on a dataset that consisted of two sets
of CEL-Seq2 data [^dup-data] which totally consisted ~80 million read pairs.
Note that in this comparison, `CEL-Seq-pipeline` used more computational
resources than `celseq2` for UMI counting (50 parallel jobs instead of 30, see
details [here](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#how-the-parallelization-in-previous-generation-pipeline-is-performed)). Nevertheless, `celseq2` achieved a more than **3.5-fold** overall speed-up
compared to `CEL-Seq-pipeline`.
<!-- <img src='../img/efficiency_Parallel_2Libs.png' width=600> -->
![](../img/efficiency_Parallel_2Libs.png)
:white_check_mark: Each step of the work flow was accelerated by `celseq2`,
compared to `CEL-Seq-pipeline`.
1. Demultiplexing reached **9-fold** speed.
2. Counting UMIs reached **4-fold** speed [^umi].
3. Overall reached **3.5-fold** speed.
:bangbang: In this comparison `CEL-Seq-pipeline` again had an unfair advantage,
because the computational resources was as twice as what `celseq2` used for
UMI counting. Internally `CEL-Seq-pipeline` launched a fixed number parallel
jobs which was 50, instead of 30 [^umi].
:white_check_mark: Nevertheless, `celseq2` achieved a more than **3.5-fold**
speed compared to `CEL-Seq-pipeline`. In particular, the new design of work
flow of `celseq2` parallelized the demultiplexing step.
## Summary
1. Compared to `CEL-Seq-pipeline`, `celseq2` achieved the following speed:
- Demultiplexing: ~4N-fold, where N is the number of sets of CEL-Seq2 data.
- Alignment: roughly same as expected since no optimization was involved.
- Counting UMIs: ~4-fold.
2. In real case, the absolute runtime that took`celseq2` to finish ~80 million
read pairs was about 40 minutes [^time-consumed] in this example.
[^how-parallel]: The syntax to run pipelines in parallel mode was
different. `CEL-Seq-pipeline` set `proc=30` in its configuration file. And
`celseq2` requested 30 cores at most simply by setting `-j 30`.
[^dup-data]: For the record, the two sets CEL-Seq2 data were indeed two copies
of the same one set of data.
[^umi]: It turned out that `proc` parameter of `CEL-Seq-pipeline` was not used
when counting UMIs, but instead internally `CEL-Seq-pipeline` launched a fixed
number parallel jobs which was always 50. See
[line-60](https://github.com/yanailab/CEL-Seq-
pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/htseq_wrapper.py#L60) and
[line-81](https://github.com/yanailab/CEL-Seq-
pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/htseq_wrapper.py#L81) in
source code. In this example, they were supposed to request same 30 cores so
`CEL-Seq-pipeline` had roughly twice number of jobs running for counting
UMIs compared to `celseq2`.
[^time-consumed]: The absolute time varies among data sets. For example, it will
take less time if less proportional reads were qualified though same total number
of read pairs (e.g., 40 million).
\ No newline at end of file
......@@ -5,6 +5,8 @@ self-consistent manner :performing_arts:.
---
## Introduction
`celseq2` is not the first tool developed by Yanai lab to process CEL-Seq2 data,
but `CEL-Seq-pipeline` (see [codes](https://github.com/yanailab/CEL-Seq-pipeline)) is.
......
This post aims to demonstrate **self-consistency** of `celseq2`, the new pipeline:
- UMI count matrix of single one pair of read files is self-consistent regardless of different runs.
- UMI count matrix of multiple pairs of read files is self-consistent regardless of different runs.
## Data and experiment design
Suppose two plates were available, named E1 and E2, and were linked to different Illumina sequencing barcodes shown in blue and orange.
<img src='http://i.imgur.com/qUDbAI7.png' width=450 alt='self-consistency-multi'>
- item-1: Cells indexed 1-6 of E1 were sequenced in Lane-1
- item-2: Cells indexed 7-9 of E1 were sequenced in Lane-2
- item-3: Cells indexed 10 of E1 were sequenced in Lane-3
- item-4: Cells indexed 1-96 of E2 were sequenced in Lane-1
- item-5: Cells indexed 1-13 of E2 were claimed present, though cells indexed 1-96 were all sequenced in Lane-2.
Further suppose the sequencer reported reads file per lane per Illumina sequencing barcode. Accordingly, the experiment table for `celseq2` pipeline was defined as:
| SAMPLE_NAME | CELL_BARCODES_INDEX | R1 | R2 |
|------------- |--------------------- |------------------------- |------------------------- |
| E1 | 1-3,6,4-5 | S1_L001_R1_001.fastq.gz | S1_L001_R2_001.fastq.gz |
| E1 | 7,8,9 | S1_L002_R1_001.fastq.gz | S1_L002_R2_001.fastq.gz |
| E1 | 10 | S1_L003_R1_001.fastq.gz | S1_L003_R2_001.fastq.gz |
| E2 | 1-96 | S2_L001_R1_001.fastq.gz | S2_L001_R2_001.fastq.gz |
| E2 | 1-13 | S2_L002_R1_001.fastq.gz | S2_L002_R2_001.fastq.gz |
In order to create such an arbitrary set of complexed experiments, actual raw data was just one set of CEL-Seq2 sequencing read FASTQ files (40 million pairs of reads), but was duplicated to 5 pairs. In other words, all the reads listed in `R1` column were same, so was the `R2` list.
## How to validate self-consistency
UMI count matrices were generated in <kbd>expr</kbd>.
```
expr/
├── E1
│   ├── expr.csv
│   ├── expr.h5
│   ├── item-1
│   │   ├── expr.csv
│   │   └── expr.h5
│   ├── item-2
│   │   ├── expr.csv
│   │   └── expr.h5
│   └── item-3
│   ├── expr.csv
│   └── expr.h5
└── E2
├── expr.csv
├── expr.h5
├── item-4
│   ├── expr.csv
│   └── expr.h5
└── item-5
├── expr.csv
└── expr.h5
```
By examining the following comparisons, the self-consistency can be proved.
- E2-item5 v.s. E2-item4 on cells 1-13: self-consistency with **one** pair of read files
- E1 v.s. E2 on cells 1-10: self-consistency with **multiple** pairs of read files
## Self-consistency was validated
:white_check_mark: Test script [manual_test_expr_consistency.R](https://gitlab.com/Puriney/celseq2/blob/master/test/consistent2legacy/manual_test_expr_consistency.R) quantified the difference among the intact matrices. It ended up as zero which led to validation of self-consistency.
Furthermore, the heatmap on UMI count matrices where 200 randomly selected genes were rows and cells were columns would greatly help visualize the consistency.
### Self-consistency: E2-item5 v.s. E2-item4
<img src='http://i.imgur.com/hoaTQjn.png' width=450>
*Comparison E2-item5 v.s. E2-item4, focusing on cells 1-13, demonstrated self-consistency when one pair of read files was input. Rows were 200 randomly selected genes, and columns were all available cells. Left panel is cells 1-13 in E2-item5, middle panel is cells 1-13 in E2-item4, and right panel is rest of cells in E2-item4.*
### Self-consistency: E1 v.s. E2
<img src="http://i.imgur.com/WcFserh.png" width=450>
*Comparison E1 v.s. E2, focusing on cells 1-10, demonstrated self-consistency when one pair of reads was input. Rows were 200 randomly selected genes, and columns were all available cells. Left panel is cells 1-10 in E1, middle panel is cells 1-10 in E2, and right panel is rest of cells in E2.*
\ No newline at end of file
# What is new
---
## Consistent to previous pipeline
This post aims to demonstrate **cross-consistency**: consistency among the new ([version](https://gitlab.com/Puriney/celseq2/tree/2985b93b42f7863dfc8170552c9f4d58e324f614)) and the previous generation ([version](https://github.com/yanailab/CEL-Seq-pipeline/tree/133912cd4ceb20af0c67627ab883dfce8b9668df)) of CEL-Seq2 pipeline.
### Data
Experiment design was literally borrowed from post [Consistency:self-consistency](https://gitlab.com/Puriney/celseq2/wikis/Consistency:-self-consistency). Experiment E2 was sufficient to run the new and previous generation of pipeline to make conclusion.
<img src='http://i.imgur.com/rzckeCH.png' width=225>
| SAMPLE_NAME | CELL_BARCODES_INDEX | R1 | R2 |
|------------- |--------------------- |------------------------- |------------------------- |
| E2 | 1-96 | S2_L001_R1_001.fastq.gz | S2_L001_R2_001.fastq.gz |
| E2 | 1-13 | S2_L002_R1_001.fastq.gz | S2_L002_R2_001.fastq.gz |
:loudspeaker: Remember that here the first and second row remained its item name, item-4 and item-5, respectively. This is following the convention in original experiment design.
:loudspeaker: Remember the `R1` read file of item-4 and item-5 were same, so was the `R2`. This was a pseudo experiment.
### How to validate cross-consistency
In this example, the UMI count matrix of E2 was expected to be exactly same as the one of E2-item4, regardless of bioinformatics tools.
By making following set of comparisons, the cross-consistency would be validated.
1. New v.s. previous generation pipeline. Both ran on E2_item4 to test cross-consistency on single-lane inputs.
2. New generation pipeline on E2 v.s. previous generation on E2_item4 alone. This was to test cross-consistency on multiple-lane inputs.
### Cross-consistency was validated
As shown in self-consistency post, test script [manual_test_expr_consistency.R](https://gitlab.com/Puriney/celseq2/blob/master/test/consistent2legacy/manual_test_expr_consistency.R) quantified the difference among UMI count matrices. It ended up as zero which led to validation of cross-consistency.
Furthermore, the heatmap on subset of UMI count matrices where 200 randomly selected genes were rows and cells were columns would help visualize the cross-consistency.
#### Cross-Consistency on single-lane inputs
<img src='http://i.imgur.com/Wmqutbw.png' width=450>
*Executed new v.s. previous generation of pipeline on same E2-item4 data which covered 96 cells. Left and right panel was the UMI count matrix generated by `celseq2`, and the previous generation of pipeline, respectively. 200 genes were randomly selected as rows for visualization and all cells were placed on columns.*
:white_check_mark: The two intact matrices had zero difference thus validated cross-consistency in terms of single-lane inputs.
#### Cross-Consistency on multiple-lane inputs
<img src='http://i.imgur.com/3mGvLSB.png' width=450>
*Executed `celseq2` on entire E2 v.s. performed the previous generation of pipeline on E2_item4 alone. Left and right panel was the UMI count matrix generated by `celseq2`, and the previous generation of pipeline, respectively. 200 genes were randomly selected as rows for visualization and all cells were placed on columns.*
:white_check_mark: The two intact matrices had zero difference. In addition, the results of E2 and E2_item4 should be same regardless of versions of pipeline, therefore these results together validated cross-consistency in terms of multi-lane inputs.
### Method
The workflow is Demultiplex => Align by `Bowtie2` => Count UMI => Create UMI count matrix.
Executed the two generations of pipeline with same inputs were supposed to generate exactly same results. But it was **not** the fact. `celseq2` had a new naming scheme of each individual read during demultiplexing in order to reduce unnecessary storage cost. This change affected neither the correctness nor the self-consistency of `celseq2`, but did impact the consistency across generations. When `Bowtie2` determined one out from several equally good alignment choices for one read, a pseudo number generator was initiated which was partly determined by read name (see `Bowtie2`'s [manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#randomness-in-bowtie-2)). Therefore, it would be not surprising to find that `celseq2` and the previous generation pipeline had different outputs with same inputs.
However, it was not impossible to validate the cross-consistency. Suppose a big CEL-Seq2 read data was demultiplexed to 96 small CEL-Seq2 FASTQs for 96 single cells. They were aligned by using `Bowtie2` and subsequently created 96 SAM/BAM alignments. `celseq2` and the previous generation pipeline were expected to have same UMI count matrix if started with the 96 alignments. This dataset was nothing but the E2-item4, which further led to the validation of cross-consistency as shown as the above 2 sections.
Since names of alignments were inherited from FASTQ reads, the tiny modification for previous generation pipeline was required to recognize the alignments created by `celseq2` without altering the logics.
Step-1: Fetch the stable version of previous generation pipeline.
``` bash
git clone https://github.com/yanailab/CEL-Seq-pipeline.git --branch stable
```
Step-2: Modification. Open Python file <kbd>htseq_count_umified.py</kbd>. Jump to line-51 (highlighted in original context [here](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/htseq_count_umified.py#L51)). Now change `:UMI:(\w+):` to `_UMI-(\w+)` which leads to codes in this way:
<img src='http://i.imgur.com/SAELxwe.png' width=650>
The previous generation pipeline is compatible to recognize the alignments by `celseq2`, and runs as in the old days.
## Boosted computational efficiency
### Conclusions
#### Demultiplexing
`celseq2` makes it possible to parallelize demultiplexing libraries. This new feature made `celseq2` run in **>=10X** speed in the above [experiment](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#parallel-execution) where only 2 libraries were present. If more libraries were provides, the boosted performance is going to be more dramatic and obvious. Therefore, `celseq2` is in particular of great help if user has libraries generated in multiple lanes or more than one technical replicate.
`celseq2` accelerated demultiplexing *per se*. This new ability made it run in **>=3X** speed in handling one library as shown as in above [experiment](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#serial-execution).
#### Alignment
Alignment depends on `Bowtie2`. `celseq2` makes no changes at all in terms of strategy, expect the `--seed` is fixed as 42 to maintain the reproducibility.
#### Counting UMI
`celseq2` accelerated counting UMIs *per se*. This new ability made it run in **>=4X** speed as shown as in above [experiment](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#serial-execution).
#### Refs
##### How the parallelization in previous generation pipeline is performed:
1. Demultiplexing: No. Iterate each CEL-Seq2 read file only (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/bc_demultiplex.py#L42-L52))
2. Aligning by `Bowtie2`: Yes. In config file, there are two parameters for parallelization. 1) `procs` is by default 10 (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/bowtie_wrapper.py#L47)) which means there are 10 alignment operations running in parallel. 2) `number_of_threads` is to set thread number for running bowtie2 alone (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/bowtie_wrapper.py#L22))
3. Counting UMI by customized `HTSeq`: Yes. `procs` is **fixed** as 50. There are always 50 cells being counting UMIs in parallel (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/htseq_wrapper.py#L81)).
In order to make the previous generation pipeline run in serial mode, minor modification on source code is required, i.e. set the variable `procs` to 1 in the functions picked out above.
## Superb user experience
......@@ -4,3 +4,18 @@
spatial transcriptome?
:star: See [List of supported techniques](about/support_others.md#).
## Get involved
:question: How to contribute to `celseq2`?
:star: Thanks for your interest. We suggest follow "Guidelines to submit a Pull
Request" of `mxnet`
(see [here](https://mxnet.incubator.apache.org/community/contribute.html#guidelines-to-submit-a-pull-request])).
:question: How to contribute to user manuals?
:star: On top-right corner of every post there is an icon :fa-pencil:. Click it
to edit the source contents which are in markdown format.
......@@ -5,7 +5,7 @@ from CEL-Seq2 [^Hashimshony2016] sequencing data. We believe data digestion
should be automated, and it should be done in a manner not just computational
efficient, but also user-friendly and developer-friendly.
## Install `celseq2`
## Installation
``` bash
git clone git@github.com:yanailab/celseq2.git
......@@ -145,7 +145,7 @@ expr/
   └── expr.h5
```
Results of <kbd>item-X</kbd> are useful to assess technical variation when FASTQ
Results of <kbd>item-X</kbd> are useful to assess variation when FASTQ
files from multiple lanes, or technical/biological replicates are present.
## About
......
......@@ -12,10 +12,11 @@ Experiment table has 3 purposes:
2. Translate the overall design of experiments to format that `celseq2` can
understand, so that `celseq2` can generate the expected layout of UMI-count
matrix.
3. Provide an user-friendly way to finish the two above goals. Experiment table
is in particular convenient for complexed experiment design where more than
one experiment conditions with (or without) biological/technical replicates
are present.
3. Provide an user-friendly way to finish the two above goals.
Experiment table is in particular convenient for complexed experiment design
where more than one experiment conditions with (or without)
biological/technical replicates present.
## Create an experiment table from template
......
......@@ -12,8 +12,8 @@ license: 'MIT'
theme: material
extra:
palette:
primary: 'teal'
accent: 'teal'
primary: 'deep purple'
accent: 'deep purple'
logo:
icon: 'home'
social:
......@@ -21,6 +21,13 @@ extra:
link: https://github.com/Puriney
- type: 'link'
link: https://yanailab.org
feature:
tabs: true
# font:
# text: 'Open Sans'
# code: 'Robote Mono'
# extra_css:
# - css/extra.css
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment