:white_check_mark: Each step of the work flow was accelerated by `celseq2`,
compared to `CEL-Seq-pipeline`, solely due to the new design with code
optimization.
1. Demultiplexing reached **3-fold** speed.
2. Counting UMIs reached **4-fold** speed.
3. Overall reached **2-fold** speed.
:bangbang: In this comparison `CEL-Seq-pipeline` had an unfair advantage because
it internally used 15 threads for parallelizing alignment with `bowtie2` ,
while the new pipeline stringently used only one thread. Therefore, `celseq2`
spent more time than `CEL-Seq-pipeline` in this example.
:white_check_mark: Nevertheless, `celseq2` achieved an overall speed of more than
**2-fold** compared to `CEL-Seq-pipeline`.
## Comparison of runtimes in parallel mode

In practice, pipeline users usually have multiple CPU cores available, and
In practice, users usually have multiple CPU or cores available, and
pipelines can take advantage of this fact by splitting up data into smaller
chunks and processing them in parallel, and/or by running individual processes (
e.g., alignment) using multiple threads. I therefore aimed to perform a more
realistic performance comparison of the two pipelines by running them on a
32-core server, and testing them on a dataset that consisted of two lanes (
pairs of FASTQ files), each consisting of 40M read pairs.
The results showed a **9-fold** improvement in the runtime of the
demultiplexing step for the `celseq2` pipeline, which was the combined result of
code optimization (see above) and the use of the snakemake framework, which
allowed the reads from both lanes to be demultiplexed in parallel.
chunks and processing them in parallel, and/or by running individual processes
(e.g., alignment) using multiple threads.
Here the second comparison aimed to quantify the performance of the two pipelines
in a more realistic scenario by running them on a 32-core server in *parallel*
mode [^how-parallel], and testing them on a dataset that consisted of two sets
of CEL-Seq2 data [^dup-data] which totally consisted ~80 million read pairs.
Note that in this comparison, `CEL-Seq-pipeline` used more computational
resources than `celseq2` for UMI counting (50 parallel jobs instead of 30, see
details [here](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#how-the-parallelization-in-previous-generation-pipeline-is-performed)). Nevertheless, `celseq2` achieved a more than **3.5-fold** overall speed-up
- item-1: Cells indexed 1-6 of E1 were sequenced in Lane-1
- item-2: Cells indexed 7-9 of E1 were sequenced in Lane-2
- item-3: Cells indexed 10 of E1 were sequenced in Lane-3
- item-4: Cells indexed 1-96 of E2 were sequenced in Lane-1
- item-5: Cells indexed 1-13 of E2 were claimed present, though cells indexed 1-96 were all sequenced in Lane-2.
Further suppose the sequencer reported reads file per lane per Illumina sequencing barcode. Accordingly, the experiment table for `celseq2` pipeline was defined as:
In order to create such an arbitrary set of complexed experiments, actual raw data was just one set of CEL-Seq2 sequencing read FASTQ files (40 million pairs of reads), but was duplicated to 5 pairs. In other words, all the reads listed in `R1` column were same, so was the `R2` list.
## How to validate self-consistency
UMI count matrices were generated in <kbd>expr</kbd>.
```
expr/
├── E1
│ ├── expr.csv
│ ├── expr.h5
│ ├── item-1
│ │ ├── expr.csv
│ │ └── expr.h5
│ ├── item-2
│ │ ├── expr.csv
│ │ └── expr.h5
│ └── item-3
│ ├── expr.csv
│ └── expr.h5
└── E2
├── expr.csv
├── expr.h5
├── item-4
│ ├── expr.csv
│ └── expr.h5
└── item-5
├── expr.csv
└── expr.h5
```
By examining the following comparisons, the self-consistency can be proved.
- E2-item5 v.s. E2-item4 on cells 1-13: self-consistency with **one** pair of read files
- E1 v.s. E2 on cells 1-10: self-consistency with **multiple** pairs of read files
## Self-consistency was validated
:white_check_mark: Test script [manual_test_expr_consistency.R](https://gitlab.com/Puriney/celseq2/blob/master/test/consistent2legacy/manual_test_expr_consistency.R) quantified the difference among the intact matrices. It ended up as zero which led to validation of self-consistency.
Furthermore, the heatmap on UMI count matrices where 200 randomly selected genes were rows and cells were columns would greatly help visualize the consistency.
*Comparison E2-item5 v.s. E2-item4, focusing on cells 1-13, demonstrated self-consistency when one pair of read files was input. Rows were 200 randomly selected genes, and columns were all available cells. Left panel is cells 1-13 in E2-item5, middle panel is cells 1-13 in E2-item4, and right panel is rest of cells in E2-item4.*
*Comparison E1 v.s. E2, focusing on cells 1-10, demonstrated self-consistency when one pair of reads was input. Rows were 200 randomly selected genes, and columns were all available cells. Left panel is cells 1-10 in E1, middle panel is cells 1-10 in E2, and right panel is rest of cells in E2.*
This post aims to demonstrate **cross-consistency**: consistency among the new ([version](https://gitlab.com/Puriney/celseq2/tree/2985b93b42f7863dfc8170552c9f4d58e324f614)) and the previous generation ([version](https://github.com/yanailab/CEL-Seq-pipeline/tree/133912cd4ceb20af0c67627ab883dfce8b9668df)) of CEL-Seq2 pipeline.
### Data
Experiment design was literally borrowed from post [Consistency:self-consistency](https://gitlab.com/Puriney/celseq2/wikis/Consistency:-self-consistency). Experiment E2 was sufficient to run the new and previous generation of pipeline to make conclusion.
:loudspeaker: Remember that here the first and second row remained its item name, item-4 and item-5, respectively. This is following the convention in original experiment design.
:loudspeaker: Remember the `R1` read file of item-4 and item-5 were same, so was the `R2`. This was a pseudo experiment.
### How to validate cross-consistency
In this example, the UMI count matrix of E2 was expected to be exactly same as the one of E2-item4, regardless of bioinformatics tools.
By making following set of comparisons, the cross-consistency would be validated.
1. New v.s. previous generation pipeline. Both ran on E2_item4 to test cross-consistency on single-lane inputs.
2. New generation pipeline on E2 v.s. previous generation on E2_item4 alone. This was to test cross-consistency on multiple-lane inputs.
### Cross-consistency was validated
As shown in self-consistency post, test script [manual_test_expr_consistency.R](https://gitlab.com/Puriney/celseq2/blob/master/test/consistent2legacy/manual_test_expr_consistency.R) quantified the difference among UMI count matrices. It ended up as zero which led to validation of cross-consistency.
Furthermore, the heatmap on subset of UMI count matrices where 200 randomly selected genes were rows and cells were columns would help visualize the cross-consistency.
*Executed new v.s. previous generation of pipeline on same E2-item4 data which covered 96 cells. Left and right panel was the UMI count matrix generated by `celseq2`, and the previous generation of pipeline, respectively. 200 genes were randomly selected as rows for visualization and all cells were placed on columns.*
:white_check_mark: The two intact matrices had zero difference thus validated cross-consistency in terms of single-lane inputs.
*Executed `celseq2` on entire E2 v.s. performed the previous generation of pipeline on E2_item4 alone. Left and right panel was the UMI count matrix generated by `celseq2`, and the previous generation of pipeline, respectively. 200 genes were randomly selected as rows for visualization and all cells were placed on columns.*
:white_check_mark: The two intact matrices had zero difference. In addition, the results of E2 and E2_item4 should be same regardless of versions of pipeline, therefore these results together validated cross-consistency in terms of multi-lane inputs.
### Method
The workflow is Demultiplex => Align by `Bowtie2` => Count UMI => Create UMI count matrix.
Executed the two generations of pipeline with same inputs were supposed to generate exactly same results. But it was **not** the fact. `celseq2` had a new naming scheme of each individual read during demultiplexing in order to reduce unnecessary storage cost. This change affected neither the correctness nor the self-consistency of `celseq2`, but did impact the consistency across generations. When `Bowtie2` determined one out from several equally good alignment choices for one read, a pseudo number generator was initiated which was partly determined by read name (see `Bowtie2`'s [manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#randomness-in-bowtie-2)). Therefore, it would be not surprising to find that `celseq2` and the previous generation pipeline had different outputs with same inputs.
However, it was not impossible to validate the cross-consistency. Suppose a big CEL-Seq2 read data was demultiplexed to 96 small CEL-Seq2 FASTQs for 96 single cells. They were aligned by using `Bowtie2` and subsequently created 96 SAM/BAM alignments. `celseq2` and the previous generation pipeline were expected to have same UMI count matrix if started with the 96 alignments. This dataset was nothing but the E2-item4, which further led to the validation of cross-consistency as shown as the above 2 sections.
Since names of alignments were inherited from FASTQ reads, the tiny modification for previous generation pipeline was required to recognize the alignments created by `celseq2` without altering the logics.
Step-1: Fetch the stable version of previous generation pipeline.
Step-2: Modification. Open Python file <kbd>htseq_count_umified.py</kbd>. Jump to line-51 (highlighted in original context [here](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/htseq_count_umified.py#L51)). Now change `:UMI:(\w+):` to `_UMI-(\w+)` which leads to codes in this way:
The previous generation pipeline is compatible to recognize the alignments by `celseq2`, and runs as in the old days.
## Boosted computational efficiency
### Conclusions
#### Demultiplexing
`celseq2` makes it possible to parallelize demultiplexing libraries. This new feature made `celseq2` run in **>=10X** speed in the above [experiment](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#parallel-execution) where only 2 libraries were present. If more libraries were provides, the boosted performance is going to be more dramatic and obvious. Therefore, `celseq2` is in particular of great help if user has libraries generated in multiple lanes or more than one technical replicate.
`celseq2` accelerated demultiplexing *per se*. This new ability made it run in **>=3X** speed in handling one library as shown as in above [experiment](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#serial-execution).
#### Alignment
Alignment depends on `Bowtie2`. `celseq2` makes no changes at all in terms of strategy, expect the `--seed` is fixed as 42 to maintain the reproducibility.
#### Counting UMI
`celseq2` accelerated counting UMIs *per se*. This new ability made it run in **>=4X** speed as shown as in above [experiment](https://gitlab.com/Puriney/celseq2/wikis/Efficiency#serial-execution).
#### Refs
##### How the parallelization in previous generation pipeline is performed:
1. Demultiplexing: No. Iterate each CEL-Seq2 read file only (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/bc_demultiplex.py#L42-L52))
2. Aligning by `Bowtie2`: Yes. In config file, there are two parameters for parallelization. 1) `procs` is by default 10 (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/bowtie_wrapper.py#L47)) which means there are 10 alignment operations running in parallel. 2) `number_of_threads` is to set thread number for running bowtie2 alone (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/bowtie_wrapper.py#L22))
3. Counting UMI by customized `HTSeq`: Yes. `procs` is **fixed** as 50. There are always 50 cells being counting UMIs in parallel (see [source](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/htseq_wrapper.py#L81)).
In order to make the previous generation pipeline run in serial mode, minor modification on source code is required, i.e. set the variable `procs` to 1 in the functions picked out above.