Commit 02da4dce authored by yy1533's avatar yy1533
Browse files

📖 🐾

parent 24bf998c
## Overview
This is `celseq2`, a Python framework for generating the UMI count matrix
from [**CEL-Seq2**]( sequencing data.
## What is `celseq2`, actually?
1. A bioinformatics pipeline which is seamless, parallelized, and easily
extended to run on computing cluster. It reduces the burden of creating work
flow and speeds up data digestion.
2. A list of independent tools which are robust and computational efficient.
Pipeline is not always the need. Package `celseq2` provides a list of versatile
and stand-alone bash commands to address independent tasks, for example,
`count-umi` is able to quantify the UMIs present in the SAM/BAM file of single
from CEL-Seq2 [^Hashimshony2016] sequencing data. We believe data digestion
should be automated, and it should be done in a manner not just computational
efficient, but also user-friendly and developer-friendly.
## Install `celseq2`
``` bash
git clone
git clone
cd celseq2
pip install ./
## Quick Start
Running `celseq2` pipeline is easy as 1-2-3, and here is a quick start tutorial
based on an arbitrary example. Suppose user performed CEL-Seq2 and got samples
designed in the way shown as figure below.
Running `celseq2` pipeline is as easy as 1-2-3. Below is the visualization of
the experiment design behind the
[sample sheet](
used in previous pipeline as example.
The user had two biological samples which could come from two different
experiments, two time-points, two types of tissues, or even two labs. They were
denoted as squares and circles, respectively. Each sample had 9 cells.
In principle, what the user would expect as final output was one UMI count matrix
for each sample, which meant two UMI matrices in total in this example.
The user had two biological samples with 8 and 96 cells respectively, which could come
from two time-points, two tissues, or even two labs. Samples were marked with
two different Illumina sequencing barcodes (blue and orange dots), mixed
together, and subsequently sequenced in the same lane, which finally resulted
to 2 FASTQ files.
During the CEL-Seq2 experiment, all cells were placed in one 96-well cell plate.
They were labeled with same sequencing barcodes (shown as orange plate)
but each cell was labeled with its own CEL-Seq2 cell barcode, so that all of them
could be sequenced together without losing identities. In details, the
nine cells from Experiment-1 were labeled with CEL-Seq2 cell barcodes indexed
from 1 to 9, respectively, while the other nine cells from Experiment-2 were
labeled with cell barcodes 10 to 18.
By running the pipeline of `celseq2` with the them, the users would get
UMI count matrix for each of the two plates.
Finally the library was distributed in two lanes (purple and dark gray bar) of a
sequencer, and got sequenced, which resulted in two sets of CEL-Seq2 data (per
lane per sequencing barcode).
### Step-1: Specify Configuration of Workflow
What would the pipeline of `celseq2` do for the user was to generate UMI-count
matrix per experiment with the two sets of CEL-Seq2 data as input.
### Step-1: Specify Global Configuration of Workflow
Run `new-configuration-file` command to initiate configuration file (YAML
format), which specifies the details of CEL-Seq2 techniques the users perform,
......@@ -47,62 +52,44 @@ e.g. the cell barcodes sequence dictionary, and transcriptome annotation
information for quantifying UMIs, etc.
This configuration can be shared and used more than once as long as user is
running pipeline on same specie.
running pipeline on same species.
``` bash
new-configuration-file -o /path/to/wonderful_CEL-Seq2_config.yaml
Example of configuration is [here](
Example of configuration is [here](
Example of CEL-Seq2 cell barcodes sequence dictionary is [here](
Example of cell barcodes sequence dictionary is [here](
Read ["Setup Configuration"](
for full instructions.
### Step-2: Define Experiment Table
Run `new-experiment-table` command to initiate table (space/tab separated
Run `new-experiment-table` command to initiate a table (space/tab separated
file format) specifying the experiment layout.
``` bash
new-experiment-table -o /path/to/wonderful_experiment_table.txt
Fill information into the generated experiment table file.
:warning: Note: column names are NOT allowed to be modified.
:warning: Note: each slot cannot contain any space.
Fill information into the generated experiment table file row by row.
The content of experiment table in this example could be:
|----------------------- |--------------------- |------------------------- |------------------------- |
| wonderful_experiment1 | 1,8,2-7 | path/to/sampleX-lane2-R1.fastq.gz | path/to/sampleX-lane2-R2.fastq.gz |
| wonderful_experiment2 | 1-96 | path/to/sampleY-lane2-R1.fastq.gz | path/to/sampleY-lane2-R2.fastq.gz |
Each row records one pair of FASTQ reads.
To ease the pain of manually specifying `CELL_BARCODES_INDEX`, `celseq2`
recognizes human inputs in various way. Examples of specification of barcodes
indexed from 1 to 8 that present in experiment-1 are listed and are all allowed.
1. `1-8`: the most straightforward way.
2. `1,8,2-7` or `1,8,7-2`: combination of individual and range assignment.
3. `8,1,7-2,6`: redundancy is tolerant.
| wonderful_experiment1 | 1-9 | path/to/lane1-R1.fastq.gz | path/to/lane1-R2.fastq.gz |
| wonderful_experiment2 | 10-18 | path/to/lane1-R1.fastq.gz | path/to/lane1-R2.fastq.gz |
| wonderful_experiment1 | 1-9 | path/to/lane2-R1.fastq.gz | path/to/lane2-R2.fastq.gz |
| wonderful_experiment2 | 10-18 | path/to/lane2-R1.fastq.gz | path/to/lane2-R2.fastq.gz |
Read [Experiment Table Specification]( for further details when more complexed
experiment design happens.
Read ["Experiment Table Specification"](
for full instructions when more complexed experiment designs take place.
### Step-3: Run Pipeline of `celseq2`
Examine how many tasks to be performed before actually executing the pipeline:
``` bash
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
Launch pipeline in the computing node which performs 10 tasks in parallel.
``` bash
......@@ -112,23 +99,13 @@ celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
-j 10
Alternatively, it is straightforward to run the pipeline of `celseq2` by
submitting jobs to cluster, as `celseq2` is built on top of `snakemake` which is
a powerful workflow management framework. For example, in login node on server,
user could run the following command to submit jobs to computing nodes. Here it
submits 10 jobs in parallel with 50G of memory requested by each.
``` bash
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
-j 10 \
--cluster "qsub -cwd -j y -l h_vmem=50G" &
Read ["Launch Pipeline"](
for full instructions to see how to submit jobs to cluster, or preview how many
tasks are going to scheduled.
## Result
All the results are saved under <kbd>/path/to/result_dir</kbd> user specified,
which has folder structure:
All the results are saved under <kbd>/path/to/result_dir</kbd> that user
specified, which has folder structure:
├── annotation
......@@ -187,3 +164,11 @@ Alternatively, user can gzip FASTQ and perform SAM2BAM:
celseq2-slim --project-dir /path/to/result_dir -n
celseq2-slim --project-dir /path/to/result_dir
[^Hashimshony2016]: Hashimshony, T. et al. CEL-Seq2: sensitive highly-
multiplexed single-cell RNA-Seq. Genome Biol. 17, 77 (2016).
......@@ -28,6 +28,7 @@ markdown_extensions:
- toc(permalink=true)
- admonition
- codehilite
- footnotes
- Home: ''
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment