Commit 5216ac07 authored by yy1533's avatar yy1533
Browse files

1) 📖 🐾 index.md==readme.md; 2) 🐶 setup.py

parent 02da4dce
# Overview
## Overview
This is `celseq2`, a Python package for generating the UMI count matrix
from [**CEL-Seq2**](https://doi.org/10.1186/s13059-016-0938-8) sequencing data.
This is `celseq2`, a Python framework for generating the UMI count matrix
from CEL-Seq2 [^Hashimshony2016] sequencing data. We believe data digestion
should be automated, and it should be done in a manner not just computational
efficient, but also user-friendly and developer-friendly.
# What is `celseq2`, actually?
1. A bioinformatics pipeline which is seamless, parallelized, and easily
extended to run on computing cluster. It reduces the burden of creating work
flow and speeds up data digestion.
2. A list of independent tools which are robust and computational efficient.
Pipeline is not always the need. Package `celseq2` provides a list of versatile
and stand-alone bash commands to address independent tasks, for example,
`count-umi` is able to quantify the UMIs present in the SAM/BAM file of single
cell.
# Dependencies
**Python 3**
Visit <https://conda.io/miniconda.html> to find suitable scripts
for your platform to install Python 3.
**`snakemake`**
Visit <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html>
to intall `snakemake`.
```
conda install -c bioconda snakemake
```
# Install `celseq2`
## Install `celseq2`
``` bash
git clone git@gitlab.com:Puriney/celseq2.git
git clone git@github.com:yanailab/celseq2.git
cd celseq2
pip install ./
```
# Quick Start
## Quick Start
Running `celseq2` pipeline is easy as 1-2-3, and here is a quick start tutorial
based on an arbitrary example. Suppose user performed CEL-Seq2 and got samples
designed in the way shown as figure below.
Running `celseq2` pipeline is as easy as 1-2-3. Below is the visualization of
the experiment design as same as the
[sample sheet](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/sample_sheet_example.txt)
used in last generation of the pipeline ([CEL-Seq-pipeline](https://github.com/yanailab/CEL-Seq-pipeline)) as example.
![experiment-old-pipeline-visualize](https://i.imgur.com/9ZEOnUj.png)
![experiment-old-pipeline-visualize](https://i.imgur.com/ntJVTYM.gif)
This was the visualization of the experiment design as shown as in the [sample sheet](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/sample_sheet_example.txt)
in previous pipeline.
The user had two biological samples which could come from two different
experiments, two time-points, two types of tissues, or even two labs. They were
denoted as squares and circles, respectively. Each sample had 9 cells.
The user had two biological samples which could come
from two time-points, two tissues, or even two labs.
They were denoted as squares and circles, respectively.
Each sample had 9 cells for example.
In principle, what the user would expect as final output was one UMI count matrix
for each sample, which meant two UMI matrices in total in this example.
Though they were marked with same Illumina sequencing barcodes and sequenced togerther,
user was able to distinguish the source of cells because each cell had its own cell barcode.
Cells with barcode 1-9 came from sample-1 and cells with barcode 10-18 came from sample-2.
During the CEL-Seq2 experiment, all cells were placed in one 96-well cell plate.
They were labeled with same sequencing barcodes (shown as orange plate)
but each cell was labeled with its own CEL-Seq2 cell barcode, so that all of them
could be sequenced together without losing identities. In details, the
nine cells from Experiment-1 were labeled with CEL-Seq2 cell barcodes indexed
from 1 to 9, respectively, while the other nine cells from Experiment-2 were
labeled with cell barcodes 10 to 18.
By running the pipeline of `celseq2` with the them, the users would get
UMI count matrix for each of the two plates.
Finally the library was distributed in two lanes (purple and dark gray bar) of a
sequencer, and got sequenced, which resulted in two sets of CEL-Seq2 data (per
lane per sequencing barcode).
## Step-1: Specify Configuration of Workflow
What would the pipeline of `celseq2` do for the user was to generate UMI-count
matrix per experiment with the two sets of CEL-Seq2 data as input.
### Step-1: Specify Global Configuration of Workflow
Run `new-configuration-file` command to initiate configuration file (YAML
format), which specifies the details of CEL-Seq2 techniques the users perform,
......@@ -69,62 +52,43 @@ e.g. the cell barcodes sequence dictionary, and transcriptome annotation
information for quantifying UMIs, etc.
This configuration can be shared and used more than once as long as user is
running pipeline on same specie.
running pipeline on same species.
``` bash
new-configuration-file -o /path/to/wonderful_CEL-Seq2_config.yaml
```
Example of configuration is [here](https://gitlab.com/Puriney/celseq2/blob/master/example/config.yaml).
Example of configuration is [here](https://github.com/yanailab/celseq2/blob/master/example/config.yaml).
Example of CEL-Seq2 cell barcodes sequence dictionary is [here](https://gitlab.com/yanailab/celseq2/blob/master/example/barcodes_cel-seq_umis96.tab).
Example of cell barcodes sequence dictionary is [here](https://gitlab.com/Puriney/celseq2/blob/master/example/barcodes_cel-seq_umis96.tab)
Read ["Setup Configuration"](https://puriney.github.io/celseq2/user_guide/setup_config/)
for full instructions.
## Step-2: Define Experiment Table
### Step-2: Define Experiment Table
Run `new-experiment-table` command to initiate table (space/tab separated
Run `new-experiment-table` command to initiate a table (space/tab separated
file format) specifying the experiment layout.
``` bash
new-experiment-table -o /path/to/wonderful_experiment_table.txt
```
Fill information into the generated experiment table file.
:warning: Note: column names are NOT allowed to be modified.
Fill information into the generated experiment table file row by row.
:warning: Note: each slot cannot contain any space.
The content of experiment table in this example is:
The content of experiment table in this example could be:
| SAMPLE_NAME | CELL_BARCODES_INDEX | R1 | R2 |
|----------------------- |--------------------- |------------------------- |------------------------- |
| wonderful_experiment1 | 1-9 | path/to/lane1-R1.fastq.gz | path/to/lane1-R2.fastq.gz |
| wonderful_experiment2 | 10-18 | path/to/lane1-R1.fastq.gz | path/to/lane1-R2.fastq.gz |
| wonderful_experiment1 | 1,9,2-8 | path/to/lane2-R1.fastq.gz | path/to/lane2-R2.fastq.gz |
| wonderful_experiment2 | 18-10 | path/to/lane2-R1.fastq.gz | path/to/lane2-R2.fastq.gz |
| wonderful_experiment1 | 1-9 | path/to/lane2-R1.fastq.gz | path/to/lane2-R2.fastq.gz |
| wonderful_experiment2 | 10-18 | path/to/lane2-R1.fastq.gz | path/to/lane2-R2.fastq.gz |
To ease the pain of manually specifying `CELL_BARCODES_INDEX`, `celseq2`
recognizes human inputs in various way. Examples of specification of barcodes
indexed from 1 to 8 that present in experiment-1 are listed and are all allowed.
Read ["Experiment Table Specification"](https://puriney.github.io/celseq2/user_guide/experiment_table/)
for full instructions when more complexed experiment designs take place.
1. `1-9`: the most straightforward way.
2. `1,9,2-7` or `1,9,7-2`: combination of individual and range assignment.
3. `9,1,7-2,6`: redundancy is tolerant.
Read [Experiment Table Specification](https://gitlab.com/Puriney/celseq2/wikis/Examples) for further details when more complexed
experiment design happens.
## Step-3: Run Pipeline of `celseq2`
Examine how many tasks to be performed before actually executing the pipeline:
``` bash
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
--dryrun
```
### Step-3: Run Pipeline of `celseq2`
Launch pipeline in the computing node which performs 10 tasks in parallel.
......@@ -135,27 +99,18 @@ celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
-j 10
```
Alternatively, it is straightforward to run the pipeline of `celseq2` by
submitting jobs to cluster, as `celseq2` is built on top of `snakemake` which is
a powerful workflow management framework. For example, in login node on server,
user could run the following command to submit jobs to computing nodes. Here it
submits 10 jobs in parallel with 50G of memory requested by each.
Read ["Launch Pipeline"](https://puriney.github.io/celseq2/user_guide/launch_pipeline/)
for full instructions to see how to submit jobs to cluster, or preview how many
tasks are going to be scheduled.
``` bash
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
-j 10 \
--cluster "qsub -cwd -j y -l h_vmem=50G" &
```
### Results
# Result
All the results are saved under <kbd>/path/to/result_dir</kbd> user specified,
which has folder structure:
All the results are saved under <kbd>/path/to/result_dir</kbd> that user
specified, which has folder structure:
```
├── annotation
├── expr # <== Here is the UMI count matrix
├── expr # <== Here saves all the UMI count matrices
├── input
├── small_diagnose
├── small_fq
......@@ -171,40 +126,36 @@ saved in both CSV and HDF5 format and exported to <kbd>expr/</kbd> folder.
```
expr/
├── wonderful_experiment1
│   ├── expr.csv # <== UMI count matrix (CSV format) for blue plate
│   ├── expr.csv # <== UMI count matrix for cells denoted as squares
│   ├── expr.h5
│   └── item-1
│   ├── item-1
│   │   ├── expr.csv
│   │   └── expr.h5
│   └── item-3
│      ├── expr.csv
│      └── expr.h5
└── wonderful_experiment2
├── expr.csv # <== UMI count matrix (CSV format) for orange plate
├── expr.csv # <== UMI count matrix for cells denoted as circles
├── expr.h5
└── item-2
   ├── item-2
   │   ├── expr.csv
   │   └── expr.h5
└── item-4
   ├── expr.csv
   └── expr.h5
```
Results of <kbd>item-X</kbd> are useful when user has FASTQ files from multiple
lanes, or technical/biological replicates. Read [Real Example](https://gitlab.com/Puriney/celseq2/wikis/Examples) for further
details about how to specify experiment table and fetch results when more
complexed (or real) experiment design happens.
Results of <kbd>item-X</kbd> are useful to access technical variation when FASTQ
files from multiple lanes, or technical/biological replicates are present.
## About
# Storage management
Authors: See <https://github.com/yanailab/celseq2/blob/master/AUTHORS>
To reduce the storage of project, it is suggested to get rid of intermediate
files, in particular FASTQ and SAM files.
License: See <https://github.com/yanailab/celseq2/blob/master/LICENSE>
Remove generated FASTQ and SAM files:
```
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
-j 10 clean_FQ_SAM
```
Alternatively, user can gzip FASTQ and perform SAM2BAM:
```
celseq2-slim --project-dir /path/to/result_dir -n
celseq2-slim --project-dir /path/to/result_dir
```
[^Hashimshony2016]: Hashimshony, T. et al. CEL-Seq2: sensitive highly-
multiplexed single-cell RNA-Seq. Genome Biol. 17, 77 (2016).
<https://doi.org/10.1186/s13059-016-0938-8>
......@@ -16,9 +16,9 @@ pip install ./
## Quick Start
Running `celseq2` pipeline is as easy as 1-2-3. Below is the visualization of
the experiment design behind the
the experiment design as same as the
[sample sheet](https://github.com/yanailab/CEL-Seq-pipeline/blob/133912cd4ceb20af0c67627ab883dfce8b9668df/sample_sheet_example.txt)
used in previous pipeline as example.
used in last generation of the pipeline ([CEL-Seq-pipeline](https://github.com/yanailab/CEL-Seq-pipeline)) as example.
![experiment-old-pipeline-visualize](https://i.imgur.com/ntJVTYM.gif)
......@@ -101,15 +101,16 @@ celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
Read ["Launch Pipeline"](https://puriney.github.io/celseq2/user_guide/launch_pipeline/)
for full instructions to see how to submit jobs to cluster, or preview how many
tasks are going to scheduled.
tasks are going to be scheduled.
### Results
## Result
All the results are saved under <kbd>/path/to/result_dir</kbd> that user
specified, which has folder structure:
```
├── annotation
├── expr # <== Here is the UMI count matrix
├── expr # <== Here saves all the UMI count matrices
├── input
├── small_diagnose
├── small_fq
......@@ -125,48 +126,34 @@ saved in both CSV and HDF5 format and exported to <kbd>expr/</kbd> folder.
```
expr/
├── wonderful_experiment1
│   ├── expr.csv # <== UMI count matrix (CSV format) for blue plate
│   ├── expr.csv # <== UMI count matrix for cells denoted as squares
│   ├── expr.h5
│   └── item-1
│   ├── item-1
│   │   ├── expr.csv
│   │   └── expr.h5
│   └── item-3
│      ├── expr.csv
│      └── expr.h5
└── wonderful_experiment2
├── expr.csv # <== UMI count matrix (CSV format) for orange plate
├── expr.csv # <== UMI count matrix for cells denoted as circles
├── expr.h5
└── item-2
   ├── item-2
   │   ├── expr.csv
   │   └── expr.h5
└── item-4
   ├── expr.csv
   └── expr.h5
```
Results of <kbd>item-X</kbd> are useful when user has FASTQ files from multiple
lanes, or technical/biological replicates. Read [Real Example](https://gitlab.com/Puriney/celseq2/wikis/Examples) for further
details about how to specify experiment table and fetch results when more
complexed (or real) experiment design happens.
## Storage management
To reduce the storage of project, it is suggested to get rid of intermediate
files, in particular FASTQ and SAM files.
Results of <kbd>item-X</kbd> are useful to access technical variation when FASTQ
files from multiple lanes, or technical/biological replicates are present.
Remove generated FASTQ and SAM files:
## About
```
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
-j 10 clean_FQ_SAM
```
Alternatively, user can gzip FASTQ and perform SAM2BAM:
```
celseq2-slim --project-dir /path/to/result_dir -n
celseq2-slim --project-dir /path/to/result_dir
```
Authors: See <https://github.com/yanailab/celseq2/blob/master/AUTHORS>
License: See <https://github.com/yanailab/celseq2/blob/master/LICENSE>
---
[^Hashimshony2016]: Hashimshony, T. et al. CEL-Seq2: sensitive highly-
multiplexed single-cell RNA-Seq. Genome Biol. 17, 77 (2016).
......
......@@ -82,18 +82,17 @@ setup(
# See https://pypi.python.org/pypi?%3Aaction=list_classifiers
classifiers=[
'Development Status :: 3 - Alpha',
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'Intended Audience :: Science/Research',
'Topic :: Scientific/Engineering :: Bio-Informatics',
'License :: Other/Proprietary License',
'Programming Language :: Python :: 3.5',
'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3 :: Only',
],
keywords='single-cell gene expression pipeline processing',
keywords='CEL-Seq2 single-cell RNA-seq expression pipeline processing',
# packages=find_packages(exclude=['contrib', 'docs', 'tests*']),
packages=find_packages(exclude=['docs', 'tests*']),
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment