Commit d588c5b8 authored by yy1533's avatar yy1533
Browse files

💪 cell-barcode-dictionary file is allowed to be space separated which...

💪 cell-barcode-dictionary file is allowed to be space separated which improves user-friendly and update readme 📖

📖 🐾

📖 🐾 update readme and add example of cell barcodes dictionary

💪 cell-barcode-dictionary file is allowed to be space separated which improves user-friendly
parent 4140a140
......@@ -16,19 +16,21 @@ cell.
# Dependencies
## Install Python 3 by conda
**Python 3**
Visit <https://conda.io/miniconda.html> to find suitable scripts for your platform to install Python 3.
Visit <https://conda.io/miniconda.html> to find suitable scripts
for your platform to install Python 3.
## Install `snakemake`
**`snakemake`**
Visit <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html> to intall `snakemake`.
Visit <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html>
to intall `snakemake`.
```
conda install -c bioconda snakemake
```
# Install
# Install `celseq2`
``` bash
git clone git@gitlab.com:Puriney/celseq2.git
......@@ -57,7 +59,8 @@ UMI count matrix for each of the two plates.
Run `new-configuration-file` command to initiate configuration file (YAML
format), which specifies the details of CEL-Seq2 techniques the users perform,
and genome annotation information, etc.
e.g. the cell barcodes sequence dictionary, and transcriptome annotation
information for quantifying UMIs, etc.
This configuration can be shared and used more than once as long as user is
running pipeline on same specie.
......@@ -68,10 +71,12 @@ new-configuration-file -o /path/to/wonderful_CEL-Seq2_config.yaml
Example of configuration is [here](https://gitlab.com/Puriney/celseq2/blob/master/example/config.yaml).
Example of cell barcodes sequence dictionary is [here](https://gitlab.com/Puriney/celseq2/blob/master/example/barcodes_cel-seq_umis96.tab)
## Step-2: Define Experiment Table
Run `new-experiment-table` command to initiate table file (space/tab separated
file) to specify the experiment layout.
Run `new-experiment-table` command to initiate table (space/tab separated
file format) specifying the experiment layout.
``` bash
new-experiment-table -o /path/to/wonderful_experiment_table.txt
......@@ -79,16 +84,16 @@ new-experiment-table -o /path/to/wonderful_experiment_table.txt
Fill information into the generated experiment table file.
:warning: Note: Column names cannot be changes at all.
:warning: Note: column names are NOT allowed to be modified.
:warning: Note: Each slot cannot contain any space.
:warning: Note: each slot cannot contain any space.
The content of experiment table in this example could be:
| SAMPLE_NAME | CELL_BARCODES_INDEX | R1 | R2 |
|----------------------- |--------------------- |------------------------- |------------------------- |
| wonderful_experiment1 | 1,8,2-7 | path/to/x-1-r1.fastq.gz | path/to/x-1-r2.fastq.gz |
| wonderful_experiment2 | 1-96 | path/to/y-1-r1.fastq.gz | path/to/y-2-r2.fastq.gz |
| wonderful_experiment1 | 1,8,2-7 | path/to/sampleX-lane2-R1.fastq.gz | path/to/sampleX-lane2-R2.fastq.gz |
| wonderful_experiment2 | 1-96 | path/to/sampleY-lane2-R1.fastq.gz | path/to/sampleY-lane2-R2.fastq.gz |
Each row records one pair of FASTQ reads.
......@@ -100,25 +105,34 @@ indexed from 1 to 8 that present in experiment-1 are listed and are all allowed.
2. `1,8,2-7` or `1,8,7-2`: combination of individual and range assignment.
3. `8,1,7-2,6`: redundancy is tolerant.
Read [Experiment Table Specification]() for further details when more complexed
experiment design happens.
## Step-3: Run Pipeline of `celseq2`
Examine how many tasks to be performed before actually executing the pipeline:
``` bash
celseq2 --configfile /path/to/wonderful_CEL-Seq2_config.yaml --dryrun
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
--dryrun
```
Launch pipeline:
Launch pipeline in the computing node which performs 10 tasks in parallel.
``` bash
celseq2 --configfile /path/to/wonderful_CEL-Seq2_config.yaml
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
--experiment-table /path/to/wonderful_experiment_table.txt \
--output-dir /path/to/result_dir \
-j 10
```
Alternatively, it is straightforward to run the pipeline of `celseq2` by
submitting jobs to cluster, as `celseq2` is built on top of `snakemake` which is
a powerful workflow management framework. For example, in login node on server,
user could run the following command to submit jobs to computing nodes. Here it
submits 10 jobs in parallel with total maximum memory 50G requested.
submits 10 jobs in parallel with 50G of memory requested by each.
``` bash
celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
......@@ -129,19 +143,40 @@ celseq2 --config-file /path/to/wonderful_CEL-Seq2_config.yaml \
```
# Result
All the results are saved under <kbd>/path/to/result_dir</kbd> user specified,
which has folder structure:
```
├── annotation
├── expr # <== Here is the UMI count matrix
├── input
├── small_diagnose
├── small_fq
├── small_log
├── small_sam
├── small_umi_count
└── small_umi_set
```
In particular, **UMI count matrix** for each of the experiments is
saved in both CSV and HDF5 format and exported to <kbd>expr/</kbd> folder.
```
expr/
├── E1
│   ├── expr.csv
├── wonderful_experiment1
│   ├── expr.csv # <== UMI count matrix (CSV format) for blue plate
│   ├── expr.h5
│   └── item-1
│      ├── expr.csv
│      └── expr.h5
└── E2
├── expr.csv
└── wonderful_experiment2
├── expr.csv # <== UMI count matrix (CSV format) for orange plate
├── expr.h5
└── item-2
   ├── expr.csv
   └── expr.h5
```
Results of <kbd>item-X</kbd> are useful when user has FASTQ files from multiple
lanes. Read [Real Example]() for further details about how to specify experiment
table and fetch results when more complexed (or real) experiment design happens.
......@@ -2,7 +2,6 @@
# coding: utf-8
from collections import Counter
import csv
import argparse
from celseq2.helper import filehandle_fastq_gz, print_logger
......@@ -32,9 +31,9 @@ def bc_dict_seq2id(bc_index_fpath):
""" dict[barcode_seq] = barcode_id """
out = dict()
with open(bc_index_fpath, 'rt') as fin:
freader = csv.reader(fin, delimiter='\t')
next(freader)
out = {row[1]: int(row[0]) for row in freader}
next(fin)
out = map(lambda row: row.strip().split(), fin)
out = {row[1]: int(row[0]) for row in out}
return(out)
......@@ -42,9 +41,9 @@ def bc_dict_id2seq(bc_index_fpath):
""" dict[barcode_id] = barcode_seq"""
out = dict()
with open(bc_index_fpath, 'rt') as fin:
freader = csv.reader(fin, delimiter='\t')
next(freader)
out = {int(row[0]): row[1] for row in freader}
next(fin)
out = map(lambda row: row.strip().split(), fin)
out = {int(row[0]): row[1] for row in out}
return(out)
......
#barcode_id sequence
1 AGACTC
2 AGCTAG
3 AGCTCA
4 AGCTTC
5 CATGAG
6 CATGCA
7 CATGTC
8 CACTAG
9 CAGATC
10 TCACAG
11 AGGATC
12 AGTGCA
13 AGTGTC
14 TCCTAG
15 TCTGAG
16 TCTGCA
17 TCGAAG
18 TCGACA
19 TCGATC
20 GTACAG
21 GTACCA
22 GTACTC
23 GTCTAG
24 GTCTCA
25 GTTGCA
26 GTGACA
27 GTGATC
28 ACAGTG
29 ACCATG
30 ACTCTG
31 ACTCGA
32 ACGTAC
33 ACGTTG
34 ACGTGA
35 CTAGAC
36 CTAGTG
37 CTAGGA
38 CTCATG
39 CTCAGA
40 CTTCGA
41 CTGTAC
42 CTGTGA
43 TGAGAC
44 TGCAAC
45 TGCATG
46 TGCAGA
47 TGTCAC
48 TGTCGA
49 TGGTAC
50 GACATG
51 GATCAC
52 GATCTG
53 GATCGA
54 GAGTAC
55 AGACAG
56 AGACCA
57 AGTGAG
58 AGGAAG
59 AGGACA
60 CAACAG
61 CAACCA
62 CAACTC
63 CACTCA
64 CACTTC
65 CAGAAG
66 CAGACA
67 TCACCA
68 TCACTC
69 TCCTCA
70 TCCTTC
71 TCTGTC
72 GTCTTC
73 GTTGAG
74 GTTGTC
75 GTGAAG
76 ACAGAC
77 ACAGGA
78 ACCAAC
79 ACCAGA
80 ACTCAC
81 CTCAAC
82 CTTCAC
83 CTTCTG
84 CTGTTG
85 TGAGTG
86 TGAGGA
87 TGTCTG
88 TGGTTG
89 TGGTGA
90 GAAGAC
91 GAAGTG
92 GAAGGA
93 GACAAC
94 GACAGA
95 GAGTTG
96 GAGTGA
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment