Skip to content
Snippets Groups Projects
Commit aba9f9c8 authored by Felix Perez's avatar Felix Perez
Browse files

Add ENCODE team's atac-seq pipeline (v2.2.2)

parent 727ffff7
No related merge requests found
Showing
with 4613 additions and 59 deletions
LICENSE 0 → 100644
MIT License
Copyright (c) 2017 ENCODE DCC
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# atac-seq_runner
# ENCODE ATAC-seq pipeline
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.156534.svg)](https://doi.org/10.5281/zenodo.156534)[![CircleCI](https://circleci.com/gh/ENCODE-DCC/atac-seq-pipeline/tree/master.svg?style=svg)](https://circleci.com/gh/ENCODE-DCC/atac-seq-pipeline/tree/master)
## Getting started
## Introduction
To make it easy for you to get started with GitLab, here's a list of recommended next steps.
This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq and DNase-seq data. The pipeline can be run on compute clusters with job submission engines as well as on stand alone machines. It inherently makes uses of parallelized/distributed computing. Pipeline installation is also easy as most dependencies are automatically installed. The pipeline can be run end-to-end, starting from raw FASTQ files all the way to peak calling and signal track generation using a single caper submit command. One can also start the pipeline from intermediate stages (for example, using alignment files as input). The pipeline supports both single-end and paired-end data as well as replicated or non-replicated datasets. The outputs produced by the pipeline include 1) formatted HTML reports that include quality control measures specifically designed for ATAC-seq and DNase-seq data, 2) analysis of reproducibility, 3) stringent and relaxed thresholding of peaks, 4) fold-enrichment and pvalue signal tracks. The pipeline also supports detailed error reporting and allows for easy resumption of interrupted runs. It has been tested on some human, mouse and yeast ATAC-seq datasets as well as on human and mouse DNase-seq datasets.
Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!
The ATAC-seq pipeline protocol specification is [here](https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit?usp=sharing). Some parts of the ATAC-seq pipeline were developed in collaboration with Jason Buenrostro, Alicia Schep and Will Greenleaf at Stanford.
## Add your files
### Features
- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
* **Portability**: The pipeline run can be performed across different cloud platforms such as Google, AWS and DNAnexus, as well as on cluster engines such as SLURM, SGE and PBS.
* **User-friendly HTML report**: In addition to the standard outputs, the pipeline generates an HTML report that consists of a tabular representation of quality metrics including alignment/peak statistics and FRiP along with many useful plots (IDR/TSS enrichment). An example of the [HTML report](https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR889WQX/example_output/qc.html). The [json file](https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR889WQX/example_output/qc.json) used in generating this report.
* **Supported genomes**: Pipeline needs genome specific data such as aligner indices, chromosome sizes file and blacklist. We provide a genome database downloader/builder for hg38, hg19, mm10, mm9. You can also use this [builder](docs/build_genome_database.md) to build genome database from FASTA for your custom genome.
```
cd existing_repo
git remote add origin https://git.biohpc.swmed.edu/astrocyte/workflows/encode/atac-seq_runner.git
git branch -M main
git push -uf origin main
```
## Installation
## Integrate with your tools
1) Install Caper (Python Wrapper/CLI for [Cromwell](https://github.com/broadinstitute/cromwell)).
```bash
$ pip install caper
```
- [ ] [Set up project integrations](https://git.biohpc.swmed.edu/astrocyte/workflows/encode/atac-seq_runner/-/settings/integrations)
2) **IMPORTANT**: Read Caper's [README](https://github.com/ENCODE-DCC/caper/blob/master/README.md) carefully to choose a backend for your system. Follow the instruction in the configuration file.
```bash
# backend: local or your HPC type (e.g. slurm, sge, pbs, lsf). read Caper's README carefully.
$ caper init [YOUR_BACKEND]
## Collaborate with your team
# IMPORTANT: edit the conf file and follow commented instructions in there
$ vi ~/.caper/default.conf
```
- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
- [ ] [Set auto-merge](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)
3) Git clone this pipeline.
```bash
$ cd
$ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
$ cd atac-seq-pipeline
```
## Test and Deploy
4) Define test input JSON.
```bash
INPUT_JSON="https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled.json"
```
Use the built-in continuous integration in GitLab.
5) If you have Docker and want to run pipelines locally on your laptop. `--max-concurrent-tasks 1` is to limit number of concurrent tasks to test-run the pipeline on a laptop. Uncomment it if run it on a workstation/HPC.
```bash
# check if Docker works on your machine
$ docker run ubuntu:latest echo hello
- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing (SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)
# --max-concurrent-tasks 1 is for computers with limited resources
$ caper run atac.wdl -i "${INPUT_JSON}" --docker --max-concurrent-tasks 1
```
***
6) Otherwise, install Singularity on your system. Please follow [this instruction](https://neuro.debian.net/install_pkg.html?p=singularity-container) to install Singularity on a Debian-based OS. Or ask your system administrator to install Singularity on your HPC.
```bash
# check if Singularity works on your machine
$ singularity exec docker://ubuntu:latest echo hello
# Editing this README
# on your local machine (--max-concurrent-tasks 1 is for computers with limited resources)
$ caper run atac.wdl -i "${INPUT_JSON}" --singularity --max-concurrent-tasks 1
When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to [makeareadme.com](https://www.makeareadme.com/) for this template.
# on HPC, make sure that Caper's conf ~/.caper/default.conf is correctly configured to work with your HPC
# the following command will submit Caper as a leader job to SLURM with Singularity
$ caper hpc submit atac.wdl -i "${INPUT_JSON}" --singularity --leader-job-name ANY_GOOD_LEADER_JOB_NAME
## Suggestions for a good README
# check job ID and status of your leader jobs
$ caper hpc list
Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.
# cancel the leader node to close all of its children jobs
# If you directly use cluster command like scancel or qdel then
# child jobs will not be terminated
$ caper hpc abort [JOB_ID]
```
## Name
Choose a self-explaining name for your project.
7) (Optional Conda method) **WE DO NOT HELP USERS FIX CONDA DEPENDENCY ISSUES. IF CONDA METHOD FAILS THEN PLEASE USE SINGULARITY METHOD INSTEAD**. **DO NOT USE A SHARED CONDA. INSTALL YOUR OWN [MINICONDA3](https://docs.conda.io/en/latest/miniconda.html) AND USE IT.**
```bash
# check if you are not using a shared conda, if so then delete it or remove it from your PATH
$ which conda
## Description
Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.
# uninstall pipeline's old environments
$ bash scripts/uninstall_conda_env.sh
## Badges
On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.
# install new envs, you need to run this for every pipeline version update.
# it may be killed if you run this command line on a login node on HPC.
# it's recommended to make an interactive node with enough resources and run it there.
$ bash scripts/install_conda_env.sh
## Visuals
Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.
# if installation fails please use Singularity method instead.
# on your local machine (--max-concurrent-tasks 1 is for computers with limited resources)
$ caper run atac.wdl -i "${INPUT_JSON}" --conda --max-concurrent-tasks 1
# on HPC, make sure that Caper's conf ~/.caper/default.conf is correctly configured to work with your HPC
# the following command will submit Caper as a leader job to SLURM with Conda
$ caper hpc submit atac.wdl -i "${INPUT_JSON}" --conda --leader-job-name ANY_GOOD_LEADER_JOB_NAME
# check job ID and status of your leader jobs
$ caper hpc list
# cancel the leader node to close all of its children jobs
# If you directly use cluster command like scancel or qdel then
# child jobs will not be terminated
$ caper hpc abort [JOB_ID]
```
## Input JSON file specification
> **IMPORTANT**: DO NOT BLINDLY USE A TEMPLATE/EXAMPLE INPUT JSON. READ THROUGH THE FOLLOWING GUIDE TO MAKE A CORRECT INPUT JSON FILE. ESPECIALLY FOR AUTODETECTING/DEFINING ADAPTERS.
An input JSON file specifies all the input parameters and files that are necessary for successfully running this pipeline. This includes a specification of the path to the genome reference files and the raw data fastq file. Please make sure to specify absolute paths rather than relative paths in your input JSON files.
1) [Input JSON file specification (short)](docs/input_short.md)
2) [Input JSON file specification (long)](docs/input.md)
## Running and sharing on Truwl
You can run this pipeline on [truwl.com](https://truwl.com/). This provides a web interface that allows you to define inputs and parameters, run the job on GCP, and monitor progress. To run it you will need to create an account on the platform then request early access by emailing [info@truwl.com](mailto:info@truwl.com) to get the right permissions. You can see the example case from this repo at [https://truwl.com/workflows/instance/WF_e85df4.f10.8880/command](https://truwl.com/workflows/instance/WF_e85df4.f10.8880/command). The example job (or other jobs) can be forked to pre-populate the inputs for your own job.
If you do not run the pipeline on Truwl, you can still share your use-case/job on the platform by getting in touch at [info@truwl.com](mailto:info@truwl.com) and providing your inputs.json file.
## Installation
Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
## Usage
Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.
## Running on Terra/Anvil (using Dockstore)
## Support
Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.
Visit our pipeline repo on [Dockstore](https://dockstore.org/workflows/github.com/ENCODE-DCC/atac-seq-pipeline). Click on `Terra` or `Anvil`. Follow Terra's instruction to create a workspace on Terra and add Terra's billing bot to your Google Cloud account.
## Roadmap
If you have ideas for releases in the future, it is a good idea to list them in the README.
Download this [test input JSON for Terra](https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled.terra.json) and upload it to Terra's UI and then run analysis.
## Contributing
State if you are open to contributions and what your requirements are for accepting them.
If you want to use your own input JSON file, then make sure that all files in the input JSON are on a Google Cloud Storage bucket (`gs://`). URLs will not work.
For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.
You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.
## Running on DNAnexus (using Dockstore)
## Authors and acknowledgment
Show your appreciation to those who have contributed to the project.
Sign up for a new account on [DNAnexus](https://platform.dnanexus.com/) and create a new project on either AWS or Azure. Visit our pipeline repo on [Dockstore](https://dockstore.org/workflows/github.com/ENCODE-DCC/atac-seq-pipeline). Click on `DNAnexus`. Choose a destination directory on your DNAnexus project. Click on `Submit` and visit DNAnexus. This will submit a conversion job so that you can check status of it on `Monitor` on DNAnexus UI.
## License
For open source projects, say how it is licensed.
Once conversion is done download one of the following input JSON files according to your chosen platform (AWS or Azure) for your DNAnexus project:
- AWS: https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled_dx.json
- Azure: https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled_dx_azure.json
You cannot use these input JSON files directly. Go to the destination directory on DNAnexus and click on the converted workflow `atac`. You will see input file boxes in the left-hand side of the task graph. Expand it and define FASTQs (`fastq_repX_R1` and also `fastq_repX_R2` if it's paired-ended) and `genome_tsv` as in the downloaded input JSON file. Click on the `common` task box and define other non-file pipeline parameters. e.g. `auto_detect_adapters` and `paired_end`.
We have a separate project on DNANexus to provide example FASTQs and `genome_tsv` for `hg38` and `mm10`. We recommend to make copies of these directories on your own project.
`genome_tsv`
- AWS: https://platform.dnanexus.com/projects/BKpvFg00VBPV975PgJ6Q03v6/data/pipeline-genome-data/genome_tsv/v4
- Azure: https://platform.dnanexus.com/projects/F6K911Q9xyfgJ36JFzv03Z5J/data/pipeline-genome-data/genome_tsv/v4
Example FASTQs
- AWS: https://platform.dnanexus.com/projects/BKpvFg00VBPV975PgJ6Q03v6/data/pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled
- Azure: https://platform.dnanexus.com/projects/F6K911Q9xyfgJ36JFzv03Z5J/data/pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled
## Running on DNAnexus (using our pre-built workflows)
See [this](docs/tutorial_dx_web.md) for details.
## How to organize outputs
Install [Croo](https://github.com/ENCODE-DCC/croo#installation). **You can skip this installation if you have installed pipeline's Conda environment and activated it**. Make sure that you have python3(> 3.4.1) installed on your system. Find a `metadata.json` on Caper's output directory.
```bash
$ pip install croo
$ croo [METADATA_JSON_FILE]
```
## How to make a spreadsheet of QC metrics
Install [qc2tsv](https://github.com/ENCODE-DCC/qc2tsv#installation). Make sure that you have python3(> 3.4.1) installed on your system.
Once you have [organized output with Croo](#how-to-organize-outputs), you will be able to find pipeline's final output file `qc/qc.json` which has all QC metrics in it. Simply feed `qc2tsv` with multiple `qc.json` files. It can take various URIs like local path, `gs://` and `s3://`.
```bash
$ pip install qc2tsv
$ qc2tsv /sample1/qc.json gs://sample2/qc.json s3://sample3/qc.json ... > spreadsheet.tsv
```
## Project status
If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.
QC metrics for each experiment (`qc.json`) will be split into multiple rows (1 for overall experiment + 1 for each bio replicate) in a spreadsheet.
This diff is collapsed.
atac.wdl 0 → 100644
This diff is collapsed.
#!/bin/bash
set -e
WDL=atac.wdl
VER=$(cat ${WDL} | grep "String pipeline_ver = " | awk '{gsub("'"'"'",""); print $4}')
DXWDL=~/dxWDL-v1.50.jar
# general
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines" -f -folder \
/ATAC-seq/workflows/$VER/general -defaults example_input_json/dx/template_general.json
# hg38
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines" -f -folder \
/ATAC-seq/workflows/$VER/hg38 -defaults example_input_json/dx/template_hg38.json
# hg19
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines" -f -folder \
/ATAC-seq/workflows/$VER/hg19 -defaults example_input_json/dx/template_hg19.json
# mm10
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines" -f -folder \
/ATAC-seq/workflows/$VER/mm10 -defaults example_input_json/dx/template_mm10.json
# mm9
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines" -f -folder \
/ATAC-seq/workflows/$VER/mm9 -defaults example_input_json/dx/template_mm9.json
# test sample
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines" -f -folder \
/ATAC-seq/workflows/$VER/test_ENCSR356KRQ_subsampled -defaults example_input_json/dx/ENCSR356KRQ_subsampled_dx.json
# test sample (single rep)
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines" -f -folder \
/ATAC-seq/workflows/$VER/test_ENCSR356KRQ_subsampled_rep1 -defaults example_input_json/dx/ENCSR356KRQ_subsampled_rep1_dx.json
## DX Azure
# general
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines Azure" -f -folder \
/ATAC-seq/workflows/$VER/general -defaults example_input_json/dx_azure/template_general.json
# hg38
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines Azure" -f -folder \
/ATAC-seq/workflows/$VER/hg38 -defaults example_input_json/dx_azure/template_hg38.json
# hg19
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines Azure" -f -folder \
/ATAC-seq/workflows/$VER/hg19 -defaults example_input_json/dx_azure/template_hg19.json
# mm10
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines Azure" -f -folder \
/ATAC-seq/workflows/$VER/mm10 -defaults example_input_json/dx_azure/template_mm10.json
# mm9
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines Azure" -f -folder \
/ATAC-seq/workflows/$VER/mm9 -defaults example_input_json/dx_azure/template_mm9.json
# test sample
java -jar ${DXWDL} compile ${WDL} -project "ENCODE Uniform Processing Pipelines Azure" -f -folder \
/ATAC-seq/workflows/$VER/test_ENCSR356KRQ_subsampled -defaults example_input_json/dx_azure/ENCSR356KRQ_subsampled_dx_azure.json
############################################################
# Dockerfile for ENCODE DCC atac-seq-pipeline
# Based on Ubuntu 18.04.3
############################################################
# Set the base image to Ubuntu 18.04.3
#FROM ubuntu:18.04
FROM ubuntu@sha256:d1d454df0f579c6be4d8161d227462d69e163a8ff9d20a847533989cf0c94d90
MAINTAINER Jin Lee
# To prevent time zone prompt
ENV DEBIAN_FRONTEND=noninteractive
# Install softwares from apt repo
RUN apt-get update && apt-get install -y \
libncurses5-dev libcurl4-openssl-dev libfreetype6-dev zlib1g-dev \
python python-setuptools python-pip python3 python3-setuptools python3-pip \
git wget unzip ghostscript pkg-config libboost-dev \
openjdk-8-jre apt-transport-https tabix \
r-base \
&& rm -rf /var/lib/apt/lists/*
# Make directory for all softwares
RUN mkdir /software
WORKDIR /software
ENV PATH="/software:${PATH}"
# Downgrade openssl to 1.0.2t (to get the same random number for SPR)
RUN wget https://github.com/openssl/openssl/archive/OpenSSL_1_0_2t.tar.gz && tar zxvf OpenSSL_1_0_2t.tar.gz && cd openssl-OpenSSL_1_0_2t/ && ./config && make && make install && cd ../ && rm -rf openssl-OpenSSL_1_0_2t* && rm /usr/bin/openssl && ln -s /usr/local/ssl/bin/openssl /usr/bin/openssl
# Install system/math python packages (python3)
RUN pip3 install --no-cache-dir jsondiff==1.1.1 common python-dateutil cython pandas==0.25.1 jinja2==2.10.1 matplotlib==3.1.1
# Install genomic python package (python3)
RUN pip3 install --no-cache-dir pyBigwig==0.3.13 cutadapt==2.5 pyfaidx==0.5.5.2 pybedtools==0.8.0 pysam==0.15.3 deeptools==3.3.1
# Install R packages including spp
RUN echo "r <- getOption('repos'); r['CRAN'] <- 'http://cran.r-project.org'; options(repos = r);" > ~/.Rprofile && \
Rscript -e "install.packages('snow')" && \
Rscript -e "install.packages('snowfall')" && \
Rscript -e "install.packages('bitops')" && \
Rscript -e "install.packages('caTools')" && \
Rscript -e "install.packages('Rcpp')"
# Install bioconductor and Rsamtools which is required by spp package
RUN Rscript -e "source('http://bioconductor.org/biocLite.R'); biocLite('Rsamtools')"
# Install r-spp 1.15 (1.13 in Conda env)
RUN wget https://github.com/hms-dbmi/spp/archive/1.15.tar.gz && Rscript -e "install.packages('./1.15.tar.gz')" && rm -f https://github.com/hms-dbmi/spp/archive/1.15.tar.gz
# Install samtools 1.9
RUN git clone --branch 1.9 --single-branch https://github.com/samtools/samtools.git && \
git clone --branch 1.9 --single-branch https://github.com/samtools/htslib.git && \
cd samtools && make && make install && cd ../ && rm -rf samtools* htslib*
# Install bedtools 2.29.0
RUN git clone --branch v2.29.0 --single-branch https://github.com/arq5x/bedtools2.git && \
cd bedtools2 && make && make install && cd ../ && rm -rf bedtools2*
# Install Picard 2.20.7
RUN wget https://github.com/broadinstitute/picard/releases/download/2.20.7/picard.jar && chmod +x picard.jar
# Install sambamba 0.6.6
RUN wget https://github.com/lomereiter/sambamba/releases/download/v0.6.6/sambamba_v0.6.6_linux.tar.bz2 && tar -xvjf sambamba_v0.6.6_linux.tar.bz2 && mv sambamba_v0.6.6 sambamba && rm -rf sambamba_*
# Install gsl 1.16
RUN wget http://gnu.mirror.vexxhost.com/gsl/gsl-1.16.tar.gz && tar -zxvf gsl-1.16.tar.gz && cd gsl-1.16 && ./configure && make && make install && cd .. && rm -rf gsl-1.16 gsl-1.16.tar.gz
ENV LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib
# Install preseq 2.0.3
RUN git clone --branch v2.0.3 --single-branch --recursive https://github.com/smithlabcode/preseq preseq_2.0.3 && cd preseq_2.0.3 && make && cd ../ && mv preseq_2.0.3/preseq . && rm -rf preseq_2.0.3
# Install Bowtie2 2.3.4.3
RUN wget https://github.com/BenLangmead/bowtie2/releases/download/v2.3.4.3/bowtie2-2.3.4.3-linux-x86_64.zip && \
unzip bowtie2-2.3.4.3-linux-x86_64.zip && mv bowtie2*/bowtie2* . && rm -rf bowtie2-2.3.4.3*
# Install Bwa 0.7.17
RUN git clone --branch v0.7.17 --single-branch https://github.com/lh3/bwa.git && \
cd bwa && make && cp bwa /usr/local/bin/ && cd ../ && rm -rf bwa*
# Install phantompeakqualtools 1.2.1
RUN wget https://github.com/kundajelab/phantompeakqualtools/archive/1.2.1.tar.gz && tar -xvf 1.2.1.tar.gz && rm -f 1.2.1.tar.gz
ENV PATH="/software/phantompeakqualtools-1.2.1:${PATH}"
# Install SAMstats
RUN pip3 install --no-cache-dir SAMstats==0.2.1
# Install IDR 2.0.4.2
RUN git clone --branch 2.0.4.2 --single-branch https://github.com/kundajelab/idr && \
cd idr && python3 setup.py install && cd ../ && rm -rf idr*
# Install system/math python packages and biopython
RUN pip2 install --no-cache-dir numpy scipy matplotlib==2.2.4 bx-python==0.8.2 biopython==1.76
RUN pip3 install --no-cache-dir biopython==1.76
# Install genomic python packages (python2)
RUN pip2 install --no-cache-dir metaseq==0.5.6
# Install MACS2 (python3)
RUN pip3 install --no-cache-dir Cython
RUN pip3 install --no-cache-dir macs2==2.2.4
# Install UCSC tools (v377)
RUN git clone https://github.com/ENCODE-DCC/kentUtils_bin_v377
ENV PATH=${PATH}:/software/kentUtils_bin_v377/bin
ENV LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/software/kentUtils_bin_v377/lib
# Instal Trimmomatic JAR
RUN wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip && unzip Trimmomatic-0.39.zip && mv Trimmomatic-0.39/trimmomatic-0.39.jar trimmomatic.jar && chmod +rx trimmomatic.jar && rm -rf Trimmomatic-0.39.zip Trimmomatic-0.39/
# Install pytest for testing environment
RUN pip3 install --no-cache-dir pytest
# Install bedops 2.4.39
RUN mkdir bedops_2.4.39 && cd bedops_2.4.39 && wget https://github.com/bedops/bedops/releases/download/v2.4.39/bedops_linux_x86_64-v2.4.39.tar.bz2 && tar -xvjf bedops_linux_x86_64-v2.4.39.tar.bz2 && rm -f bedops_linux_x86_64-v2.4.39.tar.bz2
ENV PATH="/software/bedops_2.4.39/bin:${PATH}"
# Install ptools_bin 0.0.7 (https://github.com/ENCODE-DCC/ptools_bin)
RUN pip3 install --no-cache-dir ptools_bin==0.0.7
# Prevent conflict with locally installed python outside of singularity container
ENV PYTHONNOUSERSITE=True
# Disable multithreading for BLAS
ENV OPENBLAS_NUM_THREADS=1
ENV MKL_NUM_THREADS=1
# make some temporary directories
RUN mkdir -p /mnt/ext_{0..9}
# make pipeline src directory to store py's
RUN mkdir -p atac-seq-pipeline/src
ENV PATH="/software/atac-seq-pipeline:/software/atac-seq-pipeline/src:${PATH}"
RUN mkdir -p atac-seq-pipeline/dev/test/test_py
ENV PYTHONPATH="/software/atac-seq-pipeline/src"
# Get ENCODE atac-seq-pipeline container repository
# This COPY assumes the build context is the root of the atac-seq-pipeline repo
# and it gets whatever is checked out plus local modifications
# so the buildling command should be:
# cd [GIT_REPO_DIR] && docker build -f dev/docker_image/Dockerfile .
COPY src atac-seq-pipeline/src/
COPY atac.wdl atac-seq-pipeline/
COPY dev/test/test_py atac-seq-pipeline/dev/test/test_py/
CREATE USER 'cromwell'@'localhost' IDENTIFIED BY 'cromwell';
GRANT ALL PRIVILEGES ON cromwell_db.* TO 'cromwell'@'localhost' WITH GRANT OPTION;
CREATE USER 'cromwell'@'%' IDENTIFIED BY 'cromwell';
GRANT ALL PRIVILEGES ON cromwell_db.* TO 'cromwell'@'%' WITH GRANT OPTION;
\ No newline at end of file
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "s3://encode-pipeline-genome-data/genome_tsv/v1/hg38_aws.tsv",
"atac.fastqs_rep1_R1" : [
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
"s3://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "https://storage.googleapis.com/encode-pipeline-genome-data/genome_tsv/v1/hg38_caper.tsv",
"atac.fastqs_rep1_R1" : [
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
"https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "gs://encode-pipeline-genome-data/genome_tsv/v1/hg38_gcp.tsv",
"atac.fastqs_rep1_R1" : [
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
"gs://encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "/mnt/data/pipeline_genome_data/genome_tsv/v1/hg38_klab.tsv",
"atac.fastqs_rep1_R1" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair1/ENCFF341MYG.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair1/ENCFF106QGY.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair2/ENCFF248EJF.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair2/ENCFF368TYI.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF641SFZ.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF751XTV.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF927LSG.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF859BDM.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF193RRC.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF366DFI.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF031ARQ.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF590SYZ.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF734PEQ.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF007USV.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF886FSC.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF573UXK.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "/mnt/data/pipeline_genome_data/genome_tsv/v1/hg38_klab.tsv",
"atac.fastqs_rep1_R1" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "/mnt/data/pipeline_genome_data/genome_tsv/v1/hg38_klab.tsv",
"atac.nodup_bams" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/bam_subsampled/rep1/ENCFF341MYG.subsampled.400.trim.merged.nodup.no_chrM_MT.bam",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/bam_subsampled/rep2/ENCFF641SFZ.subsampled.400.trim.merged.nodup.no_chrM_MT.bam"
],
"atac.read_len" : [76, 76],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400, staring from NODUP_BAMs with specified read_len)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "/mnt/data/pipeline_genome_data/genome_tsv/v1/mm10_klab.tsv",
"atac.fastqs_rep1_R1" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR889WQX/fastq_subsampled/rep1/ENCFF439VSY.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR889WQX/fastq_subsampled/rep1/ENCFF325FCQ.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR889WQX/fastq_subsampled/rep1/ENCFF683IQS.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR889WQX/fastq_subsampled/rep1/ENCFF744CHW.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR889WQX/fastq_subsampled/rep2/ENCFF463QCX.subsampled.400.fastq.gz",
"/mnt/data/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR889WQX/fastq_subsampled/rep2/ENCFF992TSA.subsampled.400.fastq.gz"
],
"atac.paired_end" : false,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.enable_tss_enrich" : false,
"atac.title" : "ENCSR889WQX (subsampled 1/400 reads)",
"atac.description" : "ATAC-seq on Mus musculus C57BL/6 frontal cortex adult"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "/reference/ENCODE/pipeline_genome_data/genome_tsv/v1/hg38_scg.tsv",
"atac.fastqs_rep1_R1" : [
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
"/reference/ENCODE/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "/home/groups/cherry/encode/pipeline_genome_data/genome_tsv/v1/hg38_sherlock.tsv",
"atac.fastqs_rep1_R1" : [
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
"/home/groups/cherry/encode/pipeline_test_samples/encode-atac-seq-pipeline/ENCSR356KRQ/fastq/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
ENCODE ATAC-seq pipeline test
===================================================
# Task level test (local)
This test requires `atac-seq-pipeline-test-data` directory in `test_task/`. Git glone [a data repo](https://github.com/leepc12/atac-seq-pipeline-test-data) on `test_task/`. This repo has 1/400 subsampled test samples and chr19-chrM only bowtie2 indices and other genome data for hg38 and mm10. Make sure that you have `cromwell-31.jar` in your `$PATH` as an executable (`chmod +x`) and `Docker` installed on your system.
```
$ cd test_task/
$ git clone https://github.com/encode-dcc/atac-seq-pipeline-test-data
```
Each task in `../atac.wdl` has a corresponding pair of tester WDL/JSON (`[TASK_NAME].WDL` and [TASK_NAME].json`). You can also specify your own docker image to test each task.
```
$ cd test_task/
$ ./test.sh [WDL] [INPUT_JSON] [DOCKER_IMAGE](optional)
```
# Workflow level test (on GC)
Make sure that you have a Cromwell server running on GC. This shell script will submit `../atac.wdl` to the server and wait for a response (`result.json`). There are two input JSON files (original and subsampled) for each endedness (SE and PE). You can also check all outputs on GC bucket `gs://encode-pipeline-test-runs`.
```
$ cd test_workflow/
$ ./test_atac.sh [INPUT_JSON] [QC_JSON_TO_COMPARE] [DOCKER_IMAGE](optional)
```
Jenkins must do the following:
```
$ cd test_workflow/
# For master branch (full test sample, ~24hr)
$ ./test_atac.sh ENCSR356KRQ.json ref_output/ENCSR356KRQ_qc.json [NEW_DOCKER_IMAGE]
$ ./test_atac.sh ENCSR889WQX.json ref_output/ENCSR889WQX_qc.json [NEW_DOCKER_IMAGE]
# For develop branch (1/400 subsampled and chr19 only test sample ~30mins)
$ ./test_atac.sh ENCSR356KRQ_subsampled.json ref_output/ENCSR356KRQ_subsampled_chr19_only_qc.json [NEW_DOCKER_IMAGE]
$ ./test_atac.sh ENCSR889WQX_subsampled.json ref_output/ENCSR889WQX_subsampled_chr19_only_qc.json [NEW_DOCKER_IMAGE]
```
`test_atac.sh` will generate the following files to validate pipeline outputs. Jenkins must check if `PREFIX.qc_json_diff.txt` is empty or not.
* `PREFIX.result.json`: all outputs of `atac.wdl`.
* `PREFIX.result.qc.json`: qc summary JSON file `qc.json` of `atac.wdl`.
* `PREFIX.qc_json_diff.txt`: diff between `PREFIX.result.qc.json` and reference in `ref_output/`.
# How to run a Cromwell server on GC
1) Create/restart an instance with the following settings.
* name : `encode-cromwell-test-server`.
* resource: 1vCPU and 4GB memory
* zone: `us-west1-a`.
* image: `Ubuntu 16.04 (xenial)`
* disk: `Standard persistent disk 20GB`
* Network tags: add a tag `cromwell-server`.
* Cloud API access scopes: `Allow full access to all Cloud APIs`.
* External IP (optional): any static IP address.
2) SSH to the instance and run the followings to install Docker and Java 8:
```
$ sudo apt-get update
$ sudo apt-get install docker.io default-jre
$ sudo usermod -aG docker $USER
```
3) Log out and log back in.
4) Install cromwell.
```
$ cd
$ wget https://github.com/broadinstitute/cromwell/releases/download/31/cromwell-31.jar
$ chmod +x cromwell*.jar
$ echo "export PATH=\$PATH:\$HOME">> ~/.bashrc
$ source ~/.bashrc
```
5) Clone pipeline, make DB directory (where metadata of all pipelines are stored) and run `MySQL` container.
```
$ cd
$ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
$ mkdir cromwell_db
$ docker run -d --name mysql-cromwell -v $HOME/cromwell_db:/var/lib/mysql -v $HOME/atac-seq-pipeline/docker_image/mysql:/docker-entrypoint-initdb.d -e MYSQL_ROOT_PASSWORD=cromwell -e MYSQL_DATABASE=cromwell_db --publish 3306:3306 mysql
$ docker ps
```
4) Run Cromwell server
```
$ cd $HOME/atac-seq-pipeline
$ git checkout develop_test_jenkins
$ cd test
$ screen -RD cromwell # make screen for cromwell server
$ bash run_cromwell_server_on_gc.sh
```
5) Firewall settings to open port 8000
* Go to Google Cloud Console
* Choose your Project.
* Choose Networking > VPC network
* Choose "Firewalls rules"
* Choose Create Firewall Rule `encode-cromwell-test-server-open-port-8000`.
* Targets: `Specified target rags`.
* Target tags: cromwell-server
* Source IP ranges: `0.0.0.0/0` (CIDR notation for allowed IP range)
* Protocols and Ports: `Specified protocols and ports` with `tcp:8000`.
#!/bin/bash
if [ -f "cromwell-32.jar" ]; then
echo "Skip downloading cromwell."
else
wget -N -c https://github.com/broadinstitute/cromwell/releases/download/32/cromwell-32.jar
fi
CROMWELL_JAR=cromwell-32.jar
BACKEND_CONF=../backends/backend_with_db.conf
BACKEND=google
GC_PROJ=encode-dcc-1016
GC_ROOT=gs://encode-pipeline-test-runs
java -Dconfig.file=${BACKEND_CONF} -Dbackend.default=${BACKEND} -Dbackend.providers.google.config.project=${GC_PROJ} \
-Dbackend.providers.google.config.root=${GC_ROOT} -jar ${CROMWELL_JAR} server
atac-seq-pipeline-test-data
*.result.json
*.metadata.json
*wf_opt.json
cromwell*.jar
*.fasta
*.fa
*.gz
*.docker.json
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment