Skip to content
Snippets Groups Projects
Commit 3d891107 authored by Holly Ruess's avatar Holly Ruess
Browse files

Updated check design issue 13

parent 361d60df
Branches
Tags
1 merge request!3Resolve "update check design file"
Pipeline #5239 failed with stages
in 1 minute and 5 seconds
Showing
with 119 additions and 48 deletions
before_script:
- module add python/3.6.1-2-anaconda
- pip install --user pytest-pythonpath pytest-cov
- module load nextflow/0.27.6
- pip install --user pytest-pythonpath==0.7.1 pytest-cov==2.5.1
- module load nextflow/0.31.0
- ln -s /project/shared/bicf_workflow_ref/workflow_testdata/atacseq/*fastq.gz test_data/
stages:
......@@ -15,14 +15,22 @@ user_configuration:
single_end_human:
stage: integration
only:
- branches
- master
script:
- nextflow run workflow/main.nf
- pytest -m singleend_human
artifacts:
expire_in: 3 days
paired_end_mouse:
stage: integration
only:
- branches
- master
script:
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_ENCSR451NAE_PE.txt" --genome 'GRCm38' --pairedEnd true
- pytest -m pairedend_mouse
artifacts:
expire_in: 3 days
......@@ -10,6 +10,9 @@ All notable changes to this project will be documented in this file.
### Added
- Changelog
- Merge request template
- Added new CI/CD tests for better coverage
- Added punctuation check in design file
- Added sequence (fastq1) length into design file for better mapping
## [publish_1.0.0 ] - 2019-12-03
Initial release of pipeline
......
......@@ -7,7 +7,41 @@
[![Astrocyte](https://img.shields.io/badge/astrocyte-%E2%89%A50.1.0-blue.svg)](https://astrocyte-test.biohpc.swmed.edu/static/docs/index.html)
## Introduction
BICF ATAC-seq is a bioinformatics best-practice analysis pipeline used for ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data analysis at [BICF](http://www.utsouthwestern.edu/labs/bioinformatics/) at [UT Southwestern Department of Bioinformatics](http://www.utsouthwestern.edu/departments/bioinformatics/).
The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster](https://portal.biohpc.swmed.edu/content/). However, the pipeline should be able to run on any system that supports Nextflow.
Additionally, the pipeline is designed to work with [Astrocyte Workflow System](https://astrocyte.biohpc.swmed.edu/static/docs/index.html) using a simple web interface.
Current version of the software and issue reports are at
https://git.biohpc.swmed.edu/BICF/Astrocyte/atacseq_analysis
To download the current (working not tagged) version of the software
```bash
$ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/atacseq_analysis.git
```
## Input files
##### 1) Fastq Files
+ You will need the full path to the files for the Bash Scipt
## Design file
+ The Design file is a tab-delimited file with 4 columns for Single-End and 5 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names must begin with a letter. Columns must be as follows:
1. sample_id - The id of the sample. This will be the header in output files, please make sure it is concise
2. experiment_id - Same name given for all replicates of treatment. Will be used for the consensus header.
3. replicate - Replicate number
4. fastq_read1 - Name of fastq file 1 for SE or PE data
5. fastq_read2 - Name of fastq file 2 for PE data
+ See [HERE](/docs/design_ENCSR451NAE_PE.txt) for an example design file, paired-end
+ See [HERE](/docs/design_ENCSR265ZXX_SE.txt) for an example design file, single-end
## SOP
This SOP describes the analysis pipeline of downstream analysis of ChIP-seq sequencing data. This pipeline includes (1) Quality control using Deeptools, (2) Peak annotation, (3) Differential peak analysis, and (4) motif analysis. BAM files and SORTED peak BED files selected as input. For each sample this workflow:
1) Annotate all peaks using ChipSeeker
......@@ -16,16 +50,11 @@ This SOP describes the analysis pipeline of downstream analysis of ChIP-seq sequ
4) Annotate all differentially expressed peaks
5) Using MEME-ChIP in motif finding for both original peaks and differently expressed peaks
## Annotations used in the pipeline
ChipSeeker - Known gene from Bioconductor [TxDb annotation](https://bioconductor.org/packages/release/BiocViews.html#___TxDb)
Deeptools - RefGene downloaded from UCSC Table browser
## Workflow Parameters
bam - Choose all ChIP-seq alignment files for analysis.
......@@ -34,20 +63,6 @@ This SOP describes the analysis pipeline of downstream analysis of ChIP-seq sequ
design - Choose the file with the experiment design information. CSV format
toppeak - The number of top peaks used for motif analysis. Default is all
## Design file
+ The Design file is a tab-delimited file with 4 columns for Single-End and 5 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names must begin with a letter. Columns must be as follows:
1. sample_id - The id of the sample. This will be the header in output files, please make sure it is concise
2. experiment_id - Same name given for all replicates of treatment. Will be used for the consensus header.
3. replicate - Replicate number
4. fastq_read1 - Name of fastq file 1 for SE or PE data
5. fastq_read2 - Name of fastq file 2 for PE data
+ See [HERE](/docs/design_ENCSR451NAE_PE.txt) for an example design file, paired-end
+ See [HERE](/docs/design_ENCSR265ZXX_SE.txt) for an example design file, single-end
## Common Errors
If you find an error, please let the [BICF](mailto:BICF@UTSouthwestern.edu) know and we will add it here.
......
......@@ -15,9 +15,10 @@ params.bwaIndex = params.genome ? params.genomes[ params.genome ].bwa ?: false :
params.genomeSize = params.genome ? params.genomes[ params.genome ].genomesize ?: false : false
params.chromSizes = params.genome ? params.genomes[ params.genome ].chromsizes ?: false : false
params.tssFile = params.genome ? params.genomes[ params.genome ].tssfile ?: false : false
params.outDir= "${baseDir}/output"
// Check inputs
if( params.bwaIndex ){
if(params.bwaIndex) {
bwaIndex = Channel
.fromPath(params.bwaIndex)
.ifEmpty { exit 1, "BWA index not found: ${params.bwaIndex}" }
......@@ -27,10 +28,10 @@ if( params.bwaIndex ){
// Define List of Files
readsList = Channel
.fromPath( params.reads )
.fromPath(params.reads)
.flatten()
.map { file -> [ file.getFileName().toString(), file.toString() ].join("\t")}
.collectFile( name: 'fileList.tsv', newLine: true )
.map { file -> [file.getFileName().toString(), file.toString()].join("\t") }
.collectFile(name: 'fileList.tsv', newLine: true)
// Define regular variables
pairedEnd = params.pairedEnd
......@@ -39,33 +40,30 @@ tn5 = params.tn5
designFile = params.designFile
genomeSize = params.genomeSize
chromSizes = params.chromSizes
tssFile = params.tssFile
//tssFile = params.tssFile
process checkDesignFile {
publishDir "$baseDir/output/design", mode: 'copy'
publishDir "${outDir}/design", mode: 'copy'
input:
designFile
file readsList
designFile
file readsList
output:
file("design.tsv") into designFilePaths
file("design.tsv") into designFilePaths
script:
if (pairedEnd) {
"""
python3 $baseDir/scripts/check_design.py -d $designFile -f $readsList -p -a
"""
}
else {
"""
python3 $baseDir/scripts/check_design.py -d $designFile -f $readsList -a
"""
}
if (pairedEnd) {
"""
python3 ${baseDir}/scripts/check_design.py -d ${designFile} -f ${readsList} -p -a
"""
}
else {
"""
python3 ${baseDir{/scripts/check_design.py -d ${designFile} -f ${readsList} -a
"""
}
}
......
File added
File added
File added
File added
File added
File added
......@@ -5,6 +5,7 @@
import argparse
import logging
import pandas as pd
import os
EPILOG = '''
For more details:
......@@ -49,14 +50,15 @@ def get_args():
def check_design_headers(design, paired, atac):
'''Check if design file conforms to sequencing type.'''
'''Check if design file has proper headers.'''
# Default headers
design_template = [
'sample_id',
'experiment_id',
'replicate',
'fastq_read1']
'fastq_read1',
]
design_headers = list(design.columns.values)
......@@ -76,6 +78,46 @@ def check_design_headers(design, paired, atac):
raise Exception("Missing column headers: %s" % list(missing_headers))
def check_samples(design):
'''Check if design file has the correct sample name mapping.'''
logger.info("Running sample check.")
samples = design.groupby('sample_id') \
.apply(list)
malformated_samples = []
chars = set('-.')
for sample in samples.index.values:
if(any(char.isspace() for char in sample) | any((char in chars) for char in sample)):
malformated_samples.append(sample)
if len(malformated_samples) > 0:
logger.error('Malformed samples from design file: %s', list(malformated_samples))
raise Exception("Malformed samples from design file: %s" %
list(malformated_samples))
def check_experiments(design):
'''Check if design file has the correct experiment name mapping.'''
logger.info("Running experiment check.")
experiments = design.groupby('experiment_id') \
.apply(list)
malformated_experiments = []
chars = set('-.')
for experiment in experiments.index.values:
if(any(char.isspace() for char in experiment) | any((char in chars) for char in experiment)):
malformated_experiments.append(experiment)
if len(malformated_experiments) > 0:
logger.error('Malformed experiment from design file: %s', list(malformated_experiments))
raise Exception("Malformed experiment from design file: %s" %
list(malformated_experiments))
def check_controls(design):
'''Check if design file has the correct control mapping.'''
......@@ -90,7 +132,7 @@ def check_controls(design):
def check_replicates(design):
'''Check if design file has unique replicate numbersfor an experiment.'''
'''Check if design file has unique replicate numbers for an experiment.'''
logger.info("Running replicate check.")
......@@ -111,7 +153,7 @@ def check_replicates(design):
def check_files(design, fastq, paired):
'''Check if design file has the files found.'''
'''Check if design file fastq lists are actually present.'''
logger.info("Running file check.")
......@@ -161,7 +203,12 @@ def main():
check_controls(design_df)
check_replicates(design_df)
new_design_df = check_files(design_df, fastq_df, paired)
check_samples(design_df)
check_experiments(design_df)
checked_design_df = check_files(design_df, fastq_df, paired)
# Add length of each read to design file
new_design_df = get_length(checked_design_df)
# Write out new design file
new_design_df.to_csv('design.tsv', header=True, sep='\t', index=False)
......
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment