Updated check design issue 13

3d891107 · Holly Ruess · 361d60df · 3d891107 · 3d891107 · 3d891107
Commit 3d891107 authored 5 years ago by Holly Ruess
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
 before_script:
  - module add  python/3.6.1-2-anaconda
-  - pip install --user pytest-pythonpath pytest-cov
-  - module load nextflow/0.27.6
+  - pip install --user pytest-pythonpath==0.7.1 pytest-cov==2.5.1
+  - module load  nextflow/0.31.0
  - ln -s /project/shared/bicf_workflow_ref/workflow_testdata/atacseq/*fastq.gz test_data/

 stages:
@@ -15,14 +15,22 @@ user_configuration:

 single_end_human:
  stage: integration
+  only:
+    - branches
+    - master
  script:
  - nextflow run workflow/main.nf
+  - pytest -m singleend_human
  artifacts:
    expire_in: 3 days

 paired_end_mouse:
  stage: integration
+  only:
+    - branches
+    - master
  script:
  - nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_ENCSR451NAE_PE.txt" --genome 'GRCm38' --pairedEnd true
+  - pytest -m pairedend_mouse
  artifacts:
    expire_in: 3 days
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,9 @@ All notable changes to this project will be documented in this file.
 ### Added
 - Changelog
 - Merge request template
+ - Added new CI/CD tests for better coverage
+ - Added punctuation check in design file
+ - Added sequence (fastq1) length into design file for better mapping

 ## [publish_1.0.0 ] - 2019-12-03
 Initial release of pipeline

--- a/README.md
+++ b/README.md
@@ -7,7 +7,41 @@
 [![Astrocyte](https://img.shields.io/badge/astrocyte-%E2%89%A50.1.0-blue.svg)](https://astrocyte-test.biohpc.swmed.edu/static/docs/index.html)


+## Introduction
+BICF ATAC-seq is a bioinformatics best-practice analysis pipeline used for ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data analysis at [BICF](http://www.utsouthwestern.edu/labs/bioinformatics/) at [UT Southwestern Department of Bioinformatics](http://www.utsouthwestern.edu/departments/bioinformatics/).

+The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
+
+This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster](https://portal.biohpc.swmed.edu/content/). However, the pipeline should be able to run on any system that supports Nextflow.
+
+Additionally, the pipeline is designed to work with [Astrocyte Workflow System](https://astrocyte.biohpc.swmed.edu/static/docs/index.html) using a simple web interface.
+
+Current version of the software and issue reports are at
+https://git.biohpc.swmed.edu/BICF/Astrocyte/atacseq_analysis
+
+To download the current (working not tagged) version of the software
+```bash
+$ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/atacseq_analysis.git
+```
+
+## Input files
+##### 1) Fastq Files
+  + You will need the full path to the files for the Bash Scipt
+
+## Design file
+  + The Design file is a tab-delimited file with 4 columns for Single-End and 5 columns for Paired-End.  Letter, numbers, and underlines can be used in the names. However, the names must begin with a letter. Columns must be as follows:
+
+    1. sample_id - The id of the sample. This will be the header in output files, please make sure it is concise
+    2. experiment_id - Same name given for all replicates of treatment. Will be used for the consensus header.
+    3. replicate - Replicate number
+    4. fastq_read1 - Name of fastq file 1 for SE or PE data
+    5. fastq_read2 - Name of fastq file 2 for PE data
+    
+  + See [HERE](/docs/design_ENCSR451NAE_PE.txt) for an example design file, paired-end
+  + See [HERE](/docs/design_ENCSR265ZXX_SE.txt) for an example design file, single-end
+
+
+## SOP
 This SOP describes the analysis pipeline of downstream analysis of ChIP-seq sequencing data. This pipeline includes (1) Quality control using Deeptools, (2) Peak annotation, (3) Differential peak analysis, and (4) motif analysis. BAM files and SORTED peak BED files selected as input. For each sample this workflow:

    1) Annotate all peaks using ChipSeeker
@@ -16,16 +50,11 @@ This SOP describes the analysis pipeline of downstream analysis of ChIP-seq sequ
    4) Annotate all differentially expressed peaks
    5) Using MEME-ChIP in motif finding for both original peaks and differently expressed peaks

-
-
 ## Annotations used in the pipeline

    ChipSeeker - Known gene from Bioconductor [TxDb annotation](https://bioconductor.org/packages/release/BiocViews.html#___TxDb)
    Deeptools - RefGene downloaded from UCSC Table browser

-
-
-
 ## Workflow Parameters

    bam - Choose all ChIP-seq alignment files for analysis.
@@ -34,20 +63,6 @@ This SOP describes the analysis pipeline of downstream analysis of ChIP-seq sequ
    design - Choose the file with the experiment design information. CSV format
    toppeak - The number of top peaks used for motif analysis. Default is all

-
-
-## Design file
-  + The Design file is a tab-delimited file with 4 columns for Single-End and 5 columns for Paired-End.  Letter, numbers, and underlines can be used in the names. However, the names must begin with a letter. Columns must be as follows:
-
-    1. sample_id - The id of the sample. This will be the header in output files, please make sure it is concise
-    2. experiment_id - Same name given for all replicates of treatment. Will be used for the consensus header.
-    3. replicate - Replicate number
-    4. fastq_read1 - Name of fastq file 1 for SE or PE data
-    5. fastq_read2 - Name of fastq file 2 for PE data
-    
-  + See [HERE](/docs/design_ENCSR451NAE_PE.txt) for an example design file, paired-end
-  + See [HERE](/docs/design_ENCSR265ZXX_SE.txt) for an example design file, single-end
-
 ## Common Errors
 If you find an error, please let the [BICF](mailto:BICF@UTSouthwestern.edu) know and we will add it here.


--- a/workflow/main.nf
+++ b/workflow/main.nf
@@ -15,9 +15,10 @@ params.bwaIndex = params.genome ? params.genomes[ params.genome ].bwa ?: false :
 params.genomeSize = params.genome ? params.genomes[ params.genome ].genomesize ?: false : false
 params.chromSizes = params.genome ? params.genomes[ params.genome ].chromsizes ?: false : false
 params.tssFile = params.genome ? params.genomes[ params.genome ].tssfile ?: false : false
+params.outDir= "${baseDir}/output"

 // Check inputs
-if( params.bwaIndex ){
+if(params.bwaIndex) {
  bwaIndex = Channel
    .fromPath(params.bwaIndex)
    .ifEmpty { exit 1, "BWA index not found: ${params.bwaIndex}" }
@@ -27,10 +28,10 @@ if( params.bwaIndex ){

 // Define List of Files
 readsList = Channel
-  .fromPath( params.reads )
+  .fromPath(params.reads)
  .flatten()
-  .map { file -> [ file.getFileName().toString(), file.toString() ].join("\t")}
-  .collectFile( name: 'fileList.tsv', newLine: true )
+  .map { file -> [file.getFileName().toString(), file.toString()].join("\t") }
+  .collectFile(name: 'fileList.tsv', newLine: true)

 // Define regular variables
 pairedEnd = params.pairedEnd
@@ -39,33 +40,30 @@ tn5 = params.tn5
 designFile = params.designFile
 genomeSize = params.genomeSize
 chromSizes = params.chromSizes
-tssFile = params.tssFile
+//tssFile = params.tssFile

 process checkDesignFile {

-  publishDir "$baseDir/output/design", mode: 'copy'
+  publishDir "${outDir}/design", mode: 'copy'

  input:
-
-  designFile
-  file readsList
+    designFile
+    file readsList

  output:
-
-  file("design.tsv") into designFilePaths
+    file("design.tsv") into designFilePaths

  script:
-
-  if (pairedEnd) {
-    """
-    python3 $baseDir/scripts/check_design.py -d $designFile -f $readsList -p -a
-    """
-  }
-  else {
-    """
-    python3 $baseDir/scripts/check_design.py -d $designFile -f $readsList -a
-    """
-  }
+    if (pairedEnd) {
+      """
+      python3 ${baseDir}/scripts/check_design.py -d ${designFile} -f ${readsList} -p -a
+      """
+    }
+    else {
+      """
+      python3 ${baseDir{/scripts/check_design.py -d ${designFile} -f ${readsList} -a
+      """
+    }

 }


--- a/workflow/scripts/__pycache__/check_design.cpython-36.pyc
+++ b/workflow/scripts/__pycache__/check_design.cpython-36.pyc
--- a/workflow/scripts/__pycache__/experiment_design.cpython-36.pyc
+++ b/workflow/scripts/__pycache__/experiment_design.cpython-36.pyc
--- a/workflow/scripts/__pycache__/experiment_qc.cpython-36.pyc
+++ b/workflow/scripts/__pycache__/experiment_qc.cpython-36.pyc
--- a/workflow/scripts/__pycache__/overlap_peaks.cpython-36.pyc
+++ b/workflow/scripts/__pycache__/overlap_peaks.cpython-36.pyc
--- a/workflow/scripts/__pycache__/pool_and_psuedoreplicate.cpython-36.pyc
+++ b/workflow/scripts/__pycache__/pool_and_psuedoreplicate.cpython-36.pyc
--- a/workflow/scripts/__pycache__/utils.cpython-36.pyc
+++ b/workflow/scripts/__pycache__/utils.cpython-36.pyc
--- a/workflow/scripts/check_design.py
+++ b/workflow/scripts/check_design.py
@@ -5,6 +5,7 @@
 import argparse
 import logging
 import pandas as pd
+import os

 EPILOG = '''
 For more details:
@@ -49,14 +50,15 @@ def get_args():


 def check_design_headers(design, paired, atac):
-    '''Check if design file conforms to sequencing type.'''
+    '''Check if design file has proper headers.'''

    # Default headers
    design_template = [
        'sample_id',
        'experiment_id',
        'replicate',
-        'fastq_read1']
+        'fastq_read1',
+        ]

    design_headers = list(design.columns.values)

@@ -76,6 +78,46 @@ def check_design_headers(design, paired, atac):
        raise Exception("Missing column headers: %s" % list(missing_headers))


+def check_samples(design):
+    '''Check if design file has the correct sample name mapping.'''
+
+    logger.info("Running sample check.")
+
+    samples = design.groupby('sample_id') \
+                            .apply(list)
+
+    malformated_samples = []
+    chars = set('-.')
+    for sample in samples.index.values:
+        if(any(char.isspace() for char in sample) | any((char in chars) for char in sample)):
+            malformated_samples.append(sample)
+
+    if len(malformated_samples) > 0:
+        logger.error('Malformed samples from design file: %s', list(malformated_samples))
+        raise Exception("Malformed samples from design file: %s" %
+                        list(malformated_samples))
+
+
+def check_experiments(design):
+    '''Check if design file has the correct experiment name mapping.'''
+
+    logger.info("Running experiment check.")
+
+    experiments = design.groupby('experiment_id') \
+                            .apply(list)
+
+    malformated_experiments = []
+    chars = set('-.')
+    for experiment in experiments.index.values:
+        if(any(char.isspace() for char in experiment) | any((char in chars) for char in experiment)):
+            malformated_experiments.append(experiment)
+
+    if len(malformated_experiments) > 0:
+        logger.error('Malformed experiment from design file: %s', list(malformated_experiments))
+        raise Exception("Malformed experiment from design file: %s" %
+                        list(malformated_experiments))
+
+
 def check_controls(design):
    '''Check if design file has the correct control mapping.'''

@@ -90,7 +132,7 @@ def check_controls(design):


 def check_replicates(design):
-    '''Check if design file has unique replicate numbersfor an experiment.'''
+    '''Check if design file has unique replicate numbers for an experiment.'''

    logger.info("Running replicate check.")

@@ -111,7 +153,7 @@ def check_replicates(design):


 def check_files(design, fastq, paired):
-    '''Check if design file has the files found.'''
+    '''Check if design file fastq lists are actually present.'''

    logger.info("Running file check.")

@@ -161,7 +203,12 @@ def main():
        check_controls(design_df)

    check_replicates(design_df)
-    new_design_df = check_files(design_df, fastq_df, paired)
+    check_samples(design_df)
+    check_experiments(design_df)
+    checked_design_df = check_files(design_df, fastq_df, paired)
+
+    # Add length of each read to design file
+    new_design_df = get_length(checked_design_df)

    # Write out new design file
    new_design_df.to_csv('design.tsv', header=True, sep='\t', index=False)

--- a/workflow/tests/__pycache__/test_call_peaks_macs.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_call_peaks_macs.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_check_design.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_check_design.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_convert_reads.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_convert_reads.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_experiment_design.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_experiment_design.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_experiment_qc.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_experiment_qc.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_map_qc.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_map_qc.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_map_reads.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_map_reads.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_overlap_peaks.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_overlap_peaks.cpython-36-PYTEST.pyc
--- a/workflow/tests/__pycache__/test_pool_and_psuedoreplicate.cpython-36-PYTEST.pyc
+++ b/workflow/tests/__pycache__/test_pool_and_psuedoreplicate.cpython-36-PYTEST.pyc