Commit efb6c391 authored by Brandi Cantarel's avatar Brandi Cantarel

initial add

parents
Copyright © 2016. The University of Texas Southwestern Medical Center
5323 Harry Hines Boulevard Dallas, Texas, 75390 Telephone 214-648-3111
\ No newline at end of file
# Astrocyte Example Workflow Package
This is an example workflow package for the BioHPC Astrocyte workflow engine. Astrocyte is a system allowing workflows to be run easily from the web, in a push-button manner, taking advantage of the BioHPC compute cluster. Astrocyte allows users to access this workflow package using a simple web interface, created automatically from the definitions in this package.
This workflow package provides:
1) A sample ChIP-Seq data analysis workflow, which uses BWA and MACS to call peaks from one or more ChIP-Seq FASTQ input files. The workflow is implemented in the *Nextflow* workflow language.
2) A sample *R Shiny* visualization app, which provides a web-based tool for visualizing results.
3) Meta-data describing the workflow, it's inputs, output etc.
4) User-focused documentation, in markdown format, that will be displayed to users in the Astrocyte web interface.
5) Developer-focused documentation, in this file.
## Workflow Package Layout
Workflow packages for Astrocyte are Git repositories, and have a common layout which must be followed so that Astrocyte understands how to present them to users.
### Meta-Data
* `astrocyte_pkg.yml` - A file which contains the metadata describing the workflow in human & machine readable text format called *YAML*. This includes information about the workflow package such as it's name, synopsis, input parameters, outputs etc.
### The Workflow
* `workflow/main.nf` - A *Nextflow* workflow file, which will be run by Astrocyte using parameters provided by the user.
* `workflow/scripts` - A directory for any scripts (e.g. bash, python, ruby scripts) that the `main.nf` workflow will call. This might be empty if the workflow is implemented entirely in nextflow. You should *not* include large pieces of software here. Workflows should be designed to use *modules* available on the BioHPC cluster. The modules a workflow needs will be defined in the `astrocyte_pkg.yml` metadata file.
* `workflow/lib` - A directory for any netflow/groovy libraries that might be included by workflows using advanced features. Usualy empty.
### The Visualization App *(Optional)*
* `vizapp/` - A directory that will contain an *R Shiny* visualization app, if required. The vizualization app will be made available to the user via the Astrocyte web interface. At minimum the directory requires the standard Shiny `ui.R` and `server.R` files. The exact Shiny app structure is not prescribed. Any R packages required by the Shiny app will be listed in the `astrocyte_pkg.yml` metadata.
The visualization app will have access to any final output that was published to the `$baseDir` location in the
nextflow workflow. This path will be accessible as `Sys.getenv('baseDir')`.
### User Documentation
* `docs/index.md` - The first page of user documentation, in *markdown* format. Astrocyte will display this documentation to users of the workflow package.
* `docs/...` - Any other documentation files. *Markdown* `.md` files will be rendered for display on the web. Any images used in the documentation should also be placed here.
### Developer Documentation
* `README.md` - Documentation for developers of the workflow giving a brief overview and any important notes that are not for workflow users.
* `LICENSE.md` *(Optional)* - The license applied to the workflow package.
* `CHANGES.md` - A brief summary of changes made through time to the workflow.
### Testing
* `test_data/` - Every workflow package should include a minimal set of test data that allows the workflow to be run, testing its features. The `test_data/` directory is a location for test data files. Test data should be kept as small as possible. If large datasets (over 20MB total) are unavoidable provide a `fetch_test_data.sh` bash script which obtains the data from an external source.
* `test_data/fetch_test_data.sh` *Optional* - A bash script that fetches large test data from an external source, placing it into the `test_data/` directory.
## Testing/Running the Workflow with the Astrocyte CLI
Workflows will usually be run from the Astrocyte web interface, by importing the workflow package repository and making it available to users. During development you can use the Astrocyte CLI scripts to check, test, and run your workflow against non-test data.
To check the structure and syntax of the workflow package in the directory `astrocyte_example`:
```bash
$ astrocyte_cli check astrocyte_example
```
To launch the workflows defined tests, against included test data:
```bash
$ astrocyte_cli test astrocyte_example
```
To run the workflow using specific data and parameters. A working directory will be created.
```bash
$ astrocyte_cli run astrocyte_example --parameter1 "value1" --parameter2 "value2"...
```
To run the Shiny vizualization app against test_data
```bash
$ astrocyte_cli shinytest astrocyte_example
```
To run the Shiny vizualization app against output from `astrocyte_cli run`, which will be in the work directory created by `run`:
```bash
$ astrocyte_cli shiny astrocyte_example
```
To generate the user-facing documentation for the workflow and display it in a web browser"
```bash
$ astrocyte_cli docs astrocyte_example
```
## The Example ChIP-Seq Workflow ##
## The Example Shiny Viz App ##
## Provided Test Data ##
\ No newline at end of file
#
# metadata for the example astrocyte ChipSeq workflow package
#
# -----------------------------------------------------------------------------
# BASIC INFORMATION
# -----------------------------------------------------------------------------
# A unique identifier for the workflow package, text/underscores only
name: 'germline_bicf'
# Who wrote this?
author: 'Brandi Cantarel'
# A contact email address for questions
email: 'biohpc-help@utsouthwestern.edu'
# A more informative title for the workflow package
title: 'BICF RNASeq Variant Analysis Workflow'
# A summary of the workflow package in plain text
description: |
This is a workflow package for the BICF RNASeq Germline Variant workflow system.
It implements a simple germline variant analysis workflow using TrimGalore, HiSAT,
Speedseq, GATK, Samtools and FeatureCount. SNPs and Indels are integrated using BAYSIC;
then annotated using SNPEFF and SnpSift.
# -----------------------------------------------------------------------------
# DOCUMENTATION
# -----------------------------------------------------------------------------
# A list of documentation file in .md format that should be viewable from the
# web interface. These files are in the 'docs' subdirectory. The first file
# listed will be used as a documentation index and is index.md by convention
documentation_files:
- 'index.md'
# -----------------------------------------------------------------------------
# NEXTFLOW WORKFLOW CONFIGURATION
# -----------------------------------------------------------------------------
# Remember - The workflow file is always named 'workflow/main.f'
# The workflow must publish all final output into $baseDir
# A list of clueter environment modules that this workflow requires to run.
# Specify versioned module names to ensure reproducability.
workflow_modules:
- 'bcftools/intel/1.3'
- 'bedtools/2.25.0'
- 'bwa/intel/0.7.12'
- 'gatk/3.5'
- 'picard/1.127'
- 'integrate/0.2.6'
- 'samtools/intel/1.3'
- 'snpeff/4.2'
- 'speedseq/20160506'
- 'vcftools/0.1.11'
- 'jags/4.2.0'
# A list of parameters used by the workflow, defining how to present them,
# options etc in the web interface. For each parameter:
#
# REQUIRED INFORMATION
# id: The name of the parameter in the NEXTFLOW workflow
# type: The type of the parameter, one of:
# string - A free-format string
# integer - An integer
# real - A real number
# file - A single file from user data
# files - One or more files from user data
# select - A selection from a list of values
# required: true/false, must the parameter be entered/chosen?
# description: A user friendly description of the meaning of the parameter
#
# OPTIONAL INFORMATION
# default: A default value for the parameter (optional)
# min: Minium value/characters/files for number/string/files types
# max: Maxumum value/characters/files for number/string/files types
# regex: A regular expression that describes valid entries / filenames
#
# SELECT TYPE
# choices: A set of choices presented to the user for the parameter.
# Each choice is a pair of value and description, e.g.
#
# choices:
# - [ 'myval', 'The first option']
# - [ 'myval', 'The second option']
#
# NOTE - All parameters are passed to NEXTFLOW as strings... but they
# are validated by astrocyte using the information provided above
workflow_parameters:
- id: rnabam
type: files
required: true
description: |
Alignment files from a RNASeq experiment
regex: ".bam"
min: 1
- id: dnabam
type: files
required: false
description: |
BAM Files from Genomic Sequence from the same samples used for Gene Fusion Analysis
regex: ".bam"
- id: incdna
type: select
required: true
choices:
- [ '0', 'RNA Only']
- [ '1', 'DNA BAM Included in Design File']
description: |
DNA BAM files can be included for Gene Fusion Calling
- id: design
type: file
required: true
description: |
A design file listing pairs of sample name and sample group
- id: genome
type: select
choices:
- [ '/project/shared/bicf_workflow_ref/GRCh38', 'Human GRCh38']
- [ '/project/shared/bicf_workflow_ref/GRCh37', 'Human GRCh37']
- [ '/project/shared/bicf_workflow_ref/GRCm38', 'Mouse GRCh38']
required: true
description: |
Reference genome for alignment
# -----------------------------------------------------------------------------
# SHINY APP CONFIGURATION
# -----------------------------------------------------------------------------
# Remember - The vizapp is always 'vizapp/server.R' 'vizapp/ui.R'
# The workflow must publish all final output into $baseDir
# Name of the R module that the vizapp will run against
vizapp_r_module: 'R/3.2.1-Intel'
# List of any CRAN packages, not provided by the modules, that must be made
# available to the vizapp
vizapp_cran_packages: []
# # List of any Bioconductor packages, not provided by the modules, that must be made
# available to the vizapp
vizapp_bioc_packages: []
# Astrocyte Germline Variant Calling Workflow Package
This workflow carries out a Germline Exome Analysis pipeline, including the integration of variants from various callers and basic annotation.
1) RNA Alignments are then recalibrated and realigned using GATK3 (DePristo et al 2011;McKenna et al 2010)
2) To detect genome germline variants, GATK3 (DePristo et al 2011, McKenna et al 2010), Platypus (Rimmer et al 2014), Samtools version 1.3 and FreeBayes version 0.9.7 (Garrison and Marth 2012) are used.
3) Integration of predicted SNPs and INDELs from these algorithms is performed using BAYSIC (Cantarel et al 2014).
4) Effect of SNPs and INDELs on genes is predicted using snpEff (Cingolani et al 2012) using the gencode gene annotations. For GRCH38 Only: allele frequency in the general population is determined by comparison to ExAC (The ExAC Consortium 2015). Additionally for this build, discovered variants are annotated using SnpSift (Cingolani et al 2012) using the dbSNP, COSMIC (Forbes et al. 2009), CLINVAR (Landrum et al 2014), GWAS Catalog (Welter et al 2014) and dbNSFP (Liu et al 2011) databases.
5) Features (genes, transcripts and exons) are counted using featureCounts (Liao et al 2014) using the Gencode feature table(Harrow et al. 2012)
##Workflow Parameters
rnabam - Choose the alignments of your RNASeq data (generated by RNASEq Differential Expression Pipeline).
dnabam - Choose the bamfiles from genomic data that should be used for gene fusion
genome - Choose a genomic reference (genome).
pairs - Choose if pair-ended or single-end sequences
incdna - Choose whether GeneFusion analysis should include evidence from genomic data from the same sample
design - This file matches the fastq files to data about the sample
The following columns are necessary, must be named as in template and can be in any order:
SampleID
This ID should match the name in the fastq file ie S0001.R1.fastq.gz the sample ID is S0001
SampleName
This ID can be the identifier of the researcher or clinician
SubjectID
Used in order to link samples from the same patient
Phenotype
2= Case or Diseaes Phenotype, 1= Healthy Control
Gender
1=male, 2=female
FullPathToFqR1
Name of the fastq file R1
FullPathToFqR2
Name of the fastq file R2
There are some optional columns that might help with the analysis:
SequenceRun
Organism
FamilyID
CellPopulation
Treatment
GeneticFeature (WT or KO)
Race
Ethnicity
Age
### Test Data
### Credits
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility (BICF), Department of Bioinformatics and Clinical Sciences.
### References
Andy Rimmer, Hang Phan, Iain Mathieson, Zamin Iqbal, Stephen R. F. Twigg, WGS500 Consortium, Andrew O. M. Wilkie, Gil McVean, Gerton Lunter. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nature Genetics (2014) doi:10.1038/ng.3036
Bernstein, B. E., Birney, E., Dunham, I., Green, E. D., Gunter, C., & Snyder, M. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. doi:10.1038/nature11247
Cantarel, B. L., Weaver, D., McNeill, N., Zhang, J., Mackey, A. J., & Reese, J. (2014). BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity. BMC Bioinformatics, 15, 104. doi:10.1186/1471-2105-15-104
Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., ? Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. doi:10.4161/fly.19695
Cingolani, P., Patel, V. M., Coon, M., Nguyen, T., Land, S. J., Ruden, D. M., & Lu, X. (2012). Using Drosophila melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program, SnpSift. Frontiers in Genetics. doi:10.3389/fgene.2012.00035
Challis, D., Yu, J., Evani, U. S., Jackson, A. R., Paithankar, S., Coarfa, C., ? Yu, F. (2012). An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics. doi:10.1186/1471-2105-13-8
DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V, Maguire, J. R., Hartl, C., ? Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43, 491–498. doi:10.1038/ng.806
Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL: http://evs.gs.washington.edu/EVS/) [01 (01, 2016) accessed].
Forbes, S. A., Tang, G., Bindal, N., Bamford, S., Dawson, E., Cole, C., ? Futreal, P. A. (2009). COSMIC (the Catalogue of Somatic Mutations In Cancer): A resource to investigate acquired mutations in human cancer. Nucleic Acids Research, 38. doi:10.1093/nar/gkp995
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
Hansen NF, Gartner JJ, Mei L, Samuels Y, Mullikin JC. Shimmer: detection of genetic alterations in tumors using next-generation sequence data. Bioinformatics. 2013 Jun 15;29(12):1498-503. doi: 10.1093/bioinformatics/btt183. Epub 2013 Apr 24. PubMed PMID: 23620360; PubMed Central PMCID: PMC3673219.
Kim S, Jeong K, Bhutani K, Lee J, Patel A, Scott E, Nam H, Lee H, Gleeson JG, Bafna V. Virmid: accurate detection of somatic mutations with sample impurity inference. Genome Biol. 2013 Aug 29;14(8):R90. doi: 10.1186/gb-2013-14-8-r90. PubMed PMID: 23987214; PubMed Central PMCID: PMC4054681.
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012 Mar;22(3):568-76. doi: 10.1101/gr.129684.111. Epub 2012 Feb 2. PubMed PMID: 22300766; PubMed Central PMCID: PMC3290792.
Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2014). ClinVar: Public archive of relationships among sequence variation and human phenotype. Nucleic Acids Research, 42. doi:10.1093/nar/gkt1113
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Preprint arXiv, 00, 3. doi:arXiv:1303.3997 [q-bio.GN]
Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat. 2011 Aug;32(8):894-9. doi: 10.1002/humu.21517. PubMed PMID: 21520341
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ? DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20, 1297–1303. doi:10.1101/gr.107524.110
The 1000 Genome Consortium. An integrated map of genetic variation from 1,092 human genomes. (2012). Nature, 491(7422), 56–65. Retrieved from http://dx.doi.org/10.1038/nature11632.
Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics. 2012 Jul 15;28(14):1811-7. doi: 10.1093/bioinformatics/bts271. Epub 2012 May 10. PubMed PMID: 22581179.
Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, and Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Research, 2014, Vol. 42 (Database issue): D1001-D1006.
#!/usr/bin/env nextflow
// Default parameter values to run tests
params.bamdna="$baseDir/../test_data/dna/*.bam"
params.bamrna="$baseDir/../test_data/rna/*.bam"
rnabams=file(params.bamrna)
dnabams=file(params.bamdna)
params.incdna = '1'
params.design="$baseDir/../test_data/design.txt"
design_file = file(params.design)
params.genome="/project/shared/bicf_workflow_ref/GRCh38"
dbsnp="$params.genome/dbSnp.vcf.gz"
indel="$params.genome/GoldIndels.vcf.gz"
btoolsgenome = file("$params.genome/genomefile.txt")
knownindel=file(indel)
gatkref=file("$params.genome/genome.fa")
index_path = file(params.genome)
dbsnp=file(dbsnp)
genebed = file("$params.genome/gencode.genes.chr.bed")
snpeff_vers = 'GRCh38.82';
if (params.genome == '/project/shared/bicf_workflow_ref/GRCm38') {
snpeff_vers = 'GRCm38.82';
}
if (params.genome == '/project/shared/bicf_workflow_ref/GRCh37') {
snpeff_vers = 'GRCh37.75';
}
def fileMap = [:]
dnabams.each {
final fileName = it.getFileName().toString()
prefix = fileName.lastIndexOf('/')
fileMap[fileName] = it
}
rnabams.each {
final fileName = it.getFileName().toString()
prefix = fileName.lastIndexOf('/')
fileMap[fileName] = it
}
def bamset = []
new File(params.design).withReader { reader ->
def hline = reader.readLine()
def header = hline.split("\t")
prefixidx = header.findIndexOf{it == 'SampleID'};
dnaidx = header.findIndexOf{it == 'DNAbam'};
rnaidx = header.findIndexOf{it == 'RNAbam'};
if (dnaidx == -1) {
dnaidx = rnaidx
}
while (line = reader.readLine()) {
def row = line.split("\t")
if (fileMap.get(row[rnaidx]) != null) {
bamset << tuple(row[prefixidx],fileMap.get(row[rnaidx]),fileMap.get(row[dnaidx]))
}
}
}
if( ! bamset) { error "Didn't match any input files with entries in the design file" }
Channel
.from(bamset)
.set { initbam }
process gatkbam {
publishDir "$baseDir/output", mode: 'copy'
input:
set pair_id, file(rbam), file(dbam) from initbam
output:
set pair_id,file("${pair_id}.rna.bam"),file("${pair_id}.rna.bai") into gatkbam
set pair_id,file("${pair_id}.rg_added_sorted.bam"),file("${pair_id}.unmapped.bam"),file(dbam) into svbam
file("${pair_id}.rna.bam") into sambam
file("${pair_id}.rna.bai") into samidx
file("${pair_id}.rna.bam") into ssbam
file("${pair_id}.rna.bai") into ssidx
script:
"""
module load gatk/3.5 samtools/intel/1.3 speedseq/20160506 picard/1.127
java -Xmx4g -jar \$PICARD/picard.jar AddOrReplaceReadGroups INPUT=${rbam} O=${pair_id}.rg_added_sorted.bam SO=coordinate RGID=${pair_id} RGLB=tx RGPL=illumina RGPU=barcode RGSM=${pair_id}
sambamba markdup -t 20 -r ${pair_id}.rg_added_sorted.bam ${pair_id}.dedup.bam
java -Xmx4g -jar \$GATK_JAR -T SplitNCigarReads -R ${gatkref} -I ${pair_id}.dedup.bam -o ${pair_id}.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -Xmx4g -jar \$GATK_JAR -l INFO -R ${gatkref} --knownSites ${dbsnp} -I ${pair_id}.split.bam -T BaseRecalibrator -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o ${pair_id}.recal_data.grp -nt 1 -nct 30
java -Xmx4g -jar \$GATK_JAR -T PrintReads -R ${gatkref} -I ${pair_id}.split.bam -BQSR ${pair_id}.recal_data.grp -o ${pair_id}.rna.bam -nt 1 -nct 8
sambamba view -t 30 -f bam -F "unmapped" -o ${pair_id}.unmapped.bam ${pair_id}.rg_added_sorted.bam
"""
}
process svcall {
publishDir "$baseDir/output", mode: 'copy'
cpus 32
input:
set pair_id,file(rbam),file(unmap),file(dnabam) from svbam
output:
set pair_id,file("${pair_id}.genefusions.txt")
script:
if (params.incdna == '1')
"""
module load integrate/0.2.6 samtools/intel/1.3 speedseq/20160506
sambamba index ${dnabam}
sambamba index ${rbam}
sambamba index ${unmap}
Integrate fusion -reads ${pair_id}.reads.txt -sum ${pair_id}.summary.tsv -ex ${pair_id}.exons.tsv -bk ${pair_id}.breakpoints.tsv -vcf ${pair_id}.bk_sv.vcf -bedpe ${pair_id}.fusions.bedpe ${index_path}/genome.fa ${index_path}/annot.ensembl.txt ${index_path}/bwts/ ${rbam} ${unmap} ${dnabam}
perl $baseDir/scripts/compile_genefusion_results.pl -i ${pair_id} -r ${index_path}
"""
else
"""
Integrate fusion -reads ${pair_id}.reads.txt -sum ${pair_id}.summary.tsv -ex ${pair_id}.exons.tsv -bk ${pair_id}.breakpoints.tsv -vcf ${pair_id}.bk_sv.vcf -bedpe ${pair_id}.fusions.bedpe ${index_path}/genome.fa ${index_path}/annot.ensembl.txt ${index_path}/bwts/ ${rbam} ${unmap}
perl $baseDir/scripts/compile_genefusion_results.pl -i ${pair_id} -r ${index_path}
"""
}
process gatkgvcf {
//publishDir "$baseDir/output", mode: 'copy'
cpus 30
input:
set pair_id,file(gbam),file(gidx) from gatkbam
output:
file("${pair_id}.gatk.g.vcf") into gatkgvcf
script:
"""
java -Xmx10g -jar \$GATK_JAR -R ${gatkref} -D ${dbsnp} -T HaplotypeCaller -stand_call_conf 20 -stand_emit_conf 20.0 -dontUseSoftClippedBases -A FisherStrand -A QualByDepth -A VariantType -A DepthPerAlleleBySample -A HaplotypeScore -A AlleleBalance -variant_index_type LINEAR -variant_index_parameter 128000 --emitRefConfidence GVCF -I ${gbam} -o ${pair_id}.gatk.g.vcf -L ${genebed}
"""
}
process gatk {
publishDir "$baseDir/output", mode: 'copy'
cpus 4
input:
file(gvcf) from gatkgvcf.toList()
output:
file("final.gatkpanel.vcf.gz") into gatkvcf
script:
"""
module load gatk/3.5 bedtools/2.25.0 snpeff/4.2 vcftools/0.1.11
java -Xmx16g -jar \$GATK_JAR -R ${gatkref} -D ${dbsnp} -T GenotypeGVCFs -o final.gatk.vcf -nt 4 --variant ${(gvcf as List).join(' --variant ')}
java -Xmx16g -jar \$GATK_JAR -R ${gatkref} -T VariantFiltration -V final.gatk.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o final.filtgatk.vcf
vcf-annotate -n --fill-type final.filtgatk.vcf | java -jar \$SNPEFF_HOME/SnpSift.jar filter '((QUAL >= 10) & (MQ > 40) & (DP >= 10))' |bgzip > final.gatkpanel.vcf.gz
"""
}
process mpileup {
publishDir "$baseDir/output", mode: 'copy'
cpus 32
input:
file(gbam) from sambam.toList()
file(gidx) from samidx.toList()
output:
file("final.sampanel.vcf.gz") into samvcf
script:
"""
module load samtools/intel/1.3 bedtools/2.25.0 bcftools/intel/1.3 snpeff/4.2 vcftools/0.1.11
ls *.bam > final.bam.list
samtools mpileup -t 'AD,ADF,ADR,INFO/AD,SP' -ug -Q20 -C50 -f ${index_path}/genome.fa -b final.bam.list | bcftools call --format-fields gq,gp -vmO z -o final.sam.vcf.gz
vcf-annotate -n --fill-type final.sam.vcf.gz | java -jar \$SNPEFF_HOME/SnpSift.jar filter '((QUAL >= 10) & (MQ >= 20) & (DP >= 10))' |bgzip > final.sampanel.vcf.gz
"""
}
process speedseq {
publishDir "$baseDir/output", mode: 'copy'
cpus 32
input:
file(gbam) from ssbam.toList()
file(gidx) from ssidx.toList()
output:
file("final.sspanel.vcf.gz") into ssvcf
file("final.complex.vcf.gz") into complexvcf
script:
"""
module load samtools/intel/1.3 bedtools/2.25.0 bcftools/intel/1.3 snpeff/4.2 speedseq/20160506 vcftools/0.1.11
speedseq var -q 10 -t 32 -o final.ssvar ${index_path}/genome.fa ${gbam}
vcf-annotate -n --fill-type -n final.ssvar.vcf.gz | java -jar \$SNPEFF_HOME/SnpSift.jar filter '((QUAL >= 10) & (DP >= 10))' |bgzip > final.sspanel.vcf.gz
java -Xmx10g -jar \$SNPEFF_HOME/SnpSift.jar filter "(TYPE='complex')" final.sspanel.vcf.gz |bgzip> final.complex.vcf.gz
"""
}
process integrate {
publishDir "$baseDir/output", mode: 'copy'
input:
file(ssvcf)from ssvcf
file(samvcf) from samvcf
file(gvcf) from gatkvcf
file(complex) from complexvcf
output:
file("final.integrate.vcf.gz") into finalvcf
script:
"""
module load bedtools/2.25.0 snpeff/4.2 bcftools/intel/1.3 samtools/intel/1.3 jags/4.2.0
module load vcftools/0.1.11
zcat ${ssvcf} | bedtools intersect -v -header -a stdin -b ${complex} | bgzip > ss.shuff.vcf.gz
vcf-shuffle-cols -t ${ssvcf} ${gvcf} | vcf-sort | bedtools intersect -v -header -a stdin -b ${complex} | bgzip > gatk.shuff.vcf.gz
vcf-shuffle-cols -t ${ssvcf} ${samvcf} | bedtools intersect -v -header -a stdin -b ${complex} | bgzip > sam.shuff.vcf.gz
tabix ss.shuff.vcf.gz
tabix gatk.shuff.vcf.gz
tabix sam.shuff.vcf.gz
vcf-compare ss.shuff.vcf.gz sam.shuff.vcf.gz gatk.shuff.vcf.gz > vcf_compare.out
vcf-isec -f --prefix integrate ss.shuff.vcf.gz sam.shuff.vcf.gz gatk.shuff.vcf.gz
perl $baseDir/scripts/baysic_blc.pl -c ${complex} -e $baseDir -f ${index_path}/genome.fa ss.shuff.vcf.gz sam.shuff.vcf.gz gatk.shuff.vcf.gz
"""
}
process annotate {
publishDir "$baseDir/output", mode: 'copy'
input:
file(intvcf) from finalvcf
output:
file("annot.vcf.gz") into annotvcf
script:
if (params.genome == '/project/shared/bicf_workflow_ref/GRCh38')
"""
module load bedtools/2.25.0 snpeff/4.2 jags/4.2.0
java -Xmx10g -jar \$SNPEFF_HOME/snpEff.jar -no-intergenic -lof -c \$SNPEFF_HOME/snpEff.config ${snpeff_vers} final.integrate.vcf.gz | java -Xmx10g -jar \$SNPEFF_HOME/SnpSift.jar annotate ${index_path}/dbSnp.vcf.gz - | java -Xmx10g -jar \$SNPEFF_HOME/SnpSift.jar annotate ${index_path}/clinvar.vcf.gz - | java -Xmx10g -jar \$SNPEFF_HOME/SnpSift.jar annotate ${index_path}/ExAC.vcf.gz - | java -Xmx10g -jar \$SNPEFF_HOME/SnpSift.jar annotate ${index_path}/cosmic.vcf.gz - | java -Xmx10g -jar \$SNPEFF_HOME/SnpSift.jar dbnsfp -v -db ${index_path}/dbNSFP.txt.gz - | java -Xmx10g -jar \$SNPEFF_HOME/SnpSift.jar gwasCat -db ${index_path}/gwas_catalog.tsv - |bgzip > annot.vcf.gz
"""
else
"""
java -Xmx10g -jar \$SNPEFF_HOME/snpEff.jar -no-intergenic -lof -c \$SNPEFF_HOME/snpEff.config ${snpeff_vers} final.integrate.vcf.gz |bgz
ip > annot.vcf.gz
"""
}
#!/usr/bin/perl -w
#count_venn.vcf
use Getopt::Long qw(:config no_ignore_case no_auto_abbrev);
my %opt = ();
my $results = GetOptions (\%opt,'fasta|f=s','help|h','prefix|p=s','complex|c=s','execdir|e=s');
my @zipvcf = @ARGV;
unless($opt{fasta}) {
$opt{fasta} = '/project/shared/bicf_workflow_ref/GRCh38/hs38DH.fa';
}
unless($opt{prefix}) {
$opt{prefix} = 'baysic';
}
my $total = 0;
open FASTA, "<$opt{fasta}\.fai" or die "unable to find fai index for $opt{fasta}, please index with samtools";
while (my $line = <FASTA>) {
chomp($line);
my ($chr,$length,$offset,$linelen,$linebytes) = split(/\t/,$line);
$total += $length;
}
my %ct = ();
my %snpbins = ();
open CTS, ">baysic.cts" or die $!;
print CTS "Estimated sum: ".$total,"\n";
open VC, "<vcf_compare.out" or die $!;
while (my $line = <VC>) {
my @key = ();
next if($line =~ m/#/);
if ($line =~ m/^VN\s+(.+)/) {
my ($ct,@flist) = split(/\s+/,$1);
foreach my $dset (@zipvcf) {
if (grep(/$dset/,@flist)) {
push @key, 1;
}else {
push @key, 0;
}
}
$key = join("",@key);
$ct{$key} = $ct;
$total -= $ct;
}
}
my @g = bits(scalar(@zipvcf));
$ct{$g[0]} = $total;
foreach (@g) {
$ct{$_} = 0 unless ($ct{$_});
print CTS join("\t",$_,$ct{$_}),"\n";
}
system("Rscript $opt{execdir}\/scripts/lca.R -c baysic.cts -s baysic.stats");
my @key1 = split(//,$g[-1]);
my @key2;
foreach $i (0..$#key1) {
push @key2, $i if ($key1[$i] > 0);
}
open STATS, "baysic.stats" or die $!;
my @keeppos;
while (my $line = <STATS>) {
chomp($line);
next unless ($line =~ m/postprobs\[(\d+)\]\s+(\S+)/);
my ($index,$prob) = ($1,$2);
next unless ($prob >= 0.8);
my $key = $g[$index-1];
@key1 = split(//,$key);
@key2 = ();
foreach $i (0..$#key1) {
push @key2, $i if ($key1[$i] > 0);
}
my $subset = 'integrate'.join('_',@key2).'.vcf.gz';
push @keeppos, $subset if (-e $subset);
}
push @keeppos, $opt{complex} if ($opt{complex});
system("vcf-concat ".join(" ",@keeppos)." |vcf-sort |bgzip > final.integrate.vcf.gz");
sub bits { glob join "", map "{0,1}", 1..$_[0] }