process_getData
Purpose:
- Download input data from the consortium
Input:
- bdbag.zip (see example at /archive/BICF/shared/GUDMAP.RBK/RNA/data/Study_Q-Y4H0.zip)
Outputs for later processes:
- file.csv
- Experiment Settings.csv
- Experiment.csv
- *.fastq.gz
Process:
- Remove all lines in fetch.txt (within bdbag.zip) that the filename doesn't end in .fastq.gz
- Run BDBAG to fetch all files
Tools:
- Python v3.7 or lower depending on compatibility
- Pandas v0.25.1 (if python will be used for fetch.txt filtering)
- BDBAG v1.5.5
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Link issues together to show that they're related or that one is blocking others. Learn more.
When this merge request is accepted, this issue will be closed automatically.
Activity
- Gervaise Henry changed milestone to %v0.0.1
changed milestone to %v0.0.1
- Gervaise Henry changed the description
changed the description
- Gervaise Henry assigned to @ghenry and unassigned @s181706
- Gervaise Henry added ~291 label
added ~291 label
- Author Owner
Need to figure out deriva-auth session
- Author Owner
From Laura Pearlman from GUDMAP/RBK data hub:
Hi Gervaise, The short answer is that yes, our underlying software does allow for an easier way to do this, but gudmap isn't currently set up to support this. I'll get that process started (it involves registration with third parties and can take a while). You'll still need to do a manual login once, but that will generate a long-lived revocable token that you can store and reuse. -- Laura
Work around until reusable token gets generated...
Problem is:
- bdbag requires manual authentication through deriva-auth
- on the BioHPC cluster, during deriva-auth authentification, it crashes with an error and doesn't generate the temporary tokens
- deriva-auth cannot get through the proxy on UTSW network when run locally
Workaround:
- connect to the internet not through UTSW network on local computer
- authenticate with deriva-auth locally (both GUDMAP and RBK: www.gudmap.org and www.rebuildingakidney.org)
- connect to UTSW network an copy deriva cookie file with tokens to BioHPC $Home
- run nextflow pipeline well before the expiration of the tokens (tokens last for 15 minutes, practically it lasts much longer - how much longer is unknown)
To install deriva/bdbag
module load python/3.6.4-anaconda pip install deriva-client pip install bdbag
Edited by Gervaise Henry - Gervaise Henry added ~293 label and removed ~291 label
added ~293 label and removed ~291 label
- Author Owner
To containerize: may need to take deriva cookie file as input to put in $home inside container for bdbag to authenticate
- Gervaise Henry added ~291 label and removed ~293 label
added ~291 label and removed ~293 label
- Author Owner
Separate getData into 2 processes
- splitData to split the study bdbag into replicate bdbag (1 fastq pair - if PE) so that...
- getData can bdbag fetch each replicate in parallel
- Author Owner
Next steps:
-
split docker images for process 1 (splitData) and 2 (getData) splitData only needs Python with pandas, argparse, and re while getData only needs bdbag (perhaps deriva-client)use shared container for both processes renamed to be similar to GUDMAP/RBK file transfer - output fastq pairs into an object where, when experiment is PE we can link fastq pairs downstream
- unit tests for current the 2 processes
- modify to take auto downloading of bagit with deriva-download-cli
Edited by Gervaise Henry -
- Gervaise Henry mentioned in issue #13 (closed)
mentioned in issue #13 (closed)
- Gervaise Henry mentioned in issue #14 (closed)
mentioned in issue #14 (closed)
- Gervaise Henry mentioned in issue #15 (closed)
mentioned in issue #15 (closed)
- Gervaise Henry mentioned in merge request !2 (merged)
mentioned in merge request !2 (merged)
- Gervaise Henry closed via merge request !2 (merged)
closed via merge request !2 (merged)
- Gervaise Henry removed Doing label
removed Doing label