process_getData

changed milestone to %v0.0.1

changed the description

assigned to @ghenry and unassigned @s181706

added ~291 label

Need to figure out deriva-auth session

From Laura Pearlman from GUDMAP/RBK data hub:

Hi Gervaise, The short answer is that yes, our underlying software does allow for an easier way to do this, but gudmap isn't currently set up to support this. I'll get that process started (it involves registration with third parties and can take a while). You'll still need to do a manual login once, but that will generate a long-lived revocable token that you can store and reuse. -- Laura

Work around until reusable token gets generated...

Problem is:

bdbag requires manual authentication through deriva-auth
on the BioHPC cluster, during deriva-auth authentification, it crashes with an error and doesn't generate the temporary tokens
deriva-auth cannot get through the proxy on UTSW network when run locally

Workaround:

connect to the internet not through UTSW network on local computer
authenticate with deriva-auth locally (both GUDMAP and RBK: www.gudmap.org and www.rebuildingakidney.org)
connect to UTSW network an copy deriva cookie file with tokens to BioHPC $Home
run nextflow pipeline well before the expiration of the tokens (tokens last for 15 minutes, practically it lasts much longer - how much longer is unknown)

To install deriva/bdbag

module load python/3.6.4-anaconda
pip install deriva-client
pip install bdbag

added ~293 label and removed ~291 label

To containerize: may need to take deriva cookie file as input to put in $home inside container for bdbag to authenticate

added ~291 label and removed ~293 label

Separate getData into 2 processes

splitData to split the study bdbag into replicate bdbag (1 fastq pair - if PE) so that...
getData can bdbag fetch each replicate in parallel

Next steps:

split docker images for process 1 (splitData) and 2 (getData) splitData only needs Python with pandas, argparse, and re while getData only needs bdbag (perhaps deriva-client) use shared container for both processes renamed to be similar to GUDMAP/RBK file transfer
output fastq pairs into an object where, when experiment is PE we can link fastq pairs downstream
unit tests for current the 2 processes
modify to take auto downloading of bagit with deriva-download-cli

mentioned in issue #13 (closed)

mentioned in issue #14 (closed)

mentioned in issue #15 (closed)

mentioned in merge request !2 (merged)

closed via merge request !2 (merged)

removed Doing label

process_getData

Child items

Activity

process_getData

Child items

Linked items

Related merge requests

Activity