Slopetree from Raquel Bromberg - Repo for BioHPC web service project Some fixes / optimizations added by DCT SlopeTree was developed on Linux Ubuntu and compiled using g++ (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4. For all commands listed below, the character "$" indicates the commandline, and should never actually be typed. ************************************************************************************************************** CONTENTS OF README: 1) COMPILING SLOPETREE 2) RUNNING SLOPETREE WITHOUT FILTERING OR A REFERENCE SET (most basic run) 3) ST-TREES WITH MOBILE ELEMENTS REMOVED 4) ST-TREES FILTERED BY CONSERVATION 5) USING THE PAIR-WISE CORRECTION FOR HORIZONTAL GENE TRANSFER ************************************************************************************************************** 1. COMPILING SLOPETREE SlopeTree includes a Makefile which should allow for simple compilation of the entire package. Navigate to the SlopeTree main directory and enter into the commandline: $make clean $make all ************************************************************************************************************** 2. RUNNING SLOPETREE WITHOUT FILTERING OR A REFERENCE SET (most basic run) INPUT: The input for SlopeTree is files of FASTA format, with the files for each organism contained in a single directory named after the given organism (an example of this is in Examples/Ex1/). (a) Create a new, empty directory for the run (e.g. test_run). In this directory, create a directory named FAA. Copy all directories containing FASTA files to FAA. Each directory must contain at least 1 FASTA file; if there are multiple FASTA files in one of these directories, for instance plasmids, they must all come from the same organism. FAA should contain only directories, with directory names being organism names. Refer to Examples/Ex1/ for an example of this. (b) Download taxdump.tar.gz located at ftp://ftp.ncbi.nih.gov/pub/taxonomy and unpack. This download can be used for all future SlopeTree runs. SlopeTree specifically uses nodes.dmp and names.dmp from this archive. (c) Once the set of organisms has been moved to the FAA directory and the taxdump archive has been downloaded and unpacked, run SlopeTree with the following command: $bash dSTm.sh <full path to the main directory> <tag length, default=20> <B for Bacteria, A for Archaea, and O for Other> <full path to the directory where taxdump was unpacked> This script will run all necessary SlopeTree programs. OUTPUT: - A distance matrix, located in DISTANCE_MATRICES in the main run directory: there will be 2 distance matrices in DISTANCE_MATRICES at the end of this run, one with the original organism names, the other with the original names with the taxonomic information added. - A file called HGT_pairs.txt which lists the pairs flagged for the horizontal gene transfer correction in (5). As is discussed more in (5), it is advised to apply the mobile element and conservation filters (3 and 4) on the data first, then to run the SlopeTree routines presented in this section (2), and THEN to apply the pair-wise HGT correction (5). - a directory called DATA with all the plots (number of unique matches beween every pair per bit score) in a file called data.txt EXAMPLE: In the Examples directory, Ex1 contains a FAA/ directory with a small number of bacterial proteomes. The command to run SlopeTree on this set would be: $bash dSTm.sh Examples/Ex1/ 20 B <full path to directory where taxdump contents (specifically names.dmp and nodes.dmp)> The above command requires the full path to the taxdump files. ************************************************************************************************************** 3. ST-TREES WITH MOBILE ELEMENTS REMOVED Step 3.1) Create a set of conserved proteins from a small, taxonomically diverse (within a single domain) set of organisms. We provide 2 reference sets in this package, located in REFS/. Bacteria_REF/ consists of 30 diverse bacteria (used in the SlopeTree publication). Archaea_REF/ consists of 10 diverse archaea (used in the SlopeTree publication). These can be used or users can create their own diverse sets. (a) Create a new main directory (e.g. make_ref_set/) and create an FAA directory containing the reference set (same as was done in (2) above, using sets provided in REFS/Bacteria_REF, REFS/Archaea_REF, or user-chosen sets). (b) Run the pFilt script: $bash pFilt.sh <main directory 1> <tag length> <A for archaea, B for bacteria, O for other> (c) Then write out the new set of proteomes: $./fpwrite <main_directory 1> -f 10 -o <filtering parameter> We recommend a filtering parameter of 7 for this operation. The bounds for this parameter are [0,10], with lower values producing less stringent filtering. OUTPUT FOR STEP 3.1: The fpwrite program creates four new directories in the main directory. Depending on the parameters used, the new directories should have names of the format FAA_csv_10_7 and FAA_ncsv_10_7 (the 7 in these names corresponds to the filtering parameter chosen). 'csv' stands for conserved, while 'ncsv' stands for non-conserved. (for the run specifically described here, the FAA_ref_csv_10_7 and FAA_ref_ncsv_10_7 directories should be empty and can be ignored) Step 3.2) Create a set of alphabetically sorted k-mers from Step 3.1's conserved set of proteins. (d) Create a new main directory (e.g. make_csv_kmers). Copy the 'csv' directory from the above run (e.g. FAA_csv_10_7) to the new main directory and rename the folder FAA. (e) Run the dMT.sh script: $bash dMT.sh <main directory 2> <tag length> <A for Archaea, B for Bacteria, O for Other> OUTPUT FOR STEP 3.2: This will produce a set of directories in the new main directory. One of these directories will be called MERGED_TAGS. This directory will be used as an input in the mobile-element filtering command below. Step 3.3) Filter the real (i.e. the proteomes you wish to filter of mobile elements) input set. (f) Create a third main directory (e.g. filter_me) and within this directory, a new directory FAA with the input organisms you would like to filter of mobile elements. (g) Create sets of alphabetically sorted k-mers for each organism individually (this may not be necessary if the mobile element filter is being run on an organism already passed through (2); if the main directory already contains a TAGS directory and an *_info.txt file, the following command is unnecessary): $bash pME.sh <main directory 3> <tag length> <A for Archaea, B for Bacteria, O for Other> (h) Run the mobile element filtering command: $./mef <main directory 3> <full path to the contents of MERGED_TAGS> -m OUTPUT FOR STEP 3.3 (all in the main directory for Step 3.3): - A directory called FAA_mobelim will have been created at the end of the run. FAA_mobelim consists of the input proteomes minus the mobile elements. - A directory called FAA_mobkept will have been created at the end of the run. FAA_mobkept consists of all the mobile elements which were removed from FAA_mobelim. - A file called mefilter_discrepancies.txt which lists proteome IDs of proteomes in which nearly every protein was marked as a mobile element due to the presence of multiple, nearly identical genomes in the organism's FASTA input. It is advised that these organisms be removed from the analysis at some point, (e.g. pruned out of the final distance matrix prior to tree-building). The file <main directory name>_info.txt identifies the organism associatd with each of these IDs. - A file called ME_size_reduction.txt which lists all organism ordinals, the number of proteins per organism originally and the number of mobile elements removed. - If the -m option is used (as it was above), a directory called MEFILTERED is created with details on the proteins removed for every organism. Step 3.4) Move the FAA_mobelim directory to a new directory, change its name from FAA_mobelim to FAA, and then run SlopeTree on it as in (2), OR use this directory as input to the conservation filtering tool described in (4). ************************************************************************************************************** 4) ST-TREES FILTERED BY CONSERVATION 4A) Filtering by means of a reference set. (a) Create a main directory (e.g. filter_by_csv). Within this directory, create an FAA directory, with the organisms you would like to filter, and another directory, FAA_ref, containing the organisms you would like to filter your set against. We provide 2 reference sets in REFS, one of bacteria and another of archaea (used in the SlopeTree publication). Users may also create their own reference set. If the same organism is present in the reference set and the main set, SlopeTree will ignore its presence in the main set (the file *_reference_repeats.txt lists these instances of redundant inputs). ALL organisms, in both the reference and main sets, are included (e.g. filtered) in the analysis. (b) Run the filtering code: $bash pFilt.sh <path to main directory> <tag length> <A for Archaea, B for Bacteria, O for Other> (c) Then write out the new proteomes, according to the filtering level desired: $./fpwrite <path to main directory> -f 10 -o <filtering level> The filtering level parameter (referred to as o in the SlopeTree publication) has bounds [0,10], with a lower value producing less filtered proteomes. Depending on the purpose of the filtering, different levels of filtering may be appropriate. Given that writing out the new protoemes is a fast operation, we recommend users experiment with different filtering parameters for the fpwrite command. We consider filtering on 5 or more to be aggressive filtering which usually reduces proteomes by more than half their. 4B) FILTERING WITHOUT A REFERENCE SET It is also possible to perform conservation/stability filtering without a reference set. To do this, simply perform all steps in (4A), WITHOUT creating the directory FAA_ref. Filtering by conservation without a reference set is generally looser filtering (fewer proteins) than filtering against a reference set, and may require a higher filtering parameter (o). This filtering is particularly appropriate for sets of strains. OUTPUT FOR 4A AND 4B: - New proteomes are written to a new directory within the main directory, with a name of the format: FAA_csv_10_3 and FAA_ref_csv_10_3 (the 3 would indicate a filtering level of 3 having been selected). The reference set (if used) and main set do not need to be disjoint, and the new ref directory has also been filtered. - A file in the main directory called problematic_inputs.txt, which lists the ordinals of organisms showing unusual copy number (e.g. over or under-representation of conserved k-mers) - A file in the main directory called proteome_sizes.txt which lists the value of o (2nd column), organism ordinal (3rd column), the original proteome size (4th column) and the size after filtering (5th column). The new directories, FAA_csv_10_3 and FAA_ref_csv_10_3 (again, assuming a reference was used), or some combination, can be moved to a new main directory, renamed FAA and FAA_ref, and then SlopeTree run on them as in (2). The reference used here, or any reference set, does not have to be included for step (2). ************************************************************************************************************** 5) USING THE PAIR-WISE CORRECTION FOR HORIZONTAL GENE TRANSFER After running SlopeTree (as described in (2)), a file HGT_pairs.txt should have been created in the main directory. These are pairs SlopeTree flagged as having unusual statistics or curvature, which may be involved in HGT. We advise some kind of filtering (mobile elements and some level of conservation) be applied to the initial data, if the pair-wise HGT correction is to be used; this significantly reduces the number of pairs flagged for correction (many of which are false positives). It is not necessary to create a new directory for this correction. (a) To run the correction: ./fh <main directory that has already been processed by the SlopeTree routines in (2)> OUTPUT FOR 5: - This will create an updated distance matrix in DISTANCE_MATRICES. - A file in the main directory called hgt_msc_for_filtering.txt which lists the proteins that were removed by the pair-wise correction for every flagged pair. - A file in DATA/ called data_after_hgt_cleanup.txt with the new plots for those pairs that were modified