Commit ecf11784 authored by Raquel Bromberg's avatar Raquel Bromberg

readme.txt updated

parent e1929e68
Slopetree from Raquel Bromberg - Repo for BioHPC web service project
Some fixes / optimizations added by DCT
\ No newline at end of file
Some fixes / optimizations added by DCT
1. COMPILING SLOPETREE
SlopeTree includes a Makefile which should allow for simple compilation of the entire package. Navigate to the SlopeTree main directory and enter into the commandline:
$make clean
$make all
2. RUNNING SLOPETREE WITHOUT A REFERENCE SET (most basic run)
The input for SlopeTree is FASTA files. (1) Create a directory for the run (e.g. test_run). In this directory, create a directory named FAA. Copy all FASTA files to FAA. FAA should contain only directories, with directory names being organism names, and within each organism directory should be the FASTA files (1 or more).
Download taxdump.tar.gz located at ftp://ftp.ncbi.nih.gov/pub/taxonomy and unpack.
Once the set of organisms has been moved to the FAA directory, run SlopeTree with the following command:
$bash dSTm.sh <full path to the main directory> <tag length, default=20> <B for Bacteria, A for Archaea, and O for Other> <full path to the directory where taxdump was unpacked>
The final distance matrix will be written out to DISTANCE_MATRICES in the main directory.
In the Examples directory, Ex1 contains formatted data for a run as the one shown above (still requires the local taxdump files).
$bash dSTm.sh Examples/Ex1/ 20 B <full path to directory where taxdump was unpacked>
3. ST-TREES WITH MOBILE ELEMENTS REMOVED
This first requires creating a set of alphabetically sorted k-mers, taken from conserved proteins, using a small, taxonomically diverse (within a single domain) set of organisms. We provide 2 reference sets in this package, located in REFS. Bacteria_REF/ consists of 30 diverse bacteria (used in the SlopeTree publication). Archaea_REF/ consists of 10 diverse archaea (used in the SlopeTree publication). These can be used or users can create their own. Create a new main directory (e.g. make_ref_set/) and create the FAA directory containing the reference set (same as in (2)).
Run the pFilt script:
$bash pFilt.sh <main directory> <tag length>
Then write out the new set of proteomes:
$bash fpwrite <main_directory> -f 10 -o <filtering parameter>
We recommend a filtering parameter of 7 for this operation. The bounds for this parameter are [0,10], with lower values producing less stringent filtering.
The pfwrite program creates a pair of new directories in the main directory. Depending on the parameters used, the new directories should have names of the format FAA_csv_10_4 and FAA_ncsv_10_4. 'csv' stands for conserved, while 'ncsv' stands for non-conserved. Copy the 'csv' directory to a new main directory and rename the folder FAA.
Run the dMT.sh script:
$bash dMT.sh <new main directory> <tag length> <A for Archaea, B for Bacteria, O for Other>
This will produce a set of directories, including one called MERGED_TAGS. This directory will be used as an input in the mobile-element filtering command below.
Create a new main directory and within this directory, a new directory FAA with the input organisms you would like to filter of mobile elements.
Run the mobile element filtering command:
./mef <main directory to organisms you would like to filter> <full path to the contents of MERGED_TAGS>
This will create a directory FAA_mobelim, consisting of the input proteomes minus the mobile elements. Move this directory to a new directory, change its name from FAA_mobelim to FAA, and then run SlopeTree on it as in (2).
4) ST-TREES FILTERED BY CONSERVATION
4A) FILTERING BY MEANS OF A REFERENCE SET
Create a main directory. Within this directory, create an FAA directory, with the organisms you would like to filter, and another directory, FAA_ref, containing the organisms you would like to filter your set against. We provide 2 reference sets in REFS, one of bacteria and another of archaea (used in the SlopeTree publication). Users may also create their own reference sets.
Run the filtering code:
$bash pFilt.sh <path to main directory> <tag length> <A for Archaea, B for Bacteria, O for Other>
Then write out the new proteomes, according to the filtering level desired:
$./fpwrite <path to main directory> -f 10 -o <filtering level>
The filtering level parameter has bounds [0,10], with a lower value producing less filtered proteomes. The new proteomes are written to a new directory within the main directory, with a name of the format: FAA_csv_10_3 and FAA_ref_csv_10_3. The reference set and main set do not need to be disjoint, and the new ref directory has also been filtered.
Depending on the purpose of the filtering, different levels of filtering may be appropriate. Given that writing out the new protoemes is a fast operation, we recommend users experiment with different filtering parameters. We consider filtering on 5 or more to be aggressive filtering which reduces proteomes by well over half their proteins.
The new directories, FAA_csv_10_3 and FAA_ref_csv_10_3 can be moved to a new main directory, renamed FAA and FAA_ref, and then SlopeTree run on them as in (2). The reference set does not have to be included for this run.
4B) FILTERING WITHOUT A REFERENCE SET
It is also possible to perform conservation/stability filtering without a reference set. To do this, simply perform all steps in (4A), WITHOUT creating the directory FAA_ref.
5) USING THE PAIR-WISE CORRECTION FOR HORIZONTAL GENE TRANSFER
After running SlopeTree, a file HGT_pairs.txt should have been created in the main directory. These are pairs SlopeTree flagged as having unusual statistics or curvature, which may be involved in HGT.
To run the correction:
./fh <main directory>
This will create an updated distance matrix in DISTANCE_MATRICES.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment