GenLibrary: Tools for generating diverse sequence libraries
The main purpose of this package is to provide genLibrary.py
, a command-line tool that given a list of Uniprot IDs, will download sequences for all disordered regions in those proteins, subdivide those regions into windows of a user-specified length, calculate properties of the windowed sequences, and return a diverse set of sequence windows. The submodules in the package provide functions for calculating several predictors of phase separation.
Installation
Prerequisites
- If necessary, download and install Java. You can check whether Java is installed by executing
java -version
in a terminal. - If necessary, install Python. You can check if Python is installed by running
python --version
. - If necessary, install pip. You can check if pip is install by running
pip --version
. - If necessary, install git. You can check if git is install by running
git --version
genLibrary.py
Get software and data used by - Download and install the PLAAC software as described here. By default,
genLibrary.py
expects thePLAAC .jar
file to be located in$HOME/bin/plaac/cli/target/
. - Download the PScore software and databases from Julie Forman-Kay's website (look for 'IDR Phase Separation Predictor based on propensity for pi-pi contacts). By default,
genLibrary.py
expects the PScore database files to be located in$HOME/bin/PScore/DBS
. - GenLibrary requires the latest version of Alex Holehouse's localCIDER package available on GitHub. Execute the following to install it:
pip install --upgrade git+https:/github.com/Pappulab/localCIDER.git
.
Install GenLibrary package
- In a terminal, execute the following command:
pip install git+https://git.swmed.edu/s181646/genlibrary.git
. Check that the installation was run successfully by runninggenLibrary.py --help
.
Usage
To get a library of disordered segments given a list of Uniprot IDs with the default options:
genLibrary.py -u UNIPROT_ID_FILE -m -o OUTPUT_FILE
This will return a library of 10,000 80-residues sequences. Use -n N_SEQUENCES
and -w WINDOW_LENGTH
to change these options. By default, the dimensionality of the sequence parameter space is reduced via principal component analysis, retaining the first principal components that cumulatively explain 99% of variance. This can be changed by providing a value in the interval [0, 1] with the -v VARIANCE
option.
Alternatively, given a list of Uniprot IDs, you can generate a library including all regions of the sequences (disordered and ordered) by leaving out the -m
flag in the command above.
Finally, you can generate a library from a FASTA file by using genLibrary.py -f FASTA_FILE -o OUTPUT_FILE
. Note that this option does not calculate disorder predictions.
Notes
This package was developed using Python 3.7.3 and tested on Linux. Sequence parameters are calculated in parallel using all available CPUs. This uses the Python multiprocessing
module, which should work on Windows but I haven't tested it.