Skip to content
Snippets Groups Projects

GenLibrary: Tools for generating diverse sequence libraries

The main purpose of this package is to provide genLibrary.py, a command-line tool that given a list of Uniprot IDs, will download sequences for all disordered regions in those proteins, subdivide those regions into windows of a user-specified length, calculate properties of the windowed sequences, and return a diverse set of sequence windows. The submodules in the package provide functions for calculating several predictors of phase separation.

Installation

Prerequisites

  1. If necessary, download and install Java. You can check whether Java is installed by executing java -version in a terminal.
  2. If necessary, install Python. You can check if Python is installed by running python --version.
  3. If necessary, install pip. You can check if pip is install by running pip --version.
  4. If necessary, install git. You can check if git is install by running git --version

Get software and data used by genLibrary.py

  1. Download and install the PLAAC software as described here. By default, genLibrary.py expects the PLAAC .jar file to be located in $HOME/bin/plaac/cli/target/.
  2. Download the PScore software and databases from Julie Forman-Kay's website (look for 'IDR Phase Separation Predictor based on propensity for pi-pi contacts). By default, genLibrary.py expects the PScore database files to be located in $HOME/bin/PScore/DBS.
  3. GenLibrary requires the latest version of Alex Holehouse's localCIDER package available on GitHub. Execute the following to install it: pip install --upgrade git+https:/github.com/Pappulab/localCIDER.git.

Install GenLibrary package

  1. In a terminal, execute the following command: pip install git+https://git.swmed.edu/s181646/genlibrary.git. Check that the installation was run successfully by running genLibrary.py --help.

Usage

To get a library of disordered segments given a list of Uniprot IDs with the default options:

genLibrary.py -u UNIPROT_ID_FILE -m -o OUTPUT_FILE

This will return a library of 10,000 80-residues sequences. Use -n N_SEQUENCES and -w WINDOW_LENGTH to change these options. By default, the dimensionality of the sequence parameter space is reduced via principal component analysis, retaining the first principal components that cumulatively explain 99% of variance. This can be changed by providing a value in the interval [0, 1] with the -v VARIANCE option.

Alternatively, given a list of Uniprot IDs, you can generate a library including all regions of the sequences (disordered and ordered) by leaving out the -m flag in the command above.

Finally, you can generate a library from a FASTA file by using genLibrary.py -f FASTA_FILE -o OUTPUT_FILE. Note that this option does not calculate disorder predictions.

Notes

This package was developed using Python 3.7.3 and tested on Linux. Sequence parameters are calculated in parallel using all available CPUs. This uses the Python multiprocessing module, which should work on Windows but I haven't tested it.