| ![]()
Below is a table describing of some of the programs I (and others) have written in the course of my research (or just for fun, or a combination of the two). They are intended to be run on the command line, and require some basic familiarity with UNIX. The command line syntax is very similar to that of the GCG package of sequence analysis software. If you routinely process large numbers of sequences (I'm talking in the hundreds to thousands of sequences < 10Kb, not a genome worth), some of these utilities may be useful. Most perform the same tasks as other far more elegant software that is already out there, but a few I haven't seen duplicated (like classify.py, fastaview.py, treeparser.py). I use do.py constantly. UseBe warned: these are "working" versions, which means they may not work at all. I'm constantly updating them as I find bugs. I'm always eternally grateful to anyone who points out problems. I've tried to keep the syntax simple and consistent (using the cp.py module). All of the information needed to run this software can be obtained by invoking the program name with the -h option; sometimes I've included examples in the documentation. The general syntax for these programs is as follows: programname.py -option1[=value] -option2[=value] inputfile1 inputfile2 ... Some features:
fastaview.py -w=60 -blo=10 -con align1.fasta align2.fastaor, equivalently: fastaview.py -con align1.fasta -blo=10 align2.fasta -w=60 Most of these programs write to standard output (i.e., the screen) by default. This output can usually be directed to a file using the -out=outfilename option; alternatively, output can always be redirected to a file like this: fastaview.py align.fasta > outputfileOutput can also be directed to a pager: fastaview.py align.fasta | less Some (hopefully most) of the sequence utilities can accept a piped string as input. Thus, the above example could have been written like this: cat align1.fasta align2.fasta | fastaview.py -w=60 -blo=10 -conPiping also lets you combine multiple programs with standard UNIX commands to do complex things with single commands. Here's a somewhat exaggerated example: this is a small set of nucleotide sequences retrieved from a database search for gp120 V3 sequences that use CXCR4. The following command will find sequences from France (FR), sort the names alphabetically, remove gap characters, translate the sequences, and display the result in an interleaved format for easy viewing (this must be entered as one unbroken line). fastanames.py v3_x4.fasta | grep FR | sort | extract.py -l=- v3_x4.fasta | translate.py -degap | fastaview.py | lessThe above command is equivalent to the following (except for the creation of the intermediate files): fastanames.py v3_x4.fasta > v3_x4_namelist grep FR v3_x4_namelist > v3_x4_fromfrance sort -o v3_x4_fromfrance v3_x4_fromfrance extract.py v3_x4.fasta -l=v3_x4_fromfrance -out=v3_x4_fromfrance.fasta translate.py -degap v3_x4_fromfrance.fasta fastaview.py v3_x4_fromfrance.fasta -out=v3_x4_fromfrance.interleaved less v3_x4_fromfrance.interleaved Note that some of these programs will also accept stdin as the value of an option. In this case, stdin is represented as a dash (-), as in the example above, where stdin is sent to the -l option of extract.py. My implementation of these piping features is not terribly consistent yet. Look here for a brief explanation of piping and command redirection (and a good general UNIX reference). For more examples, see each program's documentation (use the -h option), and Phylogenetic trees made quick and dirty. RequirementsAll of these are written in Python, and require at least Python version 2.1 (some may need v2.2, which I recommend getting in any case). I wrote them on a Mac running OS X v10.1, and routinely use then on various flavors of UNIX. I've never tried to run them on a Windows machine. Note that these programs expect input files to have UNIX line breaks - mac or windows line breaks will cause errors (convert files using m2u.py and u2m.py - see table below). InstallationUsers at UNC with an isis accountThis software can be executed from any machine that has AFS mounted (any of the scientific servers can access them) if your $PATH variable is set properly. You may have to add Python to your account for them to work with the command:ipm add pythonA script to add the location of this software to your PATH can be executed by typing: /afs/isis.unc.edu/home/n/g/nghoffma/public/add_nh_profile You must log out and log back in for the changes to take effect. Don't do this if you have modified the file ~/public/.profile.personal to customize your environment: instead, look at /afs/isis.unc.edu/home/n/g/nghoffma/public/.profile.personal to see how to modify your own profile. Installing local copiesDownload the individual files from the links in the table below, or (preferably) get a tarball with all of the sources here. Unpack the tar archive with the command tar xvf noahprogs.tarThe programs must be executable. You can determine if a file is executable by a) listing it using "ls -F": executables will be followed by an asterisk (*); or b) using ls -l [filename] (see the ls man page). These should be executable if expanded using tar, but if they are not, make them so: chmod +x *.pyAlmost all of these programs require the cp.py module, and most need Seq.py and SeqIO.py as well (see table below); just keep them all in the same directory and they should work fine. These files must be placed in a directory included in both the system's path (path or PATH) and python's search path (PYTHONPATH). You can find such a directory with the commands "echo $PATH" and "echo $PYTHONPATH". It is preferable to add a directory to your path by modifying your login script (the location and name of this file varies in different operating systems - ask your system administrator if you don't know how to do this. If you're using Mac OS 10.2.x, you can also see the file /usr/share/tcsh/examples/README for instructions, and read this thread on the Apple Unix discussion page). If you are completely lost, follow these instructions (skip the first command if ~/bin already exists; this might not work if you're using a shell other than tcsh or csh). % mkdir ~/bin % mv [wherever it is]/noahprogs.tar ~/bin % tar xvf ~/bin/noahprogs.tar % chmod +x ~/bin/noahprogs/* % set path = ( $path ~/bin/noahprogs ) % set PATH = ( $PATH ~/bin/noahprogs ) % set PYTHONPATH = ( $PYTHONPATH ~/bin/noahprogs )You'll have to do the last three steps each time you log in if you don't add commnds like these to your login script - the other steps you should only have to do once. For those who know Python...I've found that any moderately complex task involving the manipulation of sequences can be made much less labor intensive with just a little bit of code. The modules cp.py, Seq.py, and SeqIO.py make it easy to write little throwaway scripts for very specific tasks, as well as some of the more complex programs below. For example, here's all the code you need to translate a fasta format file of nucleotide sequences (removing gap characters first ), and get rid of sequences with stop codons that aren't at least 100 amino acids long.
#! /usr/bin/env python
import SeqIO, Seq, sys
def main():
fasta = SeqIO.readFasta(sys.argv[1], degap=1, output='list')
for seq in fasta:
pep = Seq.translate(seq)
if len(pep) > 100 and pep.getSeq().find('*') == -1:
print SeqIO.seqToFasta(pep),
main()
Note that some of the functionality of cp.py is found in the standard getopt Python module (though cp.py did not use getopt). I probably wouldn't have bothered if I had known about Greg Ward's Optik module. Also, check out the other python sequence analysis projects out there, especially at biopython.org (a great resource that I didn't use nearly enough when writing this code). Feel free to copy, modify, and post any of this code - I'd be grateful for an email if anyone actually got any use out it. CreditsSeq.py (the sequence object) and SeqIO.py started as a collaboration between me and Wolfgang Resch, a former graduate student in the Swanstrom Lab (now at NIH). Wolfgang wrote much of the original code in these two modules, and contributed to many of the other programs listed below. The sequence classification algorithm used in classify.py was adapated (i.e., copied) from a Java implementation by Jason Walker. The important parts of the C code in mandel.py are from someplace on the web - my apologies to the author for failing to remember the source.
|