Swanstrom Lab
Swanstrom Home
Links
 
Lab Stuff
Home
Research
Software
 
Misc
Phylogenetic trees made quick and dirty
LaTeX
 
Email Me
noah_hoffman[at symbol]med.unc.edu
 
  


Software


Below is a table describing of some of the programs I (and others) have written in the course of my research (or just for fun, or a combination of the two). They are intended to be run on the command line, and require some basic familiarity with UNIX. The command line syntax is very similar to that of the GCG package of sequence analysis software. If you routinely process large numbers of sequences (I'm talking in the hundreds to thousands of sequences < 10Kb, not a genome worth), some of these utilities may be useful. Most perform the same tasks as other far more elegant software that is already out there, but a few I haven't seen duplicated (like classify.py, fastaview.py, treeparser.py). I use do.py constantly.

Use

Be warned: these are "working" versions, which means they may not work at all. I'm constantly updating them as I find bugs. I'm always eternally grateful to anyone who points out problems.

I've tried to keep the syntax simple and consistent (using the cp.py module). All of the information needed to run this software can be obtained by invoking the program name with the -h option; sometimes I've included examples in the documentation.

The general syntax for these programs is as follows:

programname.py -option1[=value] -option2[=value] inputfile1 inputfile2 ...

Some features:

  • Options are always preceded by a '-'.
  • If an option can be assigned a value, the value is given after an '=' sign without a space (think of -option=value as a single word).
  • Values containing a space or other non-alphanumeric characters must be quoted.
  • The order of options and input files on the command line is not important.
For example, the following command instructs fastaview.py to read two fasta format sequence alignments (align1.fasta and align2.fasta), output an interleaved alignment of line width 60 (-w=60), block size 10 (-blo=10), and include a consensus (-con).
fastaview.py -w=60 -blo=10 -con align1.fasta align2.fasta
or, equivalently:
fastaview.py -con align1.fasta -blo=10 align2.fasta -w=60

Most of these programs write to standard output (i.e., the screen) by default. This output can usually be directed to a file using the -out=outfilename option; alternatively, output can always be redirected to a file like this:

fastaview.py align.fasta > outputfile
Output can also be directed to a pager:
fastaview.py align.fasta | less

Some (hopefully most) of the sequence utilities can accept a piped string as input. Thus, the above example could have been written like this:

cat align1.fasta align2.fasta | fastaview.py -w=60 -blo=10 -con
Piping also lets you combine multiple programs with standard UNIX commands to do complex things with single commands. Here's a somewhat exaggerated example: this is a small set of nucleotide sequences retrieved from a database search for gp120 V3 sequences that use CXCR4. The following command will find sequences from France (FR), sort the names alphabetically, remove gap characters, translate the sequences, and display the result in an interleaved format for easy viewing (this must be entered as one unbroken line).
fastanames.py v3_x4.fasta | grep FR | sort | extract.py -l=- 
v3_x4.fasta | translate.py -degap | fastaview.py | less
The above command is equivalent to the following (except for the creation of the intermediate files):
fastanames.py v3_x4.fasta > v3_x4_namelist
grep FR v3_x4_namelist > v3_x4_fromfrance
sort -o v3_x4_fromfrance v3_x4_fromfrance
extract.py v3_x4.fasta -l=v3_x4_fromfrance -out=v3_x4_fromfrance.fasta
translate.py -degap v3_x4_fromfrance.fasta
fastaview.py v3_x4_fromfrance.fasta -out=v3_x4_fromfrance.interleaved
less v3_x4_fromfrance.interleaved

Note that some of these programs will also accept stdin as the value of an option. In this case, stdin is represented as a dash (-), as in the example above, where stdin is sent to the -l option of extract.py. My implementation of these piping features is not terribly consistent yet.

Look here for a brief explanation of piping and command redirection (and a good general UNIX reference). For more examples, see each program's documentation (use the -h option), and Phylogenetic trees made quick and dirty.

Requirements

All of these are written in Python, and require at least Python version 2.1 (some may need v2.2, which I recommend getting in any case). I wrote them on a Mac running OS X v10.1, and routinely use then on various flavors of UNIX. I've never tried to run them on a Windows machine. Note that these programs expect input files to have UNIX line breaks - mac or windows line breaks will cause errors (convert files using m2u.py and u2m.py - see table below).

Installation

Users at UNC with an isis account

This software can be executed from any machine that has AFS mounted (any of the scientific servers can access them) if your $PATH variable is set properly. You may have to add Python to your account for them to work with the command:
ipm add python
A script to add the location of this software to your PATH can be executed by typing:
/afs/isis.unc.edu/home/n/g/nghoffma/public/add_nh_profile

You must log out and log back in for the changes to take effect. Don't do this if you have modified the file ~/public/.profile.personal to customize your environment: instead, look at /afs/isis.unc.edu/home/n/g/nghoffma/public/.profile.personal to see how to modify your own profile.

Installing local copies

Download the individual files from the links in the table below, or (preferably) get a tarball with all of the sources here. Unpack the tar archive with the command

tar xvf noahprogs.tar
The programs must be executable. You can determine if a file is executable by a) listing it using "ls -F": executables will be followed by an asterisk (*); or b) using ls -l [filename] (see the ls man page). These should be executable if expanded using tar, but if they are not, make them so:
chmod +x *.py
Almost all of these programs require the cp.py module, and most need Seq.py and SeqIO.py as well (see table below); just keep them all in the same directory and they should work fine. These files must be placed in a directory included in both the system's path (path or PATH) and python's search path (PYTHONPATH). You can find such a directory with the commands "echo $PATH" and "echo $PYTHONPATH". It is preferable to add a directory to your path by modifying your login script (the location and name of this file varies in different operating systems - ask your system administrator if you don't know how to do this. If you're using Mac OS 10.2.x, you can also see the file /usr/share/tcsh/examples/README for instructions, and read this thread on the Apple Unix discussion page). If you are completely lost, follow these instructions (skip the first command if ~/bin already exists; this might not work if you're using a shell other than tcsh or csh).
% mkdir ~/bin
% mv [wherever it is]/noahprogs.tar ~/bin
% tar xvf ~/bin/noahprogs.tar
% chmod +x ~/bin/noahprogs/*
% set path = ( $path ~/bin/noahprogs )
% set PATH = ( $PATH ~/bin/noahprogs )
% set PYTHONPATH = ( $PYTHONPATH ~/bin/noahprogs )
You'll have to do the last three steps each time you log in if you don't add commnds like these to your login script - the other steps you should only have to do once.

For those who know Python...

I've found that any moderately complex task involving the manipulation of sequences can be made much less labor intensive with just a little bit of code. The modules cp.py, Seq.py, and SeqIO.py make it easy to write little throwaway scripts for very specific tasks, as well as some of the more complex programs below. For example, here's all the code you need to translate a fasta format file of nucleotide sequences (removing gap characters first ), and get rid of sequences with stop codons that aren't at least 100 amino acids long.

#! /usr/bin/env python
import SeqIO, Seq, sys
def main():
	fasta = SeqIO.readFasta(sys.argv[1], degap=1, output='list')
	for seq in fasta:
		pep = Seq.translate(seq)
		if len(pep) > 100 and pep.getSeq().find('*') == -1:
			print SeqIO.seqToFasta(pep),
main()

Note that some of the functionality of cp.py is found in the standard getopt Python module (though cp.py did not use getopt). I probably wouldn't have bothered if I had known about Greg Ward's Optik module. Also, check out the other python sequence analysis projects out there, especially at biopython.org (a great resource that I didn't use nearly enough when writing this code).

Feel free to copy, modify, and post any of this code - I'd be grateful for an email if anyone actually got any use out it.

Credits

Seq.py (the sequence object) and SeqIO.py started as a collaboration between me and Wolfgang Resch, a former graduate student in the Swanstrom Lab (now at NIH). Wolfgang wrote much of the original code in these two modules, and contributed to many of the other programs listed below. The sequence classification algorithm used in classify.py was adapated (i.e., copied) from a Java implementation by Jason Walker. The important parts of the C code in mandel.py are from someplace on the web - my apologies to the author for failing to remember the source.

 Sequence formatting
fasta2lines.pyWrites sequences in fasta format to a file in which each name and sequence string are contained on a single line (names and sequences are tab-delimited). Useful for exporting sequences to excel or for sorting by name or sequence. [Documentation]
lines2fasta.pyConverts a file in which each line contains [name] [sequence] to fasta format. Name and sequence strings may be separated by a tab or space characters. [Documentation]
fasta2msf.pyConverts fasta format to GCG MSF format. [Documentation]
fasta2phylip.pyConverts fasta format sequence alignment to phylip2 (sequential) format. Mercilessly truncates sequence names to 10 characters. [Documentation]
fastaview.pyConverts fasta format sequence alignments to an interleaved format more suitable for viewing. Can also display differences from the consensus. Blocksize and line length can be specified. [Documentation]
selex2fasta.pyConverts an alignment in selex format to fasta format. Selex is the default output format used by hmmer. [Documentation]
 Sequence manipulation utilities
assemble.pyConcatenates sequences contained in fasta files according to a list file. Sequences may be provided in one or more fasta files. [Documentation]
classify.pyClassifies sequences in an alignment according to one or two "rules" specifying amino acid composition at defined positions. Rules are written as nested, parenthesized pairs of expressions; an expression consists of a position followed by a character (e.g., 3R). Build the rules by joining pairs of expressions with parentheses. [Documentation]
codonalign.pyAligns nucleotide sequences based on an aligned peptide sequence. Neither alignment need be sorted or in any particular order. [Documentation]
consensus.pyCalculates the consensus for an alignment of nucleotide or protein sequences. [Documentation]
degap.pyRemoves gap characters from sequences in fasta format. [Documentation]
encode.pyNumerically encode sequences in a fasta file. [Documentation]
extract.pyCreate a new fasta file containing the subset of sequences defined in list. Neither file need be in any particular order. Sequences in list but not in fasta are ignored. [Documentation]
fastanames.pyReads names in a tfa file. Prints to stdout by default. [Documentation]
gapstrip.pyRemoves columns from a sequence alignment in which a specified percent of sequences contain gap characters. Useful for preparing alignments for phylogenetic analysis. [Documentation]
translate.pyTranslates a fasta format file containing nucleotide sequences. Can filter translated sequences according to length, presence of unwanted characters (like X,*). [Documentation]
trim.pyExclude a specified set of columns from a sequence alignment. [Documentation]
 Phylogenetic analysis
paupboot.pyCreates an execution script for various analysis using PAUP. Nucleotide sequences are supplied either as one or more fasta format files or in a single nexus format file. [Documentation]
treeparser.pyModifies Newick format trees from PAUP (NEXUS format) and phylip phylogenetics programs. Can be used to change taxon names, node labels, and branch lengths. The default input format is from PAUP 4.0b10. [Documentation]
 General utilities
do.pyIteration of a command over a list. The list can be supplied using a list file with the -l argument, as a list of arguments, as an expression to be evaluated or expanded, or as some combination of these methods. This program has some of the functionality of the unix find and xargs commands, but has simpler syntax, and has some nice filename replacement features. [Documentation]
shuffle.pyRandomizes the lines of a file and writes the first n entries. Use to pick a random subset of items from a list. [Documentation]
m2u.pyConvert mac-style line breaks ('\r') to UNIX style ('\n'). [Documentation]
u2m.pyConvert UNIX-style line breaks ('\n') to mac style ('\r'). [Documentation]
lconvConverts line breaks of input file(s) regardless of input file type. The type of conversion is specified according to the first argument. Replace [files] with - to read stdin. [Documentation]
unixname.pyMake filenames unix safe. Replaces all whitespace and non-alphanumeric charaters with a char specified in c. [Documentation]
col.pyGet a specified column from input file. [Documentation]
 The Seq Object and Sequence I/O modules
Seq.pyThe Sequence object. Contains a nucleotide sequence plus information such as name, a header string, accession number. [Documentation]
SeqIO.pyModule containing functions for reading and writing sequence information in various formats, translation of nucleotide sequences. Lots of useful functions in here! [Documentation]
Dictionaries.pyContains various dictionaries for translation, encoding, etc. of nucleotide and amino acid sequences. [Documentation]
Patterns.pyCollection of regular expression patterns used by sequence handling programs. [Documentation]
Utility.pyMiscellaneous utility functions. [Documentation]
 Command parsing for Python scripts
cp.pyA module for parsing arguments from the command line. This module was designed to permit the rapid construction of programs with a simple, uniform command-line interface modeled roughly after the syntax used by GCG software. Gracefully handles pipes, opens files for input and output, and automatically generates a help string. [Documentation]
cp_template.pyA template for building scripts using the cp, SeqIO, and Seq modules. [Documentation]
 LaTex
endnlib.pyParse, modify, and write a reference database in endnote export format. [Documentation]
texref.pyConverts all citations enclosed by temporary citation markers in the format [!Author, Year, Label!] to BibTeX format citations (i.e., \cite{Label}). [Documentation]
 Fun
mandel.pyCalls mandel.c to create an array of numbers generated using the Mandelbrot formula. Attempts to compile mandel.c if mandel isn't found on PATH. Writes an 8-bit binary file which can be viewed in NIH Image or Photoshop, for example. [Documentation]
Last modified Wed Dec 15 23:43:27 2004