@hackage clustertools0.1.2

Tools for manipulating sequence clusters

This contains the following tools:

To build these, you will need a Haskell compiler (the most likely candidate begin GHC), and my bioinformatics library and the SimpleArgs module installed (Downloadable from: http://malde.org/~ketil/biohaskell/).

filter - remove unwanted sequences from a clustering usage: filter seq.list < cluster.L > cluster2.L cluster2.L will only contain sequence labels found in seq.list

hist - produce a histogram of cluster sizes from a "label"-formatted clustering.

clusc - compare clusterings, calculating numerous pair-based and entropy based indices.

xcerpt - given a file containing a list of sequence labels (e.g. a "label" formatted clustering), extract matching sequences from a FASTA file. Like "agrep -d '^>'" without the bugs.

     Usage: xcerpt list.txt fasta.seq
     creates "fasta.seq.match" and "fasta.seq.rest"

add_single - add singletons to a clustering. Usage: add_single all.L clustering.L creates clustering.L_s listing all sequences in all.L but not in clustering.L, one per line.

ace2contigs - parse an ACE assembly file, and output the contigs in a FASTA file (named by tacking on .fasta to the ACE file name), and the corresponding quality information (.qual).

ace2fasta - parse an ACE assembly, and output each assembly in a separate FASTA formatted file, with the necessary gaps inserted to align the sequences (suitable for import into e.g. Seaview)

ace2clusters - parse an ACE assembly, and output clusters composed of the sequences used for each contig. The format is similar to TGICL's, with cluster output as one line consisting of a '>' and the contig name, and the next line containing the names of the sequences that comprise the cluster.

clusterlibs - given a table of regular expressions and library names, along with a clustering (TGICL-format), output a table of clusters with the library name prepended to the sequences.