@hackage rbr0.8.6

Mask nucleotide (EST) sequences in Fasta format

New in this version (0.8)

Mostly a maintenance release, but at least we have

  • Cabalized install

  • Sparse mode completed and optimized

  • More reasonable default parameters

    Acquiring RBR

I'll try to keep a selection of source and binaries at:

http://www.ii.uib.no/~ketil/bioinformatics/downloads/software

If you need binaries for other architectures, drop me a mail at ketil@ii.uib.no. The latest version should always be available from my darcs repo:

darcs get http://www.ii.uib.no/~ketil/bioinformatics/repos/rbr

Installation instructions

I'm working on a smoother installation process, but this is how it currently works.

You need GHC (http://haskell.org/ghc). Everything is tested against version 6.8, but older versions might also work. Expect to do some modifications in the code for earlier versions than that.

You first need to get my 'bio' library, it is available from the same website, and install that. You can use cabal to install the binary, at the top level, do

chmod +x Setup.hs ./Setup.hs configure (add --prefix=$HOME if you don't have root access) ./Setup.hs build {sudo} ./Setup.hs install

If you want to go the more manual route, cd to the src subdirectory, and

make rbr    -- builds a dynamically linked executable

or: make rbr_s -- builds a statically linked executable

The main development platform is Linux/x86, so expect that to be most well supported. In order to build on an aging Sun with an old gcc (2.95), I had to comment out 'hooks.o' from the Makefile, and static build didn't work either. I'm investigating this, but perhaps it suffices to have a current GCC available, and/or a newer Solaris.

Usage

I've no real manual page yet, but 'rbr --help' should list the available options. Basically, masking is determined by examining word frequencies of a certain word length (-k), estimating a distribution around the "modal interval" of the word frequencies with a certain stringency (-s), and masking words with frequencies exceeding the mean of this distribution by a certain standard deviations (-t). Defaults are -k 16 -s 1.1 -t 5.0 if -L (lower case masking) is specified and -k 16 -s 2.0 -t 8.0 if -n (masking with 'n') is specified. Lower case is now the default.

To mask more agressively, you can either try to reduce stringency, or a lower deviation, or both. Conversely if you want more conservative masking. In general, the differences are small, and typically you can compensate for a decrease in one parameter with an increase in the other.

Shorter word length will be more tolerant against SNPs and read errors, but increase the variance. Longer will be less tolarant, but have less variance. In addition, word lengths beyond 16 will be slower.

There's also a --sparse=X option that will store a fraction (but at least every X'th) of the words. This will reduce memory consumption proportionally.

RBR's memory usage can be limited with options to the run time system's garbage collector. I good rule of thumb may be to limit it to 80-90% of available physcal memory, which will avoid paging to disk. If RBR is compiled with hooks.o linked in, this will be the default, but if other behaviour is desired, you can use "+RTS -MxxxM -RTS" to limit heap use to xxxMB¹. See the GHC documentation (http://haskell.org/ghc) for more on this. Usually, you'll get better performance by supplying -HxxxM as well (this will reduce GC time, again see the GHC docs).

The -v option gives some feedback while RBR runs, which is nice if you're using it interactively.

There is also a server mode, where RBR will index a data set, and listen on stdin for sequence names, and answer on stdout with the original sequence, the masked sequence, and the distribution of word frequencies along the sequence.

¹) GHC version 6.6 and earlier had a bug that would cause memory consumption to be measured incorrectly if the system allocated it in an unusual order. The fix will be in subsequent releases of GHC.