@hackage sequor0.3.2

A sequence labeler based on Collins's sequence perceptron.

sequor 0.3.0

AUTHOR: Grzegorz Chrupała gchrupala@lsv.uni-saarland.de

Sequor is a sequence labeler based on Collins's sequence perceptron. Sequor has a flexible feature template language and is meant mainly for NLP applications such as Part of Speech tagging, syntactic chunking or Named Entity labeling.

This version of Sequor includes SemiNER, a named-entity labeler, with pre-trained models for German. For details see ./lib/seminer/README

INSTALLATION

See installation instructions: http://code.google.com/p/sequor/wiki/INSTALL

USAGE

With Sequor you can learn a model from sequences manually annotated with labels, and then apply this model to new data in order to add labels. Sequor is meant to be used mainly with linguistic data, for example to learn Part of Speech tagging, syntactic chunking or Named Entity labeling.

Usage: sequor command [OPTION...] [ARG...] train: train model train [OPTION...] TEMPLATE-FILE TRAIN-FILE MODEL-FILE --rate=NUM learning rate --beam=INT beam size --iter=INT number of iterations --min-count=INT minimum feature frequency for label dictionary --heldout=FILE path to heldout data --hash use hashing instead of feature dictionary --hash-sample=INT sample size to estimate number of features when hashing --hash-max-size=INT maximum size of parameter vector when hashing

predict: predict using model predict MODEL-FILE

version: print version version

help: print usage information help

Data files should be in the UTF-8 encoding.

As an example we can use data annotated with syntactic chunk labels in the data directory. For example:

./bin/sequor train data/all.features data/train.conll model
--rate 0.1 --beam 10 --iter 5 --min-count 50 --hash
--heldout data/devel.conll

./bin/sequor predict model < data/test.conll > data/test.labels

FEATURE TEMPLATE SYNTAX

Sequor uses a small language for specifying feature templates to use when learing. This section gives an informal overview of this language. Sequor uses the simple CoNLL format for the input files. In this format sentences are separated by blank lines. Each line represents a single token (word). Each token should have the same fixed number of space-separated fields, where the last field is the label, e.g.

der d ART I-NC O Europäischen europäisch ADJA I-NC ORG Union Union NN I-NC ORG

The template language treats the input sentence as a matrix of features (i.e. field values) and allows you to select and apply some transformations to those features.

The language consists of a number predefined functions. By calling the functions with certain argument you can specify the feature set to use. As an example consider the following template: Cat [ Cell 0 0, Suffix 2 (Cell 0 0), Row -1, Row 1 ]. It specifies the following features: the first field in the current token, the two-character suffix of the first field of the current token, all the fields of the previous token and all the fields of the following token.

Functions:

Cell r c Selects field in row r and column c. Rect r c r' c Selects all features in the rectangle whose upper-left corner is in row r column c and lower-right corner is in row r' column c'. Row r Selects all features in row r. MarkNull f If feature does not exist, replace it with a NULL mark Typically used when absence of feature is significant, e.g. to mark the beginning of the sentence. Index f Marks the feature f to use in indexing for label dictionary. Cat [f1,f2,...,fn] Selects features in the list. Cart f f' Creates Cartesian product of feature sets f and f'. If f and f' are singletons, simply conjoins the two features. Lower f Maps f to lower case characters. Suffix i f Takes suffix of i character length of feature f. Prefix i f Takes prefix of i character langth of feature f. WordShape f Creates a specification of which charater classes such as lower case and upper case letters, digits or punctuation occur in feature f.

Remarks: Rows are indexed relative to the current token (0). Columns are indexed starting with 0. Functions which take features as arguments can be passed either singleton features or sequences of features. If passed a sequence they are applied to each of its elements. For example (Suffix 3 (Row 0)) will return the sequence of features formed by taking the suffix of length 3 of each field of row 0.

For more examples see files all.features and example.features in the directory data.