@hackage maxent-learner-hw0.1.1

Hayes and Wilson's maxent learning algorithm for phonotactic grammars.

Maxent Phonotactic Learner

A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text, based on Hayes and Wilson's A Maximum Entropy Model of Phonotactics and Phonotactic Learning. This package provides functionality both as a Haskell library and as a command line tool.

To compile this package, run stack build in the root of this repository. Run stack haddock to build the library documentation. The library may be useful if you wish to use a custom set of candidate constraints beyond the generators offered by the command line tool.

Command line usage

The command line tool (phono-learner-hw) has two commands: learn, which infers grammars, and gensalad, which generates random text using those grammars. The learn command takes the name of a lexicon file as an argument and outputs a grammar (note this is quite slow). By default the candidates consist of single classes and bigrams, and several; mote constraint types can be added with options. The gensalad takes a grammar generated by learn and uses it to generate random text. Both commands can also take global options to output their final results to a file, to use a custom-defined feature table for the generation of natural classes, and to control how text is divided into segments.

The command line works as follows:

phono-learner-hw COMMAND [-t|--featuretable CSVFILE] ([-c|--charsegs] | [-w|--wordsegs] | [--fierrosegs]) [-n|--samples ARG] [-o|--output OUTFILE]
Option Description
-t, --featuretable CSVFILE Use the features and segment list from a feature table in CSV format (a table for IPA is used by default).
-c, --charsegs Use characters as segments (default).
-w, --wordsegs Separate segments by spaces.
--fierosegs Parse segments by repeatedly taking the longest possible match and use ' to break up unintended digraphs (used for Fiero orthography).
-n, --samples N Number of samples to use for salad generation.
-o, --output OUTFILE Record final output to OUTFILE as well as stdout.
hw-learner learn LEXICON [--thresholds THRESHOLDS] [-f|--freqs] [-e|--edges] [-3|--trigrams COREFEATURES] [-l|--longdistance SKIPFEATURES] [GLOBALOPTIONS]
Option Description
--thresholds THRESHOLDS thresholds to use for candidate selection (default is `[0.01, 0.1, 0.2, 0.3]``).
-f,--freqs Lexicon file contains word frequencies.
-e,--edges Allow constraints involving word boundaries.
-3,--trigrams COREFEATURES Allow trigram constraints where at least one class uses a single one of the following features (space separated in quotes).
-l,--longdistance SKIPFEATURES Allow constraints with two classes separated by a run of characters possibly restricted to all having one of the following features.
hw-learner gensalad GRAMMAR [GLOBALOPTIONS]

Example usage

The following two command calculates a grammar using Hayes and Wilson's Shona test data using their selection of trigram restrictions and then generate random text using it.

phono-learner-hw learn ShonaLearningData.txt -f -e -3 "syllabic consonantal sonorant" -t ShonaFeatures.csv -w -o shonalongdistance.txt
phono-learner-hw gensalad ShonaGrammar.txt -t ShonaFeatures.csv -w -o ShonaSalad.txt

Feature Table Format

To use a feature table other than the default IPA one, you may define it in CSV format (RFC 4180). The segment names are defined by the first row (they may be any strings as long as they are all distinct, i.e. no duplicate names) and the feature names are defined by the first column (they are not hard-coded). Data cells should contain +, -, or 0 for binary features and + or 0 for privative features (where we do not want a minus set that could form classes).

As a simple example, consider the following CSV file, defining three segments (a, n, and t), and two features (vowel and nasal).

     ,a,n,t
vowel,+,-,-
nasal,0,+,-

If a row contains a different number of cells (separated by commas) than the header line, is rejected as invalid and does not define a feature (and will not be dispayed in the formatted feature table). If the CSV which is entered has duplicate segment names, no segments, or no valid features, the entire table is rejected (indicated by a red border around the text area, green is normal) and the last valid table is used and displayed.


Copyright © 2016-2017 George Steel and Peter Jurgec.

This project is supported by the University of Toronto Advancing Teaching and Learning in Arts and Science (ATLAS) grant to Peter Jurgec.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.