@hackage unicode-data0.6.0

Access Unicode Character Database (UCD)

<h1 id="readme">README</h1> <p><code>unicode-data</code> provides Haskell APIs to efficiently access the Unicode character database. <a href="#performance">Performance</a> is the primary goal in the design of this package.</p> <p>The Haskell data structures are generated programmatically from the <a href="https://www.unicode.org/ucd/">Unicode character database</a> (UCD) files. The latest Unicode version supported by this library is <a href="https://www.unicode.org/versions/Unicode15.1.0/"><code>15.1.0</code></a>.</p> <p>Please see the <a href="https://hackage.haskell.org/package/unicode-data">Haddock documentation</a> for reference documentation.</p> <h2 id="performance">Performance</h2> <p><code>unicode-data</code> is up to <em>5 times faster</em> than <code>base</code> ≤ 4.17 (see <a href="#partial-integration-of-unicode-data-into-base">partial integration to <code>base</code></a>).</p> <p>The following benchmark compares the time taken in milliseconds to process all the Unicode code points (except surrogates, private use areas and unassigned), for <code>base-4.16</code> (GHC 9.2.6) and this package (v0.4). Machine: 8 × AMD Ryzen 5 2500U on Linux.</p> <pre><code>All Unicode.Char.Case.Compat isLower base: OK (1.19s) 17.1 ms ± 241 μs unicode-data: OK (0.52s) 3.58 ms ± 125 μs, 0.21x isUpper base: OK (0.63s) 17.5 ms ± 359 μs unicode-data: OK (1.02s) 3.58 ms ± 48 μs, 0.21x toLower base: OK (0.59s) 16.3 ms ± 524 μs unicode-data: OK (0.80s) 5.63 ms ± 129 μs, 0.35x toTitle base: OK (3.91s) 14.9 ms ± 427 μs unicode-data: OK (2.84s) 5.31 ms ± 37 μs, 0.36x toUpper base: OK (2.12s) 15.4 ms ± 234 μs unicode-data: OK (0.86s) 5.80 ms ± 159 μs, 0.38x Unicode.Char.General generalCategory base: OK (1.16s) 16.6 ms ± 534 μs unicode-data: OK (0.62s) 4.14 ms ± 103 μs, 0.25x isAlphaNum base: OK (0.62s) 17.1 ms ± 655 μs unicode-data: OK (0.97s) 3.59 ms ± 51 μs, 0.21x isControl base: OK (0.63s) 17.6 ms ± 494 μs unicode-data: OK (0.57s) 3.59 ms ± 90 μs, 0.20x isMark base: OK (0.34s) 17.6 ms ± 695 μs unicode-data: OK (1.00s) 3.59 ms ± 67 μs, 0.20x isPrint base: OK (1.22s) 17.7 ms ± 492 μs unicode-data: OK (1.92s) 3.56 ms ± 27 μs, 0.20x isPunctuation base: OK (2.23s) 16.6 ms ± 619 μs unicode-data: OK (1.05s) 3.60 ms ± 52 μs, 0.22x isSeparator base: OK (1.15s) 16.6 ms ± 439 μs unicode-data: OK (0.49s) 3.60 ms ± 85 μs, 0.22x isSymbol base: OK (2.11s) 16.1 ms ± 553 μs unicode-data: OK (1.05s) 3.58 ms ± 62 μs, 0.22x Unicode.Char.General.Compat isAlpha base: OK (0.58s) 17.2 ms ± 502 μs unicode-data: OK (1.02s) 3.58 ms ± 50 μs, 0.21x isLetter base: OK (8.57s) 16.4 ms ± 553 μs unicode-data: OK (1.05s) 3.58 ms ± 79 μs, 0.22x isSpace base: OK (1.09s) 7.56 ms ± 159 μs unicode-data: OK (0.97s) 3.58 ms ± 46 μs, 0.47x Unicode.Char.Numeric.Compat isNumber base: OK (0.58s) 15.7 ms ± 462 μs unicode-data: OK (0.58s) 3.58 ms ± 107 μs, 0.23x </code></pre> <h3 id="partial-integration-of-unicode-data-into-base">Partial integration of <code>unicode-data</code> into <code>base</code></h3> <p>Since <code>base</code> 4.18, <code>unicode-data</code> has been <em>partially</em> <a href="https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8072">integrated to GHC</a>, so there should be no relevant difference. However, using <code>unicode-data</code> allows to select the <em>exact</em> version of Unicode to support, therefore not relying on the version supported by GHC.</p> <h2 id="unicode-database-version-update">Unicode database version update</h2> <p>To update the Unicode version please update the version number in <code>ucd.sh</code>.</p> <p>To download the Unicode database, run <code>ucd.sh download</code> from the top level directory of the repo to fetch the database in <code>./ucd</code>.</p> <pre><code>$ ./ucd.sh download </code></pre> <p>To generate the Haskell data structure files from the downloaded database files, run <code>ucd.sh generate</code> from the top level directory of the repo.</p> <pre><code>$ ./ucd.sh generate </code></pre> <h2 id="running-property-doctests">Running property doctests</h2> <p>Temporarily add <code>QuickCheck</code> to build depends of library.</p> <pre><code>$ cabal build $ cabal-docspec --check-properties --property-variables c </code></pre> <h2 id="licensing">Licensing</h2> <p><code>unicode-data</code> is an <a href="https://github.com/composewell/unicode-data">open source</a> project available under a liberal <a href="LICENSE">Apache-2.0 license</a>.</p> <h2 id="contributing">Contributing</h2> <p>As an open project we welcome contributions.</p>