@hackage unicode-data0.6.0
Access Unicode Character Database (UCD)
Installation
Tested Compilers
Dependencies (2)
Dependents (6)
@hackage/streamly, @hackage/commonmark, @hackage/unicode-data-security, @hackage/unicode-transforms, @hackage/unicode-data-names, @hackage/unicode-data-scripts
Package Flags
dev-has-icu
(off by default)
Use ICU for test and benchmark. Intended for development on the repository.
README
unicode-data
provides Haskell APIs to efficiently access the Unicode
character database. Performance is the primary goal in the
design of this package.
The Haskell data structures are generated programmatically from the
Unicode character database (UCD) files.
The latest Unicode version supported by this library is
15.1.0
.
Please see the Haddock documentation for reference documentation.
Performance
unicode-data
is up to 5 times faster than base
≤ 4.17 (see
partial integration to base
).
The following benchmark compares the time taken in milliseconds to process all
the Unicode code points (except surrogates, private use areas and unassigned),
for base-4.16
(GHC 9.2.6) and this package (v0.4).
Machine: 8 × AMD Ryzen 5 2500U on Linux.
All
Unicode.Char.Case.Compat
isLower
base: OK (1.19s)
17.1 ms ± 241 μs
unicode-data: OK (0.52s)
3.58 ms ± 125 μs, 0.21x
isUpper
base: OK (0.63s)
17.5 ms ± 359 μs
unicode-data: OK (1.02s)
3.58 ms ± 48 μs, 0.21x
toLower
base: OK (0.59s)
16.3 ms ± 524 μs
unicode-data: OK (0.80s)
5.63 ms ± 129 μs, 0.35x
toTitle
base: OK (3.91s)
14.9 ms ± 427 μs
unicode-data: OK (2.84s)
5.31 ms ± 37 μs, 0.36x
toUpper
base: OK (2.12s)
15.4 ms ± 234 μs
unicode-data: OK (0.86s)
5.80 ms ± 159 μs, 0.38x
Unicode.Char.General
generalCategory
base: OK (1.16s)
16.6 ms ± 534 μs
unicode-data: OK (0.62s)
4.14 ms ± 103 μs, 0.25x
isAlphaNum
base: OK (0.62s)
17.1 ms ± 655 μs
unicode-data: OK (0.97s)
3.59 ms ± 51 μs, 0.21x
isControl
base: OK (0.63s)
17.6 ms ± 494 μs
unicode-data: OK (0.57s)
3.59 ms ± 90 μs, 0.20x
isMark
base: OK (0.34s)
17.6 ms ± 695 μs
unicode-data: OK (1.00s)
3.59 ms ± 67 μs, 0.20x
isPrint
base: OK (1.22s)
17.7 ms ± 492 μs
unicode-data: OK (1.92s)
3.56 ms ± 27 μs, 0.20x
isPunctuation
base: OK (2.23s)
16.6 ms ± 619 μs
unicode-data: OK (1.05s)
3.60 ms ± 52 μs, 0.22x
isSeparator
base: OK (1.15s)
16.6 ms ± 439 μs
unicode-data: OK (0.49s)
3.60 ms ± 85 μs, 0.22x
isSymbol
base: OK (2.11s)
16.1 ms ± 553 μs
unicode-data: OK (1.05s)
3.58 ms ± 62 μs, 0.22x
Unicode.Char.General.Compat
isAlpha
base: OK (0.58s)
17.2 ms ± 502 μs
unicode-data: OK (1.02s)
3.58 ms ± 50 μs, 0.21x
isLetter
base: OK (8.57s)
16.4 ms ± 553 μs
unicode-data: OK (1.05s)
3.58 ms ± 79 μs, 0.22x
isSpace
base: OK (1.09s)
7.56 ms ± 159 μs
unicode-data: OK (0.97s)
3.58 ms ± 46 μs, 0.47x
Unicode.Char.Numeric.Compat
isNumber
base: OK (0.58s)
15.7 ms ± 462 μs
unicode-data: OK (0.58s)
3.58 ms ± 107 μs, 0.23x
Partial integration of unicode-data
into base
Since base
4.18, unicode-data
has been
partially integrated to GHC,
so there should be no relevant difference. However, using unicode-data
allows
to select the exact version of Unicode to support, therefore not relying on
the version supported by GHC.
Unicode database version update
To update the Unicode version please update the version number in
ucd.sh
.
To download the Unicode database, run ucd.sh download
from the top
level directory of the repo to fetch the database in ./ucd
.
$ ./ucd.sh download
To generate the Haskell data structure files from the downloaded database
files, run ucd.sh generate
from the top level directory of the repo.
$ ./ucd.sh generate
Running property doctests
Temporarily add QuickCheck
to build depends of library.
$ cabal build
$ cabal-docspec --check-properties --property-variables c
Licensing
unicode-data
is an open source
project available under a liberal Apache-2.0 license.
Contributing
As an open project we welcome contributions.