@hackage unicode-transforms0.4.0.1

Unicode normalization

Categories
- Text Processing
License
BSD-3-Clause
Maintainer
harendra.kumar@gmail.com
Links
Versions
- 0.4.0.1 Wed, 5 Feb 2025
- 0.4.0 Sun, 9 Jan 2022
- 0.3.8 Fri, 23 Jul 2021
- 0.3.7.1 Sat, 24 Jul 2021
- 0.3.7 Sat, 24 Jul 2021
- 0.3.6 Fri, 14 Jun 2019

Installation
In your cabal file:
Tested Compilers
- 9.8.1
- 9.6.3
- 9.4.7
- 9.2.8
- 9.0.2
- 8.10.7
- 8.8.4
- 8.6.5
- 8.4.4
- 8.2.2
- 8.0.2
Dependencies (5)
- base >=4.8 && <4.23
- bytestring >=0.9 && <0.13
- ghc-prim >=0.2 && <0.14
- text >=1.1.1 && <=1.2.5.0 || >=2.0 && <2.2
- unicode-data >=0.2 && <0.9
Dependents (11)
@hackage/slugify, @hackage/commonmark, @hackage/unicode-collation, @hackage/stack, @hackage/url-slug, @hackage/lex-applicative, Show all…

Package Flags

dev

(off by default)

Developer build

bench-show

(off by default)

Use bench-show to compare benchmarks

has-icu

(off by default)

Use text-icu for benchmark and test comparisons

has-llvm

(off by default)

Use llvm backend (faster) for compilation

use-gauge

(off by default)

Use gauge instead of tasty-bench for benchmarking

Unicode Transforms

Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

What is normalization?

Unicode characters with adornments (e.g. Á) can be represented in two different forms, as a single composed character (U+00C1 = Á) or as multiple decomposed characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte sequences but for humans they have exactly the same visual appearance.

A regular byte comparison may tell that two strings are different even though they might be equivalent. We need to convert both the strings in a normalized form using the Unicode Character Database before we can compare them for equivalence. For example:

>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True

Performance

Normalization performance comparison of this package (v0.3.7) with the text-icu package using the ICU C++ library version ICU4C 65.1 on macOS. The benchmarks compare the time taken in milliseconds to normalize files in different languages and normalization forms using both the packages. In most cases unicode-transforms outperforms ICU.

Benchmark       unicode-transforms(ms) ICU(ms)    % Diff
--------------- ---------------------- -------   --------
NFKD/Korean                       7.78   37.10    +376.87
NFD/Korean                        7.86   37.06    +371.50
NFKD/Vietnamese                   6.85   12.48     +82.20
NFKD/Deutsch                      2.17    3.55     +63.30
NFKD/English                      1.71    2.78     +62.30
NFKC/Korean                       4.77    7.65     +60.28
NFD/Deutsch                       2.24    3.53     +57.41
NFD/English                       1.76    2.77     +57.32
NFC/Vietnamese                   10.66   16.63     +56.00
NFKC/Vietnamese                  10.95   16.58     +51.43
NFD/Devanagari                    6.48    8.68     +34.10
NFC/Devanagari                    6.77    8.49     +25.48
NFD/AllChars                      6.18    7.41     +19.91
NFD/Japanese                      7.80    9.20     +17.99
NFKC/Devanagari                   7.33    8.48     +15.74
NFKD/Japanese                     8.71   10.05     +15.39
NFD/Vietnamese                    5.94    6.83     +14.99
NFKD/Devanagari                   7.59    8.68     +14.27
NFKD/AllChars                     9.80   10.66      +8.82
NFKC/Deutsch                      3.21    3.18      -0.72
NFC/Korean                        4.62    4.38      -5.35
NFKC/English                      2.21    2.06      -6.88
NFC/English                       2.19    2.04      -7.21
NFKC/AllChars                    14.67    9.75     -50.51
NFC/Deutsch                       3.02    1.95     -54.39
NFKC/Japanese                    12.46    5.42    -129.93
NFC/AllChars                      9.72    3.58    -171.63
NFC/Japanese                     11.90    3.04    -292.04

Talks

Talks: Functional Conf 2018 Video | Functional Conf 2018 Slides

Contributing

Please use https://github.com/harendra-kumar/unicode-transforms to raise issues, or send pull requests.

@hackage unicode-transforms0.4.0.1

Categories

License

Maintainer

Links

Versions

Installation

Tested Compilers

Dependencies (5)

Dependents (11)

Package Flags

Unicode Transforms

What is normalization?

Performance

Talks

Contributing