www
aboutsummaryrefslogtreecommitdiff

ucm – An ICU UCM compiler for Erlang

UCM stands for Unicode Character Mapping, an internal format of the ICU Project (International Components for Unicode) for defining the relationship between Unicode and some other character encoding. This library provides a way to compile such mappings into completely self-contained Erlang modules that can perform the according conversions.

USAGE

For one-off converters, this module can simply be used on its own from erl. The interface is akin to that of yecc; run rebar3 edoc to see the specifics.

To treat a UCM as an Erlang source file,

MOTIVATION

We found the existing solutions for bringing arbitrary character encodings to Erlang unsatisfactory:

  • icu4e and i18n are both NIFs for the full icu4c. iconv is a NIF for libiconv.
    • The ICU Project concerns itself with many aspects of internationalisation, many of which are indispensable for certain applications. However, carrying all this along just to deal with some legacy codepages is a cognitive burden too heavy to justify.
    • NIFs are undesirable unless strictly necessary. ABI and dependency headaches hurt quality of life during development and deployment, not to mention the library's chances of working unmodified in the future. At runtime, NIFs throw scheduler guarantees out the window and pose a risk to overall VM stability.
  • erl-creole is a pure Erlang library for converting between Unicode and a limited number of common Japanese encodings. It achieves this by generating Erlang code from .ucm files, and could theoretically be extended by adding more mappings; however:
    • The code to turn mappings into Erlang is written in Common Lisp.
    • The mappings are manually stripped-down versions, missing information critical to correctness and efficiency.

So, this module attempts to hit a sweet spot:

  • Pure Erlang all the way down – Erlang that generates Erlang.
  • Concerned only with character sets, like iconv.
  • Flexible.
  • Compatible with unmodified mappings from ICU.

CREDITS

For expediency and consistency, much of the scaffolding (the user-facing functions, handling of files, and so on), is lifted piecemeal from yecc, only trimmed of cruft and dressed with specs and EDoc comments.

CAVEATS

For everyone

Errors are not comprehensive; nor will they ever be. This library is primarily intended for use with existing mappings from the ICU Project, and errors reported are solely those that altogether prevent compilation. Most are caught early as failures to parse to a rather rigid structure; only a few obvious sanity checks are in place beyond that. However, while significant validation is beyond the scope of this library, those seeking to write new mappings can find a wealth of tools in the ICU Project itself.

For those familiar with ICU

  • Delta mappings do not produce modules that depend on other modules at runtime, as would be analogous to converter objects in icu4c. Instead, modules are made to be self-contained. This may be wasteful on some level, but it's worth the headaches it could prevent.