Skip to content

Back to Toolbox

Notes on Unicode

What is Unicode?

Why Unicode? The Unicode Standard and its associated specifications provide programmers with a single universal character encoding, extensive descriptions, and a vast amount of data about how characters function. The specifications and data describe how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia. These specifications include descriptions of how to deal with security concerns regarding the many “look-alike” characters from alphabets around the world. Without the properties and algorithms in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible, and much of the vast breadth of the world’s languages would lie outside the reach of modern software. — The Unicode Standard

The Unicode Standard is composed of of a core specification and a series of Annexes. The Standard is supplemented with Techical Standards and Technical Reports. Here’s the list.

The Unicode Common Locale Data Repository (CLDR) is the largest and most extensive standard repository of locale data available. It’s the recommended data source for JavaScript’s Intl objects, defined in ECMA-402.

Technical Report #17: Unicode Character Encoding Model defines useful terminology, but I’ve only skimmed it so I’ll possibly be misusing some of the terms below.

Unicode organization

Unicode is organized in 17 planes of 65,536 (216) code points each. The code points in each plane have the values in the U+xx0000 to U+xxFFFF range, with xx corresponding to the plane. There are a total of 1,114,112 code points, and the largest one is U+10FFFF.

The first plane is called the Basic Multilingual Plane (BMP), spanning the code points U+0000 through U+FFFF. The planes other than BMP are collectively known as astral planes.

Some code point ranges in the BMP are reserved:

Private-use characters and non-characters

The Unicode Private-Use Characters, Noncharacters & Sentinels FAQ page explains the differences, offers guidance on their use, and explains what to do when you encounter them in your inputs.

Encodings

UTF-8

UTF-8 encodes Unicode code points as sequences of one to four bytes. Its design gives the encoding some useful properties, which Nic Barker explains in this 38-minute YouTube video.

The 128 code points that comprise the ASCII encoding require 7 bits, and are thus always represented with a high 0 bit in a single byte, which allows for compatibility between ASCII and UTF-8.

Non-ASCII code points are marked with a high 1 bit in all the code units. The leading byte starts with 11, and any continuation bytes start with 10. The number of bytes is encoded as the number of leading 1s in the leading byte.

Code points from BMP require at most three bytes. Code points from astral planes require four bytes.

UTF-16

UTF-16 uses one or two 16-bit code units, each referring to code points in the BMP.

When using two code units, these form a pair of surrogate codes, confined to a reserved area in the BMP:

Surrogate pairs are used to represent a total of 1,048,576 code points outside the BMP. Along with the 63,488 code points from BMP addressable with one code unit, UTF-16 is capable of representing 1,112,064 Unicode code points. The limits of UTF-16 addressability define the size of the Unicode repertoire.

Byte-order mark

Code units in UTF-16 are represented by 16-bit data types called words. Each word is a set of two bytes. The order in which a computer places these two bytes in a word depends on the computer architecture’s endiannes, either big-endian (BE) or little-endian (LE). Since communication and storage protocols often operate in bytes rather than words, it may be useful to specify the endianness.

You can make the endianness explicit by declaring one of the two UTF-16BE or UTF-16LE encoding types.

An alternative is to prefix the data with the byte-order mark (BOM), having the value U+FEFF ZERO WIDTH NO-BREAK SPACE. An application reading this data can look for either U+FEFF or its reversal U+FFFE to figure out if its own endianness matches. The U+FFFE code point is defined to be a non-character, so it’s unlikely to otherwise appear in a text stream.

(Since UTF-8 works with sequences of bytes, it doesn’t need a BOM.)

UTF-32

Uses four bytes for every code point.

JavaScript strings and Unicode

JavaScript’s handling of Unicode text is surprising but it reflects the language’s internal representation of strings.

JavaScript strings are implemented as sequences of 16-bit integers, with each element in the string corresponding to a UTF-16 code unit.

The language does not impose restrictions on these elements, so strings can contain invalid surrogate sequences — either a surrogate missing its pair, or a surrogate pair in the wrong order.

The .isWellFormed() String method checks for lone surrogates, and .toWellFormed() replaces them with the U+FFFD � replacement character. The latter can be used to preprocess strings for web APIs that throw on ill-formed strings, such as encodeURI().

Inside string literals, UTF-16 code units can be escaped with the \uXXXX syntax, where each X represents a hexadecimal digit.

Things involving string elements are ‘Unicode-unaware’. They will reflect the underlying UTF-16 code units, with individual surrogates each counting as an element:

Unicode-aware string things

Mercifully, iterating on a string is defined to operate on Unicode code points instead of UTF-16 code units.

This lets us use Array.from(string) or [...string] instead of string.split('') to obtain a correct sequence of Unicode code points, as long as the string is well-formed. (Lone surrogates still appear as themselves in the resulting array.)

The .codePointAt(index) and String.fromCodePoint() methods are the Unicode-aware counterparts to .charCodeAt(index) and String.fromCharCode().

Note: Because codePointAt() still works with string element indices, it risks returning lone UTF-16 surrogate code units. To cover your bases, read codePointAt(0) from each element of the [...string.toWellFormed()] array.

As a counterpart to \uXXXX escapes, Unicode code points can be written into string literals with the \u{…} syntax that takes one to six hexadecimal digits.

Regular expressions also have a u flag which puts them in unicode-aware mode that enables \u{…} escapes and changes how Unicode characters in the regex are interpreted. There’s also a newer v flag.

Unicode grapheme clusters

What we recognize as ‘a character’ may be represented by a sequence of one or more Unicode code points. To segment a text into user-perceived characters, Unicode introduces the concept of grapheme clusters.

To split a JavaScript string into grapheme clusters, we must use the Intl.Segmenter object:

function splitIntoClusters(str) {
	const segmenter = new Intl.Segmenter('en', { 
		granularity: 'grapheme' 
	});
	return Array.from(
		segmenter.segment(str)
	).map(it => it.segment);
}

To illustrate the granularity of the various split methods, here’s how they handle the flag of Romania:

/* Split into UTF-16 code units */
"🇷🇴".split('')["\ud83c", "\uddf7", "\ud83c", "\uddf4"]

/* Split into Unicode code points */
Array.from("🇷🇴")["🇷", "🇴"]

/* Split into Unicode grapheme clusters */
splitIntoClusters("🇷🇴")["🇷🇴"]

Unicode in databases

MySQL and MariaDB have a character set called utf8 which is actually an alias for utf8mb3, a deprecated encoding that only supports the BMP. Actual UTF-8 is utf8mb4.

As for the collation, pick the one that implements the more up-to-date version of the Unicode Collation Algorithm (UCA). For MariaDB at the time of writing, that’s utf8mb4_uca1400_ai_ci.

Problematic Unicode characters

Tim Bray and Paul Hoffman authored RFC 9839: Unicode Character Repertoire Subsets, a document that discusses subsets of the Unicode character repertoire for use in protocols and data formats and specifies three subsets recommended for use in IETF specifications. It aims to guide spec authors away from some of the ‘bad Unicode’ code points.

Bits and bobs

Miscellaneous links from the archives:

Emoji presentation sequences

A note from the drafts folder:

I never looked into why some characters render as symbols ⚠︎ in some browsers but as emojis ⚠️ in others. It turns out 371 Unicode characters have two variants: a monochrome, simplified text presentation and a full-color emoji presentation. These variants can be accessed by appending U+FE0F for text or U+FE0E for emoji to the symbol. (Kudos to Nathan for this great tip!) Among Unicode reference websites, crowd favorite codepoints.net seems to be the only one that shows this piece of information about characters.