Whether it be log statements, text labels, the words in a book or a post to social media, pretty much all Swift code we write deals with text in some form or other and having a detailed and clear understanding of how characters and strings works in Swift is critical.
In the next two articles my aim is to help you gain this understanding and by the end of them, I’m hoping that you will have a much greater understanding of how strings and characters work in Swift. Before we start though there is some background information we need to look at first.
We’re going to start off by looking at Unicode, an international text encoding standard that underpins Swift’s Character
and String
types. We’ll look at how this standard came into being, we’ll look at it’s basic mechanics and we’ll look at how it is used in practice. Once armed with that knowledge, in the next article, we’ll then turn our attention back to Swift.
In the next article, we’ll take a close look at the Character
and String
types in Swift. We’ll see how each of these data types are underpinned by the Unicode standard and we’ll see how we can use these data types to perform a range of common string and character operations such as combining strings, modifying strings and testing strings for equality.
For now though, let’s focus our attention on Unicode. We’ll start with a bit of history.
Note: It may be best to read the remainder of this article within Safari as Safari seems to have better support for rendering of Unicode characters than Chrome.
History
Encoding
At their core, computers are binary beasts. They understand one’s and zeros and not much more than that. Granted there is some flexibility to this (in that the one’s and zero’s can be arrange into bytes or groups of bytes to represent larger and larger numbers) but fundamentally computers are limited to storing and interpreting numerical information only.
For early computers scientists this posed a bit of a problem. They wanted to store and represent other types of information, information that wasn’t just numerical, information like the characters and words that make up this article. Their solution was encodings.
These early scientists worked out that by assigning a numerical value to each of the characters they wanted to represent they could encode the characters as a series of numbers. When a computer then read those numbers back from memory, it could then interpret those numbers and translate them back into the characters for display on screen or sending to print.
ASCII
One of the earliest forms of encoding was the American Standard Code for Information Interchange commonly known as as ASCII
.
Developed in 1963 (and still in use today), the ASCII encoding standard is a 7-bit encoding that uses the lower 7-bits of an 8-bit byte to map the letters of the English alphabet, the numerical digits 0 – 9 and some additional punctuation and control characters such as tabs, spaces and carriage returns, to the integer values 0 – 127.
The ASCII encoding standard was great for English text that used western characters such as the letters A – Z but it did have it’s draw-backs. The biggest one being the lack of support for non-English languages such as Chinese, Russian and Arabic.
The problem was that the 128 values that the 7-bits of the ASCII encoding standard afforded simply didn’t provide enough room to map all the characters that were needed for these additional languages. There was however, a potential solution.
As mentioned, the ASCII standard only used the lowest 7 of the possible 8 bits in a byte to perform it’s encoding. This left the 8th bit (and a potential 128 additional values in the range 128 – 255) up for grabs that could potentially be used to encode additional characters.
Code Pages
Computer manufacturers jumped on this idea developing different code pages, additional character mappings that used the undefined values in the range 128 – 255 to encode additional characters beyond the ASCII character set.
Space Issues
Unfortunately, there were still problems though. The additional 128 values simply didn’t provide enough space to encode the characters from all the writing systems around the world and in certain languages (such as Chinese, Japanese and Korean), the sheer number of characters in those languages alone would simply not fit within the additional 128 values that were available.
Incompatibilities
To work around this fact, manufacturers developed a number of different code pages, with each code page mapping different sets of characters into these 128 values. The side effect was that different encodings used the same number to represent different characters (and in some cases different numbers were used to represent the same character).
On top of this there was the additional complication that early computer systems were only able to have one code page active at a time. This wasn’t a problem on individual systems where documents were created, saved and opened on a single system but people were increasingly starting to exchange documents -producing them on one platform (using one encoding), exchange them, and then opening them on a different platform (using a different encoding).
In the cases where the encodings didn’t match, characters that were mapped to the values between 128 – 255 would often be corrupted.
In the years that followed, there were some improvements though. Computer systems developed the capabilities to support multiple code pages on a document-by-document basis and as long as individual documents reported their encodings correctly, document corruption started to become a thing of the past. Despite this, there were still limits though.
If all the characters required for the document were supported within a single encoding scheme then everything was fine, but writing multi-language documents, documents that required characters beyond those that could be fitted within a single encoding, remained a pipe-dream.
To try to solve this problem, computer scientists went back to the drawing board looking at the how they could encode a larger number of characters within a single encoding scheme. Their answer was obvious. Use a larger number of bits to encode each character.
Adding More Bits
The decision to use more bits of storage to store individual characters went along way to solving the problems we’ve talked about. 16-bit (or two-byte) encodings were a common choice and provided around 65,536 values that could be used to encode additional character and symbols. It wasn’t all plane sailing though. These 16-bit encodings still had problems, the primary one being byte order.
Byte Order
In computer systems there are two ways in which multi-byte values can be stored. Some systems, known as big endian systems, store the bytes that contain the most significant bits of a number first with the subsequent bytes that make up the number being stored at higher memory addresses. Other systems however, store the bytes with the least significant bits first and store the the remaining (more significant) bytes at higher memory addresses. These systems are known as little endian systems.
Given these two different approaches to storing multi-byte values in memory, the order in which those bytes are read back is extremely important and it is this problem that plagued 16-bit character encodings.
On top of this, some encodings were designed that didn’t use a fixed numbers of bytes to store their values. These encodings were called variable-width encodings and stored characters that were encoding used small values using just a single byte of storage whilst other characters, those mapped using larger values, were stored using two bytes.
Interpreters then, had to not only determine the number of bytes of storage that a particular encoding was using but also determine the order in which the bytes were stored if they were to read back the correct values from memory.
Both of these things added complexity to the character encoding and decoding mechanisms and in 1987, major tech companies such as Apple and NeXT started working on a solution to address both the space issues and the issues with byte order. Their goal was to design a universal character encoding system that could cover all of the world’s writing systems whilst avoiding as many of the character encoding issues as they could. The result, first released in October 1991, was version 1.0.0 of the Unicode Standard.
Unicode
The Unicode Standard is an international standard for encoding, representing and processing text in different writing systems. At it’s most basic, the Unicode standard defines a unique number for every character or symbol that is used in writing and covers nearly all the world’s writing systems both past and present. This is no small feat.
The current Unicode Standard (version 8.0) supports 129 scripts and 14 different symbol collections (symbols such as mathematical symbols, emoji and mahjong tiles). Amongst the as yet unsupported scripts, the standard also lists 5 additional scripts that are “in current use in living communities” and 18 “archaic or ‘dead’ scripts” that aren’t currently supported.
The Unicode Codespace
In it’s original form, the Unicode Standard was defined to be a 16-bit encoding system. Like the other 16-bit encodings we talked about earlier, this provided space for around 65,536 characters but after a few years, the standard was extended even further to use 21-bits of storage instead. The result was the current Unicode codespace (the range of all potential values that can be used for encoding) of 1,114,112 different values in the range 0
to 0x10FFFF
inclusive.
Code Points and Unicode Scalars
Each value within the codespace is referred to as a code point. Each code point is written using a hexadecimal value prefixed with the capital letter U
and a plus sign (e.g. U+
).
In addition to the concept of code points, the Unicode Standard also defines the concept of Unicode Scalars. Unicode Scalars are a subset of the unicode code points and are defined to be those code points in the range U+0000
to U+D7FF
inclusive or U+E000
to U+10FFFF
inclusive.
If you are paying attention, you will have noticed that this leaves a gap in the unicode codespace between U+D800
and U+DFFF
inclusive. These are allocated to the Unicode surrogate pair code points which are in turn divided into two groups, the high-surrogate (U+D800
to U+DBFF
) and low-surrogate (U+DC00
too U+DFFF
) code points. We’ll look at these in more detail when we look at UTF-16.
Organising the Unicode Codespace
Now, at 1,114,112 different code points there is not getting away from the fact that the Unicode codespace is pretty larger. To help with this, the Unicode standard organises the Unicode codespace into 17 planes with each plane containing 65,536 characters.
The first plane, Plane 0, is called the Basic Multilingual Plane (BMP). This plan contains most of the characters and symbols you’ll commonly encounter.
After the Basic Multilingual Plane come the supplementary planes.
The first of the supplementary planes is Plane 1 – the Supplementary Multilingual Plane (SMP). The Supplementary Multilingual Plane contains the characters for a number of historic scripts, Egyptian hieroglpyhs, cuneiform scripts, historic and modern musical notations, mathematical alphanumerics, emoji and other pictographic sets and game symbols such as playing cards, Mah Jongg tiles and dominoes.
Beyond this are the remaining supplementary planes.
Room For Expansion
In general, the remaining supplementary planes in Unicode are empty. Not all Unicode Scalar values are assigned encoded characters and in reality only around 10% them are actually in use.
Within these unused scalars the Unicode Standard also builds in a number of private use areas.
These private use areas are designated areas within the Unicode character space that, by definition, will never have any characters assigned to them. These areas can be used by third-party organisations to define their own character mappings without conflicting with the assignments of the Unicode Standard.
Duplicates
Now, with so many characters being encoded within the Unicode standard, it is not uncommon for seemingly identical characters to be encoded multiple times with different Unicode Scalars.
For example C (U+0042 LATIN CAPITAL LETTER C)
and the C (U+0421 CYRILLIC CAPITAL LETTER ES)
look visually identical when they are rendered but they are actually different characters and the key here is that they have different meanings. By encoding characters with different meaning using different code points, the Unicode Standard is able to retain the meaning of these separate characters whilst simplifying the conversion from legacy encodings. So although those characters look identical, they are actually distinct characters. There are however, some true duplicates built into the Unicode Standard.
For example, the entire ASCII latin alphabet (the first 128 characters of the Unicode Standard) are represented twice within the Unicode standard, once in the Basic Latin
Unicode block (U+0000
to U+007F
) and once in the Halfwidth and Fullwidth
Unicode Block (U+FF00
to U+FFEF
). There are also another group of characters that fall under a broader definition of “duplicate”.
These characters have a different visual appearance or behaviour but in reality represent the same abstract character. Examples of this include the letters of the Greek alphabet which are used as mathematical symbols as well as Greek characters and the Roman Numerals which are encoded in the range U+2160
to U2188
in addition to the standard Basic Latin
letters.
So as you can see, the Unicode standard may have what appears to be a number of duplicate characters but when considered in the wider context of the entire Unicode code space these characters have been encoded as different code points due to their different meanings. Now you might think that this is wasteful in terms of storage, but in reality there is still a lot of room for expansion within the Unicode Standard. However, to help with this, the Unicode standard also has a trick that helps extend things even further. This is the concept of Combining Character Sequences.
Combining Character Sequences
In the previous sections we’ve talked about how individual Unicode Scalars can be used to represent individual characters. [combining character sequences] ([http://unicode.org/faq/char_combmark.html]) build on this idea by using special sequences of one or more Unicode scalars that when combined produce a single perceived character also known as a grapheme.
For example, in Unicode certain accented characters such as the letter é
, can be represented in two different forms. In it’s precomposed form the character is encoded as a single unicode scalar (U+00E9 LATIN SMALL LETTER E WITH ACUTE
). In it’s decomposed form it is represented as a collection of unicode code points sometimes referred to as an extended grapheme cluster. In the case of the decomposed form the character é
, the extended grapheme cluster consists of a pair of Unicode scalars (U+0065 LATIN SMALL LETTER E
followed by U+0301 COMBINING ACUTE ACCENT
) and when presented, the U+0301 COMBINING ACUTE ACCENT
scalar is graphically applied to the preceding scalar turning the character e
into the character é
when rendered by a unicode aware text rendering system. In Unicode terminology these two forms, the precomposed form and decomposed form are said to be canonically equivalent.
It doesn’t just stop with accented characters though. Combining character sequences can also be used to represent a range of other combined characters.
Enclosing Marks
Combining character sequences can also be used to represent enclosing marks.
Enclosing marks are marks that are applied to their preceding character and completely enclose the character they are being applied to.
For example, we can apply the U+20DD COMBINING ENCLOSING CIRCLE
mark to our previous example to create a new combined symbol é⃝
.
Ligatures and Diagraphs
Ligatures and diagraphs are yet another example.
Originating in cursive hand writing where characters were often joined together, ligatures were produced to solve problems of visually representing multiple letters as a single combined glyphs (or visual representation).
For example, when English text is rendered, it generally doesn’t look great if the upper part of a letter f
is too close to the dot of a letter i
of if the letters f
and l
are used in combination as the spacing goes a bit wonky. In these cases, the letters can be combined into a single glyph when rendered that visually combines these letters into a single combined representation.
Now, you might be wondering why I’m mentioning this. For the most part typographical issues (the issues of how text is presented when rendered) is specifically excluded from the Unicode standard but what I wanted to mention here is that there are a few constructs built into the Unicode standard that do related to ligatures specifically.
Included mostly for compatibility with legacy print media, the Unicode standard includes a number of precomposed characters in the ranges U+FB00
to U+FB4F
that are used specifically to represent ligatures. Examples include the letters ff
, fi
, fl
, ffi
and ffl
which can be used to represent these if the rendering engine supports it. With that said though, they should be used sparingly as the Unicode Standard officially discouraged their use (see the link above).
Glyph Variants
Ok, let’s move on and look at another example of combining sequences – glyph variants.
We’ve already talked about the concept of glyphs as being the visual representation of a particular character or symbol.
The thing is, in Unicode, there is not always a one-to-one mapping between characters and the glyphs that are used to display them. Some rendering engines provide multiple glyph variants for a single abstract character and the Unicode Standard builds in the concept of variation selectors which can be used to affect which particular glyph is used when the character is rendered.
At their most basic form, variation selectors are nothing more than additional code points built into the Unicode Standard, but like the combining marks we’ve looked at previously, they
provide a mechanism by which the appearance of the preceding character can be affected when rendered.
There are 256 variation selectors built into the Unicode Standard (VS1
through VS256
in the range U+FE00
to U+FE0F
for the variation selectors block and U+E0100
to U+E01EF
for the variation selectors supplement block). These variation selectors distinguish between the Standardized Variation Sequences which are defined in the Unicode standard and the Ideographic Variation Sequences which are symbols submitted by third parties to the Unicode consortium and once registered can be used by anyone.
One example of standardised variation sequences in practice is emoji styles.
Many emoji and some ”normal” characters come in two forms; a colourful “emoji style” and a black and white more symbol-like “text style”. Variation selectors allow us to select which particular style we would like to use when an emoji is rendered.
For example, the glyph ☕ (U+2615 HOT BEVERAGE)
has two variants – the text style (☕︎
) and an emoji style (☕
). To select between these two styles, we use the base emoji (U+2615
) and then select which particular rendering style we require by combining it with either the U+FE0E VARIATION SELECTOR-15
variation selector for the text-style or U+FE0F VARIATION SELECTOR-16
variation selector for the emoji-style.
Emoji Modifiers
In addition to the variant selectors above, certain emoji such as those that represent characters and body parts also support modifiers that reflect human diversity such as skin tone.
By default human emoji in Unicode use a generic non-realistic skin tone such as grey or bright yellow but the Unicode standard also supplies five symbol modifiers in the range U+1F3FB
through U+1F3FF
which allow us to select a more realistic skin tone.
These modifiers are based on the skin tones of the Fitzpatrick Scale, a recognised standard for dermatology, and when used in conjunction with these human emoji allow the selection of the particular skin tone that will be used when the emoji is rendered.
Let’s leave variation selectors there for now and move on and look at something a little different – Regional Indicator Symbols.
Regional Indicator Symbols
As you might expect, national flags are another set of symbols that are supported by the Unicode Standard. What might surprise you though is that the Unicode Standard doesn’t define code points for national flag symbols directly.
Instead, the standard defines a method for composing a flag symbol from two [regional indicator symbols](https://en.wikipedia.org/wiki/Regional_Indicator_Symbol) a set of 26 alphabetic Unicode code points (representing the letters A-Z) and ranging from U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A
to U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z
that are used to construct the two-letter ISO country code for the country in question.
When combined, these two-letter country codes represent a single unicode glyph that represents the countries flag.
For example, we could combined the unicode scalar U+1F1EC REGIONAL INDICATOR SYMBOL LETTER G
with the unicode scalar U+1F1E7 REGIONAL INDICATOR SYMBOL LETTER B
and when rendered in a unicode aware rendering system the result would be the flag of the United Kingdom.
In similar fashion, the unicode scalar U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
combined with the unicode scalar U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S
would result in the flat of the United States of America.
There are however a couple of things to note. Whether the combination of regional indicator symbols is actually displayed as a flag or not actually depends on your devices font support. If you use a regional indicator symbol for which no glyph exists, it will be displayed as multiple letters but will still be treated as a single unit. In addition, you also need to be careful about grapheme cluster boundaries.
The Unicode Standard doesn’t actually state that a grapheme cluster consisting of rational indicator symbols should contain only two unicode scalars ([http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules]). There is nothing stopping you listing multiple regional indicator symbols in sequence. However, to have them treated as separated characters you are going to have to give the rendering engine a bit of a hint about where the boundaries of each symbol is. To this end, it is usually best to separated individual flags within a non-printing character such as U+200B ZERO WIDTH SPACE
to separate each flag in a sequence of unicode code points, this way rendering engines are able to determined the boundaries of each cluster and treat each regional flag as a separate character.
Ok, that’s pretty much it for the core of the Unicode Standard. Let’s now move on and look at normalisation.
Normalisation Forms
In earlier sections, we’ve talked about how the Unicode Standard has both duplicate characters but also how, through the use of extended grapheme clusters, the standard also supports precomposed and decomposed forms of the same character.
As we’ve seen, this all works well if all we’re worried about is how those characters are stored and presented on screen but when it comes to text processing software, we have to account for the presence of these different forms, especially when it comes to things like text equivalence.
The question then is how exactly do we determine text equality when the text we’re looking at could be encoded in a number of different forms? The most obvious option would be to compare two pieces of text on a code point by code point basis. If we think about it though, that’s not a great solution for characters like the character é
that we talked about earlier. Remember that this character can be represented in either a precomposed form (U+00E9 LATIN SMALL LETTER E WITH ACUTE
or in a decomposed form (U+0065 LATIN SMALL LETTER E
followed by U+0301 COMBINING ACUTE ACCENT
). Instead we need a way of determining whether two strings are canonical equivalent (i.e. they have the same visual appearance and from a human perspective are essentially the same character) and the good news here is that the Unicode Standard has a number of normalization algorithms that can help us with this.
The Unicode standard supports two formal types of equivalence – canonical equivalence (NF) and compatibility equivalence (NFK).
Canonical equivalence is a fundamental equivalence between characters or sequences of characters that when displayed, represent the same abstract character and result in the same visual appearance. Our character é
is a prime example of this. In Unicode terms, the precomposed and decomposed forms of the character é
we talked about earlier are said to be canonically equivalent as they result in the same character being visually rendered.
Compatibility equivalence is a slightly weaker form of equivalence. With compatibility equivalence, characters are deemed to be equivalent if they represent the same abstract character but when rendered may have visually distinct representations. For example the ff
ligature we talked about earlier is a good example. The unicode scalar U+FB00 LATIN SMALL LIGATURE FF
is defined in Unicode to be compatible — but not canonically equivalent — to two unicode scalars U+0066 LATIN SMALL LETTER F
.
In Unicode terms, normalisation is “a process of removing alternative representations of equivalent sequences from textual data, to convert that data into a form that can be binary-compared for equivalence”.
So when it comes to actually normalising a sets of text in Unicode in order to compare them, there are actually four different transformations we can use when comparing text:
– Normalisation Form C (NFC) – Normalisation Form Canonical Composition – Characters are normalised by decomposing them by canonical equivalence (transforming any precomposed characters into their canonical equivalents) before performing a canonical composition (combining multiple characters into their precomposed forms using canonical equivalence).
– Normalisation Form D (NFD) – Normalisation Form Canonical Decomposition – Performs canonical decomposition before arranging multiple combining characters in a specific order. This ensures that the sequence of combining marks is unique after decomposition.
– Normalisation Form KC (NFKC) – Normalisation Form Compatibility Composition – Characters are decomposed by compatibility equivalence and then recomposed by canonical equivalence.
– Normalisation Form KD (NFKD) – Normalisation Form Compatibility Decomposition – Characters are decomposed by compatibility equivalence and then any combining characters are arranged in a specific order again ensuring that they are unique.
For the purposes of comparing text, it doesn’t really matter whether you normalise to the composed (C) or decomposed (D) forms. The decomposed form (D) is slightly faster then the composed form (C) due the fact that it doesn’t have to first decompose then characters but ultimately the results of any comparison should have the same outcome.
There is one word of caution though. If you have any bright ideas about storing text in a normalised form then generally the answer is – don’t.
The various transformations outlined above can, and often do, alter the meaning of the text that is being transformed so if you are intending on using that stored text for anything other than comparison it’s generally a bad idea as that stored text may not have the same meaning as the original.
Unicode Transformation Formats
Ok, so we’ve talked about how Unicode represents characters in an abstract way, and we’ve talked about how we can uniquely identify characters through the use of Unicode code points. What we’ve not yet talked about is how those code points are stored in memory or written to file. We’ll cover that next.
As we’ve talked about, the Unicode standard uses 21-bit code points to uniquely identify different abstract characters but how those bits are stored in memory is also extremely important. This is where the unicode transformation formats (UTF) (sometimes called encodings) come in.
The Unicode Transformation Formats (UTF-8, UTF-16 and UTF-32), each define a set of rules about how unicode code points are transformed to and from bytes stored in memory.
Each encoding is based around the concept of a code unit with a code unit being the minimum number of bits that can be used to represent a unit of encoded text in memory.
Each encoding uses code unit of a different size. The UTF-8 encoding uses 8-bits. the UTF-16 encoding uses 16 and (unsurprisingly) the UTF-32 encoding uses 32. However, the exact number of code units each encoding uses to represent a unicode code point varies between these encodings. The easiest to understand is UTF-32.
UTF-32
UTF-32 is the most straightforward of the three unicode encodings schemes. It is a fixed-length encoding and uses a single, 32-bit code unit to represent each and every encoded unicode code point. Any unicode code point is simply stored as a 32-bit value that translates directly to the corresponding 21-bit code point it represents. Pretty simple. But there is a bit of a down-side with this simplicity.
Since a unicode code point requires only 21-bits of storage and the UTF-32 encoding uses a fixed 32-bits, the UTF-32 encoding is actually pretty inefficient. In almost every encoding, at least 1-byte will never be used, and when encoding text over low-bandwidth links where every bit and byte counts, this overhead can simply be too much. For this reason, UTF-32 isn’t commonly seen in the wild and instead, one of the other UTF encodings is used. This brings me onto UTF-16.
UTF-16
UTF-16 is a variable– width encoding and is the default encoding for the Unicode standard. As such it encodes the Unicode characters in the range U+0000
to U+FFFF
(essentially those in the Basic Multilingual Plane (BMP)) in a single 16-bit code unit of the same value. Since the BMP encompasses pretty much all the commonly used characters, this makes the UTF-16 encoding relatively efficient with only the rarely used characters (those characters between U+10000
and U+10FFFF
inclusive) requiring two UTF-16 code units to encode them. This pair of 16-bit code units is referred to in the Unicode Standard as a surrogate pair. So, in some cases UTF-16 uses a single 16-bit code unit, and in others it uses two. How then, does an interpreter know which is being used?
Two avoid ambiguous byte sequences and to make detection of surrogate pairs easy, the Unicode Standard reserves the range of Unicode code points between U+D800
and U+DFFF
for the exclusive use of the UTF-16 encoding.
As we touched on earlier, the Unicode standard, guarantees that code points within this range will never have a character assigned to them. This means that when an interpreter does sees a value that falls into this range it immediately knows it’s dealing with a UTF-16 surrogate pair but it still doesn’t necessarily know which code unit within that pair it has encountered. For this, we need to dig a little further.
Within each surrogate pair there are obviously two code units. In order to distinguish between them, the Unicode standard defines things so that the first code unit in the pair will always be within the range 0xD800
to 0xDBFF
inclusive and the second code unit in a pair will always be in the range 0xDC00
to 0xDFFF
inclusive. It does this through the following encoding algorithm.
First, a value of 0x010000
is subtracted from the code point that is being encoded. This leaves a 20-bit number in the range 0x0
to 0x0FFFFF
inclusive. The top ten bits of that value (a number in the range 0x0
to 0x03FF
inclusive) are then added to a value of 0xD800
to give the first 16-bit code unit (also known as a high surrogate). This code unit will be in the range 0xD800
to 0xDBFF
inclusive.
Next, the low ten bits of the number (also in the range 0x0
to 0x03FF
inclusive) are added then added to a value of 0xDC00
to give the second 16-bit code unit (also known as a low surrogate). This code unit will be in the range 0xDC00
to 0xDFFF
inclusive. To decode a code point, the process is reversed.
The key here is that the two resulting ranges, 0xD800
to 0xDBFF
inclusive for the high surrogates and 0xDC00
to 0xDFFF
inclusive for the low surrogates, are mutually exclusive. By dividing the values of the surrogates into two separate and non-overlapping ranges, any text interpreter is able to immediately distinguish between whether it’s looking at the first or second code unit in a surrogate pair. Pretty clever. But there is one more thing.
We talked earlier about the importance of byte order when encoding values to and from memory and the Unicode Standard also has to deal with this.
Tranformation Formats and Byte Order
As we’ve seen, both the UTF-16 and UTF-32 encodings both use code units that consists of multiple bytes. In the case of UTF-32 this means four bytes and in the case of UTF-16 it obviously means two, but in both cases, the order in which those bytes are written to and read from memory is highly important.
To help with this, both the UTF-16 and UTF-32 encodings incorporate the concept of a byte order mark (BOM).
A byte order mark is a special character (U+FEFF
) that is placed as the start of a file or character stream to indicate the endianness (or byte order) of the following code units.
In a big-endian system, interpreters will read these first couple of bytes and see the sequence as 0xFE
followed by 0xFF
. In little-endian systems, they’ll see theses sequences reversed with a value of 0xFF
being followed by 0xFE
. By paying attention to which way around these bytes appear, interpreters can use that information to determine the resulting order of all of the remaining bytes in the file or stream. It’s a neat little trick but does add additional complexity and this may be the reason why UTF-16, despite being the default encoding for the Unicode standard, is not that common in the wild. What is common however, is UTF-8.
UTF-8
So, we talked about UTF-32 and we saw that due to it’s unused bytes it wasn’t the most efficient of encodings but in practice UTF-16 isn’t much better.
In Unicode, the first 256 code points (code points U+0000
to U+00FF
inclusive) are identical to those of the ISO-8859-1 (Latin 1) encoding and cover most of the English and Western European characters and digits. When encoding English and Western European texts in UTF-16 this commonly means that the upper 8-bits of each 16-bit code unit are never used are essentially always set to zero leading to this encoding potentially using more storage space than it really needs. To try to address these issues, the UTF-8 encoding was developed.
As mentioned earlier, the UTF-8 encoding uses 8-bit code units and like UTF-16, is another variable-width encoding that uses between one and four code units to encode an individual code point.
The code points from 0 – 127 are mapped directly to a single code unit (making UTF-8 identical to ASCII for text that only contain these characters).
The following 1,920 code points are then encoded using two code units and all the remaining code points in the Basic Multilingual Plane use three. Finally, code points in all the remaining Unicode planes are encoded using 4 code units.
There are a couple of benefits to UTF-8.
Firstly, since UTF-8 is based on 8-bit code units, it doesn’t suffer the same byte-ordering issues that plague the UTF-16 and UTF-32 encodings. Secondly, and due to it’s variable-width nature, it is also space efficient, especially for western languages. In general then the UTF-8 encoding is seen as the best encoding for the storage of Unicode text and has become the de-facto standard for file formats, network protocols and web APIs.
There is however, one issue. Due to unicode code points being encoded with a variable number of 8-bit code units, the UTF-8 encoding still needs a way for interpreters to determine the exact number of code units that are being used to encode an individual code point. This is done by using some special bits as the start of each code unit.
In UTF-8, each byte starts with a few bits that tells interpreters whether it is using a single code unit, multiple code units or whether that byte is simply a continuation of a previous multi-byte code point. The encoding it uses are as follows.
For single-byte ASCII characters (the first 127 characters of the Unicode Standard), each byte is encoded with the high-order bit set to zero as follows:
” 0xxx xxxx (A single-byte code point)
For multi-byte code points, each code unit starts with a few bits that tells the interpreter whether it should read 1, 2 or 3 subsequent bytes for the encoded unicode code point:
” 110x xxxx (One more byte will follow)
” 1110 xxxx (Two more bytes will follow)
” 1111 0xxx (Three more bytes will follow)
Finally, any subsequent byte that is a continuation of a previous multi-byte code points is encoded to start with the following bits:
” 10xx xxxx (A continuation of a multi-byte code point)
Through a combination of these little hints, any text interpreter is then able to work out the boundaries for each encoded code point. Pretty cool.
Summary
Anyway, there we have it. Unicode. It’s a pretty big standard and has had some pretty big players behind it. It’s an encoding standard that has been designed to solve both the current and future character encoding problems of all of the worlds texts and by now, you should hopefully have a much greater understanding of how it works.
In the next post, we’ll turn our focus back to Swift. We’ll use background Unicode knowledge we’ve gained from this article to look at how String
and Character
types are implemented in Swift and will see how certain methods on those characters can give us a peek back into into this Unicode world.