Genzis - Blog

Home

Blog

Protocols

Understanding The Unicode Standard (part 1): UTF-32 & UTF-16

Understanding The Unicode Standard (part 1): UTF-32 & UTF-16

post-details

Unicode is an open-source standard or a universal character encoding system. More visually, Unicode is a list or a table associating every existing character in most languages in Europe, America, Asia, and some parts of Africa with their corresponding code points or binary encoding format. The standard can accommodate nearly 1.2 million individual characters making it sustainable. Today, a little over 120k symbols are in use, and the complete list of the updated available characters in the standard can be found here: Unicode 15.0 Character Code Charts

 

We all have heard about ASCII, and what it is. ASCII is an American standard used to represent the English alphabet characters in a seven bits format. Meaning up to 128 characters can be defined with ASCII. the problem was, in the early days of the internet and the worldwide adoption of computers and technologies, there was a big issue: how can a Japanese speak with an American? 

 

In reality, ASCII could not accommodate the seven thousand kanji characters or the additional 71 hiragana characters on top of the already full list of 128 English language characters. The Unicode Consortium, therefore, provided several standards to solve this issue:

 

  • UTF-32
  • UTF-16
  • UTF-8

 

Each one with its own use and encoding/decoding specificities. Today, UTF-8 is the most used encoding, also being the most memory efficient. 

 

In this article and a couple more that follows, we will discuss each technology's encoding and decoding process, and efficiency and go through a mini-lab to experiment with each method to better understand how mostly UTF-8 works. 

 

For great video resources about Unicode, you can check out these three videos I loved: 

 

 

UTF-32

 

The American Standard Code for Information Interchange uses a seven-bit encoding for all English characters. ASCII is well known but not really in use by modern computers because it’s limited to 128 characters and only related to the English alphabet. 

 

UTF-32 is a Unicode consortium standard that works as follows:

 

  • Each character is associated with a unique code point
  • Each code point is encoded with 32 bits or 4 bytes. 

 

As such, UTF-32 offers an easy way to read characters from the memory, as every eight bits represent a unique character.  

 

The letter a would then be:

 

U+00000061 in Hexadecimal 

00000000 00000000 00000000 01100001 in binary (four bytes) 

 

To write Hello World, this would be the result: 

 

U+00000048 U+00000065 U+0000006c U+0000006c U+0000006f U+00000020 U+00000077  U+0000006f U+00000072 U+0000006c U+00000064 

in Hexadecimal

 

Which converts to:

 

10010000 000000 00000000 00000000 00110010 10000000 00000000 00000000 00110110 00000000 00000000 00000000 00110110 00000000 00000000 00000000 00110111 10000000 00000000 00000000 00010000 00000000 00000000 00000000 00111011 10000000 00000000 00000000 00110111 10000000 00000000 00000000 00111001 00000000 00000000 00000000 00110110 00000000 00000000 000000000 01100100

In binary. 

 

That is about 265 bits wasted. 

 

For a document written in English and counting 2000 words averaging five characters per word, it would be 70 Megabits in size if we use ASCII, compared to 320 megabits when using UTF-32. That is over eight times more storage wasted in this scenario. That would be why UTF-32 is barely ever used, except when memory or storage is not an issue and reading speed or processing is privileged. 

 

UTF-16

 

UTF-16 goes as follows: 

 

  • The first 65 536 characters use 16 bits to be represented
  • We will use what is called surrogate pairs to represent characters beyond the 65k-ish limit. 

 

So what are surrogate pairs? Let’s consider the following character to encode using UTF-16: 😃 or the Smiling face with open mouth emoji. Its code point is U+1F603 or 128 515 in decimals. 

 

To encode our emoji using UTF-16, we will:

 

  • First, we subtract the 65k-ish from the codepoint (U+1F603)

 

128515 - 65 536 = 62 979

 

  • The remainder is converted to binary, then we complete the code to a 20 bits value. 

 

(62 979)10 = (1111011000000011)2 => (0000111101 1000000011)2

 

  • then split into two binary codes of 10 bits each

 

A=> (0000111101)2 => (61)10 & B=>(1000000011)2 => (515)10

 

  • Next A & B are associated with surrogate pairs. But what are those?

 

Note

With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over one million additional characters.

~ Microsoft

 

Surrogate pairs are two intervals of code point values, reserved for the UTF-16 used-case (only applicable here) and used to encode characters with codepoints beyond U+010000. Each interval counts around one thousand values organized as follow:

 

  • High surrogates: D800 to DBFF (55296 to 56319)
  • Low surrogates: DC00 to DFFF (56320 to 57343)

 

Those are surrogate pairs. 

 

  • Next, we add D800 (high surrogate) to A (first part of the code) and DC00(low surrogate) to B (second part fo the above code). We then obtain: 

 

A => 61 + 55 296 = 55 357 => ( 1101100000111101 )2

B=> 515 + 56 320 = 56 745 => ( 1101110110101001 )2

 

  • And there you have it. These new values are our two 16 bits codepoints representing our emoji. 



Note: The surrogates pair allow us to encode all 1 112 064 characters currently that are or will be part of Unicode.  

 

 

Chat GPT input on UTF-16 

 

To have a better look at why this procedure was chosen, Chat GPT says: 



The design of the UTF-16 encoding method was based on the needs of the Unicode character set, which is a standard for encoding and representing text in all writing systems of the world. UTF-16 was designed to be a variable-length encoding that could represent all Unicode characters, including those that require more than 16 bits of space.

 

The decision to use surrogate pairs in UTF-16 was made to maintain compatibility with existing systems and software that could only handle 16-bit values. By using surrogate pairs to represent characters that require more than 16 bits, UTF-16 was able to support the full range of Unicode characters without requiring a complete overhaul of existing systems.

 

The specific formula used to calculate the surrogate pair values was carefully designed to ensure that they would be unique and easily distinguishable from other values in the UTF-16 encoding. The range of code points reserved for surrogate pairs, U+D800 to U+DFFF, was chosen because it is not used for any other characters in the Unicode standard.

 

Overall, the design of the UTF-16 encoding and the use of surrogate pairs were carefully considered and chosen to ensure maximum compatibility, efficiency, and flexibility for encoding and representing all characters in the Unicode character set.



Key Takeaways:

 

  • UTF-32 is great because it does not require any intensive calculations to read the memory. However, it consumes a lot of memory and has simply been ignored in most systems and application designs. 
  • UTF-16 encodes the first 65k-ish characters on a 16bits string. Beyond that point, each character is encoded with two 16 bits strings determined by what are called surrogate pairs. Although UTF–16 offers better storage usage, it still wastes quite a lot when only using ASCII characters. 

 

in the following article [ https://genzis.net/blog/understanding-the-unicode-standard-part-2-utf-8-or-ucs-1 ] will detail UTF-8, the most commonly used encoding method. 


Don’t Forget to Share the Article

All Rights Reserved.

arrow

Back to the top