ASCII to UTF
👷♂️ Software Architecture Series — Part 29.
ASCII (American Standard Code for Information Interchange)
Whenever we deal with texts on computer system, ASCII codes are involved in some way or the other. Formalized in 1967, ASCII remains the single most important standard in the entire computer system. It is a 7-bit code, which means it uses binary codes 0000000 through 1111111, with a total possible outcome of 27 = 128 codes in hexadecimal format (00h to 7Fh).
Here is a snippet, which comprises of upper-case Alphabet characters and their corresponding codes:
ASCII was not the de-facto standard initially and in fact it is just an upgrade over the famous Baudot code which is often used in the teletypewriter, a device with only 30 keys and a spacebar. The Baudot code was devised by Émile Baudot, an officer in the French Telegraph Service. This code was quite popular in the 1960s to send and receive text messages called telegrams.
The Baudot code was a 5-bit code and hence the possibilities were of only 32 possible codes. As you can guess, 32 codes cannot be sufficient to represent even the English alphabet characters, considering their upper-case and lower-case forms adding up to 52 different characters. To have a more standard way of representing all the characters that normally occur in English text without shift codes, ASCII was formalized as a 7-bit code with 128 possibilities.
Why 7-bit, not 8-bit?
8-bit bytes are now the standard in computer memory system, and ASCII is technically a 7-bit code, but it is almost universally stored as 8-bit value. The extra bit (8th bit) in the byte is typically set to 0 for standard ASCII characters. For example:
Character ‘A’ in binary (7-bit) is represented as 1 but stored as .
When ASCII was first developed, 7 bits were chosen as the standard because 128 combinations of codes were sufficient to represent all the characters needed at the time, including control characters (like newline, tab, etc.). It also left the 8th bit available for other uses, such as error checking (parity bits) or extending the character set (as in Extended ASCII). In extended ASCII, the first 128 codes, with hexadecimal values 00h through 7Fh, are exactly defined in the same way as original ASCII character set, but the remaining 128 codes are available for use in whatever way possible.
As the name suggests, the American Standard Code for Information Interchange, abbreviated ASCII, was developed for representing the English language from the American perspective. When computers began to be used globally, a harsh realization came to being that even 128/256 (7-bit or 8-bit) possible codes are not sufficient to represent the languages across the globe. Actually, it even fails to represent the version of English across different countries. For example, for currency, $ sign is used in America, but in England, a pound sign (yes, now we can type £, but we will come to that shortly) is required. And in global scale, we are talking about different languages like French, German, Hindi, etc.
The extended ASCII was used to accommodate accented letters and non-Latin alphabets; however, it was extended in different ways at different times with compatibility in mind. For example, if it was extended to support Japanese, it failed to support Hindi.
UTF
The need of the hour was to develop one unambiguous character encoding system that can fit all possible global languages. Several major companies across the globe joined forces in 1988 and began developing an alternative to ASCII known as Unicode. Originally, it was intended to keep Unicode as 16-bit to have same leeway to support every possible language across the globe. Being 16-bit meant that each and every character in Unicode would require 2 bytes, with values ranging from , to represent 65,536 different characters. And following the footsteps of extended ASCII character set, the first 128 characters of Unicode, codes 0000h through 007Fh, are the same as the ASCII characters. Even though, a Unicode is a hexadecimal, it is indicated by prefixing the value with a capital U and a plus sign. Following table shows Unicode values of some characters we see frequently:
However, the shift from 8-bit character code to 16-bit character code came with its own challenges. The major one being the ability of different computers in reading 16-bit values in different ways. For example, the character ‘A’ is represented by . In hexadecimal, this would be stored as a 16-bit value: . When storing or transmitting the Unicode value in a 16-bit system, the order of bytes depends on whether the system is big-endian or little-endian.
In Big-endian system, the most significant byte is stored first. Hence, the character ‘A’ with Unicode would be stored as: . In Little-endian system, least significant byte is stored first. This means, would be stored as . However, this also creates the chances for misinterpretation. Let, us consider the value . Big-endian systems would read this value as , which is the value for Euro sign (€). However, little-endian systems would read the value as , which is the character 갠 in the Korean Hangul alphabet.
This problem was solved by using Byte order mark (BOM), a special Unicode character with the code . It is used to indicate the byte order (endianness) of a Unicode text file at the beginning. In big-endian format, the BOM is stored as , and in little-endian format it is stored as . When the system reads these bytes, it is able to recognize whether data is in big-endian format or little-endian format.
With time, Unicode grew significantly and along with it grew the complexities and requirements around it. Apart from supporting the global languages, it was expected to support the extinct scripts, to preserve them for historical studies as well as support the delightful new symbols which have now become an integrated part of our daily conversations, emojis. Soon, it was realized that 16-bit character set will also not suffice the growing needs. By the year 2021, Unicode has been expanded to 21-bit code for supporting over 1 million different characters.
Although Unicode was expanded to support every possible text it can, the same spirit is not present with computer systems around. Different platforms, languages, and applications have varying requirements for how text is stored, transmitted, and processed. And these requirements stem from factors such as memory constraints, compatibility needs, and performance considerations. This resulted in several different methods being defined to store and transmit Unicode text, often collectively called as Unicode transformation formats (UTFs).
Unicode transformation formats (UTF)
The most obvious and straightforward format is UTF-32, which is a 32-bit encoding scheme that represents every single possible Unicode character by a fixed 32-bit (4-byte) value. This means that all characters, whether they are simple Latin letters or complex ideographs, are stored as 4 bytes. Since each character is exactly 4 bytes, calculating the position of characters within a string is straightforward. You can jump directly to the nth character by multiplying by 4 bytes.
However, UTF-32 format is bloated. Any character that is represented in UTF-32 takes four times as much memory, storage, bandwidth, etc., for the content as needed for the ASCII equivalent. To compromise on space, other transformation formats were developed. In UTF-16, most common characters (characters which are used in the Basic Multilingual Plane (BMP), which includes most of the world’s writing systems) are represented using 2 bytes (16 bits). However, characters outside the BMP (e.g., rare symbols, emoji, etc.) are encoded using 4 bytes (a pair of 16-bit values). This offers a good balance between memory efficiency and the ability to represent all Unicode characters. But navigating between strings is a challenge as some characters take 2 bytes and others take 4 bytes.
The most common format which is extensively used throughout the internet is UTF-8. It uses a variable-length encoding, where characters can take 1 to 4 bytes depending on their Unicode code point. For ASCII characters in range , it uses 1-byte, and this makes UTF-8 backward compatible with ASCII. The characters in the range are stored as 2-byte code and so on. Web pages and most modern applications prefer UTF-8 due to its efficiency and compatibility, while UTF-16 and UTF-32 might be more suitable for environments where character indexing and simplicity are priorities.
So, next time you encounter an emoji, you can look for its Unicode at www.unicode.org 😊.