Unicode Showcase

6 min readFeb 27, 2024

////////////////////////////////////////////////////////////////////////////

Unicode 🐞𓀀 過 _________________________

maXbox Starter 120 — Code with Unicode in ᛗᚨᛪᛒᛟᛪ.

“Language is a virus — Laurie Anderson.

Source: 1295_412_Zeosutils_sha64_uc2unicode.txt

Unicode is a character encoding system that assigns a codepoint (value) to every character and symbol in the world’s languages. Unicode is the only encoding system that ensures you may get or combine data using any combination of languages because no other encoding stand-ard covers all languages with codepoints.

So today we deal with unicode codepoint code!

XML, Java, JavaScript, LDAP, OpenSSL and other web-based technologies all require Unicode like jap. 過労死 karō shi.

UTF-8, a variable length encoding method in which one represents each written symbol to four-byte code, and UTF-16, a fixed width encoding scheme (fixed as a surrogate pair) in which a two-byte code represents each written symbol, are the two most prevalent Unicode implementations for computer systems.

Be aware of the difference between UTF and Unicode:
UTF-8/16/32 is an encoding system — Unicode is a character set definition.

Lets start with a list of emoji signs out from the album Improver:

BTB BeatTheBlues: 🐞
PC PlayChess: 📌
CC CaryCopper: 🎧
EG EatGarlic: 💮
EM ExerceMeditation: 🦋
PR PractiseRunning: 🚀
BNS BettNordSued: 🔭
OB OlemosBruja: 🦢
LWN ListenWhiteNoise: 🌈
surprise: 😅

🐞📌🎧💮🦋🚀🔭🦢🌈

And this is how we code it (UTF-16 LE (little endian)):

writeln(‘PC PlayChess: ‘#$d83d#$dccc)
writeln(‘CC CaryCopper: ‘#$d83c#$dfa7);
writeln(‘EG EatGarlic: ‘#$d83d#$dcae);
writeln(’EM ExerceMeditation: ‘#$d83e#$dd8b);
writeln(‘PR PractiseRunning: ‘#$d83d#$de80);
writeln(‘BNS BettNordSued: ‘#$d83d#$dd2d);
writeln(‘OB OlemosBruja: ‘#$d83e#$dda2);
writeln(‘LWN ListenWhiteNoise: ‘#$d83c#$df08);
writeln(‘surprise sign: ‘#$d83d#$de05); //code point : 01 F6 05 u+0001f605 UTF32

As you can see with the surprise sign, we have a code point: u+0001f605 and this is also UTF-32, so each codepoint results in UTF-32. For example we can encode this sign as a QR-code and the surprise sign is: 😅

aQRCode:= TDelphiZXingQRCode.Create;
QRCodBmp:= TBitmap.Create;
form1:= getform2(700,500,123,'QR Draw UC PaintPerformPlatform PPP5');
try
 aQRCode.Data:= aTextline;
 aQRCode.Encoding:= qrcAuto; //TQRCodeEncoding(cmbEncoding.ItemIndex);

But then we get ?? or ð Qreader | Online QR Code Reader

so what’s wrong with this idea of 🀫 ? QR-code is text based!

Lets take another sign — the black star as unicode study:

https://www.compart.com/en/unicode/U+2605

Unicode Character “★” (U+2605)
Name: Black Star[1]
Unicode Version: 1.1 (June 1993)[2]
Block: Miscellaneous Symbols, U+2600 — U+26FF[3]
Plane: Basic Multilingual Plane, U+0000 — U+FFFF[3]
Script: Code for undetermined script (Zyyy) [4]

So where’s the (UTF-32) codepoint: u+00002605

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points. In UTF-32, each code point is represented by exactly 32 bits (four bytes). Until the formal balloting is concluded, the term UTF-32 can be used to refer to the subset of UCS-4 characters that are in the range of valid Unicode code points. UTF-32 is the only fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings so it represents one Unicode code point and is exactly equal to that code point’s numerical value like u+00002605 as ★.

So what should I use? UTF8 or UTF16?

The main difference between them is that UTF8 is backwards compatible with ASCII. As long as you only use the first 128 characters, an application that is not Unicode aware can still process the data (which may be an advantage or disadvantage, depending on your scenario).

https://stackoverflow.com/questions/9818617/what-should-i-use-utf8-or-utf16

Pic1: 1295_Tut120_Unicdoeblock_Screenshot

Taking a skin-toned emoji, e.g. “🤦🏻‍♂️” is seventeen bytes: ‘\xf0\x9f\xa4\xa6\xf0\x9f\x8f\xbb\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f.

We can write this as UTF-16:

\ud83e\udd26\ud83c\udffb\u200d\u2642\ufe0f

or UTF-32:

u+0001f926u+0001f3fbu+0000200du+00002642u+0000fe0f

but wait those are more than four bytes! Yes those are 5 codepoints with fixed length and we encode as UTF-16 LE:

writeln(#$d83e#$dd26#$d83c#$dffb#$200d#$2642#$fe0f) //🤦🏻‍♂️ = 🤦🏻+ ♂️

>>> 🤦🏻‍♂️

as a proof: 5 Codepunkte gefunden — Codepoints

5 Codepunkte gefunden — Codepoints

https://codepoints.net/search?q=%F0%9F%A4%A6%F0%9F%8F%BB%E2%80%8D%E2%99%82%EF%B8%8F&na=

Note that if you wish a file to be encoded as UTF-8 when you save it you need to specify that when you save it. Like this:

const CP_UTF8 = 65001;

UTF-8 consists of 1, 2, 3, or 4 bytes per character. The codepoint U+1F605 is correctly encoded as #$F0#$9F#$98#$85.

UTF-16 consists of 2 or 4 bytes per character. The 4 byte sequences are needed to encode codepoints beyond U+FFFF (such as most Emojis). Only UCS-2 is limited to codepoints U+0000 to U+FFFF (this applies to Windows NT versions before 2000).

Pic2: 1295_tutor120_grapheme_edit.png

Unicode Character Examples

☸☹☺☻☼☾☿
한국어
日本語 (◕‿◕)
中文
ქართული
ไทย
বাংলা
فارسی
العربية
עברית
Українська
Русский ✞
Ελληνικά 🤦🏻‍♂️
Čšâêçñà một trò

UC Source Representation

Why we cannot use single “\u” to represent smiley within a string? Because when \u escape was designed, all Unicode chars could be represented by 2 bytes or 4 hexadecimal digits. So there are always 4 hexadecimal digits after \u in a Java string literal.
To represent a larger value of Unicode you need a larger hexadecimal number but that will break existing Java strings. So there Java or Delphi uses same approach as UTF-16 representation.

As an example, the letter A is represented in Unicode as U+0041 and in Ansi as just 41. So converting that would be pretty simple, but you must find out how the Unicode character is encoded. The most common are UTF-16 and UTF-8. UTF 16, is basically 2 bytes per character, but even that is an oversimplification, as a character may have more bytes. UTF-8 sounds as if it means 1 byte per character but can be 2 or 3. To further complicate matters, UTF-16 can be little endian or big endian. (U+0041 or U+4100).

Where it makes no sense in a language way is if you wanted to for example convert the arabic letter ain U+0639 ع to Ansi on an English locale machine. You can’t. Because you probably missing the code page mapping letter for the code point!

Pic3: 1295_tutor120_grapheme_edit2.png

In general, character set of hundreds thousands entries cannot be converted to character set of 127 entries without some loss of information or encoding scheme. If so, then you can use the Ord standard function to get the Unicode code-point value of whatever Unicode character you have.

An approach could be:

1.Encode the string to UTF-8 using the UTF8Encode function, which
returns a UTF8String type.

2.Convert the UTF-8 string to a byte array using the BytesOf function,
which returns a TBytes type.

3.Create a UTF-32 string using the TIdTextEncoding.UTF32.GetString
method, which takes a TBytes parameter and returns a UnicodeString
type.

Pic4: tutor119_regex_multicod.png

Last point in literally sense is surrogate as you can see in Pic3 above above.

Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16 that do not fit into a single 16-bit value. Leading surrogates, also called high surrogates, are encoded from D80016 to DBFF16, and trailing surrogates, or low surrogates, from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.

FAQ — UTF-8, UTF-16, UTF-32 & BOM (unicode.org)

By the way I never succeeded in receiving a Unicode sign from an Arduino board:

Conclusion

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 2 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings.

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as “code points”).

For example, in the Unicode character set, the number for A is 41.

Script: softwareschule.ch/examples/unicode3.htm

As a pdf: Unicode Tutor 120 (softwareschule.ch)