Cerner and the Apache Software Foundation

At the beginning of this year, we announced that Cerner became a bronze-level sponsor of the non-profit Apache Software Foundation (ASF). Many of the open source projects we use and contribute to are under the ASF umbrella, so supporting the mission and work of the ASF is important to us.

We’re happy to announce that Cerner has now increased our sponsorship of the ASF to become a silver-level sponsor. Open source continues to play an integral role in both our architecture and engineering culture. We’ve blogged and spoken at conferences about how several ASF projects are core foundational components in our architecture and several of our tech talks have focused on ASF projects.

Further increasing our sponsorship of the ASF reaffirms our continued support for an organization that provides homes for numerous open source projects that are important not only to us, but the larger development community.

Closures & Currying in JavaScript

Preface

I have been asked many times what closures are and how they work. There are many resources available to learn this concept, but they are not always clear to everyone. This has led me to put together my own approach to exchanging the information.

I will supply code samples. //> denotes an output or return.

Before discussing closures, it is important to review how functions work in JavaScript.

Introduction to functions

If a function does not have a return statement, it will implicitly return undefined, which brings us to the simplest functions.

Noop

Noop typically stands for no operation; it takes any parameters, does nothing with them, and returns undefined.

1
2
function noop() {};
noop("cat"); //> undefined

Identity

The identity function takes in a value and returns it.

1
2
3
4
5
6
function identity(value) {
  return value;
}

identity("cat"); //> "cat"
identity({a: "dog"}); //> Object {a: "dog"}

The important thing to note here is that the variable (value) passed in is bound to that function’s scope. This means that it is available to everything inside the function and is unavailable outside of it. There is an exception to this, being that objects are passed by reference which will prove useful with the use of closures and currying.

Functions that evaluate to functions

Functions are first class citizens in Javascript, which means that they are objects. Since they are objects, they can take functions as parameters, have methods bound to them, and even return functions.

1
2
3
4
5
6
7
function foo() {
  return function () {
    return true;
  }
}

foo()(); //> true

This is a function that returns a function which returns true.

Functions take arguments and those arguments can be values or reference types, such as functions. If you return a function, it is that function you are returning, not a new one (even though it might have just been made to return).

Closures

Creating a closure is nothing more than accessing a variable outside of a function’s scope (using a variable that is neither bound on invocation or defined in the function body).

To elaborate, the parent function’s variables are accessible to the inner function. If the inner function uses its parent’s (or parent’s parent’s and so on) variable(s) then they will persist in memory as long as the accessing functions(s) are still referenceable. In JavaScript, referenceable variables are not garbage collected.

Let’s review the identity function:

1
function identity(a) { return a; }

The value, a, is bound inside of the function and is unavailable outside of it; there is no closure here. For a closure to be present, there would need to be a function within this function that would access the variable a.

Why is this important?

  • Closures provide a way to associate data with a method that operates on that data.
  • They enable private variables in a global world.
  • Many patterns, including the fairly popular module pattern, rely on closures to work correctly.

Due to these strengths, and many more, closures are used everywhere. Many popular libraries utilize them internally.

Let’s take a look at an example of closure in action:

1
2
3
4
5
6
7
8
9
function foo(x) {
  function bar(y) {
    console.log(x + y);
  }

  bar(2);
}

foo(2); // will log 4 to the console

The outer function (foo) takes a variable (x), which, which is bound to that function when invoked. When the internal function (bar) is invoked, x (2) and y (2) are added together then logged to the console as 4. Bar is able to access foo’s x-variable because bar is created within foo’s scope.

The takeaway here is that bar can access foo’s variables because it was created within foo’s scope. A function can access variables in its scope and up the chain to the global scope. It cannot access other function’s scopes that are declared within it or parallel to it.

No, a function inside of a function doesn’t have to reference variables outside of its scope. Recall the example function which returned a function which evaluated to true:

1
2
3
4
5
6
7
8
function foo(x) {
  // does something with x or not
  return function () {
      return true;
  }
}

foo(7)(); //> true

No matter what is passed to foo, a function that evaluates to true is returned. A closure only exists when a function accesses a variable(s) outside of its immediate scope.

This leads into an important implication about closures, they enable you to define a dataset once. We’re talking about private variables here.

Without closures, you recreate the data per function call if you want to keep it private.

1
2
3
4
5
6
7
function foo() {
  var private = [0, 1, 2]; // Imaginary large data set - instantiated per invocation

  console.log(private);
}

foo(); //> [0, 1, 2]

We can do better! With a closure, we can save it to a variable that is private, but only instantiated once.

1
2
3
4
5
6
7
8
9
10
var bar = (function () {
  var private = [0, 1, 2]; // Same large imaginary data set - only instantiated once

  // As long as this function exists, it has a reference to the private variable
  return function () {
    console.log(private);
  }
}());

bar(); //> [0, 1, 2]

By utilizing closure here, our big imaginary data set only has to be created once. Given the way garbage collection (automatic memory freeing) works in JavaScript, the existence of the internal function (which is returned and set to the variable bar) keeps the private variable from being freed and thus available for subsequent calls. This is really advantageous when you consider large data sets that may be created via Ajax requests which have to go over the network.

Currying

Currying is the process of transforming a function with many arguments into the same function with less arguments.

That sounds cool, but why would I care about that?

  • Currying can help you make higher order factories.
  • Currying can help you avoid continuously passing the same variables.
  • Currying can memorize various things including state.

Let’s pretend that we have a function (curry) defined and set onto the function prototype which turns a function into a curried version of itself. Please note, that this is not a built in feature of JavaScript.

1
2
3
4
5
6
7
8
function msg(msg1, msg2) {
  return msg1 + ' ' + msg2 + '.';
}

var hello = msg.curry('Hello,');

console.log(hello('Sarah Connor')); // Hello, Sarah Connor. 
console.log(msg('Goodbye,', 'Sarah Connor')); // Goodbye, Sarah Connor. 

By currying the msg function so the first variable is cached as “Hello,”, we can call a simpler function, hello, that only requires one variable to be passed. Doesn’t this sound similar to what a closure might be used for?

In the discussion of functional programming concepts, there is often a sense of resistance.

The thing is, you’ve probably already been functionally programming all along. If you use jQuery, you certainly already do.

1
2
3
4
$("some-selector").each(function () {
  $(this).fadeOut();
  // other stuff to justify the each
});

Another place you may have seen this is utilizing the map function for arrays.

1
2
3
4
5
6
var myArray = [0, 1, 2];
console.log(myArray.map(function (val) {
  return val * 2;
}));

//> [0, 2, 4]

Conclusion

We’ve seen some examples of closures and how they can be useful. We’ve seen what currying is and more importantly that you’ve likely already been functionally programming even if you didn’t realize it. There is a lot more to learn with closures and currying as well as functional programming.

I ask you to:

  1. Work with closures and get the hang of them.
  2. Give currying a shot.
  3. Embrace functional programming as an additional tool that you can utilize to enhance your programs and development workflow.

Additional readings and inspirations

Bonus

Check out how you can utilize closure and currying to manage state throughout a stateful function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function setFoo(state) {
  if (state === "a") { // Specific state
      return function () {
          console.log("State a for the win!");
      };
  } else if (state) { // Default state
      return function () {
        console.log("Default state");
      };
  }
  // Empty function since no state is desired. This avoids invocation errors.
  return function () {};
}

var foo = setFoo("a"); // Set to the specific state (a)
foo(); //> "State a for the win!";

foo = setFoo(true); // Set foo to its default state
foo(); //> "Default state"

foo = setFoo(); // Set foo to not do anything
foo(); //> undefined
// etc

Bonus 2

Checkout how closures and currying can be used to create higher order functions to create methods on the fly: http://jsfiddle.net/GneatGeek/A9WRb/

Intern HackFest 2014

Ten teams of two to four Cerner interns competed in a week-long HackFest this summer, working to solve any problem they put their minds to. This competition cumulated in a presentation and judging of projects, with prizes of Raspberry Pi Kits for each member of the second place team and Leap Motions for each member of the winning team. From mobile apps, to machine learning algorithms, to drones…this year’s Summer Intern HackFest has been one for the books.

We called ourselves Team Rubber Duck Dynasty, and it was made up of Umer Khan (University of Notre Dame), Ryan Boccabella (University of Notre Dame), MaKenzie Kalb (Vanderbilt University), and Jake Gould (University of Kansas).

We were excited to get to work the first night when the week-long competition had commenced. Since the beginning of the summer, all of us had been impressed with the caliber of talent Cerner brought into the Software Engineer Internship program. All of the nine teams we were up against were made up of remarkably smart, driven college students from all over the country. One of the most difficult parts of the HackFest was deciding on an interesting and competitive project that could be feasibly completed in only a week (without too many sleepless nights). One of our four team members was a member of the iOS team, and convinced us that an iOS game was the way to go. We wanted to make a game that we would be excited to show our friends as well as the judges.

We ended up building an app called Encore. It is a musical turn-based game revolving around the creation and mirroring of three second tunes between users. Tunes are created using four arpeggio based tones from real piano, guitar, or tenor trombone recordings. The initiating iOS device and sends the data to the Parse server using the Parse API for iOS. Parse stores this data on the server and sends a push notification to the receiving iOS device. Each time a new game is created, an activity is logged on the server to keep track of the game data. When the receiving user selects the game, it downloads the game data from the server and starts the game. Once the app downloads the game data, it is programmed to decode an array of dictionaries of instrument key and time and convert the array into an audio playback; this allowed for faster upload and download times, as well as significantly smaller game data files. The receiving user hears and immediately attempts to replay the tune. Scoring is accomplished using a Needleman-Wunsch algorithm for sequence alignment. The receiving user now has their chance to create a tune, and the melodious competition continues.

Over the week, we began to get to know our teammates even more than we probably wanted. Passion is the main word that comes to mind when we reminisce on this highlighting week of our summer. From the uncertainty when overhearing other groups huddled in a room talking excitedly about cutting-edge technologies, to the shrieks of excitement when a test finally passed that perhaps woke many a consulting intern roommate, this HackFest was filled with memories all around. As we went for a celebratory completion dinner the night before the presentations Monday morning, the satisfaction of completion was sweet in the air. Sitting there, playing our noisy pride and joy on our phones at the table, we agreed that the week was an excellent experience already…and we hadn’t even started the real judging yet.

Sound checks were full of nerves and excitement the morning we presented our project. The knowledge that each team had a mere five minutes to “sell” what had been more time consuming than sleep over the past week was a challenge everyone was hoping to ace. Later on that afternoon, when the esteemed judges Chris Finn, Michelle Brush, and Jenni Syed were announced as the event began, the caliber of the resources Cerner provides for their many interns was standing right in front of us. We heard from many enthusiastic, impressive groups that afternoon. The presentations showcased many feats of great teamwork and skill: a recommendation engine, dashboard for developers, chatting website, facial recognition android app, iOS game, machine learning algorithm, twitter-controlled drone, and music website.

After a delicious ice cream break while scores were deliberated and after judges provided valuable feedback for each team, the moment of anticipation was upon us. All teams certainly completed the day with the ultimate reward of new skills learned, friends made, and a fantastic project that some are undoubtedly still building off of. As the first and second place teams were called to the stage, Team Rubber Duck Dynasty was surprised and thrilled to be among them. And as the runner up, Team Marky Mark and the Funky Bunch, received their Raspberry Pi Kits, we were amazed to find out each of us was taking home our very own Leap Motion.

We returned to our actual teams late that afternoon, proud of our accomplishments and brand new owners of a cutting-edge technology. We received the congratulations of our superiors and mentors, many of whom were our biggest encouragers to participate and supporters throughout the week. The numerous empowered associates that have guided us through this summer have been an unbelievable community – a community that all of us are incredibly grateful to have been a part of.

The Plain Text Is a Lie

There is no such thing as plain text

“But I see .txt files all the time” you say. “My source code is plain text” you claim. “What about web pages?!” you frantically ask. True, each of those things is comprised of text. The plain part is the problem. Plain denotes default or normal. There is no such thing. Computers store and transmit data in a number of methods; each are anything but plain. If you write software, design websites or test systems where even a single character of text is accepted as input, displayed as output, transmitted to another system or stored for later – please read on to learn why the plain text is a lie!

The topic of text handling applies to many disciplines:

  • UX/web designers – Your UX is the last mile of displaying text to users.
  • API developers – Your APIs should tell your consumers what languages, encodings and character sets your service supports.
  • DBAs – You should know what kinds of text your database can handle.
  • App developers – You apps should not crash when non-English characters are encountered.

After reading this article you will …

  • … understand why text encodings are important.
  • … have some best practices for handling text in your tool belt.
  • … know a bit about how computers deal with text.

This topic has been extensively written about already. I highly recommend reading Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). You should also read up on how your system handles strings. Then, go read how the APIs you talk to send/receive strings. Pythonistas, check out Ned Batchelder’s Pragmatic Unicode presentation.

OK, let’s get started!

Part I – Gallery of FAIL or “When text goes wrong, by audience”

Let’s start off by demonstrating how text handling can fail, and fail hard. The following screen shots and snippets show some of the ways text handling can fail and who should care about the type of failure.

UX and web people

The above image shows the English wikipedia article on Résumés with garbled text. Garbled text can happen if your web pages don’t specify an encoding or character set in your markup. Specifying the wrong encoding can also cause garbled text. XML and JavaScript need correct character sets too. It’s important to note that no error or exception was raised here. The text looks wrong to the user, but the failure happens silently.

This article on Tokyo above is displayed in a language (Aramaic) that my fonts don’t support. Instead of a symbol, we see a box with a number identifying the un-showable character. If you think that example is too contrived, here is a more commonly used symbol: a 16th note from sheet music. Many perfectly valid characters are not supported by widely used fonts. Specialized web fonts might not support the characters you need.

API developers

1
2
//Fetch the Universal Declaration of Human Rights in Arabic
documentAPIClient.getTitle(docID=123)

The result of this API call (example source) is similar to the last two examples: nonsense text. This can happen if the client and server use different text encodings. By the way, this situation happens so often that there’s a term for it: Mojibake.

Here are some client/server scenarios resulting in Mojibake:

  • The server didn’t document their encoding and the client guessed the wrong encoding.
  • The server or client inherit the encoding of their execution environment (virtual machine, OS, parent process, etc.), but the execution environment’s settings changed from their original values.

DBAs

Database systems can be misconfigured such that characters sent to the database are not stored accurately. In this example, the offending characters are replaced with the imaginatively-named Replacement Character (“�”). The original characters are forever lost. Worse still, replacement characters will be returned by your queries and ultimately shown to your users. Sometimes, offending characters will be omitted from the stored value or replaced with a nearest match supported character. In both scenarios the database has mangled the original data.

App developers

1
2
3
4
5
org.scalatest.exceptions.TestFailedException: "d[é]funt" did not equal "d[é]funt"
at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
...
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

The top image shows the 500 page of an app that crashed when improperly encoding. In the Scala error message (bottom), a property file was read in ISO-8859-1 encoding but had UTF-8 encoded bytes in it. This caused the unit test to fail.

Your source code, web pages, properties files, and any other text artifact you work with has an encoding. Every tool in your development tool chain (local server, terminal, editor, browser, CI system, etc.) is a potential failure point if these encodings are not honoroed.

Part II – Avoid text handling problems

Ghost in the machine

You’ve seen examples of failure and (hopefully) are wondering how such failures can be avoided. To avoid failure you must ask yourself one question: “Can my system store and transmit a ghost?”

GHOST (code point U+1F47B) is a valid (albeit weird) part of the Unicode standard. Unicode is a system of storing and manipulating text that supports thousands of languages. Using Unicode properly will go a long way to prevent text handling problems. Thus, if your system can store, transmit, read and write GHOST then you’re doing it right. But how to handle this GHOST?

Some Terminology

You need to know some terms before the rest of this article will make any sense.

Unicode object
A datatype that lets you operate on Unicode text.
Byte-string
A sequence of bytes (octets).
Encode
To turn a Unicode object into a byte-string, where the bytes follow an encoding.
Encoding (noun)
A standard about what a byte means, like: When you see 01000001, it means “A”.
Decode
The inverse operation of encode. To turn a byte-string of a certain encoding into a Unicode object.

Remembering the difference between encode and decode can be difficult. One trick to keep them straight is to think of Unicode objects as the ideal state of being (thanks, Joel Spolksy) and byte-strings as strange, cryptic sequences. Encoding turns the ideal into a bunch of cryptic bytes, while decoding un-weirds a bunch of bytes back into the ideal state; something we can reason about. Some systems use different terms but the ideas still apply. For example: Java Strings are Unicode objects and you can encode/decode to/from byte-strings with them.

Now that you’ve got the necessary terminology under your belt, let’s prevent text handling problems in our system by making a sandwich; a Unicode sandwich!

Make a Unicode sandwich

Analogy credit: Ned Batchelder coined the Unicode sandwich analogy in his Pragmatic Unicode presentation at PyCon 2012 (video). It’s so clever that I can’t resist re-using it in this article!

Original image

In this analogy the pieces of bread on the top and bottom are regions of your code where you deal with byte-strings. The meat in the middle is where your system deals in Unicode objects. The top bread is input into your system such as database query results, file reads or HTTP responses. The bottom bread is output from your system such as writing files or sending HTTP responses. The meat is your business logic.

Good sandwiches are meaty

Your goal is to keep the bread thin and the meat thick. You can achieve this by decoding from byte-strings to Unicode objects as early as you can; perhaps immediately after arrival from another system. Similarly, you should do your encoding from Unicode objects into byte-strings at the last possible moment, such as right before transmitting text to another system.

Working with Unicode inside your system gives you a common ground of text handling that will largely avoid the errors we’ve seen at the top of this article. If you don’t deal in Unicode inside your system then you are limiting the languages you support at best and exposing yourself to text handling bugs at worst!

The best sandwich bread is UTF-8

Your system ultimately needs to send and receive byte-strings at some point, so you must choose an encoding for your byte-strings. Encodings are not created equal! Some encodings only support one language. Some support only similar languages (for example, German and French but not Arabic). Never assume your system will only encounter languages you speak or write! Ideally you will choose encodings that support a great many languages.

UTF-8 is the best general purpose encoding for your byte-strings. You’ll learn why UTF-8 is an excellent encoding choice later in this article in the Unicode: One standard to rule them all section. For now I recommend you:

  • Choose UTF-8 for all byte-strings.
  • Configure your system to use this encoding explicitly. Do not rely on the parent system (OS, VM, etc.) to provide an encoding since system settings might change over time.
  • Document your encoding choice in both public facing and internal documentation.

The UTF-8 encoding supports all the text you’d ever want. Yet, in this imperfect world you might be forced to use a more limited encoding such as ISO-8859-1 or Windows-1252 when interfacing with other systems. Working with a limited encoding presents problems when decoding to and encoding from Unicode: not every encoding supports the full Unicode range of characters. You must test how your system converts between your byte-strings and Unicode objects. In other words, test between the meat and the bread.

Testing between the meat and the bread

The critical areas to test are where bytes strings are decoded to Unicode objects and where Unicode objects are encoded into byte-strings. If you’ve followed the advice of this article thus far then the rest of your app logic should operate exclusively in Unicode objects. Here is a handy table of how to test regions of your system that encode and decode:

ScenarioTest Strategy
My input encoding doesn’t support full Unicode. Test that non-English characters are faithfully decoded to Unicode.
My output encoding doesn’t support full Unicode.

Test that supported non-English characters are faithfully encoded to byte-strings.

Test that your system behaves correctly when asked to encode un-supported characters.

My input(output) encoding supports full Unicode. Test that non-English characters are faithfully decoded(decoded) to(from) Unicode.

† English characters and Arabic numerals (0 - 9) are bad test cases because their byte values are identical across many encodings.

Correctly is in the eye of the beholder. Some systems choose to raise an exception. Others choose to replace the offending character with a replacement character. Lastly, some systems simply omit the offending character. The choice is up to you, but they’re all terrible. Seriously, just use UTF-8.

Unicode sandwich applies to new projects and legacy systems

Using UTF-8 for I/O, Unicode inside and testing the in-between points will save you from pain and bugs. If you’re building a new system then you have the opportunity to design it with Unicode in mind. If you have an existing system, it is worth your time to audit how your system handles text.

With the practical stuff out of the way, let’s dive deeper into computers and text!

Part III – Encodings, Unicode and how computers handle text

We’ve talked about how you should use Unicode, encodings and byte-strings in your system to handle text. You may be wondering why text handling is so painful at times. Why are there so many encodings and why don’t they all work together in harmony? I’ll attempt to explain a bit of history behind text handling in computers. Understanding this history should shed some light on why text handling can be so painful.

To make things interesting, let’s pretend we are inventing how computers will handle text. Also assume we live in the United States and speak only English. That’s a pretty ignorant assumption for real world software development, but it simplifies our process.

ASCII: Works great (if you want to ignore most of the world)

Our challenge is to invent how computers handle text. Morse code is an encoding that pre-dates digital computers but provides a model for our approach: Each character has a transmission sequence of dots and dashes to represent it. We’ll need to make a few changes and additions though…

Image source

Rather than dots and dashes we can use 1’s and 0’s (binary). Let’s also use a consistent number of bits per character so that it’s easy to know when one character ends and another begins. To support US English we need to map a binary sequence to each of the following:

  • a-z
  • A-Z
  • 0-9
  • “ ”(space)
  • !“#$%&‘()*+,–./:;<=>?@[\]^_`{|}~
  • Control characters like “ring a bell”, “make a new line”, etc.

That’s 96 printable characters and some control characters for a total of 128 characters. 128 is 27, so we can send these characters in seven-bit sequences. Since computers use eight-bit bytes, let’s decide to send eight bits per character but ignore the last bit. We have just invented the ASCII encoding!

ASCII forms the root influence of many text encodings still used today. In fact, at one time ASCII was the law: U.S. President Lyndon B. Johnson mandated that all computers purchased by the United States federal government support ASCII in 1968.

Image source

International and OEM standards: Supporting other languages

Starting with similar languages to US English

We need more space to pack in more symbols if we want to support other languages and other symbols like currencies. It seems reasonable that people typically deal with a block of languages that are geographically or politically related, and when we’re lucky those languages share many of the same symbols. Given that assumption we can create several standards; each one for a block of languages!

For each block, we can keep the first 128 characters as-is from ASCII (identical bit sequences) so that the US English characters and Arabic numerals are still supported. We can then use the eighth bit for data instead of ignoring it. That would give us eight bits per character and a total of 256 characters to work with (double ASCII’s paltry 128). Now let’s apply that eight bit.

A bunch of countries in Western Europe use the same latin alphabet plus special diacritics (also known as accent marks) like ü or é or ß. In fact, we can pack enough extra characters in those last 128 slots to support 29 other languages like Afrikaans, German, Swahili and Icelandic. Our Western European language block encoding is ready! We call this type of encoding a single-byte encoding because every character is represented by exactly one byte.

Image source

Additional single byte encodings for other language blocks

We can repeat the same process we used to create our Western European language encoding to develop other single-byte encodings for other language blocks; each a 256 character set! To give one more example, let’s build a single byte coding for Arabic.

Again, we take the first 128 ASCII characters as-is, then fill up the last 128 with the Arabic alphabet We’ve got some space left over. Arabic has some diacritics as well, so let’s use some of the leftover slots to hold diacritic marks that are only valid when combined with other letters.

Some languages don’t even fit in 256 characters. Chinese, Japanese and Korean for example. That’s OK, we’ll just use multiple bytes per character to get more room. As you may have guessed, these encodings are called multibyte encodings. Sometimes we choose to use the same number of bytes for every character (fixed width multibyte encodings) and sometimes we might choose to use different byte lengths (variable width multibyte encodings) to save space.

Ratifying our encodings to standards

After we’ve built several of these encodings (Russian, Greek, Simplified Chinese, etc.) we can ratify them as international standards such as ISO-8859 for single byte encodings. We previously built ISO-8895-1 (Western European) and ISO-8859-6 (Latin/Arabic). International standards for multibyte encodings exist too. People who use the same standard can communicate without problems.

The international standards like ISO-8895 are only part of the story. Companies like Microsoft and IBM created their own standards (so-called OEM standards or code pages). Some OEM standards map to international standards, some almost-but-not-quite map (see Windows-1252) and some are completely different.

Our standards have problems

Our standards and code pages are better than ASCII but there are a number of problems remaining:

  • How do we intermix different languages in the same document?
  • What if our standards run out of room for new symbols?
  • There is no rosetta stone to allow communication between systems that use different encodings.

Enter Unicode.

Unicode: One standard to rule them all

Image source

As mentioned earlier, Unicode is a single standard supporting thousands of languages. Unicode addresses the limitations of byte encodings by operating at a higher level than simple byte representations of characters. The foundation of Unicode is an über list of symbols chosen by a multinational committee.

Unicode keeps a gigantic numbered list of all the symbols of all the supported languages. The items in this list are called code points and are not concerned with bytes, how computers represent them, or what they look like on screen. They’re just numbered items, like:

a LATIN SMALL LETTER A – U+0061

Pinyin: dōng, Chaizi: shi,ba,ri – U+6771

SNOWMAN – U+2603

We have virtually unlimited space to work with. The Unicode standards supports a maximum of 1,114,112 items. That is more than enough to express the world’s active written languages, some historical languages and miscellaneous symbols. Some of the slots are even undefined and left to the user to decide what they mean. These spaces have been used for wacky things like Klingon and Elvish.

Fun fact: Apple Inc. uses U+F8FF in the Private Use Area of Unicode for their logo symbol (). If you don’t see the Apple logo in parenthesis in the preceding sentence, then your system doesn’t agree with Apple’s use of U+F8FF.

OK, we have our gigantic list of code points. All we need to do is devise an encoding scheme to encode unicode objects (which now we know are lists of code points) into byte-strings for transmission over the wire to other systems.

UTF-8

UTF-8 encodes every Unicode code point in between one and four byte sequences. Here are some cool features of UTF-8:

  • Popularity – It’s the dominant encoding of the world wide web since 2010.
  • Simplicity – No need to transmit byte order information or worry about endianness in transmissions.
  • Backwards compatibility – The first 128 byte sequences are identical to ASCII.

UCS-2: Old and busted

UCS-2 is a fixed width, two-byte encoding. In the mid-nineties, Unicode added code points that cannot be expressed in the two-byte system. Thus, UCS-2 is deprecated in favor of UTF-16.

  • UCS-2 was the original Java String class’s internal representation
  • C Python 2 and 3 use UCS-2 if compiled with default options
  • Microsoft Windows OS API used UCS-2 prior to Windows 2000

UTF-16: UCS-2++

UTF-16 extends UCS-2 by adding support for the code points that can’t be expressed in a two-byte system. You can find UTF-16 in:

  • Windows 2000 and later’s OS API
  • The Java String class
  • .NET environment
  • OS X and iOS’s NSString type

UTF-32: Large bytes, simple representation

UTF-32 is a simple 1:1 mapping of code points to four-byte values. C Python uses UTF-32 for internal representation of Unicode if compiled with a certain flag.

Conclusion

We’ve seen how text handling can go wrong. We’ve learned how to design and test our systems with Unicode in mind. Finally, we’ve learned a bit of history of text encodings. There is a lot more to the topic of text, but for now I ask you do to the following:

  1. Examine your system to see if you’re using Unicode inside
  2. Use UTF-8 when reading and writing data
  3. Know that the plain text is a lie!

Thanks for reading!