Intern HackFest 2014

Ten teams of two to four Cerner interns competed in a week-long HackFest this summer, working to solve any problem they put their minds to. This competition cumulated in a presentation and judging of projects, with prizes of Raspberry Pi Kits for each member of the second place team and Leap Motions for each member of the winning team. From mobile apps, to machine learning algorithms, to drones…this year’s Summer Intern HackFest has been one for the books.

We called ourselves Team Rubber Duck Dynasty, and it was made up of Umer Khan (University of Notre Dame), Ryan Boccabella (University of Notre Dame), MaKenzie Kalb (Vanderbilt University), and Jake Gould (University of Kansas).

We were excited to get to work the first night when the week-long competition had commenced. Since the beginning of the summer, all of us had been impressed with the caliber of talent Cerner brought into the Software Engineer Internship program. All of the nine teams we were up against were made up of remarkably smart, driven college students from all over the country. One of the most difficult parts of the HackFest was deciding on an interesting and competitive project that could be feasibly completed in only a week (without too many sleepless nights). One of our four team members was a member of the iOS team, and convinced us that an iOS game was the way to go. We wanted to make a game that we would be excited to show our friends as well as the judges.

We ended up building an app called Encore. It is a musical turn-based game revolving around the creation and mirroring of three second tunes between users. Tunes are created using four arpeggio based tones from real piano, guitar, or tenor trombone recordings. The initiating iOS device and sends the data to the Parse server using the Parse API for iOS. Parse stores this data on the server and sends a push notification to the receiving iOS device. Each time a new game is created, an activity is logged on the server to keep track of the game data. When the receiving user selects the game, it downloads the game data from the server and starts the game. Once the app downloads the game data, it is programmed to decode an array of dictionaries of instrument key and time and convert the array into an audio playback; this allowed for faster upload and download times, as well as significantly smaller game data files. The receiving user hears and immediately attempts to replay the tune. Scoring is accomplished using a Needleman-Wunsch algorithm for sequence alignment. The receiving user now has their chance to create a tune, and the melodious competition continues.

Over the week, we began to get to know our teammates even more than we probably wanted. Passion is the main word that comes to mind when we reminisce on this highlighting week of our summer. From the uncertainty when overhearing other groups huddled in a room talking excitedly about cutting-edge technologies, to the shrieks of excitement when a test finally passed that perhaps woke many a consulting intern roommate, this HackFest was filled with memories all around. As we went for a celebratory completion dinner the night before the presentations Monday morning, the satisfaction of completion was sweet in the air. Sitting there, playing our noisy pride and joy on our phones at the table, we agreed that the week was an excellent experience already…and we hadn’t even started the real judging yet.

Sound checks were full of nerves and excitement the morning we presented our project. The knowledge that each team had a mere five minutes to “sell” what had been more time consuming than sleep over the past week was a challenge everyone was hoping to ace. Later on that afternoon, when the esteemed judges Chris Finn, Michelle Brush, and Jenni Syed were announced as the event began, the caliber of the resources Cerner provides for their many interns was standing right in front of us. We heard from many enthusiastic, impressive groups that afternoon. The presentations showcased many feats of great teamwork and skill: a recommendation engine, dashboard for developers, chatting website, facial recognition android app, iOS game, machine learning algorithm, twitter-controlled drone, and music website.

After a delicious ice cream break while scores were deliberated and after judges provided valuable feedback for each team, the moment of anticipation was upon us. All teams certainly completed the day with the ultimate reward of new skills learned, friends made, and a fantastic project that some are undoubtedly still building off of. As the first and second place teams were called to the stage, Team Rubber Duck Dynasty was surprised and thrilled to be among them. And as the runner up, Team Marky Mark and the Funky Bunch, received their Raspberry Pi Kits, we were amazed to find out each of us was taking home our very own Leap Motion.

We returned to our actual teams late that afternoon, proud of our accomplishments and brand new owners of a cutting-edge technology. We received the congratulations of our superiors and mentors, many of whom were our biggest encouragers to participate and supporters throughout the week. The numerous empowered associates that have guided us through this summer have been an unbelievable community – a community that all of us are incredibly grateful to have been a part of.

The Plain Text Is a Lie

There is no such thing as plain text

“But I see .txt files all the time” you say. “My source code is plain text” you claim. “What about web pages?!” you frantically ask. True, each of those things is comprised of text. The plain part is the problem. Plain denotes default or normal. There is no such thing. Computers store and transmit data in a number of methods; each are anything but plain. If you write software, design websites or test systems where even a single character of text is accepted as input, displayed as output, transmitted to another system or stored for later – please read on to learn why the plain text is a lie!

The topic of text handling applies to many disciplines:

  • UX/web designers – Your UX is the last mile of displaying text to users.
  • API developers – Your APIs should tell your consumers what languages, encodings and character sets your service supports.
  • DBAs – You should know what kinds of text your database can handle.
  • App developers – You apps should not crash when non-English characters are encountered.

After reading this article you will …

  • … understand why text encodings are important.
  • … have some best practices for handling text in your tool belt.
  • … know a bit about how computers deal with text.

This topic has been extensively written about already. I highly recommend reading Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). You should also read up on how your system handles strings. Then, go read how the APIs you talk to send/receive strings. Pythonistas, check out Ned Batchelder’s Pragmatic Unicode presentation.

OK, let’s get started!

Part I – Gallery of FAIL or “When text goes wrong, by audience”

Let’s start off by demonstrating how text handling can fail, and fail hard. The following screen shots and snippets show some of the ways text handling can fail and who should care about the type of failure.

UX and web people

The above image shows the English wikipedia article on Résumés with garbled text. Garbled text can happen if your web pages don’t specify an encoding or character set in your markup. Specifying the wrong encoding can also cause garbled text. XML and JavaScript need correct character sets too. It’s important to note that no error or exception was raised here. The text looks wrong to the user, but the failure happens silently.

This article on Tokyo above is displayed in a language (Aramaic) that my fonts don’t support. Instead of a symbol, we see a box with a number identifying the un-showable character. If you think that example is too contrived, here is a more commonly used symbol: a 16th note from sheet music. Many perfectly valid characters are not supported by widely used fonts. Specialized web fonts might not support the characters you need.

API developers

1
2
//Fetch the Universal Declaration of Human Rights in Arabic
documentAPIClient.getTitle(docID=123)

The result of this API call (example source) is similar to the last two examples: nonsense text. This can happen if the client and server use different text encodings. By the way, this situation happens so often that there’s a term for it: Mojibake.

Here are some client/server scenarios resulting in Mojibake:

  • The server didn’t document their encoding and the client guessed the wrong encoding.
  • The server or client inherit the encoding of their execution environment (virtual machine, OS, parent process, etc.), but the execution environment’s settings changed from their original values.

DBAs

Database systems can be misconfigured such that characters sent to the database are not stored accurately. In this example, the offending characters are replaced with the imaginatively-named Replacement Character (“�”). The original characters are forever lost. Worse still, replacement characters will be returned by your queries and ultimately shown to your users. Sometimes, offending characters will be omitted from the stored value or replaced with a nearest match supported character. In both scenarios the database has mangled the original data.

App developers

1
2
3
4
5
org.scalatest.exceptions.TestFailedException: "d[é]funt" did not equal "d[é]funt"
at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
...
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

The top image shows the 500 page of an app that crashed when improperly encoding. In the Scala error message (bottom), a property file was read in ISO-8859-1 encoding but had UTF-8 encoded bytes in it. This caused the unit test to fail.

Your source code, web pages, properties files, and any other text artifact you work with has an encoding. Every tool in your development tool chain (local server, terminal, editor, browser, CI system, etc.) is a potential failure point if these encodings are not honoroed.

Part II – Avoid text handling problems

Ghost in the machine

You’ve seen examples of failure and (hopefully) are wondering how such failures can be avoided. To avoid failure you must ask yourself one question: “Can my system store and transmit a ghost?”

GHOST (code point U+1F47B) is a valid (albeit weird) part of the Unicode standard. Unicode is a system of storing and manipulating text that supports thousands of languages. Using Unicode properly will go a long way to prevent text handling problems. Thus, if your system can store, transmit, read and write GHOST then you’re doing it right. But how to handle this GHOST?

Some Terminology

You need to know some terms before the rest of this article will make any sense.

Unicode object
A datatype that lets you operate on Unicode text.
Byte-string
A sequence of bytes (octets).
Encode
To turn a Unicode object into a byte-string, where the bytes follow an encoding.
Encoding (noun)
A standard about what a byte means, like: When you see 01000001, it means “A”.
Decode
The inverse operation of encode. To turn a byte-string of a certain encoding into a Unicode object.

Remembering the difference between encode and decode can be difficult. One trick to keep them straight is to think of Unicode objects as the ideal state of being (thanks, Joel Spolksy) and byte-strings as strange, cryptic sequences. Encoding turns the ideal into a bunch of cryptic bytes, while decoding un-weirds a bunch of bytes back into the ideal state; something we can reason about. Some systems use different terms but the ideas still apply. For example: Java Strings are Unicode objects and you can encode/decode to/from byte-strings with them.

Now that you’ve got the necessary terminology under your belt, let’s prevent text handling problems in our system by making a sandwich; a Unicode sandwich!

Make a Unicode sandwich

Analogy credit: Ned Batchelder coined the Unicode sandwich analogy in his Pragmatic Unicode presentation at PyCon 2012 (video). It’s so clever that I can’t resist re-using it in this article!

Original image

In this analogy the pieces of bread on the top and bottom are regions of your code where you deal with byte-strings. The meat in the middle is where your system deals in Unicode objects. The top bread is input into your system such as database query results, file reads or HTTP responses. The bottom bread is output from your system such as writing files or sending HTTP responses. The meat is your business logic.

Good sandwiches are meaty

Your goal is to keep the bread thin and the meat thick. You can achieve this by decoding from byte-strings to Unicode objects as early as you can; perhaps immediately after arrival from another system. Similarly, you should do your encoding from Unicode objects into byte-strings at the last possible moment, such as right before transmitting text to another system.

Working with Unicode inside your system gives you a common ground of text handling that will largely avoid the errors we’ve seen at the top of this article. If you don’t deal in Unicode inside your system then you are limiting the languages you support at best and exposing yourself to text handling bugs at worst!

The best sandwich bread is UTF-8

Your system ultimately needs to send and receive byte-strings at some point, so you must choose an encoding for your byte-strings. Encodings are not created equal! Some encodings only support one language. Some support only similar languages (for example, German and French but not Arabic). Never assume your system will only encounter languages you speak or write! Ideally you will choose encodings that support a great many languages.

UTF-8 is the best general purpose encoding for your byte-strings. You’ll learn why UTF-8 is an excellent encoding choice later in this article in the Unicode: One standard to rule them all section. For now I recommend you:

  • Choose UTF-8 for all byte-strings.
  • Configure your system to use this encoding explicitly. Do not rely on the parent system (OS, VM, etc.) to provide an encoding since system settings might change over time.
  • Document your encoding choice in both public facing and internal documentation.

The UTF-8 encoding supports all the text you’d ever want. Yet, in this imperfect world you might be forced to use a more limited encoding such as ISO-8859-1 or Windows-1252 when interfacing with other systems. Working with a limited encoding presents problems when decoding to and encoding from Unicode: not every encoding supports the full Unicode range of characters. You must test how your system converts between your byte-strings and Unicode objects. In other words, test between the meat and the bread.

Testing between the meat and the bread

The critical areas to test are where bytes strings are decoded to Unicode objects and where Unicode objects are encoded into byte-strings. If you’ve followed the advice of this article thus far then the rest of your app logic should operate exclusively in Unicode objects. Here is a handy table of how to test regions of your system that encode and decode:

ScenarioTest Strategy
My input encoding doesn’t support full Unicode. Test that non-English characters are faithfully decoded to Unicode.
My output encoding doesn’t support full Unicode.

Test that supported non-English characters are faithfully encoded to byte-strings.

Test that your system behaves correctly when asked to encode un-supported characters.

My input(output) encoding supports full Unicode. Test that non-English characters are faithfully decoded(decoded) to(from) Unicode.

† English characters and Arabic numerals (0 - 9) are bad test cases because their byte values are identical across many encodings.

Correctly is in the eye of the beholder. Some systems choose to raise an exception. Others choose to replace the offending character with a replacement character. Lastly, some systems simply omit the offending character. The choice is up to you, but they’re all terrible. Seriously, just use UTF-8.

Unicode sandwich applies to new projects and legacy systems

Using UTF-8 for I/O, Unicode inside and testing the in-between points will save you from pain and bugs. If you’re building a new system then you have the opportunity to design it with Unicode in mind. If you have an existing system, it is worth your time to audit how your system handles text.

With the practical stuff out of the way, let’s dive deeper into computers and text!

Part III – Encodings, Unicode and how computers handle text

We’ve talked about how you should use Unicode, encodings and byte-strings in your system to handle text. You may be wondering why text handling is so painful at times. Why are there so many encodings and why don’t they all work together in harmony? I’ll attempt to explain a bit of history behind text handling in computers. Understanding this history should shed some light on why text handling can be so painful.

To make things interesting, let’s pretend we are inventing how computers will handle text. Also assume we live in the United States and speak only English. That’s a pretty ignorant assumption for real world software development, but it simplifies our process.

ASCII: Works great (if you want to ignore most of the world)

Our challenge is to invent how computers handle text. Morse code is an encoding that pre-dates digital computers but provides a model for our approach: Each character has a transmission sequence of dots and dashes to represent it. We’ll need to make a few changes and additions though…

Image source

Rather than dots and dashes we can use 1’s and 0’s (binary). Let’s also use a consistent number of bits per character so that it’s easy to know when one character ends and another begins. To support US English we need to map a binary sequence to each of the following:

  • a-z
  • A-Z
  • 0-9
  • “ ”(space)
  • !“#$%&‘()*+,–./:;<=>?@[\]^_`{|}~
  • Control characters like “ring a bell”, “make a new line”, etc.

That’s 96 printable characters and some control characters for a total of 128 characters. 128 is 27, so we can send these characters in seven-bit sequences. Since computers use eight-bit bytes, let’s decide to send eight bits per character but ignore the last bit. We have just invented the ASCII encoding!

ASCII forms the root influence of many text encodings still used today. In fact, at one time ASCII was the law: U.S. President Lyndon B. Johnson mandated that all computers purchased by the United States federal government support ASCII in 1968.

Image source

International and OEM standards: Supporting other languages

Starting with similar languages to US English

We need more space to pack in more symbols if we want to support other languages and other symbols like currencies. It seems reasonable that people typically deal with a block of languages that are geographically or politically related, and when we’re lucky those languages share many of the same symbols. Given that assumption we can create several standards; each one for a block of languages!

For each block, we can keep the first 128 characters as-is from ASCII (identical bit sequences) so that the US English characters and Arabic numerals are still supported. We can then use the eighth bit for data instead of ignoring it. That would give us eight bits per character and a total of 256 characters to work with (double ASCII’s paltry 128). Now let’s apply that eight bit.

A bunch of countries in Western Europe use the same latin alphabet plus special diacritics (also known as accent marks) like ü or é or ß. In fact, we can pack enough extra characters in those last 128 slots to support 29 other languages like Afrikaans, German, Swahili and Icelandic. Our Western European language block encoding is ready! We call this type of encoding a single-byte encoding because every character is represented by exactly one byte.

Image source

Additional single byte encodings for other language blocks

We can repeat the same process we used to create our Western European language encoding to develop other single-byte encodings for other language blocks; each a 256 character set! To give one more example, let’s build a single byte coding for Arabic.

Again, we take the first 128 ASCII characters as-is, then fill up the last 128 with the Arabic alphabet We’ve got some space left over. Arabic has some diacritics as well, so let’s use some of the leftover slots to hold diacritic marks that are only valid when combined with other letters.

Some languages don’t even fit in 256 characters. Chinese, Japanese and Korean for example. That’s OK, we’ll just use multiple bytes per character to get more room. As you may have guessed, these encodings are called multibyte encodings. Sometimes we choose to use the same number of bytes for every character (fixed width multibyte encodings) and sometimes we might choose to use different byte lengths (variable width multibyte encodings) to save space.

Ratifying our encodings to standards

After we’ve built several of these encodings (Russian, Greek, Simplified Chinese, etc.) we can ratify them as international standards such as ISO-8859 for single byte encodings. We previously built ISO-8895-1 (Western European) and ISO-8859-6 (Latin/Arabic). International standards for multibyte encodings exist too. People who use the same standard can communicate without problems.

The international standards like ISO-8895 are only part of the story. Companies like Microsoft and IBM created their own standards (so-called OEM standards or code pages). Some OEM standards map to international standards, some almost-but-not-quite map (see Windows-1252) and some are completely different.

Our standards have problems

Our standards and code pages are better than ASCII but there are a number of problems remaining:

  • How do we intermix different languages in the same document?
  • What if our standards run out of room for new symbols?
  • There is no rosetta stone to allow communication between systems that use different encodings.

Enter Unicode.

Unicode: One standard to rule them all

Image source

As mentioned earlier, Unicode is a single standard supporting thousands of languages. Unicode addresses the limitations of byte encodings by operating at a higher level than simple byte representations of characters. The foundation of Unicode is an über list of symbols chosen by a multinational committee.

Unicode keeps a gigantic numbered list of all the symbols of all the supported languages. The items in this list are called code points and are not concerned with bytes, how computers represent them, or what they look like on screen. They’re just numbered items, like:

a LATIN SMALL LETTER A – U+0061

Pinyin: dōng, Chaizi: shi,ba,ri – U+6771

SNOWMAN – U+2603

We have virtually unlimited space to work with. The Unicode standards supports a maximum of 1,114,112 items. That is more than enough to express the world’s active written languages, some historical languages and miscellaneous symbols. Some of the slots are even undefined and left to the user to decide what they mean. These spaces have been used for wacky things like Klingon and Elvish.

Fun fact: Apple Inc. uses U+F8FF in the Private Use Area of Unicode for their logo symbol (). If you don’t see the Apple logo in parenthesis in the preceding sentence, then your system doesn’t agree with Apple’s use of U+F8FF.

OK, we have our gigantic list of code points. All we need to do is devise an encoding scheme to encode unicode objects (which now we know are lists of code points) into byte-strings for transmission over the wire to other systems.

UTF-8

UTF-8 encodes every Unicode code point in between one and four byte sequences. Here are some cool features of UTF-8:

  • Popularity – It’s the dominant encoding of the world wide web since 2010.
  • Simplicity – No need to transmit byte order information or worry about endianness in transmissions.
  • Backwards compatibility – The first 128 byte sequences are identical to ASCII.

UCS-2: Old and busted

UCS-2 is a fixed width, two-byte encoding. In the mid-nineties, Unicode added code points that cannot be expressed in the two-byte system. Thus, UCS-2 is deprecated in favor of UTF-16.

  • UCS-2 was the original Java String class’s internal representation
  • C Python 2 and 3 use UCS-2 if compiled with default options
  • Microsoft Windows OS API used UCS-2 prior to Windows 2000

UTF-16: UCS-2++

UTF-16 extends UCS-2 by adding support for the code points that can’t be expressed in a two-byte system. You can find UTF-16 in:

  • Windows 2000 and later’s OS API
  • The Java String class
  • .NET environment
  • OS X and iOS’s NSString type

UTF-32: Large bytes, simple representation

UTF-32 is a simple 1:1 mapping of code points to four-byte values. C Python uses UTF-32 for internal representation of Unicode if compiled with a certain flag.

Conclusion

We’ve seen how text handling can go wrong. We’ve learned how to design and test our systems with Unicode in mind. Finally, we’ve learned a bit of history of text encodings. There is a lot more to the topic of text, but for now I ask you do to the following:

  1. Examine your system to see if you’re using Unicode inside
  2. Use UTF-8 when reading and writing data
  3. Know that the plain text is a lie!

Thanks for reading!

ShipIt - 24-hour Hackathon for Millennium+ Platform Dev

At the end of March, some of our teams held their first 24-hour hackathon, titled ShipIt: Millennium+ Services FedEx Day. We had 41 participants, in 15 teams working on 15 unique projects. The idea was inspired by several teams spending a few hours every so often to work on different projects. After reading about Atlassian’s hack days, we decided to hold one.

The event was initially announced early in February, to give teams time to work this into their project plans. The schedule was to start at 10 am on a Thursday and wrap-up at 10 AM on Friday. Teams then presented their awesome projects and then were free to leave for the weekend (and catch-up on some sleep). Each team was free to choose the project they wanted to work on, with the limitation added that they should work on something which can be deployed somewhere in 24 hours (there were bonus points involved for deployed projects). The winning prize not only included bragging rights, but also the ‘Golden Keyboard’, which will be a traveling trophy.

Behold, the Golden Keyboard:

We had reserved a large room off campus to get everyone away from their daily routines. Teams immediately jumped into their projects as soon as the hack day started on March 27th. Plenty of food and snacks were on hand, with lunch and dinner delivered to keep everyone fed. A hackavision dashboard (Sinatra application which subscribed to a list of atom feeds) was created, to track all the github commits by the teams.

The projects had amazing breadth. These include Neo4j, Riemann, Capistrano and languages such as Ruby and Clojure. Teams not only learned new languages and projects in the 24 hours, but also had most of them fully functional and deployed at the end of the hackathon.

There were other activities as well, such as playing Xbox and watching the NCAA Basketball Tournament, which provided to be great breaks throughout the night. Motivational movies, like Robin Hood: Men in Tights, were also on tap through the night.

By 10 am on Friday morning on March 28th, everyone was ready to present their projects. We had four judges representing different areas of expertise. The demos were awesome and judges had a tough time picking the top three projects. Third place went to the SplunkOverflow team, who worked on a Maven plugin that would build site documentation for Thrift RPC services. Second place went to the Short Circuit team, who improved the performance of hash calculations in our Storm topologies. First place (and the Golden Keyboard) went to the Minions team, who created “lando”, a set of services that supported monitoring and management tasks on JVM-based Thrift RPC services.

All in all, it was an exciting 24 hours where teams showed their innovative abilities. All the projects were demo’ed to a larger audience about a week later. Most of the projects started on during ShipIt are being enhanced further, by the teams during their team level hack time or scheduled projects.

This was our first hackathon, but it won’t be our last! We will have at least one more later this year and plan on a recurring event, with the Golden Keyboard traveling around with the winning team. It was amazing to see what people can do in a short amount of time and with the flexibility of choosing what you want to work on, the end result will always be something cool. We hope to continue the innovative thinking, not only by team level hack days, but having larger hack days.

Scaling People With Apache Crunch

Starting the Big Data Journey

When a company first starts to play with Big Data it typically involves a small team of engineers trying to solve a specific problem. The team decides to experiment with scalable technologies either due to outside guidance or research which makes it applicable to their problem. The team begins with the basics of Big Data spending time learning and prototyping. They learn about HDFS, flirt with HBase or other NoSQL, write the required WordCount example, and start to figure out how the technologies can fit their needs. The group’s immersion into Big Data deepens as they start to move beyond a prototype into a real product.

The company sees the success of using Big Data technologies and the possibilities to solve difficult problems, tackle new endeavors, and open new doors. The company has now shifted its scalability problem out from the technical architecture into a people problem. The small team of experts cannot satisfy the demand and transferring their accumulated knowledge to new teams is a significant investment. Both new and experienced engineers face a steep learning curve to get up to speed. Learning the API is not difficult but challenges typically oocur when applying the WordCount example to complex problems. Making the mental jump from processing homogeneous data which produce a single output to a complex processing pipeline involving heterogeneous inputs, joins, and multiple outputs is difficult for even a skilled engineer.

Cerner has developed a number of Big Data solutions each demonstrating the 3 V’s of data (variety, velocity, and volume). The complexity of the problems being solved, evolving functionality, and required integration across teams led Cerner to look beyond simple MapReduce. Cerner began to focus on how to construct a processing infrastructure that naturally aligned with the way the processing is described. Looking through the options available for processing pipelines including Hive, Pig, and Cascading, Cerner finally arrived at using Apache Crunch. Using Crunch’s concepts, we found that we were easily able to translate how we described a problem into concepts we can code. Additionally the API was well suited for our complicated data models and building integration contracts between teams.

When we describe a problem we often talk about its flow through the various processing steps. The processing flow is comprised of data from multiple sources, several transformations, various joins, and finally the persistence of the data. Looking at an example problem of transforming raw data into a normalized object for downstream consumers we might encounter a problem similar to the diagram below.

If we apply this problem to the raw MapReduce framework we begin to see problems absent in the standard WordCount example. The heterogeneous data and models, the custom join logic, and follow up grouping by key all could result in extra code or difficulty fitting this processing into a single MapReduce job. When our problem expands to multiple MapReduce jobs we now have to write custom driver code or bring in another system like Oozie to chain the workflow together. Additionally while we could fit the steps neatly into 1-2 MapReduce jobs that careful orchestration and arrangement could become imbalanced as we introduce new processing needs into the workflow.

This problem is common and fairly basic with respect to some of the processing needs we faced at Cerner. For the experienced MapReduce developers this problem might cause a momentary pause to design it but that is due to your expertise. For those less skilled imagine being able to break this problem down into the processing steps you understand and POJO like models of which you are already familiar. Breaking this problem down using Apache Crunch we can see how we can articulate the problem and still take advantage of the processing efficiency of MapReduce.

Building a Processing Pipeline with Apache Crunch

Apache Crunch allows developers to construct complicated processing workflows into pipelines. Pipelines are directed acyclic graphs (DAG) comprised of input data that is then transformed through functions and groupings to produce output data. When a developer is done constructing the pipeline Apache Crunch will calculate the appropriate processing steps and submit the steps to the execution engine. In this example we will talk about using Apache Crunch in the context of MapReduce but it also supports running on Apache Spark. It should be noted that pipelines are lazily executed. This means that no work will be done until the pipeline is executed.

To begin processing we need a pipeline instance on which we will construct our DAG. To create a MRPipeline we need the typical Hadoop Configuration instance for the cluster and the driver class) for the processing.

1
Pipeline pipeline = new MRPipeline(Driver.class, conf);

With a pipeline instance available the next step is to describe the inputs to the processing using at least one Crunch Source. A pipeline must contain at least one source but could read from multiple. Apache Crunch provides implementations for the typical inputs such as Sequence Files, HFiles, Parquet, Avro, HBase, and Text. As an example if we were to read data out of a text file we might write code like the following:

1
2
PType<String> ptype = Avros.strings();
PCollection<String> refDataStrings = pipeline.read(new TextFileSource(path, ptype));

This code utilizes the TextFileSource to generate a collection of Java Strings from files at a certain path. The code also introduces two additional Apache Crunch concepts of PCollections and PTypes. A PCollection represents potential data elements to process. Since a pipeline is lazily executed it is not a physical representation of all of the elements. A PCollection cannot be created but can be read or transformed. Apache Crunch also has special forms of PCollections, PTable and PGroupedTable, which are useful in performing join operations on the data. A PType is a concept that hides serialization and deserialization from pipeline developers. In this example the developer is using native Java strings instead of dealing with wrapper classes like Writable’s Text class.

Processing based off of Java Strings is error prone so typically developers would transform the data into a model object that is easier to work with. Transformation of a PCollection is done through a DoFn. A DoFn processes a single element of the PCollection into zero or many alternate forms depending on its logic. The bulk of a processing pipeline’s logic resides in implementations of DoFn. Custom implementations of a DoFn requires extending the DoFn class as well as defining the input and output types. This allows Crunch to provide compile time checking as transformations are applied to collections.

1
2
3
4
5
6
7
8
9
10
11
12
13
class ConvertReferenceDataFn extends DoFn<String, RefData>{
     public void process (String input, Emitter<RefData> emitter) {
       RefData data = //processing logic;
       emitter.emit(data);
    }
}

...

PType<String> ptype = Avros.strings();
PCollection<String> refDataStrings = pipeline.read(new TextFileSource(path, ptype));
PCollection<RefData> refData =
  refStrings.parallelDo(new ConvertReferenceDataFn(), Avros.records(RefData.class));

Recalling the previous example processing problem we see that we need to perform join and grouping operations based on a key. Instead of converting the strings into a RefData object it would actually be better to convert the string into a key/value pair (e.g. Pair<String, RefData>). Apache Crunch has a PTable<K, V>, which is simply a special form of PCollection<Pair<K, V>>. Adjusting the function we can instead produce the key/value pair.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class ConvertReferenceDataFn extends DoFn<String, Pair<String, RefData>>{
     public void process (String input, Emitter<Pair<String, RefData>> emitter) {
       RefData data = //processing logic;
       String id = //extract id;
       emitter.emit(new Pair(id, data));
    }
}

...

PType<String> ptype = Avros.strings();
PCollection<String> refDataStrings = pipeline.read(new TextFileSource(path, ptype));
PTable<String, RefData> refData =
  refStrings.parallelDo(new ConvertReferenceDataFn(),
    Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

Utilizing the PTable<String, RefData> collection we could then join that collection with another similarly keyed PTable using one of the many prebuilt implementations. The built in join functionality helps to avoid developing custom implementations of a common data processing pattern.

More functions are applied to the joined data to continue the processing workflow. The processing of the data is distributed in separate tasks across the cluster. In the example problem we need all of the data for a given key to be grouped to a single task for processing.

1
2
PTable<String, Model> data = ...;
PGroupedTable<String, Model> groupedModels = data.groupByKey();

A pipeline requires at least one collection of data to be persisted to a target. Crunch provides the standard targets for data but consumer can also easily create new custom inputs.

1
2
//persist Avro models
pipeline.write(models, new AvroFileTarget(path));

When constructing the processing pipline for the example problem we would end up with an executable program that looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
public void run(){
    Pipeline pipeline = new MRPipeline(Driver.class, conf);

    //Read data from sources
    PType<String> ptype = Avros.strings();
    PTable<String, RefData> refDataStrings = pipeline.read(new TextFileSource(path1, ptype));
    PTable<String, RefModel> refModelStrings = pipeline.read(new TextFileSource(path1, ptype));

    //Convert the Strings into models
    PTable<String, RefData> refData =
      refStrings.parallelDo(new ConvertReferenceDataFn(),
        Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));
    PTable<String, RefModel> refModel =
      refStrings.parallelDo(new ConvertReferenceModelFn(),
        Avros.tableOf(Avros.strings(), Avros.records(RefModel.class)));

    //Data separate input together.
    PTable<String, Pair<RefData, RefModel>> joinedDataModel = refModel.join(refData);

    //Apply a similar DoFn to convert the Pair<RefData, RefModel> into a single object
    PTable<String, Model> models = ...;

    //Filter out data that is not useful
    PTable<String, Model> filteredModels = models.filter(new FilterModelFn());

    //Group the data by key to have all model instances with the same key in a single location
    PGroupedTable<String, Model> groupedModels = filteredModels.groupByKey();

    //Convert the grouped Models into a single model object if they share the same key.
    PCollection<PersonModel> personModels =
      groupedModels.parallelDo(new ConvertPersonModelFn(), Avros.records(PersonModel.class));

    //Write out the Person Model objects
    pipeline.write(personModels, new AvroFileTarget(path));

    //At this point the pipeline has been constructed but nothing executed.  
    //Therefore tell the pipeline to execute.
    PipelineResult result = pipeline.done();
}

class ConvertReferenceDataFn extends DoFn<String, Pair<String, RefData>>{
     public void process (String input, Emitter<Pair<String, RefData>> emitter) {
       RefData data = //processing logic;
       String id = //extract id;
       emitter.emit(new Pair(id, data));
    }
}

class ConvertReferenceModelFn extends DoFn<String, Pair<String, RefModel>>{
     public void process (String input, Emitter<Pair<String, RefModel>> emitter) {
       RefModel model = //processing logic;
       String id = //extract id;
       emitter.emit(new Pair(id, model));
    }
}

class FilterModelFn extends Filter<Pair<String, Model>>{
     public boolean filter (Pair<String, Model> model) {
       boolean include = //logic to apply to the model
       return include;
    }
}

class ConvertPersonModelFn extends MapFn<Pair<String, Iterable<Model>>>{
     public PersonModel map (Pair<String, Iterable<Model>> models) {
       PersonModel model = //apply grouping logic to generate model from many items.
       return model;
    }
}

It is very easy to see how the processing steps and flow of the original diagram can be mapped to Apache Crunch concepts.

Executing the code as written above would cause Apache Crunch to calculate the execution graph to be spread over two MapReduce jobs.

In this case having a MapReduce job being reduce only is less than ideal but the team has been able to focus on correctness and functionality first. Focus can now be shifted on performance tuning or adjusting algorithms as appropriate. The solid foundation of functionality and the simplicity of the concepts allows developers to easily understand how the processing pipeline. The ease of understanding helps to allow the team to refactor and iterate with confidence.

This blog post is essentially my script for my North America ApacheCon 2014 presentation. Slides are available here.

Links