Among the many compelling talks that attendees come to expect every year at the Strange Loop conference was a session given by Ben Johnson that provided an overview of a new distributed consensus protocol originating from research at Stanford University, named Raft.
What is distributed consensus?
Distributed consensus can be described as the act of reaching agreement among a collection of machines cooperating to solve a problem. With the rise of open source distributed computing and storage platforms, consensus algorithms have become essential tools for replication, and thus, serve to enhance resiliency by eliminating single points of failure.
Examples of distributed consensus in action can often be elusive because such protocols are ordinarily buried inside core systems, and consequently, are largely invisible to application developers. For example, a relational database in a clustered configuration would typically employ a consensus algorithm to coordinate commits with other replicas. And similarly, Apache ZooKeeper, a popular distributed synchronization service used in projects such as HBase and Solr, utilizes a consensus protocol to achieve fault-tolerance by replicating its configuration repository across many servers.
Raft is about understandability and practicality
The current Raft paper argues that while Paxos has historically dominated both academic and commercial discourse with respect to distributed consensus, the protocol itself is too complicated to reason about and that a more understandable algorithm was needed, not only for educational purposes, but also to serve as a foundation for building practical systems.
An obvious question instinctively arises for the inquisitive reader: what makes Raft better than Paxos? Having personally implemented a replicated log using the Paxos algorithm, there was a natural curiosity in understanding how Raft approached the problem of solving consensus. It is worth noting, however, that comparing Raft and Paxos can be a bit misleading. Even though both address the fundamental problem of reaching consensus among a network of connected machines, Paxos is more academic in nature and primarily concerned with the mechanics of consensus, whereas Raft is oriented around the practical challenges of implementing a replicated log.
Paxos is about theory
The seminal work done by Leslie Lamport in 1989 with the design of the Paxos protocol was an important step forward in establishing a theoretical foundation for achieving consensus in asynchronous distributed systems. His contributions were largely academic and centered around reaching agreement on a single value, thus relying on software engineers to translate these ideas into practical solutions, such as replicated databases, which must decide on many values. However, the actual requirements necessary to build a real system, including areas such as leader election, failure detection, and log management, are not present in the Paxos specification, yet add a degree of complexity that almost always significantly alters the original protocol. This is precisely where the Raft designers correctly argue that the absence of specificity leads to great difficulty in applying Paxos to real world problems. A subsequent paper by Lamport in 2001 does an excellent job of making the original protocol accessible to practitioners, and to some degree, proposes techniques for designing a replicated log, but it stops short of being prescriptive in the way that Raft does.
You say tomāto, I say tomäto
Raft is unique in many ways compared to typical Paxos implementations, but despite those differences, both are grounded in a similar set of core principles. For example, Raft requires leader election to occur strictly before any new values can be appended to the log, whereas a Paxos implementation would make the election process an implicit outcome of reaching agreement. The Raft designers claim that doing so has the consequence of simplifying log management, particularly with respect to edge cases in which a succession of leadership changes can result in log discrepancies, but the tradeoff is that leader election in Raft is more complicated than its counterpart in Paxos.
What both protocols acknowledge, though, is that leader election is imperative if systems want to ensure progress. The notion of progress simply means that a system eventually does something useful. An important discovery in 1985, called the FLP Impossibility Result, proved that consensus was impossible in asynchronous distributed systems with the presence of only one faulty process. The practical implication follows: a system that cannot reach consensus is a system that cannot make progress. To be clear, the finding did not state that consensus was unreachable, just that some executions cannot reach consensus in bounded time. As a consequence, leader election, combined with timeouts, is often used as a technique for eliminating a class of conditions under which reaching agreement could take an arbitrarily long period of time. Interestingly, the Paxos algorithm as originally described by Lamport, makes no guarantees about progress, so implementations are compelled to incorporate timeouts as a compensatory measure. Raft, on the other hand, is prescriptive about the use of timeouts.
Raft wins on accessibility
One especially interesting component of the Raft specification is the mechanism for coordinating changes to cluster membership. The protocol employs a novel approach in which joint consensus is reached using two overlapping majorities, i.e. quorums defined by both the old and new cluster configuration, thereby supporting dynamic elasticity without disruption to operations.
The emergence of Raft has clearly seen a positive embrace by the software development community as evidenced by nearly 40 open source implementations in a variety of different languages. Even though Paxos is beautifully elegant in describing the essence of distributed consensus, the absence of a comprehensive and prescriptive specification has rendered it inaccessible and notoriously difficult to implement in practical systems.
The use of open source software has become nearly ubiquitous in contemporary software development and it is no different for us, here at Cerner. We have been using open source software, directly and indirectly, for decades. Over the past decade, we’ve grown in maturity both in our use of open source software as well as our participation in open source communities. Our associates have long been contributors to open source communities, including helping users, logging bugs and enhancements, and submitting patches. Cerner associates also spearheaded the development of the Java Reference Implementation of the Direct Project.
Recently, we’ve decided to take another step in the open source journey by releasing complete projects on our Github organization. Although these projects seem small, they are a big step for us and just the beginning of what we hope to open up and share in the future. We hope you’ll check out these projects and participate in their development.
Project: knife-tarSource: https://github.com/cerner/knife-tar
Knife-tar is a Chef tool for uploading and downloading Chef components from a tar file. It can be used for creating backups of your chef-server or for uploading released Chef artifacts from a repository.
Project: scrimpSource: https://github.com/cerner/scrimp
Scrimp is a tool for interactively testing Thrift services in a web browser. It’s meant to fill the same role that browser-based REST clients fill for web services. Given the IDL files for the services, it provides a UI to help construct requests, invoke services, and display formatted responses.
Providing opportunities for students to gain experience in software development and grow the skills necessary to excel in their careers after graduation is a top priority for Cerner Engineering. Our annual Software Intern Program gives students insight into the design, implementation, testing, deployment, and maintenance of large-scale software projects, which are often beyond the scope of the typical academic experience.
In 2013, we had a total of 224 interns across all business segments with 107 placed into the Software Intern Program to grow their experience as Software Engineers. Each intern was placed on an engineering team, paired with a dedicated mentor, and immediately began learning to use industry-standard tools and processes. They worked on real projects that will ultimately end up in the hands of our clients and play a part in transforming healthcare through the use of technology.
This summer, interns worked on a wide variety of projects spanning from Eclipse RCP applications, Web services, iPad applications, or working with Big Data in the cloud. They used Java, iOS, Ruby, .NET, and many other technologies to solve problems and create new solutions to ensure each intern learned something new while challenging them to take away a new set of skills at the end of their summer experience.
But it’s not all work – the summer schedule was full of opportunities to learn about Kansas City, Healthcare, network with other interns in the program, and have as much fun as possible. This year, we kicked off the internship program at a Sporting KC soccer game, including a private tailgating event. Interns also got a chance to experience the Tucker Leadership Lab, attend lunchtime tech-talks, participate in a CodeRetreat, attend our annual Developer’s Conference (DevCon), go to Worlds of Fun, and go on tours around Kansas City to learn more about the different parts of town and what makes this a great place to live and work.
One of the most fun events this past summer was the Software Intern Hackfest. An optional event, Interns formed teams of 2-4 and were given a week with the goal of:
- Dream it.
- Build it.
- Present it.
The teams were given a week to build anything they wanted and presented the end result to a panel of Cerner software developers for prizes and glory. We took some time to sit down with two of the winning teams to get their thoughts on the Hackfest and their overall experience with the internship program.
Grand Prize winners: Team Golf
Addison Shaw, Richard Habeeb
Cerner: Describe your project and how you created it.
A&R: When we came up with the project, we thought, in one week, how far could we push ourselves? Let’s do a game. We wanted a challenge.
When we were coming up with the design of the game, it was more of a “create as you go” project. We didn’t just sit down and immediately come up with a game of aliens, cows and robots. We started out with a framework.
We wanted the game to be dynamically generated, and we wanted it to be a traditional survival game.
We knew it had to be 2-d. We built the framework with a grid structure. We made all the graphics ourselves with a pixelater.
Cerner: What were your biggest takeaways?
Richard: I called it free motivation. It’s a competition, and there’s that side of me that wants to do really well. There’s also a deadline. It’s a chance to show off your work to others. The project was very self-directed. We came up with the tools and decided it ourselves. The creativity part was awesome.
Cerner: What was the most fun?
Addison: I had Richard over to my house one day. We bought a bunch of snack food and red bull and worked on the game. My mom got us a pizza and it was really fun! We also did some trash talking with other interns. That was fun.
Cerner: What will you do with your prize? (Raspberry Pi)
Addison: It was an awesome prize! I want to do something with home automation. We are living in a house [at college] this year. Maybe I’ll create something like clap on lights.
Richard: I do robotics a lot during the school year and I was thinking of working with it at our club.
Crowd Favorite: Mad Scientists
Alex Valentine, George Li, Jack Miles
Cerner: Describe your project and how you created it.
Alex: It was my idea to make a game. For the longest time, we didn’t know what we wanted to do. Originally we were going to make a 2-D game. We thought it would be easy not making that 3rd dimension. There is a game that puts 2D and 3D together. It looked really simple and easy. So we went with that.
Jack: A few of us weren’t familiar with C#.
George: We all liked to play video games! We used Unity to create it. Someone told me if you want to learn Git you should forget everything about SVN. That didn’t necessarily happen for me.
Cerner: How did you communicate the tasks that needed to be done?
Alex: We assigned people different aspects of the game. I took characters.
George: I took responsibilities for weapons.
Jack: I took charge of enemies.
Cerner: How did you deal with integration?
Jack: For the longest time Git was giving me problems. I was on a Mac and they were on PCs.
Alex: We learned that we should never commit directly to the master branch. Making mistakes and then figuring out how to fix them really helped me learn Git.
George: I learned more about Git than creating the game and ended up giving a lightning talk on Git. The Unity engine isn’t setup well for source control. We would have to diff in different stuff.
Cerner: You learned some new technologies – what else did you learn?
Alex: Start early. Be efficient.
Jack: If you’re working on new technologies, learn each technology before going too deep. That way when we’re trying to do real work, you aren’t worrying about how to use it.
George: It was about Wednesday when we still didn’t have anything working. When he told me, “I’ll be happy if it shoots,” I knew we were in trouble. [Editor’s Note: At the competition, the main character was able to shoot, but only backwards, which was a huge crowd pleaser]
Alex: When you realize that what you were going to do and your deadline are not going to jive, what do you do? We decided we wanted one of everything instead of having multiples of everything. We needed to scale our project down a bit.
Cerner: What was the most fun?
Alex: The lunch meetings were pretty entertaining!
Jack: The “fun” of frustration. The [other teams'] presentations were fun for me because, although we set the goal and didn’t quite make it, you could tell there were a lot of people in the same situation. Everyone who has worked on software has been in our position.
For more information about the Cerner Summer Intern Program, visit Cerner Careers Opportunities for College Students.
Improving the state of healthcare through innovation requires investing in others to join you on the journey; not just for today, but for the decades to come.
Project Lead the Way has established the Computer Science and Software Enginering course that teaches computational thinking to high school students, and it will pilot in 60 schools across the country this fall. Providing exposure to a wide variety of computational and computer science concepts, students can program a story or game in Scratch, write a mobile application for Android, and learn about knowledge discovery and data mining, computer simulation, cybersecurity, GUI programming, web development, version control, and agile software development.
Recently, I had the privilege of working with experts across the country to define the curriculum for Project Lead the Way as well as the opportunity to sit in on the training sessions given to the pilot teachers, hosted at Cerner’s Innovation Campus. Bennett Brown, Director of Curriculum and Instruction, led the class into the deep end to solve the hard problems their students would face throughout the course. Experts augmented the training sessions from organizations such as Cerner, Lawrence Livermore National Laboratory, Purdue University, Carnegie Mellon University, University of Virginia, and more.
This course follows in the footsteps of other Project Lead the Way courses in engineering and biomedical science currently operating in over 4,700 schools with more than 10,000 trained teachers.
What I Learned
My role in the class was to teach one simple section and provide some practical answers from the industry trenches. I’ll let you in on a secret: I’m pretty sure I learned more about teaching and how to teach novices than I taught on any particular topic.
Problems and projects in PLTW are designed to be approachable by all students; there is no ceiling for the advanced students who want to go beyond what is specifically covered. The teachers receiving training were given the same end-of-unit problems that their students will face during the year and had to work out solutions in a pair programming environment, often working in their dorms or in the Cerner training rooms until 10pm. The experience level of teachers ranged from those having engineering degrees and multiple years of industry experience before going into teaching to some who had never programmed in their lives before this class.
What I learned was that even in this environment of mixed skill levels, the activity/project/problem-based approach surprisingly requires just a minimal introduction to a new concept. After simple activities which build quick wins, natural curiosity swells up within each of the students; skill and understanding start to form, not just learning a particular technology but how to problem solve using software.
For example, during the App Inventor sections, teachers received step-by-step simple examples of putting together programming blocks to create a simple Android application before working on homework to through one Agile sprint with their partner that evening to come up with their own application from scratch. The next morning, pairs went through their demos in a rotating show-and-tell fashion, showing a variety of apps including a flash card game, as well as a “one of these things is not like the other” game using the accelerometer for randomization and featuring the appropriate Sesame Street music.
One night during the course I pulled down the Codea app for my iPad and showed my own kids the relationship between a few behaviors in a physics app and the corresponding code in Lua. Three hours later, all of the iPads in the house were engaged in hacking.
In addition to some great instruction over the course from Bennett Brown on a variety of problems in genetics, chaos theory, Tkinter and GUI programming, other highlights included:
Vic Castillo, Group Leader in the Quantitative Risk Analysis at Lawrence Livermore National Laboratories with a PhD in computational physics gave a great three hour survey presentation on what he sees as the “big three” topics in the future of computing: computer simulation and modeling (a subject near and dear to my own origin stories in computing), 3-D printing and mobile robotics featuring almost a dozen demos of cases where modeling was used to learn something new about a real-world system or design. Using just the NetLogo multi-agent programmable modeling environment developed at Northwestern University, Vic showed a variety of models that could be built with just a basic knowledge of the Logo programming language: for instance, modeling the heat absorption of a home design and how different designs and window configurations affect the internal temperature of the living space), “solving” the game Lunar Lander, or how a First Robotics team modeled the software algorithms incorporated into their TurtleBot (think Roomba + netbook + Kinect sensor).
Joanne Cohoon talked about achieving diversity in STEM classrooms and the state of the job market–how 50% of the jobs available involve some need for computational thinking and the challenges and how teachers can create classroom environments that attract students who might otherwise rule themselves out from the start based on the stereotypes and environments commonly surrounding STEM in schools. With half the workforce needing some kind of computational skill, we can’t drive anyone away.
Peter Chapman, a PhD student at Carnegie Mellon University and Technical Leader of the picoCTF project joined us via Skype to describe the program. Set up along the storyline of an interactive adventure called Toaster Wars, picoCTF is a competition among high school students to solve as many of 57 challenges as they can using whatever means necessary–hacking, decrypting, reverse engineering, breaking or whatever. The challenges force students to get their hands dirty, learning on their own via documentation and their browser the role of cookies in web security, how to use grep and tar to find a secret key in a file, and other examples designed to give students the experience of tackling unknown problems using the resources available to them.
Cerner interns and former Lee’s Summit High School Team Driven First Robotics team members Dakota Ewigman and Victoria Utter taught a condensed version of the KC Power Source Android App Camp using MIT App Inventor which set the stage for more advanced App Inventor instruction provided by Dave O’Larte around calling and integrating web services and open APIs. We had developed a simple ballot and question and answer database-backed web service that let students and teachers hack around independently. In the category of unintended learning, and no doubt inspired by their picoCTF hacking, some of the teachers quickly found out how to troll other schools’ services by injecting funny questions and answers about the course material.
Why It’s Cool
At one point during the week, Bennett Brown shared with me the growth potential for PLTW’s CSE program. In the engineering and biomedical engineering pathways, the year-over-year growth has been tremendous which is paving the way for an even faster uptake of the computer science offering. Such an important and widespread channel offers a great opportunity to influence the future of our entire industry and communities. With around 20% of the total pilot schools in this curriculum, Kansas City is getting a head start on building this pool of talent. We can use little advantages such as these to increase the value of KC as a hub for computing and entrepreneurial activity.
An instructor asked me at one point, “What is Cerner looking for in high school students who take this course?” I had the sense that the question sought an answer more about a specific technology or language. I don’t really see that as the focus. Instead, I think what we want is defined more by sparking that interest in solving hard problems and coming away with the understanding that these problems exist in their community and that there are people in the community at companies like Cerner solving those problems every day.
Why It Matters
We have a lot to accomplish in “engineering health,” and K-12 outreach programs can feel like a long game that often take a back seat to immediate concerns around designing and shipping solutions. But seeing how even the less computing-savvy teachers started to get hooked on programming their Android devices or trying to get to the next level in picoCTF while beating their heads against man pages and HTTP dumps, it became clear to me that by getting involved in these programs we can accelerate not only the numbers of students comfortable with computational thinking, but also give them a network of relationships and experiences in their own communities to return to as their career and learning progresses. The long game may not be as long as we think.
When I graduated from college, I thought I understood what it meant to develop software in the real world. It required process. It required troubleshooting. It required quality. However, to me, process meant waterfall. Troubleshooting meant trying a few things and then asking for help. Quality meant manual testing.
Agile methods were not unheard of when I graduated in 2001. My professors noted that iterative development was better than waterfall; they just only taught waterfall. Debuggers had been around since the 50’s, but my classmates and I still debugged with what I call “Hi Mom!” techniques. (We peppered our code with print statements.) Kent Beck had written the JUnit framework 4 years before, but it wasn’t entrenched in the Java culture yet. So it’s not surprising that my education didn’t cover these topics.
It took a few painful experiences in the real world to make me realize the way I programmed in college wasn’t the best way to engineer software. I needed to adopt some new practices.
Not much has changed in terms of software education. Being a part of Cerner’s software engineer training program, I am able to ask every group of new engineers three questions:
“Do you use an agile process?”
“Do you use a debugger when troubleshooting your code?”
“Do you write automated unit tests?”
Cerner has had explosive growth in engineering, so I’ve asked those questions of hundreds of recent graduates. Almost no one says yes. This told me that while colleges are doing a great job of teaching computer science, many schools are not teaching best practices in software development.
Until recently, Cerner wasn’t doing that great of a job of teaching them either. Our training program covered them, but we still saw the new engineers struggle to understand agile development, debug their code, and write their first unit test. One of Cerner’s core values is if you recognize something is broken, you are empowered to fix it. I knew our training program wasn’t working. It became my job to fix it.
Before tackling the whole program, I tried a little experiment. I wanted to see what it would take to get engineers in training to write just one unit test. At the time, training included a class on JUnit. In spite of the class, only 5% of the engineers were writing unit tests for training assignments.
To correct this, I started telling the engineers that I would take points off assignments that didn’t have a unit test. The idea was to create structural motivation. We immediately saw 40% of the engineers writing unit tests. A step forward, but it wasn’t enough.
The biggest obstacle to broader use of unit tests in training was that they didn’t know how to include the testing framework in their Java projects. That, more than the effort of writing the test, was keeping them from doing something we expected of them.
Something was wrong. We were teaching Maven in our training program. If you are not familiar with it, Maven helps you manage your project builds, and as a result, it helps you manage your dependencies. The engineers were already attending a class that taught them how to add dependencies to Java projects. They just weren’t able to associate what they had learned with the goal of bringing JUnit into their projects. They weren’t making the connection.
This connection was missed because engineers were learning about Maven in the absence of a problem. They were being told it’s an important tool, but we hadn’t given them a reason to use it. Later, when they did encounter the problem – “How do I add the JUnit jar to my project?” – it was too late. They had forgotten about Maven.
The key was to move the Maven training closer to when they needed the information. This is called “Just in Time Teaching.” It became the first requirement of the new program.
Another interesting aspect of my experiment is that 40% would write the tests even given the delay between training and practice. It should be obvious to anyone that’s ever taken a college programming class that some programmers can get it from lectures alone. Others have to practice. Any one-size-fits-all approach to training is flawed. The second requirement for the new program was that it must flex to meet the skills and learning styles of the engineers.
With these goals in mind, I started the redesign of Cerner’s training program. My first step was to interview a large sample of our software leaders. I asked them what they wanted engineers to learn. Time and time again, the top answers would be agile development, debugging, and automated unit tests. Surprisingly, it was not a list of technologies like iOS, Hadoop, JSON, or ReST.
The resumes of our newly hired engineers are full of languages and technologies. However, when asked what they wanted of new engineers, our lead architects described practices. If Cerner could get engineers to improve in practices, we could take the great engineers we were hiring and immediately make them more productive.
The scary thing is that sharing knowledge is easy, but changing people’s behavior is hard. Once I realized our problems were about software development behavior and not knowledge, I realized we would need to completely rebuild the way Cerner trains its new engineers. The result is DevAcademy.
Imagine you are a new engineer starting at Cerner. In your first week at Cerner, you report to the DevAcademy. The first two weeks focus on in-class instruction and assessment. The goal here is to introduce core software development behaviors and then assess your skills.
After completing the first two weeks of instruction, you join what we call the DevCenter and are assigned a real project. However, that project isn’t assigned by your team. You get to pick. The projects come from all over Cerner including web applications and services, tools to make engineers more productive, and even contributing to open source projects used by Cerner. In picking a project, you are telling Cerner the types of work you find interesting. This helps us determine the best place for you across our diverse range of solutions and technologies.
While working on that first project, you have a dedicated mentor. You are expected to make progress on the project while receiving feedback. You also get just-in-time training on user stories, source code management, unit testing, and scrum. You get to use Cerner’s software development ecosystem in an isolated, safe environment.
Once you show readiness to join a team, you are allowed to demo your work to the teams that have open positions. Those teams can then pick the engineer that best fits their team. In this way, Cerner makes sure you are assigned to the right team for both Cerner and you.
DevAcademy recognizes that you should never stop learning, so the program continues well into your first few months on your team. You are offered classes on different technologies and more advanced topics as part of an elective-based training plan. You work with your manager to decide which classes to take. It’s Cerner’s way of making sure all of our engineers continue to grow.
We’ve had 150 engineers join DevAcademy since it was launched. I’ve had the privilege of seeing the new engineers struggle and then succeed on their projects. I’ve seen the light come on when they realize the usefulness good development practices and apply them effectively. I’ve seen them get excited about git and other powerful tools that they didn’t have the opportunity to learn during their formal educations. The best part of my job is that I’ve seen many very good engineers start down the career-long path towards becoming really great ones.
In software development, we solve problems. As we solve these problems, we build connections in our minds of how to look at a problem, relate it to previous problems and solutions, and re-apply past approaches and techniques.
These behavior habits build dogmatic ways of thinking and limit design choices to selective technologies we’ve used in the past. As we all know, you have to continually learn new technologies and different ways of thinking to stay current in the ever-changing landscape of software development. Unfortunately, keeping up-to-date on technologies and approaches isn’t an easy behavior to maintain when you have many other priorities.
This spring, a few engineers wanted to bring focus to this important behavior trait by giving a tech talk on “Honing your Craft,” which discussed how to continuously strengthen and refine skills. Not only did we want engineers that we work with on a daily basis to be part of the tech talk; we wanted engineers from other teams at Cerner to be involved too. Also, we didn’t just want to talk about it; we wanted to pose a challenge to put people into action and re-enforce these behaviors.
At the end of the talk, we announced the challenge: 30 days of code. It was a challenge like many other “30 day” programs but was centered on learning new aspects of software development, languages, or just hacking up a tool that you find useful for your day-to-day development. The goal being that after 30 days, new habits and behaviors would be established to help promote continuous learning.
To make this a social learning experience, we built a Ruby web application called “mural” using async_sinatra and eventmachine to consume our GitHub Enterprise instance and show gists, which contained a code comment of “30_days_of_code”. The result was really interesting. Similar to Twitter, where you follow a specific hashtag, we were following code snippets of fellow developers. Since most of them were gists, they were small enough so you could easily see what they were doing (without having to go through a mountain of code). With this dashboard, a score was calculated based on count of your posts and was displayed with your avatar. Having scores displayed in descending order added a little peer pressure to keep people active in challenge.
By the time I got back to my desk, I received questions of where the URL was to see the app, so they could verify their posts were showing. Soon, people wanted GitHub repos to show up with gists, so a pull request was requested for that. We then wanted anonymous gists to also get pulled in, which required developing our own crawler since these were exposed on the gist search. It was apparent that people wanted to share what they were doing, and people wanted to see it.
At the end of the challenge, we returned to the auditorium to highlight some of the posts that came in over the past month. During the 30 days, we had 124 gists posted and 17 different contributions in repos. People were showing their skills over a wide range of technologies. Examples were:
- Building a plugin to invoke Jenkins commands over a phone (using Siri)
- Ruby scripts that interact with our Github Enterprise instance API that will send out emails when pull requests or branches are getting old (executed periodically through Jenkins)
- Clojure application that looks for pull requests based on Github organizations, which have two “+1” comments (alerting which pull request may be candidates to close out)
- Illustrating Crucible code interactions by extracting data with python and visualizing with D3 Using Node.js to flash lights on a Raspberry Pi when a health check from web service is failing
- Presenting statistics from a storm cluster with Rickshaw
Eleven people presented what they worked on and learned. It was amazing to see all of the different ideas people came up with in this time frame.
Even more interesting was how quickly people learned from the ideas of others. Not only were developers sharing their code snippets, but they were also sharing the problem they were attempting to solve or the idea of what they wanted invent. For example, the Ruby script which alerted the last committer of a dead branch through email, spawned into other implementations that would send alerts based on different pieces of Github data (ex. old pull requests). Sharing these ideas in their early stages (through code snippets) really accelerates the rate that an idea can be seeded in other minds and helps inspire even more innovation and learning.
This wasn’t only a challenge of what we would build, but it was also an experiment of what can happen by taking a small portion of your day and doing something different. By structuring this exercise around a formal challenge that included a competitive aspect, there was additional motivation to get involved and stay involved to the end.
In summary, find a way to take a little time out of your day to try something different; you will be amazed with the different perspectives that you gain.
This is the blog form of the Thinking in MapReduce talk at StampedeCon 2013. I’ve linked to existing resources for some items discussed in the talk, but the structure and major points are here.
We programmers have had it pretty good over the years. In almost all cases, hardware scaled up faster than data size and complexity. Unfortunately, this is changing for many of us. Moore’s Law has taken on a new direction; we gain power with parallel processing rather than faster clock cycles. More importantly, the volume of data we need to work with has grown exponentially.
Tackling this problem is tremendously important in healthcare. At the most basic level, healthcare data is too often fragmented and incomplete: an individual’s medical data is spread across multiple systems for different venues of care. Such fragmentation means no one has the complete picture of a person’s health and that means decisions are made with incomplete information. One thing we’re doing at Cerner is securely bringing together this information to enable a number of improvements, ranging from better-informed decisions to understanding and improving the health of entire populations of people. This is only possible with data at huge scale.
This is also opening new opportunities; Peter Norvig shows in the Unreasonable Effectiveness of Data how simple models over many data points can perform better than complex models with fewer points. Our challenge is to apply this to some of the most complicated and most important data sets that exist.
New problems and new solutions
Our first thought may be to tackle such problems using the proven, successful strategy of relational databases. This has lots of advantages, especially the ACID semantics that are easy to reason about and make strong guarantees about correctness. The downside is such guarantees require strong coordination between machines involved and in many cases the cost of that coordination grows as the square of data size. Such models should be used whenever they can, but to reason about huge data sets holistically means we have to consider different tradeoffs.
So we need new approaches for these problems. Some are clear upfront: as data becomes too large to scale up on single machines, we must scale out across many. Going further, we reach a point where we have too much data to move across a network — so rather than moving data to our computation, we must move computation to data.
In fact, these simple assertions form the foundation of MapReduce: we move computation to data by running map functions across individual records without moving them over the network and merge and combine, or reduce, the output of those functions into a meaningful result. Word count is the prototypical example of this pattern in action. MapReduce implementations as offered by Hadoop actually offer a bit more than this, with the following phases:
- Map — transform or filter individual input records
- Combine — optional partial merge of map outputs in the mapping process, usually for efficiency
- Shuffle and Sort — Sort the output of map operations by an arbitrary key and partition map output across reducers
- Reduce — Process the shuffled map output in the sorted order, emitting our final result.
We have our building blocks: we can split data across many machines and apply simple functions against them. Hadoop and MapReduce support this pattern well. Now we need to answer two questions: How do we use these building blocks effectively and how do we create higher-level value on top of them?
The first step is to maximize parallelism. The most efficient MapReduce jobs shift as much work into the map phase as possible, even to the point where there is little or no data that needs to be sent across the network to the reducer. We can gauge the gains made by scaling out by applying Amdahl’s Law where the parallelism is the amount of work we can do in map tasks versus more serial reduce-side operations.
The second step is to compose our map, combine, shuffle, sort, and reduce primitives into higher-level operations. For example:
- Join — Send distinct inputs to map tasks, and combine them with a common key in the reducers.
- Map-Side Join — When one data set is much smaller than another, it may be more efficient to simply load it in each map task, eliminating the reduce phase overhead outright.
- Aggregation — Summarizes big data to be easily computed.
- Loading into external systems — The output of the above operations can be exported to dedicated tools like R to do further analysis.
Beyond that, the above operations can be composed into sophisticated process flows to take data from several complex sources, join it together, and distill it down into useful knowledge. The book MapReduce Design Patterns discusses all of these patterns and more.
Understanding the above patterns is important but much like how higher-level languages have grown dominant, higher-level libraries have replaced direct MapReduce jobs. At Cerner, we make extensive use of Apache Crunch for our processing infrastructure and of Apache Hive for querying data sitting in Hadoop.
Reasoning About the System
Most of development history has focused on variations on Place-Oriented Programming, where we have data in objects or database rows and we apply change by updating our data in place. Yet such a model doesn’t align with MapReduce; when dealing with mass processing of very large data sets, the complexity and inefficiency involved in individual updates becomes overwhelming. The system would become too complicated to perform or reason about. The result is a simple axiom for processing pipelines: start with the questions you want to ask and then transform the data to answer them. Re-processing huge data sets at any time is what Hadoop does best and we can leverage that to view the world as pure functions of our data, rather than trying to juggle in-place updates.
In short, the MapReduce view of the world is a holistic function of your raw data. There are techniques for processing incremental change and persisting processing steps for efficiency but these are optimizations. Start by processing all data holistically and adjust from there.
The paper From Databases to Dataspaces discusses a new view of integrating and leveraging data. A similar idea has entered the lexicon under the label “Data Lake” but the principles align: securely bring structured and unstructured data together and apply massive computation to it at any time for any new need. Existing systems are good at efficiently executing known query paths but require a lot of up-front work, either by creating new data models or building out infrastructure for the immediate need. Conversely, Hadoop and MapReduce allow us to ask questions about our data in parallel at massive scale without prior build.
This becomes more powerful as Hadoop becomes a more general fabric for computation. Projects like Spark can be layered on top of Hadoop to significantly improve processing time for many jobs. SQL- and search-based systems allow faster interrogation of data directly in Hadoop to a wider set of users and domain-specific data models can be quickly computed for new needs.
Ultimately, the gap between the discovery of a novel question and our ability to answer it is shrinking dramatically. The rate of innovation is increasing.