Managing 30,000 Logging Events Per Day With Splunk

Our team works on a patient facing web application with a thousand live clients with 2,315,000+ users. On an average, the high traffic results into more than 40,000 visits and 300,000 page views daily generating about 30,000 logging events. A considerable portion of these events are of information and warning level in order to aid proactive monitoring or identify potential issues due to clients’ misconfiguration.

Before Splunk

To handle this large volume of logging, our team created a rotational support role to manually monitor the logs at regular intervals daily. We built a custom log viewer that would aggregate application logs and the engineer on the support role was expected to watch this tool manually to identify issues. Although we were able to identify problems such as bad client builds or service errors, it was not very efficient nor accurate in quickly determining end user impact stats. Since there was no way to tag previously identified and resolved issues, often times newer support engineers lacked knowledge to react to a problem. This led to unnecessary escalation and engagement of next tier support.

Below: Our old log aggregator used to identify the occurrences (2) of a log with the stack trace.

Splunk Round 1

Once we migrated to Splunk we were very excited about the capabilities it offered, especially around searching and data visualization. In addition to searching logs more effectively, we were able to extract meaningful information from our logs unlike before. Splunk gave us the ability to identify client and end user impact down to the user id across all events in our logs [see image below]. This helped us gain more insight into our problems and trends in terms of impact to users. For a particular problem, we were able to quickly conclude whether all clients were affected, whether clients affected were over a virtual network only, and if the issue was isolated to a specific user. This information gave us the ability to determine the impact of issues coming into our logs, including the area of the site being impacted and frequency.

Below: Once we extracted meaningful fields in our logs, we could identify impact. In this case, an issue is spread across 9 orgs and and 28 users.

Although we had crossed some of the hurdles which made log monitoring difficult shortly after moving to Splunk, monitoring logs for issues was still not an easy job. It was possible to overlook issues since there was no effective way of studying trends. Initially, we created dashboards which helped identify organizations having problems. This was slightly useful but failed to depict more important graphical representation of the different types of occurring issues for a particular client or for all clients at a given time.

Below: Reports like these weren’t very helpful. Clients with more users tend to have more errors, so this trend doesn’t necessarily indicate a client is experiencing a downtime.

Splunk Round 2

It didn’t take us long to realize that we had to get a better handle on our logs to stay on top of increasing traffic caused by a growing user base. Although we were able to identify frequently occurring errors, we still needed a more effective way to identify known issues, service issues, configuration issues and application issues. In order to do that, we needed something more meaningful than a stack trace to track issues. We needed to tag events, and to do that, we turned to the eventtypes feature offered by Splunk.

Eventtypes are applied at search time and allow you to create events to tag search results with. Because they are applied at search time, we were able to add new event types and have them applied historically throughout our logs. This also gave us the ability to tweak our event types to add more known issues as we continued identifying them. Once we successfully gauged a way to take advantage of eventtypes, we came up with a query that created a stacked timechart of eventtypes where eventtypes represented known issues. Once we reached the improved level of production monitoring, the following had to be done:

  1. Create an eventtype with least priority that catches all problems and label it “Unknown Issue.”
  2. Go through “Unknown Issues” and create prioritized eventtypes that describe the problem in english. Once an issue is logged in our bug tracking system, tag the eventtype with that id for easy tracking.
  3. Repeat daily.

Below: Eventtypes give us the ability to see known problems that happen over time. We can even see known problems broken down by client.

Once we had our frequently occurring problems categorized, we were able to break it down even further. We could identify problems caused by configuration in our application layer, problems that required escalation or if client side contacts needed to be engaged.

Below: We now have the ability to track impact to users from clients not taking a service package [left], or from improper Service Configuration [right].

Alerting

We’ve also started taking advantage of Splunk’s alerting. With its powerful searching abilities, we have scheduled searches that trigger an alert when a particular condition is met. For example, when a client has misconfigured certain credentials that cause authentication errors all over the site, we can engage support immediately to get it resolved.

What’s Next?

Although we have a better understanding of our logs now, it can get even better. We plan on continually categorizing our logs so that monitoring our system becomes really simple for everyone. Once all of our known issues are categorized, we wish to have a scheduled search that can identify anomalies in the logs. This would be highly beneficial to find out if a release introduces issues.

Since our site is dependent on multiple services, most of the service problems are resolved by escalated support. We are currently working on identifying problems with known resolutions along with the people that need to be contacted to perform the resolution steps. Eventually we would like to send alerts/emails from Splunk to Cerner’s general support directly for these issues.

We also plan on integrating Jira into splunk with the help of the Splunk Jira app. This will give us the ability to not only track issues in our logs, but also view their current status (investigation, assigned, fixed, resolved). This closes the loop on finding new issues, tracking their impact, and finally their resolution until the end. Splunk has been extremely exciting to work on and has been an invaluable asset to our team. We’d love to continue the conversations on how we can improve our usage of Splunk and how others are using it as well.

Cerner and the Apache Software Foundation

At the beginning of this year, we announced that Cerner became a bronze-level sponsor of the non-profit Apache Software Foundation (ASF). Many of the open source projects we use and contribute to are under the ASF umbrella, so supporting the mission and work of the ASF is important to us.

We’re happy to announce that Cerner has now increased our sponsorship of the ASF to become a silver-level sponsor. Open source continues to play an integral role in both our architecture and engineering culture. We’ve blogged and spoken at conferences about how several ASF projects are core foundational components in our architecture and several of our tech talks have focused on ASF projects.

Further increasing our sponsorship of the ASF reaffirms our continued support for an organization that provides homes for numerous open source projects that are important not only to us, but the larger development community.

Closures & Currying in JavaScript

Preface

I have been asked many times what closures are and how they work. There are many resources available to learn this concept, but they are not always clear to everyone. This has led me to put together my own approach to exchanging the information.

I will supply code samples. //> denotes an output or return.

Before discussing closures, it is important to review how functions work in JavaScript.

Introduction to functions

If a function does not have a return statement, it will implicitly return undefined, which brings us to the simplest functions.

Noop

Noop typically stands for no operation; it takes any parameters, does nothing with them, and returns undefined.

1
2
function noop() {};
noop("cat"); //> undefined

Identity

The identity function takes in a value and returns it.

1
2
3
4
5
6
function identity(value) {
  return value;
}

identity("cat"); //> "cat"
identity({a: "dog"}); //> Object {a: "dog"}

The important thing to note here is that the variable (value) passed in is bound to that function’s scope. This means that it is available to everything inside the function and is unavailable outside of it. There is an exception to this, being that objects are passed by reference which will prove useful with the use of closures and currying.

Functions that evaluate to functions

Functions are first class citizens in Javascript, which means that they are objects. Since they are objects, they can take functions as parameters, have methods bound to them, and even return functions.

1
2
3
4
5
6
7
function foo() {
  return function () {
    return true;
  }
}

foo()(); //> true

This is a function that returns a function which returns true.

Functions take arguments and those arguments can be values or reference types, such as functions. If you return a function, it is that function you are returning, not a new one (even though it might have just been made to return).

Closures

Creating a closure is nothing more than accessing a variable outside of a function’s scope (using a variable that is neither bound on invocation or defined in the function body).

To elaborate, the parent function’s variables are accessible to the inner function. If the inner function uses its parent’s (or parent’s parent’s and so on) variable(s) then they will persist in memory as long as the accessing functions(s) are still referenceable. In JavaScript, referenceable variables are not garbage collected.

Let’s review the identity function:

1
function identity(a) { return a; }

The value, a, is bound inside of the function and is unavailable outside of it; there is no closure here. For a closure to be present, there would need to be a function within this function that would access the variable a.

Why is this important?

  • Closures provide a way to associate data with a method that operates on that data.
  • They enable private variables in a global world.
  • Many patterns, including the fairly popular module pattern, rely on closures to work correctly.

Due to these strengths, and many more, closures are used everywhere. Many popular libraries utilize them internally.

Let’s take a look at an example of closure in action:

1
2
3
4
5
6
7
8
9
function foo(x) {
  function bar(y) {
    console.log(x + y);
  }

  bar(2);
}

foo(2); // will log 4 to the console

The outer function (foo) takes a variable (x), which, which is bound to that function when invoked. When the internal function (bar) is invoked, x (2) and y (2) are added together then logged to the console as 4. Bar is able to access foo’s x-variable because bar is created within foo’s scope.

The takeaway here is that bar can access foo’s variables because it was created within foo’s scope. A function can access variables in its scope and up the chain to the global scope. It cannot access other function’s scopes that are declared within it or parallel to it.

No, a function inside of a function doesn’t have to reference variables outside of its scope. Recall the example function which returned a function which evaluated to true:

1
2
3
4
5
6
7
8
function foo(x) {
  // does something with x or not
  return function () {
      return true;
  }
}

foo(7)(); //> true

No matter what is passed to foo, a function that evaluates to true is returned. A closure only exists when a function accesses a variable(s) outside of its immediate scope.

This leads into an important implication about closures, they enable you to define a dataset once. We’re talking about private variables here.

Without closures, you recreate the data per function call if you want to keep it private.

1
2
3
4
5
6
7
function foo() {
  var private = [0, 1, 2]; // Imaginary large data set - instantiated per invocation

  console.log(private);
}

foo(); //> [0, 1, 2]

We can do better! With a closure, we can save it to a variable that is private, but only instantiated once.

1
2
3
4
5
6
7
8
9
10
var bar = (function () {
  var private = [0, 1, 2]; // Same large imaginary data set - only instantiated once

  // As long as this function exists, it has a reference to the private variable
  return function () {
    console.log(private);
  }
}());

bar(); //> [0, 1, 2]

By utilizing closure here, our big imaginary data set only has to be created once. Given the way garbage collection (automatic memory freeing) works in JavaScript, the existence of the internal function (which is returned and set to the variable bar) keeps the private variable from being freed and thus available for subsequent calls. This is really advantageous when you consider large data sets that may be created via Ajax requests which have to go over the network.

Currying

Currying is the process of transforming a function with many arguments into the same function with less arguments.

That sounds cool, but why would I care about that?

  • Currying can help you make higher order factories.
  • Currying can help you avoid continuously passing the same variables.
  • Currying can memorize various things including state.

Let’s pretend that we have a function (curry) defined and set onto the function prototype which turns a function into a curried version of itself. Please note, that this is not a built in feature of JavaScript.

1
2
3
4
5
6
7
8
function msg(msg1, msg2) {
  return msg1 + ' ' + msg2 + '.';
}

var hello = msg.curry('Hello,');

console.log(hello('Sarah Connor')); // Hello, Sarah Connor. 
console.log(msg('Goodbye,', 'Sarah Connor')); // Goodbye, Sarah Connor. 

By currying the msg function so the first variable is cached as “Hello,”, we can call a simpler function, hello, that only requires one variable to be passed. Doesn’t this sound similar to what a closure might be used for?

In the discussion of functional programming concepts, there is often a sense of resistance.

The thing is, you’ve probably already been functionally programming all along. If you use jQuery, you certainly already do.

1
2
3
4
$("some-selector").each(function () {
  $(this).fadeOut();
  // other stuff to justify the each
});

Another place you may have seen this is utilizing the map function for arrays.

1
2
3
4
5
6
var myArray = [0, 1, 2];
console.log(myArray.map(function (val) {
  return val * 2;
}));

//> [0, 2, 4]

Conclusion

We’ve seen some examples of closures and how they can be useful. We’ve seen what currying is and more importantly that you’ve likely already been functionally programming even if you didn’t realize it. There is a lot more to learn with closures and currying as well as functional programming.

I ask you to:

  1. Work with closures and get the hang of them.
  2. Give currying a shot.
  3. Embrace functional programming as an additional tool that you can utilize to enhance your programs and development workflow.

Additional readings and inspirations

Bonus

Check out how you can utilize closure and currying to manage state throughout a stateful function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function setFoo(state) {
  if (state === "a") { // Specific state
      return function () {
          console.log("State a for the win!");
      };
  } else if (state) { // Default state
      return function () {
        console.log("Default state");
      };
  }
  // Empty function since no state is desired. This avoids invocation errors.
  return function () {};
}

var foo = setFoo("a"); // Set to the specific state (a)
foo(); //> "State a for the win!";

foo = setFoo(true); // Set foo to its default state
foo(); //> "Default state"

foo = setFoo(); // Set foo to not do anything
foo(); //> undefined
// etc

Bonus 2

Checkout how closures and currying can be used to create higher order functions to create methods on the fly: http://jsfiddle.net/GneatGeek/A9WRb/

Intern HackFest 2014

Ten teams of two to four Cerner interns competed in a week-long HackFest this summer, working to solve any problem they put their minds to. This competition cumulated in a presentation and judging of projects, with prizes of Raspberry Pi Kits for each member of the second place team and Leap Motions for each member of the winning team. From mobile apps, to machine learning algorithms, to drones…this year’s Summer Intern HackFest has been one for the books.

We called ourselves Team Rubber Duck Dynasty, and it was made up of Umer Khan (University of Notre Dame), Ryan Boccabella (University of Notre Dame), MaKenzie Kalb (Vanderbilt University), and Jake Gould (University of Kansas).

We were excited to get to work the first night when the week-long competition had commenced. Since the beginning of the summer, all of us had been impressed with the caliber of talent Cerner brought into the Software Engineer Internship program. All of the nine teams we were up against were made up of remarkably smart, driven college students from all over the country. One of the most difficult parts of the HackFest was deciding on an interesting and competitive project that could be feasibly completed in only a week (without too many sleepless nights). One of our four team members was a member of the iOS team, and convinced us that an iOS game was the way to go. We wanted to make a game that we would be excited to show our friends as well as the judges.

We ended up building an app called Encore. It is a musical turn-based game revolving around the creation and mirroring of three second tunes between users. Tunes are created using four arpeggio based tones from real piano, guitar, or tenor trombone recordings. The initiating iOS device and sends the data to the Parse server using the Parse API for iOS. Parse stores this data on the server and sends a push notification to the receiving iOS device. Each time a new game is created, an activity is logged on the server to keep track of the game data. When the receiving user selects the game, it downloads the game data from the server and starts the game. Once the app downloads the game data, it is programmed to decode an array of dictionaries of instrument key and time and convert the array into an audio playback; this allowed for faster upload and download times, as well as significantly smaller game data files. The receiving user hears and immediately attempts to replay the tune. Scoring is accomplished using a Needleman-Wunsch algorithm for sequence alignment. The receiving user now has their chance to create a tune, and the melodious competition continues.

Over the week, we began to get to know our teammates even more than we probably wanted. Passion is the main word that comes to mind when we reminisce on this highlighting week of our summer. From the uncertainty when overhearing other groups huddled in a room talking excitedly about cutting-edge technologies, to the shrieks of excitement when a test finally passed that perhaps woke many a consulting intern roommate, this HackFest was filled with memories all around. As we went for a celebratory completion dinner the night before the presentations Monday morning, the satisfaction of completion was sweet in the air. Sitting there, playing our noisy pride and joy on our phones at the table, we agreed that the week was an excellent experience already…and we hadn’t even started the real judging yet.

Sound checks were full of nerves and excitement the morning we presented our project. The knowledge that each team had a mere five minutes to “sell” what had been more time consuming than sleep over the past week was a challenge everyone was hoping to ace. Later on that afternoon, when the esteemed judges Chris Finn, Michelle Brush, and Jenni Syed were announced as the event began, the caliber of the resources Cerner provides for their many interns was standing right in front of us. We heard from many enthusiastic, impressive groups that afternoon. The presentations showcased many feats of great teamwork and skill: a recommendation engine, dashboard for developers, chatting website, facial recognition android app, iOS game, machine learning algorithm, twitter-controlled drone, and music website.

After a delicious ice cream break while scores were deliberated and after judges provided valuable feedback for each team, the moment of anticipation was upon us. All teams certainly completed the day with the ultimate reward of new skills learned, friends made, and a fantastic project that some are undoubtedly still building off of. As the first and second place teams were called to the stage, Team Rubber Duck Dynasty was surprised and thrilled to be among them. And as the runner up, Team Marky Mark and the Funky Bunch, received their Raspberry Pi Kits, we were amazed to find out each of us was taking home our very own Leap Motion.

We returned to our actual teams late that afternoon, proud of our accomplishments and brand new owners of a cutting-edge technology. We received the congratulations of our superiors and mentors, many of whom were our biggest encouragers to participate and supporters throughout the week. The numerous empowered associates that have guided us through this summer have been an unbelievable community – a community that all of us are incredibly grateful to have been a part of.