This past spring, I had the pleasure of attending SREcon18 in Santa Clara, California. If you have never heard about SREcon or the term SRE then let me diverge for a moment to describe. SRE, or Site Reliability Engineer, was coined by Google employees back in 2003 when a team of software engineers were tasked with running a production environment. It’s the new hotness in the technology world, so an internet search will turn up a bunch of topics. If you are interested in learning more, I recommend sticking with the more trusted sources:
Since SRE concepts have been codified, organized, and documented by the creators, the above resources are easy to explore without getting conflicting information from other entities who have taken the ideas and modified them to fit their organization. There is a lot of excitement in the industry around the SRE concept and there’s a lot of overlap between SRE and DevOps. Having been working in this industry for a while now, this is a much-needed evolution in the operations and development roles.
SREcon started in 2014, so it is relatively new and is hosted by Usenix Association. Usenix has been hosting technical conferences since 1975 and one of their tenants is to host vendor-neutral events. I have been to a few vendor hosted conferences over the past few years and they are always heavy on the sales and light on the ideas. The sales aspect of these conferences got to the point where I was burned out and looking for something idea based. SREcon was the complete inverse, heavy on the ideas and light on the sales. It fit my need perfectly. My only complaint, and this is minor, was that most of the speakers selected this year were backed by big name tech companies, i.e. Google, Facebook, LinkedIn, Netflix, etc. A fellow attendee I was talking to mentioned that they enjoyed the company diversity this year over previous years, though. Apparently, the tone of the conference in previous years was Google heavy, i.e. “This is the Google Way” which is not surprising since that’s where it all started. The conference was being held in the heart of Silicon Valley therefore, the proximity made it reasonable to have many presenters come from companies in the valley. I fully expect to see more diversity in the coming years. SREcon19 was announced to be held in Brooklyn, NY, so I would not be surprised if there are many east coast companies represented. Additionally, I attended the SREcon Americas conference. Showcasing another aspect of diversity, there are SREcons in Europe/Middle East/Africa and Asia/Australia, so it’s become a world-wide conference.
The idea-sharing phenomenon of SREcon played an important role in my decision to attend, since I’m leading an automation development team in CernerWorks℠. CernerWorks is Cerner’s managed services organization responsible for hosting, managing, and monitoring our client’s systems in order to provide the most reliable, highest performing and cost effective delivery of technology services for healthcare. My team is focused on automation in the client aligned systems management space. That’s just a fancy way of saying that our focus is on the non-cloud solutions. We are focusing on solution provisioning, configuration management, and upgrades. The work we are doing aligns well with the tenants of SRE, DevOps and other philosophies that have been discussed in the past few years. I am very interested in these philosophies simply because change is hard, especially when it comes to changes in a culture with an established mentality of doing things a certain way. The technical part of our job is easier in my opinion since technology is always changing.
Some of the more interesting sessions I attended talked about patterns and behaviors that we can learn from other industries. For example, the way that firefighters and other first responders react to incidents and the command structures they put in place outline ways that we can improve how we deal with incidents. Having a plan and structure when engaging with an incident has many benefits. It allows you to define what normal operations for your team look like versus what emergency operations looks like. Determining the chain of command and defining pre-set roles that anyone can fit into can help streamline the time to respond and the emergency command hierarchy of the team. Other industries, such as oil and gas refineries, must have strong contingency plans in place in order to not go boom when things go south. There are a lot of patterns, behaviors, and philosophies that we can learn from almost any industry out there if we open our eyes.
A key topic for the conference is learning from failures and mistakes. The idea of a blameless postmortem was discussed at length in multiple talks and conversations. The need for such a thing stems from our human nature to want to assign blame and fault when dealing with problems. But in many ways this type of behavior causes actions to be hidden out of fear of retribution. A blameless postmortem comes from the idea that learning and preventing is more important than assigning blame. Etsy, for example, has an annual award for the employee who’s made the biggest mistake. The culture they are building is one where accidents are a source of learning rather than a source of embarrassment. This is quite an interesting concept.
Working in the technology industry and never having been to Silicon Valley was surreal. Seeing big company names on the side of buildings and realizing those were the corporate headquarters of companies whose software I have interacted with for years was an experience. One of the nights, I took a trip to the Googleplex to walk around the campus. It was a beautiful campus with trees and flowers everywhere and the weather was amazing. The campus was abuzz with people and activity that you would expect to take place at Google late at night. While walking a thought started forming in my mind around the type of work we do at Cerner. A thought that continued to form later at the conference when I spoke with development teams from various big-name tech companies. It was rewarding to talk about how we are using technology to make healthcare better. These conversations would usually end with the team from the big tech company expressing admiration, respect, and praise for the work we are doing. The Googles, Facebooks, Twitters, and the like are doing interesting things, and it’s easy to get distracted by those names, but the problems that we are solving have real impact on the lives of our fellow humans and we shouldn’t take that for granted.
SREcon18 was worth the trip, but if you couldn’t attend this year’s, previous years, or future SREcons you don’t need to worry about missing out. Usenix has an open access policy which means that slides, videos, and audio of all the talks are available after of the conference. I strongly recommend reviewing the program for this year’s conference and watching any of the talks that catch your fancy.