Recently I was fortunate enough (thanks Skyscanner!) to attend O’Reilly Software Architecture Conference in San Francisco. Given our ongoing move to a cloud based, micro-service oriented architecture at Skyscanner I thought it would be interesting to go and see what other internet scale companies were doing in this area.
In this blog post I want to recap and share some of the interesting things I learnt.
My pick of the conference
I’ll start with my highlight of the conference, a talk by Susan Fowler (@susanthesquark) a “Software Reliability Engineer” at Uber.
Susan’s talk resonated with me because it presented a solution to a problem that I don’t think we’ve solved at Skyscanner and one that we’re likely to run into more frequently the further we travel down the microservice road.
The talk was titled “Microservice Standardisation” and was about creating a “framework” which allows an organisation to build trust between teams, build truly production ready microservices and create an overarching product which is reliable, scalable and performant.
In the talk, we were told that Uber has more than 1300 microservices powering their product. As you can imagine, in an architecture like this each microservice must rely on several others to do its job, each of which in turn probably rely on further services.
Working in this environment Uber found that it was difficult for development teams to know which other services (and teams) they could trust. If a team is building a component which is critical and needs to be “always” available – how do they know which other services they can use, without compromising their availability goal? This became even more complicated for them as they started to heavily utilise the independent scaling of microservices. How do teams know that their dependencies will scale appropriately with them, as a feature is rolled out and becomes popular?
Uber’s answer to this was to create what they call “production-readiness standards” which all their services must adhere to before they can be trusted with production traffic.
These standards should be:
- Global (local standards don’t build organisational trust)
- General enough to be applicable to all microservices
- Specific enough to be quantifiable and able to produce measurable results (otherwise they aren’t useful)
- Inherently give guidance on how to architect, build and or run a service (otherwise they aren’t helpful to developers)
The last point I found quite interesting, the example given was “Availability”. This is a good way to quantify an important metric for microservices and is a good way to build trust through SLAs. However, “availability” is a goal, not a standard. Telling a team to “make your service more available” isn’t terribly helpful. The key is to think about what brings a service closer to the goal, in this case to be more available. The answer to this question can then become a standard.
The standards that Uber use are:
Being quite mature with this approach Uber have now automated much of the application of these standards in their infrastructure, processes and pipelines.
One of the interesting closing remarks which Susan made was that they now have production readiness leaderboards for all the services on screens throughout their offices. At first it might seem like this would-be kind of cruel, however she said that it has fostered some awesome conversations between their engineers. The boards have become a jumping off point for “water cooler” conversations about best practices and little tips and tricks which all feedback into the system to build more trust and better services.
I’d love to see us have something like this at Skyscanner so that we can all hold our services to high measurable standards.
An Agile Architects Framework for Navigating Complexity
Rewinding to the half day tutorials on first day of the conference I really enjoyed “An Agile Architects Framework for Navigating Complexity” in the afternoon session.
This was a pretty hands on session with lots of group activities and discussion so I was glad I had fuelled up on the afternoon snacks – only in America – Starbucks Coffee and assorted candy.
The session started with us all taking an experience from our own day to day, writing a small headline about it and sketching the architecture. This was basically to get us thinking about a specific piece of work which would form the basis of the rest of the workshop.
We then used coloured and numbered dots to answer questions about the experience on large pieces of paper on the walls, in much the same way as you’d use sticky notes in a retrospective.
The purpose of doing this with a team is that you can then use the data to formulate hypotheses about what is working well, what isn’t and how difficult or important it will be to tackle any issues.
For example, if someone had marked their experience as “having happened frequently” on one of the boards and selected that it “involved customers” on another, you’d probably think that was quite a high priority issue to look at. You’d then be able to look at some of the other boards and find factors that could be addressed going forward. This also works for positive examples that you might want to make happen more frequently or enable the team to more easily achieved.
The second half of the workshop was to take the experiences that you’d identified using the techniques above and to apply the Cynefin framework to further categorise them into achievable actions.
For example, if one of the experiences falls into the obvious portion of the Cynefin diagram you might choose not to take any invasive action other than to keep an eye on it. However, if an experience fell into the complex or complicated domains you’d be more likely to spend some effort in addressing it and producing a plan of action to tackle the issue in future sprints.
Two Netflix employees, Dianne March (Director of Engineering) and Scott Mansfield (Senior Software Engineer) gave interesting presentations.
Both presentations were quite heavily based on “here’s what we do at Netflix, take from that what you will” rather than giving any specific guidance, but were both interesting.
Dianne’s presentation covered their version of “build it, run it” and being a good Netflix citizen, rather than having a command and control type structure. She also talked a bit about the problems Netflix had had in the past with AWS regions going offline and how that had effected the business. This lead on to Chaos Monkey and them running monthly traffic drains from one region to another, to make sure they aren’t building in any region-specific dependencies.
Scott’s talk was even more specific, covering EVCache and how Netflix use this technology they’ve developed to enable them to be low latency and high throughput. The talk was incredibly detailed and technical so I won’t go into detail, but check out his slides to get an idea of the stuff he was talking about.
An off the cuff comment that he made during this presentation which I found quite interesting was that despite Chaos Monkey running practically all the time, AWS still kills far more of their instances than Chaos Monkey. If ever there was an illustration of the need to develop for the wild west which is the cloud, that is it!
Playing Tetris live on stage while demoing Kubernetes, enough said.
Containerisation at Pinterest
This was another interesting one. There wasn’t anything terribly surprising in the talk, but it was more interesting from the point of view of seeing how well we do at Skyscanner compared to other large and well known tech companies.
Pinterest seemed to have taken a rocky road to their current architecture. They first had a big problem with random deployment failures as their infrastructure and machines required a lot of manual intervention, creating inconsistencies across their fleet. They tried to solve this with various technologies including “Teletraan” their open sourced deployment system. Unfortunately, this didn’t meet their needs fully so they’re currently moving more toward a mix of Teletraan and Docker.
Microservices – The Supporting Cast
The last session I want to mention specifically is “Microservices – The Supporting Cast”. This talk was by Randy Layman from Pindrop, a company who provide fraud detection software to financial institutions.
The talk basically covered three supporting services which Pindrop have alongside their business logic services.
First was their API Gateway, a concept with which most of you will likely be familiar. Something potentially unique that Pindrop build into their API gateways however, is the ability to protect themselves from bad request.
The story behind this was a client installing a now VoIP phone system which didn’t quite speak the language of their previous installation. The change triggered a series of events which caused a DDOS-like effect downstream.
With their new API Gateway, however they are now able detect bad traffic like this, drop it if appropriate or reshape it into an understandable structure before forwarding it downstream.
The next service was what they call a “cleaner”. This was quite specific to their business but I could see how a similar process could be applied to various more general scenarios. The cleaner is responsible for identifying sensitive or personally identifiable information and removing it before transmission to downstream services which don’t require it. Depending on if and how the data is needed it can be removed in various ways:
- If the data isn’t required, it can simply be removed
- If the data is needed, but can anonymised, its replaced by a generated token
- Finally, if the data is required they still do the token step. However, in this version the original data is stored within a datastore, from which it can be retrieved later. This means they only have one data store which stores personally identifiable or sensitive information, while still allowing other services to access the data. Making their lives much easier in terms of compliance audits etc.
Finally, he talked about their auth proxy which uses Kong and JWT (JSON Web Tokens). There wasn’t anything terribly ground breaking in this section, but it was interesting to hear how they use JWT, which sounded like a really nice tool that I hadn’t heard of before.
The other take away from this section was that they deliberately separate their auth proxy from any specific services. The reason for this is that it means they can ensure that requests can’t reach services without going through the proxy first. Also, due to them using JWT to encode the requests via the proxy, even if a request could be made around the proxy, the downstream service wouldn’t be able to understand it.
Overall I really enjoyed the conference, the quality of the talks was generally great and it was brilliant to go and listen to how others are tackling their software architecture, especially around microservices and the cloud. The conference is coming to New York and London next year and I’d recommend attending if you’re interested in this sort of stuff.
Thanks again to Skyscanner for sending me along.