Last week I attended O’Reilly Software Architecture Conference 2017 in London. The conference started on Monday with the usual welcome and keynotes, running until Wednesday, ending with half day tutorials just as Velocity started to kick off in the same venue.
The best one-liners
Feasibly funny, potentially profound. They stuck with me either way.
“Our service design follows ‘Vegas Rules’. What happens in the region, stays in the region” – Rick Fast (great name), Expedia.
The delivery certainly helped this one but I thought it was a good way to remember an important design principal.
“Our definition of done is – Live in production, collecting telemetry, which examines the hypothesis behind the change” – Martin Woodward, Microsoft
This line made my day, I think its an awesome definition of done. Coming from the Experimentation Services squad in Skyscanner I really hope we can move toward adopting this.
“Box, arrow, box, arrow, cylinder – classic architecture” – Andrew Morgan, SpectoLabs
Every system diagram you’ve ever drawn
“Distributed systems constantly run in a degraded state – something will always be wrong” – Uwe Friedrichsen, codecentric
At first it seemed like a throw away statement for a few laughs, but I think in a reasonable scale it is probably true. If our system is always going to be running in an imperfect state perhaps we should take more time to make sure the degraded version still offers customers a great experience.
Top 2 talks of the conference
I saw quite a few really good talks over the two days, not to mention the half day tutorials on the Wednesday, but I want to call out two specifically.
Building the Next-Gen Edge at Expedia – Rick Fast
The synopsis of Rick’s talk mentioned loads of things we’ve tackled or are tackling at Skyscanner – refactoring a monolith, moving to a distributed web front end, configuration, experimentation, self-service architecture, bot blocking and traffic management. On top of that I was really interested to hear about the work of another big player in the travel space – sufficed to say there wasn’t another talk I was considering during this slot.
The first thing that struck me while listening to Rick speak was how similar our stacks actually are. Obviously there are implementation and technology differences, but it was more variations on a theme than paradigm shift:
- Geo-routing to 5 AWS regions
- Outer edge is Akamai
- Inner edge service at the front of each region for service routing and other things (a bit like our scaffolding service and varnish cache)
- An old monolith for some core pages
- Distributed front end for new development
- Experimentation used for traffic splits when launching new services (their experiment platform is called Abacus)
- Dynamic configuration (for the inner edge system)
That’s not to say they weren’t doing some really interesting things. Here’s a couple which caught my attention.
Their inner edge is a Java application which has a plugin type architecture. One of the plugin modules they run handles bot blocking.
For this they analyse the context of the request, generating a bot confidence score and sentiment about the bot – good, bad,
Using the confidence score together with the sentiment they take different actions. For example if the confidence score is low they might redirect the request to a captcha. For a high confidence bot they know to be “good”, think Google, they’ll let it through to a suitable page. My favourite though, was for a high confidence “bad” bot. He said they don’t use this often – let it through but change all the prices to ruin scraped results
They have dynamic configuration for the inner edge which teams can control using a self service portal (called Lobot). The source of truth is jsonb in a Postgres database, however changes are pushed out to machines via Consul.
A really cool part of this was how they fed back to users in the UI that their changes had propagated, without having any pre-defined notion of the current state of the system and which nodes exist.
First the self service system saves the changes to the database and then publishes them to the Consul servers in each region.
Next the self service system asks Consul for a list of edge instances which are currently connected.
The changes are synced to the Consul agents on the edge instances. When a change is received the node writes a message to SQS (or writes an error message).
The self service portal can then match the list of instances it received earlier to messages coming from SQS and can show progress as they update, until the change is fully propagated.
Practically everything that happens in their stack gets pushed into Kinesis – logs, metrics, even the requests and responses made by services in the stack.
At the end of this stream is a system called Haystack, which they have open sourced in the last few months.
You can read more about Haystack over at the github page but it sounded like a really interesting tool for making sense of what is happening when things go wrong. I also thought it was an slightly different take on the tracing space compared to XRay, App Dynamics, New Relic etc.
How we moved 65,000+ Microsofties to DevOps in the public cloud – Martin Woodward
Martin’s talk was an interesting and frank look at moving to DevOps within Microsoft. He’s a Principal Program Manager working on VSTS, which powers parts of Azure and all Microsoft teams now use it for development.
He touched on a lot of interesting things both organisational and technical, I’ll try and cover a few here.
For big features they typically start a rollout using feature flags which are gradually opened up to sets of users. First to the team, then select internal users, wider internal opt-in, then external with opt-in.
The opt-in part was interesting as they monitor the number of users who opt-in to a feature only to later opt-out. Possibly a leading indicator of a “bad” feature.
He also mentioned that they also tend to “dark launch” features, especially if they’ve got an event coming up where it’ll be announced to a fanfare. Apparently at one such event they turned on feature as the live demo was starting only to uncover a cascading failure which hadn’t previously presented itself. Queue red faces all round. Now they dark launch.
When it comes to a full roll out they start with what he called a canary data centre which is only used by internal users, I guess you can afford such luxuries when you are running the platform. From here they go to the smallest external data centre, then the largest to get the extremes of traffic. Penultimately they go to the highest latency data centre before releasing everywhere.
Live Site Culture / Production First Mindset / Security Mindset
If you’ve been involved in a production incident you might recognise some of these themes.
Every alert that a system generates must be actionable and represent a real issue.
When there is a an incident a “bridge” call is created with the people that are required. For each service they have a DRI “Directly Responsibly Individual” who will be brought into the call (these people are on a rotation like our 24/7 rotas).
After an incident they do a root cause analysis and run transparent postmortems, much in the same way we do. Anything that comes out of a postmortem has a two sprint deadline to be fixed, which is usually up to 6 weeks away.
In postmortems if bad alerts were raised or the the wrong DRI was paged it is considered just as serious as the problem itself – anything that increases the time to resolution is bad.
They also periodically run security war games where a group will try and break into the system while another tries to detect them. Their mantra is to that accept “breaches happen”, so it is better that an internal team finds them and that they can detect breaches as quickly as possible. Emailing everyone a screen grab of the hacking team logged into VSTS as one of the heads of engineering is a fun prize
if when they manage to succeed.
The cool thing about the war games are that they have incrementally become harder for the attacking team, as less and less obvious flaws are found and fixed.
The principles around teams he described weren’t a million miles from some that we have at Skyscanner.
There were some differences though, firstly they seemed to favour physically located teams more than we do. Second they favour a 3 week sprint structure. Starting with a sprint plan communicated to the group (tribe) and ending with a “what we accomplished” message including a short over the shoulder video demo. The video demos must have zero production value as people got a bit carried away initially!
The most surprising thing for me was that their teams typically disband after 12-18 months. At this point program managers pitch their projects and engineers pick a top 3 which they’d like to work on. They aim to give everyone their first choice, although occasionally go to 2nd. That’s how teams are formed for the next 12-18 months.
Effective Testing with API Simulation and (Micro)Service Virtualisation
The morning tutorial I attended was called “Effective Testing with API Simulation and (Micro)Service Virtualisation”.
Testing our microservices without spinning up versions of their dependencies, writing custom mocks or using sandbox stacks is something I think don’t we’ve quite mastered so this tutorial seemed quite interesting to me.
The tutorial was mainly a hands on lab using an open source tool called hoverfly to simulate requests from a dummy Flights API that they had supplied (the fact it was a flight API I did find quite amusing).
We started by setting up Hoverfly as a proxy between ourselves and the API and allowing it to record the communication.
From this it was able to do simple playback of responses it had seen.
We built on this first by playing around with the request matching rules, making them a bit smarter and able to respond to more general requests than the ones we’d explicitly made.
Next we added templating to the rules so that request parameters, in this case from and to places, could be used in the response.
Finally we looked at the plugins which allow you to write custom Python, Java or Node code to intercept requests and build responses.
Overall it seemed like a really nifty tool and I’ll certainly be looking at how we could apply it to our deployments in Experimentation Services.
Resilient Software Design In A Nutshell
The afternoon session was totally different – it was three hours of really intense information, but it was awesome.
Uwe Friedrichsen rattled through a whole tool box of patterns and techniques for resilient design, thankfully not all of the ones on that slide were covered or we might not have finished yet.
What I really liked about the session was that it was presented as a tool box with pros and cons rather than a “do this” approach.
I’ve got a 300 line text file on my computer with notes from the talk so clearly I can’t do it justice in a paragraph or two but I really enjoyed it and learnt a lot.
Here’s some of the things he covered:
- Availability = mean time to failure / mean time to failure + mean time to recovery
- Failure types
- Activation paths
- Dismissing reusability (a controversial one)
- Detection methods – timeout, circuit breakers, acks, monitoring
- Recovery methods – retry, rollback, rollforward, reset, failover
- Mitigation methods – fall back, queues, back pressume, share load
- Prevention – routine maintenance
- Complimenting – Redundancy, idempotency, stateless, escalation
- Other stuff – backup requests, marked data, anti-fragility, error injection, relaxed temporal constraints
A great takeaway from the tutorial was what he called the “Architecure Cycle”:
- Make initial core architectural decisions
- Design bulkheads
- Select detection, recovery and mitigation patterns
- Optionally augment with prevention and complementing patterns
- Learn (Measure and learn. Don’t guess, know)
- Assess pattern selection (Value vs cost – Keep the balance)
- Repeat to 2 (Revisit core if needed)
Phew – That was a whirl-wind tour of my time at Software Architecture Conference, London 2017. I hope you enjoyed the write up as much as I enjoyed the conference