Raymond Davies – Software Developer

Software Developer, homebrewer and general geek

Here’s to the data scientists, growth hackers and designers — February 12, 2018

Here’s to the data scientists, growth hackers and designers

Currently I lead the Experimentation Services Squad within Skyscanner. Below are my reflections on moving from a dev only team to multi-discipline squad.

When I joined Skyscanner we had “Flights Teams” in Edinburgh who would work on any part of the flights product – front end to back end, stopping short of database work by consuming various APIs created by other teams.

We tried a couple of project teams around this time, but shortly afterward started moving toward the squad model as we all hurried to read “The Goal”. The squad model we started with has changed a lot between then and now, and it’s the latest evolution of my current squad that I want to discuss.

The teams you’ve worked in might have been different, but up until now the teams and squads I’ve been involved in have all been very developer focused. Almost exclusively these teams have been made up of 100% developers, with a project manager or product owner attached – often working across several teams.

This has worked well for the squads I’ve been in without a doubt, especially recently when our squad’s focus was in the area of internal “Developer Enablement”. We completed a migration to AWS, built SDKs in collaboration with another squad in the tribe and got a long way to releasing our new experimentation platform under this model. We knew what we wanted to do, how to do it and we just got on with it.

More recently however the team make up has started to change. At first we gained a data scientist, a little later a designer joined the team and most recently we were joined by a growth hacker. Conveniently this happened as we began moving toward an internal product focus within a tribe focused on helping growth teams win in their markets.

The Good Bits

Now obviously this isn’t going to be a model that is useful or necessary for every team, however, as a team of developers moving from enablement to building an internal product for a largely non-developer user base, this has been incredibly useful.

There are many things, little and large, that have come out of discussions between the different discipline groups we have now that we’d not have considered alone.

Adding data science to the team has allowed us to really drill down and understand the fundamentals of the experiment tool we’re offering. Questions around allocating users, building statistically valid experiments and how to analyse experiments without bias or incorrect interpretation have all been given significant thought by an expert in this area. This information can then be communicated back to the team to inform how we build out the next versions of the experimentation platform.

Having a member of the growth discipline has given us an invaluable insight into the primary user group for the platform going forward, especially useful as this is a group we’ve had a relatively small level of interaction with before now. How do growth teams run experiments? What do they need from an experimentation tool? How do we make the ideal experimentation work flow easy to use and understand? All these types of questions can be investigated and thoroughly thought-out and worked into our roadmap.

Finally having a designer in the team has helped us pull together the requirements from data science, growth and engineering into a coherent vision. We can now take input from data science on the perfect experimentation tool, add the needs of growth teams and bring this together with a fantastic UX to end up with an easy to understand and use system. Having a designer dedicated to our team has been fantastic. Being able to fully explore upcoming development tasks from a design perspective means that we have well thought out user flows and interfaces built out, ready for any of the developers to pick up and execute. No more tacky elements tagged on without much thought just to complete the current card.

Having all this extra input has been incredibly eye-opening, never before have we had such a clear understanding of the rest of the business, what we need to build and why.

The Challenges

Of course, having this extra input doesn’t happen magically and as a team new to integrating other disciplines into our workflow we’ve had a few challenges along the way. It isn’t rocket science but hopefully it’ll be useful if you are facing a similar situation.

Firstly I’d highly recommend going through a goal setting exercise, or at least taking a look at the “why” of your team again. Perhaps working to a back brief style template you might be familiar with if you’ve come across “The Art of Action”.

This was especially useful for us as we had recently changed goals. It allowed us to take our stated execution plan and break it down – where could each of the disciplines add value? In which order could we do things so that any data science research, user testing or growth analysis could take place ahead of the developers picking up the work? Doing this made sure we were all aligned on a vision and that everyone, not just the developers, had meaningful goals to aim toward.

Another challenge we had was making sure we get input from each of the disciplines at the right time. There is little point in a developer picking up a card only to be blocked waiting on the data science investigation or querying why the design doesn’t match reality. This sounds pretty obvious, and all it takes is a little forward thinking to solve, but when you’re used to creating cards that anyone can work through end to end it can be easy to forget. Calling this out in planning / pre-planning and going through the backlog with an eye for this type of issue every so often can be helpful.

I also want to mention meetings. In changing our goal and tribe, plus adding new team members we’ve had more meetings recently than I’d care to admit – some have gone amazingly well and others have been terrible. As a team our number one take-away from this process has been to embrace small targeted groups for meetings. Rather than having a mega full-squad meeting, with members in different locations, break the meeting down into sub-groups. In our case that can mean data science, growth and design coming up with a shared vision as a small group before presenting that back to the developers or perhaps just a designer and a developer having a catch up on a specific topic. For this to be effective everyone needs to be comfortable that they can’t know everything that is going on all the time, but as long as you can communicate the headlines back to the group via Confluence, Slack or something else this is much more effective than mega meetings, in our experience. We’ve also found that sending out a Confluence pre-read and agenda is super helpful in making sure we can have short action-oriented meetings – push back on meetings without an agenda!

Finally – call out meetings that aren’t working. We found this helpful when we mistakenly called a mega meeting at short notice, 10 minutes in and it was clear we weren’t going to get what we needed out of the meeting. The call connection wasn’t great and we were spending more time asking people to repeat what they’d said than discussing the issues. At this point it made most sense to give everyone their time back and come at the problem again from a different angle with smaller groups. You don’t necessarily have to cancel the meeting, but be brave enough to call out meetings which aren’t working toward actions and save everyone useful daylight hours.

In Conclusion

I really hope that more teams in Skyscanner move toward a multi-discipline approach, especially in product areas or where we’re developing internal tools for a specific non-developer audience. The close collaboration between disciplines within a team leads to amazing outcomes and well-informed decisions, without the overhead of ongoing cross team meetings, competing time commitments and the other little things that get in the way of collaboration over the course of a day.

Before we had it we didn’t know we needed it, when it first arrived it was a little overwhelming but now we wouldn’t want to live without it.

Here’s to the data scientists, growth hackers, designers and all the other disciplines out there!

Software Architecture Conference 2017, London – Best Of — October 23, 2017

Software Architecture Conference 2017, London – Best Of


Last week I attended O’Reilly Software Architecture Conference 2017 in London. The conference started on Monday with the usual welcome and keynotes, running until Wednesday, ending with half day tutorials just as Velocity started to kick off in the same venue.

The best one-liners

Feasibly funny, potentially profound. They stuck with me either way.

“Our service design follows ‘Vegas Rules’. What happens in the region, stays in the region” – Rick Fast (great name), Expedia.

The delivery certainly helped this one but I thought it was a good way to remember an important design principal.

“Our definition of done is – Live in production, collecting telemetry, which examines the hypothesis behind the change” – Martin Woodward, Microsoft

This line made my day, I think its an awesome definition of done. Coming from the Experimentation Services squad in Skyscanner I really hope we can move toward adopting this.

“Box, arrow, box, arrow, cylinder – classic architecture” – Andrew Morgan, SpectoLabs

Every system diagram you’ve ever drawn (smile)

“Distributed systems constantly run in a degraded state – something will always be wrong” – Uwe Friedrichsen, codecentric

At first it seemed like a throw away statement for a few laughs, but I think in a reasonable scale it is probably true. If our system is always going to be running in an imperfect state perhaps we should take more time to make sure the degraded version still offers customers a great experience.

Top 2 talks of the conference

I saw quite a few really good talks over the two days, not to mention the half day tutorials on the Wednesday, but I want to call out two specifically.

Building the Next-Gen Edge at Expedia – Rick Fast

The synopsis of Rick’s talk mentioned loads of things we’ve tackled or are tackling at Skyscanner – refactoring a monolith, moving to a distributed web front end, configuration, experimentation, self-service architecture, bot blocking and traffic management. On top of that I was really interested to hear about the work of another big player in the travel space – sufficed to say there wasn’t another talk I was considering during this slot.


The first thing that struck me while listening to Rick speak was how similar our stacks actually are. Obviously there are implementation and technology differences, but it was more variations on a theme than paradigm shift:

  • Geo-routing to 5 AWS regions
  • Outer edge is Akamai
  • Inner edge service at the front of each region for service routing and other things (a bit like our scaffolding service and varnish cache)
  • An old monolith for some core pages
  • Distributed front end for new development
  • Experimentation used for traffic splits when launching new services (their experiment platform is called Abacus)
  • Dynamic configuration (for the inner edge system)

That’s not to say they weren’t doing some really interesting things. Here’s a couple which caught my attention.

Bot Blocking

Their inner edge is a Java application which has a plugin type architecture. One of the plugin modules they run handles bot blocking.


For this they analyse the context of the request, generating a bot confidence score and sentiment about the bot – good, bad, ugly, indifferent.

Using the confidence score together with the sentiment they take different actions. For example if the confidence score is low they might redirect the request to a captcha. For a high confidence bot they know to be “good”, think Google, they’ll let it through to a suitable page. My favourite though, was for a high confidence “bad” bot. He said they don’t use this often – let it through but change all the prices to ruin scraped results (big grin)

Dynamic Configuration

They have dynamic configuration for the inner edge which teams can control using a self service portal (called Lobot). The source of truth is jsonb in a Postgres database, however changes are pushed out to machines via Consul.


A really cool part of this was how they fed back to users in the UI that their changes had propagated, without having any pre-defined notion of the current state of the system and which nodes exist.

First the self service system saves the changes to the database and then publishes them to the Consul servers in each region.

Next the self service system asks Consul for a list of edge instances which are currently connected.

The changes are synced to the Consul agents on the edge instances. When a change is received the node writes a message to SQS (or writes an error message).

The self service portal can then match the list of instances it received earlier to messages coming from SQS and can show progress as they update, until the change is fully propagated.


Practically everything that happens in their stack gets pushed into Kinesis – logs, metrics, even the requests and responses made by services in the stack.

At the end of this stream is a system called Haystack, which they have open sourced in the last few months.

You can read more about Haystack over at the github page but it sounded like a really interesting tool for making sense of what is happening when things go wrong. I also thought it was an slightly different take on the tracing space compared to XRay, App Dynamics, New Relic etc.

How we moved 65,000+ Microsofties to DevOps in the public cloud – Martin Woodward

Martin’s talk was an interesting and frank look at moving to DevOps within Microsoft. He’s a Principal Program Manager working on VSTS, which powers parts of Azure and all Microsoft teams now use it for development.

He touched on a lot of interesting things both organisational and technical, I’ll try and cover a few here.



For big features they typically start a rollout using feature flags which are gradually opened up to sets of users. First to the team, then select internal users, wider internal opt-in, then external with opt-in.

The opt-in part was interesting as they monitor the number of users who opt-in to a feature only to later opt-out. Possibly a leading indicator of a “bad” feature.

He also mentioned that they also tend to “dark launch” features, especially if they’ve got an event coming up where it’ll be announced to a fanfare. Apparently at one such event they turned on feature as the live demo was starting only to uncover a cascading failure which hadn’t previously presented itself. Queue red faces all round. Now they dark launch.

When it comes to a full roll out they start with what he called a canary data centre which is only used by internal users, I guess you can afford such luxuries when you are running the platform. From here they go to the smallest external data centre, then the largest to get the extremes of traffic. Penultimately they go to the highest latency data centre before releasing everywhere.

Live Site Culture / Production First Mindset / Security Mindset

If you’ve been involved in a production incident you might recognise some of these themes.

Every alert that a system generates must be actionable and represent a real issue.

When there is a an incident a “bridge” call is created with the people that are required. For each service they have a DRI “Directly Responsibly Individual” who will be brought into the call (these people are on a rotation like our 24/7 rotas).

After an incident they do a root cause analysis and run transparent postmortems, much in the same way we do. Anything that comes out of a postmortem has a two sprint deadline to be fixed, which is usually up to 6 weeks away.

In postmortems if bad alerts were raised or the the wrong DRI was paged it is considered just as serious as the problem itself – anything that increases the time to resolution is bad.

They also periodically run security war games where a group will try and break into the system while another tries to detect them. Their mantra is to that accept “breaches happen”, so it is better that an internal team finds them and that they can detect breaches as quickly as possible. Emailing everyone a screen grab of the hacking team logged into VSTS as one of the heads of engineering is a fun prize if when they manage to succeed.

The cool thing about the war games are that they have incrementally become harder for the attacking team, as less and less obvious flaws are found and fixed.


The principles around teams he described weren’t a million miles from some that we have at Skyscanner.


There were some differences though, firstly they seemed to favour physically located teams more than we do. Second they favour a 3 week sprint structure. Starting with a sprint plan communicated to the group (tribe) and ending with a “what we accomplished” message including a short over the shoulder video demo. The video demos must have zero production value as people got a bit carried away initially!

The most surprising thing for me was that their teams typically disband after 12-18 months. At this point program managers pitch their projects and engineers pick a top 3 which they’d like to work on. They aim to give everyone their first choice, although occasionally go to 2nd. That’s how teams are formed for the next 12-18 months.

Tutorial takeaways

Effective Testing with API Simulation and (Micro)Service Virtualisation

The morning tutorial I attended was called “Effective Testing with API Simulation and (Micro)Service Virtualisation”.

Testing our microservices without spinning up versions of their dependencies, writing custom mocks or using sandbox stacks is something I think don’t we’ve quite mastered so this tutorial seemed quite interesting to me.

The tutorial was mainly a hands on lab using an open source tool called hoverfly to simulate requests from a dummy Flights API that they had supplied (the fact it was a flight API I did find quite amusing).


We started by setting up Hoverfly as a proxy between ourselves and the API and allowing it to record the communication.

From this it was able to do simple playback of responses it had seen.

We built on this first by playing around with the request matching rules, making them a bit smarter and able to respond to more general requests than the ones we’d explicitly made.

Next we added templating to the rules so that request parameters, in this case from and to places, could be used in the response.

Finally we looked at the plugins which allow you to write custom Python, Java or Node code to intercept requests and build responses.

Overall it seemed like a really nifty tool and I’ll certainly be looking at how we could apply it to our deployments in Experimentation Services.

Resilient Software Design In A Nutshell

The afternoon session was totally different – it was three hours of really intense information, but it was awesome.


Uwe Friedrichsen rattled through a whole tool box of patterns and techniques for resilient design, thankfully not all of the ones on that slide were covered or we might not have finished yet.

What I really liked about the session was that it was presented as a tool box with pros and cons rather than a “do this” approach.

I’ve got a 300 line text file on my computer with notes from the talk so clearly I can’t do it justice in a paragraph or two but I really enjoyed it and learnt a lot.

Here’s some of the things he covered:

  • Availability = mean time to failure / mean time to failure + mean time to recovery
  • Failure types
  • Isolation
  • Activation paths
  • Dismissing reusability (a controversial one)
  • Detection methods – timeout, circuit breakers, acks, monitoring
  • Recovery methods – retry, rollback, rollforward, reset, failover
  • Mitigation methods – fall back, queues, back pressume, share load
  • Prevention – routine maintenance
  • Complimenting – Redundancy, idempotency, stateless, escalation
  • Other stuff – backup requests, marked data, anti-fragility, error injection, relaxed temporal constraints

A great takeaway from the tutorial was what he called the “Architecure Cycle”:

  1. Make initial core architectural decisions
  2. Design bulkheads
  3. Select detection, recovery and mitigation patterns
  4. Optionally augment with prevention and complementing patterns
  5. Implement
  6. Deploy
  7. Measure
  8. Learn (Measure and learn. Don’t guess, know)
  9. Assess pattern selection (Value vs cost – Keep the balance)
  10. Repeat to 2 (Revisit core if needed)

Phew – That was a whirl-wind tour of my time at Software Architecture Conference, London 2017. I hope you enjoyed the write up as much as I enjoyed the conference (smile)

Migrating a running service to AWS — October 2, 2017

Migrating a running service to AWS

At Skyscanner I work on our internal Experimentation and Configuration platform called Dr Jekyll (the UI) and Mr Hyde (the API).

Recently we migrated these systems from our private data centres to AWS. Below is a excerpt from a blog I wrote for our internal company blog – detailing some of the interesting learnings our squad had from undertaking the migration project.

Akamai Routing – You are our saviour!

Services using the Mr Hyde API usually poll it using a background thread to pull in experimentation data. This means that these calls aren’t on the critical path, and within reason, aren’t time critical.

However, native apps call the Mr Hyde API on application start, from anywhere in the world, and need to be able to receive experiment and other data within a one second window. If any calls can’t be completed in this window they will be aborted and the app will start without them.

In our data centre setup, we came in under this budget globally for ~70% of requests.

For the initial stab at an AWS setup we deployed to all Skyscanner AWS regions and put the service behind the Skyscanner API gateway. With fingers crossed we set up some Monitis checks to ping our service so we could monitor response times.

Non-Akamai Routing

Sadly, all regions outside of EMEA (in reality only really central Europe), were now well above the 1 second threshold for the vast majority of requests. We also observed frequent latency spikes when the AWS routing tables switched to send requests to a more distant region. For example, a request originating close to the Singapore region would normally be served by ap-southeast-1, however routing frequently switched to ap-northeast-1 in Tokyo – severally increasing response time in the process.

To try and address this problem we put our API behind a geo-aware route in Akamai. I was initially sceptical that adding another link in the request chain would result in a significant improvement in response time –Akamai’s super-fast network proved me wrong!

Akamai Routing

Behind Akamai we eliminated the latency spikes from the strange routing behaviour and reduced overall response time to a level where globally we matched or improved on the data centre’s performance.

Recent observations have seen the AWS setup now beats our data centre figures by between 1-5% of all request making it under the boundary time.

Migrating Data – Lambda can work well

I’m sure these two words, migrating data, strike fear into the hearts of many a developer. Typically, a pretty time consuming and sometimes manual process coming near the end of a long project. Do we have to bring over all that horrible inconsistent legacy data to ruin our lovely new system? Sigh.

In our case we did have to bring our old data with us as we needed to move seamlessly from one system to the next. To our advantage however was the small overall size of our database – at the time stored within the data centre Couchbase cluster.

We wanted to frequently “sync” our AWS database, now Postgres, with the Couchbase data. Firstly, this would allow us to work on proving the AWS setup using live data. Additionally, it would mean that we could at any time close the old system, wait for the next sync and then turn on the new system with practically no down time.

To do this we took advantage of scheduled Lambdas. Our Lambda would request the most recent data via our production API, which was already publically accessible due to native apps calling it. The Postgres table would then be truncated and the new data inserted.

Simple and effective, making use of our existing API and also reusing code that we’d already written elsewhere. Overall Lambda was a really handy platform to get a CRON type task up and running easily and reliably.

Data Sync.png

S3 Replication Lag

Amazon’s S3 service gets used and abused for all sorts of things these days and it’s easy to assume it’ll do everything you throw at it perfectly.

To enable us to run our Mr Hyde API in all Skyscanner regions we decided to write our data to S3 and use the built-in replication to copy the data from eu-west-1 to the other 3 regions. This path seemed far easier than working out a replication strategy for our RDS database, especially given RDS only offers read replication between the master and a single remote replica.

Initially we triggered a write from the database into S3 every 5 minutes, regardless of changes. While doing this we set up dashboards to monitor the replication status of the remote buckets. What we found were frequent periods where buckets would remain in a pending state for anything between 30 minutes and several hours. Clearly this lag would be unacceptable in the production system.

We then changed our strategy to only write a new file to S3 when there were actual changes to the database. Since making this change we no longer observe any long replication lags between the local and remote buckets.

Our take away – if you make frequent updates to your data and can live with a long tail on “eventual” consistency then the built in S3 replication is probably ok. If on the other hand you make frequent updates and also require more immediate consistency it might not be the best option.

Moving to only write new data to S3 when there were real changes worked well, allowing us to continue to get the benefit of built in, hassle free, cross region replication.

Final thoughts

Finally, I’d just like to call out the awesome efforts by other teams in Skyscanner’s Developer Enablement Tribe that made it as easy as possible to get from data centres to AWS. Building on top of the their work made our lives so much easier.

Being able to cookie cut a new service, easily deploy it through a build pipeline to a shared cluster, put it behind an API gateway in a few lines of config and to then immediately see metrics in a dashboard is amazing when you step back and think about it.

The development environment at Skyscanner has come a long way from when Dr Jekyll was first deployed into our data centres using Team City and Ansible scripts.

Long may that progress continue!

Software Architecture Conference – Best Of — November 25, 2016

Software Architecture Conference – Best Of

Recently I was fortunate enough (thanks Skyscanner!) to attend O’Reilly Software Architecture Conference in San Francisco. Given our ongoing move to a cloud based, micro-service oriented architecture at Skyscanner I thought it would be interesting to go and see what other internet scale companies were doing in this area.

In this blog post I want to recap and share some of the interesting things I learnt.


My pick of the conference

I’ll start with my highlight of the conference, a talk by Susan Fowler (@susanthesquark) a “Software Reliability Engineer” at Uber.

Susan’s talk resonated with me because it presented a solution to a problem that I don’t think we’ve solved at Skyscanner and one that we’re likely to run into more frequently the further we travel down the microservice road.

The talk was titled “Microservice Standardisation” and was about creating a “framework” which allows an organisation to build trust between teams, build truly production ready microservices and create an overarching product which is reliable, scalable and performant.


In the talk, we were told that Uber has more than 1300 microservices powering their product. As you can imagine, in an architecture like this each microservice must rely on several others to do its job, each of which in turn probably rely on further services.

Working in this environment Uber found that it was difficult for development teams to know which other services (and teams) they could trust. If a team is building a component which is critical and needs to be “always” available – how do they know which other services they can use, without compromising their availability goal? This became even more complicated for them as they started to heavily utilise the independent scaling of microservices. How do teams know that their dependencies will scale appropriately with them, as a feature is rolled out and becomes popular?


Uber’s answer to this was to create what they call “production-readiness standards” which all their services must adhere to before they can be trusted with production traffic.

These standards should be:

  • Global (local standards don’t build organisational trust)
  • General enough to be applicable to all microservices
  • Specific enough to be quantifiable and able to produce measurable results (otherwise they aren’t useful)
  • Inherently give guidance on how to architect, build and or run a service (otherwise they aren’t helpful to developers)

The last point I found quite interesting, the example given was “Availability”.  This is a good way to quantify an important metric for microservices and is a good way to build trust through SLAs. However, “availability” is a goal, not a standard. Telling a team to “make your service more available” isn’t terribly helpful.  The key is to think about what brings a service closer to the goal, in this case to be more available. The answer to this question can then become a standard.

The standards that Uber use are:

  • Stability
  • Reliability
  • Scalability
  • Performance
  • Fault-tolerance
  • Catastrophe-preparedness
  • Monitoring
  • Documentation

Being quite mature with this approach Uber have now automated much of the application of these standards in their infrastructure, processes and pipelines.

One of the interesting closing remarks which Susan made was that they now have production readiness leaderboards for all the services on screens throughout their offices. At first it might seem like this would-be kind of cruel, however she said that it has fostered some awesome conversations between their engineers. The boards have become a jumping off point for “water cooler” conversations about best practices and little tips and tricks which all feedback into the system to build more trust and better services.

I’d love to see us have something like this at Skyscanner so that we can all hold our services to high measurable standards.

Other highlights

An Agile Architects Framework for Navigating Complexity

Rewinding to the half day tutorials on first day of the conference I really enjoyed “An Agile Architects Framework for Navigating Complexity” in the afternoon session.

This was a pretty hands on session with lots of group activities and discussion so I was glad I had fuelled up on the afternoon snacks – only in America – Starbucks Coffee and assorted candy.


The session started with us all taking an experience from our own day to day, writing a small headline about it and sketching the architecture. This was basically to get us thinking about a specific piece of work which would form the basis of the rest of the workshop.

We then used coloured and numbered dots to answer questions about the experience on large pieces of paper on the walls, in much the same way as you’d use sticky notes in a retrospective.

The purpose of doing this with a team is that you can then use the data to formulate hypotheses about what is working well, what isn’t and how difficult or important it will be to tackle any issues.

For example, if someone had marked their experience as “having happened frequently” on one of the boards and selected that it “involved customers” on another, you’d probably think that was quite a high priority issue to look at. You’d then be able to look at some of the other boards and find factors that could be addressed going forward. This also works for positive examples that you might want to make happen more frequently or enable the team to more easily achieved.

The second half of the workshop was to take the experiences that you’d identified using the techniques above and to apply the Cynefin framework to further categorise them into achievable actions.


For example, if one of the experiences falls into the obvious portion of the Cynefin diagram you might choose not to take any invasive action other than to keep an eye on it.  However, if an experience fell into the complex or complicated domains you’d be more likely to spend some effort in addressing it and producing a plan of action to tackle the issue in future sprints.


 Two Netflix employees, Dianne March (Director of Engineering) and Scott Mansfield (Senior Software Engineer) gave interesting presentations.


Both presentations were quite heavily based on “here’s what we do at Netflix, take from that what you will” rather than giving any specific guidance, but were both interesting.

Dianne’s presentation covered their version of “build it, run it” and being a good Netflix citizen, rather than having a command and control type structure. She also talked a bit about the problems Netflix had had in the past with AWS regions going offline and how that had effected the business. This lead on to Chaos Monkey and them running monthly traffic drains from one region to another, to make sure they aren’t building in any region-specific dependencies.


Scott’s talk was even more specific, covering EVCache and how Netflix use this technology they’ve developed to enable them to be low latency and high throughput. The talk was incredibly detailed and technical so I won’t go into detail, but check out his slides to get an idea of the stuff he was talking about.

An off the cuff comment that he made during this presentation which I found quite interesting was that despite Chaos Monkey running practically all the time, AWS still kills far more of their instances than Chaos Monkey. If ever there was an illustration of the need to develop for the wild west which is the cloud, that is it!


 Playing Tetris live on stage while demoing Kubernetes, enough said.


Containerisation at Pinterest

This was another interesting one. There wasn’t anything terribly surprising in the talk, but it was more interesting from the point of view of seeing how well we do at Skyscanner compared to other large and well known tech companies.


Pinterest seemed to have taken a rocky road to their current architecture. They first had a big problem with random deployment failures as their infrastructure and machines required a lot of manual intervention, creating inconsistencies across their fleet. They tried to solve this with various technologies including “Teletraan” their open sourced deployment system. Unfortunately, this didn’t meet their needs fully so they’re currently moving more toward a mix of Teletraan and Docker.

Microservices – The Supporting Cast

 The last session I want to mention specifically is “Microservices – The Supporting Cast”. This talk was by Randy Layman from Pindrop, a company who provide fraud detection software to financial institutions.


The talk basically covered three supporting services which Pindrop have alongside their business logic services.

First was their API Gateway, a concept with which most of you will likely be familiar. Something potentially unique that Pindrop build into their API gateways however, is the ability to protect themselves from bad request.

The story behind this was a client installing a now VoIP phone system which didn’t quite speak the language of their previous installation. The change triggered a series of events which caused a DDOS-like effect downstream.

With their new API Gateway, however they are now able detect bad traffic like this, drop it if appropriate or reshape it into an understandable structure before forwarding it downstream.

The next service was what they call a “cleaner”. This was quite specific to their business but I could see how a similar process could be applied to various more general scenarios. The cleaner is responsible for identifying sensitive or personally identifiable information and removing it before transmission to downstream services which don’t require it. Depending on if and how the data is needed it can be removed in various ways:

  • If the data isn’t required, it can simply be removed
  • If the data is needed, but can anonymised, its replaced by a generated token
  • Finally, if the data is required they still do the token step. However, in this version the original data is stored within a datastore, from which it can be retrieved later. This means they only have one data store which stores personally identifiable or sensitive information, while still allowing other services to access the data. Making their lives much easier in terms of compliance audits etc.

Finally, he talked about their auth proxy which uses Kong and JWT (JSON Web Tokens). There wasn’t anything terribly ground breaking in this section, but it was interesting to hear how they use JWT, which sounded like a really nice tool that I hadn’t heard of before.

The other take away from this section was that they deliberately separate their auth proxy from any specific services. The reason for this is that it means they can ensure that requests can’t reach services without going through the proxy first. Also, due to them using JWT to encode the requests via the proxy, even if a request could be made around the proxy, the downstream service wouldn’t be able to understand it.


Overall I really enjoyed the conference, the quality of the talks was generally great and it was brilliant to go and listen to how others are tackling their software architecture, especially around microservices and the cloud. The conference is coming to New York and London next year and I’d recommend attending if you’re interested in this sort of stuff.

Thanks again to Skyscanner for sending me along.



Breaking The Monolith — July 18, 2016

Breaking The Monolith

A few months ago I spoke at my first conference, ddd.scot in Edinburgh, in front of a couple of hundred developers from all over Scotland.


After attending Velocity conference last year I was really inspired and excited by the talks I’d seen. The people I saw speak were great and had a lot of interesting information and experience to share. Once I returned home I started thinking that it would be awesome to talk at a few conferences myself, and to share the cool things I’ve been working on.

I first submitted my talk idea two months before the conference date, at which point the submitted talks were voted on by people planning to attend. Not really expecting to be selected I waited to hear back from the organisers once the votes were tallied.

To my surprise however I was selected to speak in the largest room – better get writing that talk!


The weeks passed as I prepared my talk and practiced it in front of my Skyscanner colleagues, girlfriend and cat. Before long almost all of the 400 tickets had been sold.

Screen Shot 2016-05-16 at 10.37.18

On the morning of the conference I was up bright and early to drive out to the venue near Edinburgh Airport and to get ready for my talk.

Screen Shot 2016-05-16 at 10.47.16.png

I was starting to get nervous, especially as the audience rolled in, however as I kicked off and moved past the first few slides I started to find my flow. That was the hard bit over, all down hill from here I was thinking to myself. Early support was also rolling in on Twitter, not that I could see it at the time.

Screen Shot 2016-05-16 at 10.49.33

45 minutes plus some questions later and I was all done and dusted, phew!

Despite my initial nerves I really enjoyed speaking and sort of wished I could have got back up and done it all over again, it was a bit like playing in my band actually.

It was lunch time at this point though and a sandwich was calling me – time to take to Twitter to see how I’d been received.

Screen Shot 2016-05-16 at 10.55.29Screen Shot 2016-05-16 at 10.55.43

I also got a great review over at techneuk.com who were in attendance.

Overall speaking at ddd.scot was really enjoyable and something I highly recommend for anyone else who is considering doing something similar. No matter how nervous you might be it really is worth doing.

Thanks to everyone who came to see me talk, hopefully I’ll see you at a conference again as I’d love to keep doing talks like this in the future.


If you are interested you can find my slides here and view a recording of my talk (audio isn’t great) here.

Configuration as a Service — January 8, 2016
Easy “Read the Docs” Workflow — October 7, 2015

Easy “Read the Docs” Workflow

If you’re a developer in an organisation with many engineers and/or teams or are involved in open source projects I’m sure you’ll be aware of the need for good, up to date documentation and quick start guides.

Having this documentation readily available can make the difference between your project being a success or a failure as people’s first experiences can often shape their opinion of a piece of software. Imagine the difference between someone who gets up and running quickly after following your documentation compared with someone who couldn’t find your documentation and is left frustrated and stuck.

If you’ve followed documentation for some popular open source projects online you’ll have likely used “Read the Docs”, perhaps without knowing it.

It’s a really good, easy to browse and update system for documentation which automatically builds search and indexing functionality from your documentation.

read the docs

At Skyscanner we’ve recently started using a private Read the Docs instance for our internal documentation.  It’s a great tool to help cross team collaboration.

The documentation on Read the Docs is generated from restructured text files (.rst) which you can store in your source control system along with your code and have automatically pushed to a Read the Docs site when you check in.

Documentation that lives with the code is really convenient and having it automatically update when you check in takes a lot of the pain out of maintaining a good set of docs for your project – all good so far.

However one snag you might hit, if like myself, you are new to the .rst format is that it isn’t always obvious how your document will look until you see it built into the template you are using.

The below image is an example of raw restructured text – not terribly intuitive I must say.


Worry not!  Grunt can come to the rescue and give you a way to live preview your documentation changes before you push them to a Read the Docs instance.

Firstly you need to install Sphinx, the generator which creates the readable docs from the restructured text files.

pip install sphinx will do the trick for that.

You’ll now be able to build documentation manually using sphinx-build or by creating a makefile and make.bat (depending on your OS preference) using the sphinx-quickstart command.  If you are starting a new documentation project the quick start command is a great way to get up and running as it generates all the structure and files you need to start.

That’s all well and good, but you don’t want to have to run these commands every time you make a change to your existing documentation project – enter our friend grunt.

You’ll need to do an npm install in your project to install grunt and a few other dependencies.  You can get the full list by checking out my package.json in the code block below

    "private" : true,
    "name" : "documentation-project",
    "author" : "Raymond Davies <me@me.net>",
    "version" : "0.0.1",
    "dependencies" : {},
    "devDependencies" : {
        "grunt" : "^0.4.5",
        "grunt-cli" : "^0.1.13",
        "grunt-shell" : "1.1.2",
        "coffee-script": "^1.9.2",
        "load-grunt-tasks": "~0.3.0",
        "grunt-contrib-watch": "^0.6.1"

You can then add a gruntfile like ours (in the code block below) which uses grunt-watch to keep an eye on your changes and grunt-shell to run either the makefile or the make.bat, live updating your local preview while you work – awsome.

module.exports = (grunt) ->
#   Load grunt tasks automatically

        shell: {
            buildHtml: {
                command: 'make.bat html'
                options: {
                    execOptions: {
                        cwd: __dirname +  '/docs'
                livereload: true

            files: [
            tasks: [

    grunt.registerTask('default', ['shell:buildHtml'])

If you are using Pycharm (which has good .rst syntax highlighting) it is really easy to do the above using the Grunt panel at the bottom of the screen.


Hopefully this example will help make your documentation workflow even easier and there will be more great projects with matching documentation out there for us all to enjoy!

Stimulating Simulations — September 23, 2015

Stimulating Simulations

For some time I’ve been really interested in simulation games from SimCity to Creatures, however more specifically I find Zero-player games really fascinating. I like the idea that the interactions of many small and often simple rules build together to create complex systems and hopefully a fun game or interesting simulation. The ultimate example of both a zero player game and a situation in which “simple rules give rise to complex systems” is in my opinion Conway’s Game of Life (also a fun programming task to do when learning a new language).


One of the things I find compelling about these types of simulation is how simply they can be represented – the game of life is nothing more than a 2D array where each position is either on or off.

Having never been one for fancy graphics I think embracing your constraints in true 37signals style and focusing on building the rules underneath the simplest interface possible is a lot of fun. I also find building up the rules over time and seeing how each new rule effects the simulation really interesting.

My first attempt at building something like this was a predator prey simulation which started with the most basic representations you could possibly imagine in the command prompt (and eventually became my honours project).

hons project test2hons project test1

Eventually once I was happy with my rules I polished up the interface to support some of the more advanced features of the simulation and to make it more useable and interesting. However at the end of the day it is really only a representation of rules being applied across a few 2D arrays.

hons projecthons project graph

Recently as part of my squad learning day at work I decided to take this type of idea and to apply it to the web as simulation / game where multiple players would be able to interact with a system in order to effect the outcome in one way or another – eventually in a competitive game style.

To get started I used Oauth and Facebook login to create unique users, allowing them to join the simulation.  I wanted to be able to track unique users and be able to use profile images etc but didn’t want to implement all of this stuff behind the scenes so Facebook login seemed like a good starting point.

fb login

I found the .NET OAuth libraries and to be a little fragmented, especially when (depending on project settings) it sometimes comes down with the Membership Provider which I found to be even more fragmented across versions of MVC.  Thankfully once I got my OAuth working I was able to bin the Membership Provider and run a fairly simple and stripped down authentication model.

After someone logs in they are moved along to the main page of the application and in the process the browser tells the server side app the size of the view port. This is important as it is used by the program to work out how big a map it can produce.

As with my previous examples I went for a simple representation of the world, this time using HTML5 canvas to draw the map based on 32x32px tiles.

The canvas API was really easy to use and well documented.  The only slight issue I ran into was a condition on first load where it would try to draw the map before the tiles had been downloaded, resulting in black empty squares. The simple solution was to add onload functions for the Image objects and to only draw the map once they had all loaded.


The server then creates a “procedurally” generated game map, based on the view port / tile size, which is different each time you start.  To help with generating the map I found an open source library called LibNoise which generates noise maps.  I use this noise map to generate the terrain – the lowest points become water, middle points are grass and the peaks become forest areas.

map big

Once the terrain is generated the largest unbroken piece of land is found and a “town” is generated, the size of which is based on the size of the full map.  I then use my best attempt at implementing A* pathfinding to draw a road from the town to one of the edges of the map (prioritising edges that are far away to make things more interesting).

The path finding algorithm is probably one of the most interesting (and weirdest) things I’ve implemented in a while, I guess because it is quite removed from web development that I normally do. To build it I followed the steps in this tutorial, writing the code as I went, and to my surprise, (baring a few edge cases I needed to fix) it worked once I was done and I had a nice road from the town to the edge of the map.

I then add certain number of townspeople to the map, the number of people is based on the map size.  Most of them are placed in the town but a few are placed randomly throughout the map.

In what time I had left I implemented a very simple gathering rule for the townspeople whereby if they have nothing in their inventory they will move toward the nearest group of trees to collect “resources”.  Once they have done some collecting they will return to the town to drop off what they have, again using my probably less than perfect implementation of A*.


At the last minute I added spawning of enemies, in the form of skeleton zombies, with the intention of having them chase the townspeople.  I wanted to use this to add another “mode” to the townspeople where they would stop the normal gathering behaviour and run away from danger if a zombie came within a certain radius.


Unfortunately though I ran out of time so the zombies can only watch on from the side lines, envious of the townspeople and their fancy movement skills….

Over all this was a fun learning day, building on previous projects I’ve worked on in my spare time.  Moving over to the web using Facebook login, OAuth and the canvas was also interesting giving me a bit of an insight into the Facebook API and using Open Auth (something I’d use in future projects rather than writing a whole logon system myself).  The most interesting part though was the pathfinding as it is so different from the stuff I work on day to day.

My First Memcache – Working with Python session state from a .NET background — May 29, 2015

My First Memcache – Working with Python session state from a .NET background

As a web developer it’s important to know about the different methods for preserving state available to you and the advantages and disadvantages of each. Speed of access, security, transmission costs, data size, volatility and shared access across web front ends are the types of things that need to be considered when deciding where parts of your session and state will be stored.

If you are coming from a .NET background, like me, you’ll be aware of how to set up an application’s session state, in memory, in a SQL database or on a separate state server, and how to access this from C# code.  You’ll also be aware of the different .NET caching options and how to cache both MVC action results and arbitrary objects.  Finally you’ll have seen the Request/Response.Cookies object allowing you to manage cookie storage.

Setting up a new .NET application with the concept of session and some form of state preservation across multiple sessions would be simple.  Good times.

However what about a Python application using Flask?

Recently someone in web apps *cough Dave cough* found this browser based version of battleships that we absolutely didn’t play at work.

It got me thinking that it would be a neat challenge to try and replicate in Python as a project to learn more about Flask applications.  So I fired up Pycharm and got my Flask skeleton working, however it wasn’t long before I realised I didn’t know how to access the Flask equivalents of the .NET objects I described earlier and that they’d be crucial to an application like this.

Step up the Flask session object and Flask Cache backed by a memcached server on EC2.

flask mEMCACHEDflask-cache

<disclaimer> Before I begin I’m clearly not a Flask or Python expert and this is just what I found from my initial investigation coming from .NET land. </disclaimer>


Firstly I wanted the idea of a “player” in my game which required me to have a session and some form of unique id.  I was surprised that, unlike in .NET, Flask doesn’t generate its own sessions when users arrive.  This also meant that there wasn’t a session id for each user like there is in .NET sessions.  I did like how clean this meant my sessions / cookies were though, no asp.net garbage taking up space.

The session object in Flask is basically just a wrapper around cookies, however for “session” type things where you might have account data or things you don’t want the user to be able to modify it is advantageous to use session over cookies directly.  The reason for this is that going through the session object creates secure cookies that are encrypted and decrypted for you in the background.

All you need to do to access this functionality is to supply a “secret key” to your application, for example app.secret_key = “mysecret”.  The documentation recommends using os.urandom(24) to generate a key to use.

After this the session object behaves like a dictionary, for example:

session['sessionid'] = sessionid


if 'sessionid' in session:

     sessionid = session['sessionid']

I decided to base my session IDs on uuid.uuid4() which supplied substantial GUIDs.

Caching / Application State

Once I had my session and “players” I wanted a way a way for the application to be able to connect two sessions for a game and to remember the state of a game at the application, rather than session level.  For this I used Flask Cache, first in “simple” development mode and later backed by a memcached server hosted in our AWS sandbox.

Getting memcached running on an EC2 instance was a piece of cake – yum install and then use the memcached command (with –vv if you want to see the connections being made from your application etc).  The only thing you really need to do in the way of configuration is to open access to the server on the default port memcached uses to communicate, 11211 UDP and TCP.

One important note is that you’d generally not open up access to your memcached server to the internet at large, and would instead restrict access to known IPs, otherwise anyone can add and read data from the server.  Memcached does have an authentication method, however I didn’t enable this for my test.

Getting the libraries necessary to connect Flask Cache to memcache on my Windows machine was a bit of a pain in the ass. I eventually managed after going from Python3 back to 2.7 and doing some fiddling around with different libraries.  I get the impression from my reading that this would have been much easier on Linux.

Once I got going the cache object was very easy to use to store arbitrary objects, and I quickly saw my access to the memcached server in the logs:

app.cache.set('users', users, timeout=500)


You can also use the cache to store the results of functions and templates too, similar to how you might use .NET output caching:



def index():

    return render_template('index.html')


@cache.cached(timeout=50, key_prefix='all_comments')

def get_all_comments():

    comments = do_serious_dbio()

    return [x.author for x in comments]

cached_comments = get_all_comments()

All in all a very enjoyable dive into creating a stateful application using Flask, and despite the slight difficulty in getting the correct libraries on Windows, everything worked as expected and was easy to plug into my app.

If you know of any good alternatives to the stuff I’ve mentioned or other things you think I should look at please let me know in the comments.

Consul for Service Discovery and Registration — May 6, 2015

Consul for Service Discovery and Registration

You might remember that in my last blog I talked about using AWS Lambda. In that blog one of the possible use cases for Lambda I mentioned was service discovery and registration.

Service discovery is an important area of interest and development at my work going forward. As we move further into the micro service squad model and push our stack into the cloud, knowing which services are available, the health of services, where they are and their configuration becomes more and more important.

In the more “uncertain” or “unpredictable” world of AWS, running our services in a multi-tenant environment, we are aiming to create an architecture in which failure is expected and gracefully tolerated. We are also aiming to create services where changes in demand are automatically detected and scaled with accordingly – adding and removing resources in real time.

Building in this way will allow us to fully utilise the fantastic opportunities working in the cloud brings and will also ensure we have a first class resilient and scalable web application. However to achieve failure resilience and on demand scaling we need to be able to answer service discovery type questions.

Enter – Consul

I had a look at Zookeeper, etcd and Consul which all aim to address the service discovery question.

There were a couple of little things that put me off Zookeeper and etcd. However I liked the look of how Consul worked and the website was also a lot nicer to browse for “getting started” type information which was in its favour. I also noticed that Consul is built by the same people that make Vagrant, which pushed me over the edge, so I chose to work with Consul.


My Test Environment

My test environment was 3 Amazon Linux AMI based machines running in our sandbox account.

The first box was my Consul “server” you can think of this like the controller of the Consul cluster. In a production environment you’d have more than one of these to be fault tolerant. The other two boxes were Consul “agents”, which in the real world would be nodes associated with our services. I also stuck Nginx on the agents to make them slightly more like real boxes.


Besides doing a yum install this was perhaps the easiest installation I’ve seen. You download a zip file unzip it and copy the binary it contains into somewhere like /usr/local/bin and you’re done!

Typing “consul” verifies everything is ready to go:

consul command

This step is the same regardless of whether the machine is going to be a server or an agent.

Starting a Consul Cluster

I then started up the first two machines, the first as the server and the other as an agent. Doing so is as simple as running the “consul agent” command and either passing (or not passing) it the –server flag. Normally you’d run this in the background, possibly using something like supervisor, however for simplicity I was doing this by hand.

At this point the consul server and agent came to life, the server even elected itself king of its little one node world, however at this point they don’t know about each other.

consul agent start consul server start

Connecting the dots

To connect the agent to the server I needed to issue the “consul join” command with the IP of the server. This is what a node would do if it was added to a cluster in order to announce itself to the world.

consul join

One of the interesting claims of Consul is that if a new node contacts one of the existing nodes in a cluster to join that existing node will “gossip” to all the others letting them know about the new one.

To test this I got my third machine online and asked it to join the cluster by contacting the existing agent, not mentioning the server at all. It told me it was accepted into the cluster so I went over to the server and ran the “consul members” command to list all the nodes​​ it knew about. As if by magic – all three machines were listed.

consul server members

That’s great but where is the useful information?

HTTP “Service” API

Now I had a little cluster running I wanted to know what it could tell me.

Using a very simple JSON config file I told one of the agents that it was a “web” service running Nginx on port 80.

I was then able to use the Consul HTTP API to query the cluster for what was running as a “web” service. Doing so from one of the other nodes returned some JSON listing the service, what it was running, its IP address and node name. All useful information if I was another box looking to send requests this way.

consul web query http

Depending on the configuration file on the boxes and the query you issue the HTTP API can be quite powerful in finding out what is running in a cluster and how to speak to it. There is also a DNS API which supplies the same information in a different format.

Key Value Store

Another way to pass information around the cluster is to use Consul’s key value store. This was probably the most interesting part for me as it would enable us to replace the costly “requirements end point” calls which the website scaffolding currently has to make each time it builds a page for an ESI component.

To put something into the key value store an agent in a cluster can call its own HTTP API and any changes will be distributed to the rest of the cluster.

To mimic how requirements might work in this system I added keys using the path


For example:

curl -X PUT -d '' http://localhost:8500/v1/kv/homepage/node1/endpointconfiguration

curl -X PUT -d '' http://localhost:8500/v1/kv/homepage/node1/nodeipaddress

curl -X PUT -d '6400' http://localhost:8500/v1/kv/homepage/node1/version

After issuing these requests I was able to read them back again from the other nodes, resulting in the following (the values are base64 encoded):

consul key value result

I was pleasantly surprised to see both my simple data and my JSON object successfully stored, proving that this could be a valid alternative to requirements end point calls.


I couldn’t really fault Consul, it was incredibly easy to set up and play with and seems to provide everything that would be required to solve the service discovery problem out the box.

I’d like to have a go at setting up a more automated cluster, using Ansible for deployment and having nodes register and deregister themselves automatically, to get a fuller understanding of the system. I’d also be interested in writing an application layer to consume the Consul information and make decisions based on it, to see how easy this would be to implement. However off the back of what I’ve tried so far I can’t imagine it being terribly difficult.

If you have some spare time I’d also recommend watching this video from dotScale last year where Mitchell Hashimoto (one of the founders of Hashi Corp) takes you through how Consul works – https://www.youtube.com/watch?v=tQ99V7QjEHc