At Skyscanner I work on our internal Experimentation and Configuration platform called Dr Jekyll (the UI) and Mr Hyde (the API).
Recently we migrated these systems from our private data centres to AWS. Below is a excerpt from a blog I wrote for our internal company blog – detailing some of the interesting learnings our squad had from undertaking the migration project.
Akamai Routing – You are our saviour!
Services using the Mr Hyde API usually poll it using a background thread to pull in experimentation data. This means that these calls aren’t on the critical path, and within reason, aren’t time critical.
However, native apps call the Mr Hyde API on application start, from anywhere in the world, and need to be able to receive experiment and other data within a one second window. If any calls can’t be completed in this window they will be aborted and the app will start without them.
In our data centre setup, we came in under this budget globally for ~70% of requests.
For the initial stab at an AWS setup we deployed to all Skyscanner AWS regions and put the service behind the Skyscanner API gateway. With fingers crossed we set up some Monitis checks to ping our service so we could monitor response times.
Sadly, all regions outside of EMEA (in reality only really central Europe), were now well above the 1 second threshold for the vast majority of requests. We also observed frequent latency spikes when the AWS routing tables switched to send requests to a more distant region. For example, a request originating close to the Singapore region would normally be served by ap-southeast-1, however routing frequently switched to ap-northeast-1 in Tokyo – severally increasing response time in the process.
To try and address this problem we put our API behind a geo-aware route in Akamai. I was initially sceptical that adding another link in the request chain would result in a significant improvement in response time –Akamai’s super-fast network proved me wrong!
Behind Akamai we eliminated the latency spikes from the strange routing behaviour and reduced overall response time to a level where globally we matched or improved on the data centre’s performance.
Recent observations have seen the AWS setup now beats our data centre figures by between 1-5% of all request making it under the boundary time.
Migrating Data – Lambda can work well
I’m sure these two words, migrating data, strike fear into the hearts of many a developer. Typically, a pretty time consuming and sometimes manual process coming near the end of a long project. Do we have to bring over all that horrible inconsistent legacy data to ruin our lovely new system? Sigh.
In our case we did have to bring our old data with us as we needed to move seamlessly from one system to the next. To our advantage however was the small overall size of our database – at the time stored within the data centre Couchbase cluster.
We wanted to frequently “sync” our AWS database, now Postgres, with the Couchbase data. Firstly, this would allow us to work on proving the AWS setup using live data. Additionally, it would mean that we could at any time close the old system, wait for the next sync and then turn on the new system with practically no down time.
To do this we took advantage of scheduled Lambdas. Our Lambda would request the most recent data via our production API, which was already publically accessible due to native apps calling it. The Postgres table would then be truncated and the new data inserted.
Simple and effective, making use of our existing API and also reusing code that we’d already written elsewhere. Overall Lambda was a really handy platform to get a CRON type task up and running easily and reliably.
S3 Replication Lag
Amazon’s S3 service gets used and abused for all sorts of things these days and it’s easy to assume it’ll do everything you throw at it perfectly.
To enable us to run our Mr Hyde API in all Skyscanner regions we decided to write our data to S3 and use the built-in replication to copy the data from eu-west-1 to the other 3 regions. This path seemed far easier than working out a replication strategy for our RDS database, especially given RDS only offers read replication between the master and a single remote replica.
Initially we triggered a write from the database into S3 every 5 minutes, regardless of changes. While doing this we set up dashboards to monitor the replication status of the remote buckets. What we found were frequent periods where buckets would remain in a pending state for anything between 30 minutes and several hours. Clearly this lag would be unacceptable in the production system.
We then changed our strategy to only write a new file to S3 when there were actual changes to the database. Since making this change we no longer observe any long replication lags between the local and remote buckets.
Our take away – if you make frequent updates to your data and can live with a long tail on “eventual” consistency then the built in S3 replication is probably ok. If on the other hand you make frequent updates and also require more immediate consistency it might not be the best option.
Moving to only write new data to S3 when there were real changes worked well, allowing us to continue to get the benefit of built in, hassle free, cross region replication.
Finally, I’d just like to call out the awesome efforts by other teams in Skyscanner’s Developer Enablement Tribe that made it as easy as possible to get from data centres to AWS. Building on top of the their work made our lives so much easier.
Being able to cookie cut a new service, easily deploy it through a build pipeline to a shared cluster, put it behind an API gateway in a few lines of config and to then immediately see metrics in a dashboard is amazing when you step back and think about it.
The development environment at Skyscanner has come a long way from when Dr Jekyll was first deployed into our data centres using Team City and Ansible scripts.
Long may that progress continue!