Raymond Davies – Software Developer

Software Developer, homebrewer and general geek

Migrating a running service to AWS — October 2, 2017

Migrating a running service to AWS

At Skyscanner I work on our internal Experimentation and Configuration platform called Dr Jekyll (the UI) and Mr Hyde (the API).

Recently we migrated these systems from our private data centres to AWS. Below is a excerpt from a blog I wrote for our internal company blog – detailing some of the interesting learnings our squad had from undertaking the migration project.

Akamai Routing – You are our saviour!

Services using the Mr Hyde API usually poll it using a background thread to pull in experimentation data. This means that these calls aren’t on the critical path, and within reason, aren’t time critical.

However, native apps call the Mr Hyde API on application start, from anywhere in the world, and need to be able to receive experiment and other data within a one second window. If any calls can’t be completed in this window they will be aborted and the app will start without them.

In our data centre setup, we came in under this budget globally for ~70% of requests.

For the initial stab at an AWS setup we deployed to all Skyscanner AWS regions and put the service behind the Skyscanner API gateway. With fingers crossed we set up some Monitis checks to ping our service so we could monitor response times.

Non-Akamai Routing

Sadly, all regions outside of EMEA (in reality only really central Europe), were now well above the 1 second threshold for the vast majority of requests. We also observed frequent latency spikes when the AWS routing tables switched to send requests to a more distant region. For example, a request originating close to the Singapore region would normally be served by ap-southeast-1, however routing frequently switched to ap-northeast-1 in Tokyo – severally increasing response time in the process.

To try and address this problem we put our API behind a geo-aware route in Akamai. I was initially sceptical that adding another link in the request chain would result in a significant improvement in response time –Akamai’s super-fast network proved me wrong!

Akamai Routing

Behind Akamai we eliminated the latency spikes from the strange routing behaviour and reduced overall response time to a level where globally we matched or improved on the data centre’s performance.

Recent observations have seen the AWS setup now beats our data centre figures by between 1-5% of all request making it under the boundary time.

Migrating Data – Lambda can work well

I’m sure these two words, migrating data, strike fear into the hearts of many a developer. Typically, a pretty time consuming and sometimes manual process coming near the end of a long project. Do we have to bring over all that horrible inconsistent legacy data to ruin our lovely new system? Sigh.

In our case we did have to bring our old data with us as we needed to move seamlessly from one system to the next. To our advantage however was the small overall size of our database – at the time stored within the data centre Couchbase cluster.

We wanted to frequently “sync” our AWS database, now Postgres, with the Couchbase data. Firstly, this would allow us to work on proving the AWS setup using live data. Additionally, it would mean that we could at any time close the old system, wait for the next sync and then turn on the new system with practically no down time.

To do this we took advantage of scheduled Lambdas. Our Lambda would request the most recent data via our production API, which was already publically accessible due to native apps calling it. The Postgres table would then be truncated and the new data inserted.

Simple and effective, making use of our existing API and also reusing code that we’d already written elsewhere. Overall Lambda was a really handy platform to get a CRON type task up and running easily and reliably.

Data Sync.png

S3 Replication Lag

Amazon’s S3 service gets used and abused for all sorts of things these days and it’s easy to assume it’ll do everything you throw at it perfectly.

To enable us to run our Mr Hyde API in all Skyscanner regions we decided to write our data to S3 and use the built-in replication to copy the data from eu-west-1 to the other 3 regions. This path seemed far easier than working out a replication strategy for our RDS database, especially given RDS only offers read replication between the master and a single remote replica.

Initially we triggered a write from the database into S3 every 5 minutes, regardless of changes. While doing this we set up dashboards to monitor the replication status of the remote buckets. What we found were frequent periods where buckets would remain in a pending state for anything between 30 minutes and several hours. Clearly this lag would be unacceptable in the production system.

We then changed our strategy to only write a new file to S3 when there were actual changes to the database. Since making this change we no longer observe any long replication lags between the local and remote buckets.

Our take away – if you make frequent updates to your data and can live with a long tail on “eventual” consistency then the built in S3 replication is probably ok. If on the other hand you make frequent updates and also require more immediate consistency it might not be the best option.

Moving to only write new data to S3 when there were real changes worked well, allowing us to continue to get the benefit of built in, hassle free, cross region replication.

Final thoughts

Finally, I’d just like to call out the awesome efforts by other teams in Skyscanner’s Developer Enablement Tribe that made it as easy as possible to get from data centres to AWS. Building on top of the their work made our lives so much easier.

Being able to cookie cut a new service, easily deploy it through a build pipeline to a shared cluster, put it behind an API gateway in a few lines of config and to then immediately see metrics in a dashboard is amazing when you step back and think about it.

The development environment at Skyscanner has come a long way from when Dr Jekyll was first deployed into our data centres using Team City and Ansible scripts.

Long may that progress continue!

Advertisements
Software Architecture Conference – Best Of — November 25, 2016

Software Architecture Conference – Best Of

Recently I was fortunate enough (thanks Skyscanner!) to attend O’Reilly Software Architecture Conference in San Francisco. Given our ongoing move to a cloud based, micro-service oriented architecture at Skyscanner I thought it would be interesting to go and see what other internet scale companies were doing in this area.

In this blog post I want to recap and share some of the interesting things I learnt.

stage

My pick of the conference

I’ll start with my highlight of the conference, a talk by Susan Fowler (@susanthesquark) a “Software Reliability Engineer” at Uber.

Susan’s talk resonated with me because it presented a solution to a problem that I don’t think we’ve solved at Skyscanner and one that we’re likely to run into more frequently the further we travel down the microservice road.

The talk was titled “Microservice Standardisation” and was about creating a “framework” which allows an organisation to build trust between teams, build truly production ready microservices and create an overarching product which is reliable, scalable and performant.

uber-crop

In the talk, we were told that Uber has more than 1300 microservices powering their product. As you can imagine, in an architecture like this each microservice must rely on several others to do its job, each of which in turn probably rely on further services.

Working in this environment Uber found that it was difficult for development teams to know which other services (and teams) they could trust. If a team is building a component which is critical and needs to be “always” available – how do they know which other services they can use, without compromising their availability goal? This became even more complicated for them as they started to heavily utilise the independent scaling of microservices. How do teams know that their dependencies will scale appropriately with them, as a feature is rolled out and becomes popular?

uber-twitter

Uber’s answer to this was to create what they call “production-readiness standards” which all their services must adhere to before they can be trusted with production traffic.

These standards should be:

  • Global (local standards don’t build organisational trust)
  • General enough to be applicable to all microservices
  • Specific enough to be quantifiable and able to produce measurable results (otherwise they aren’t useful)
  • Inherently give guidance on how to architect, build and or run a service (otherwise they aren’t helpful to developers)

The last point I found quite interesting, the example given was “Availability”.  This is a good way to quantify an important metric for microservices and is a good way to build trust through SLAs. However, “availability” is a goal, not a standard. Telling a team to “make your service more available” isn’t terribly helpful.  The key is to think about what brings a service closer to the goal, in this case to be more available. The answer to this question can then become a standard.

The standards that Uber use are:

  • Stability
  • Reliability
  • Scalability
  • Performance
  • Fault-tolerance
  • Catastrophe-preparedness
  • Monitoring
  • Documentation

Being quite mature with this approach Uber have now automated much of the application of these standards in their infrastructure, processes and pipelines.

One of the interesting closing remarks which Susan made was that they now have production readiness leaderboards for all the services on screens throughout their offices. At first it might seem like this would-be kind of cruel, however she said that it has fostered some awesome conversations between their engineers. The boards have become a jumping off point for “water cooler” conversations about best practices and little tips and tricks which all feedback into the system to build more trust and better services.

I’d love to see us have something like this at Skyscanner so that we can all hold our services to high measurable standards.

Other highlights

An Agile Architects Framework for Navigating Complexity

Rewinding to the half day tutorials on first day of the conference I really enjoyed “An Agile Architects Framework for Navigating Complexity” in the afternoon session.

This was a pretty hands on session with lots of group activities and discussion so I was glad I had fuelled up on the afternoon snacks – only in America – Starbucks Coffee and assorted candy.

 snacks

The session started with us all taking an experience from our own day to day, writing a small headline about it and sketching the architecture. This was basically to get us thinking about a specific piece of work which would form the basis of the rest of the workshop.

We then used coloured and numbered dots to answer questions about the experience on large pieces of paper on the walls, in much the same way as you’d use sticky notes in a retrospective.

The purpose of doing this with a team is that you can then use the data to formulate hypotheses about what is working well, what isn’t and how difficult or important it will be to tackle any issues.

For example, if someone had marked their experience as “having happened frequently” on one of the boards and selected that it “involved customers” on another, you’d probably think that was quite a high priority issue to look at. You’d then be able to look at some of the other boards and find factors that could be addressed going forward. This also works for positive examples that you might want to make happen more frequently or enable the team to more easily achieved.

The second half of the workshop was to take the experiences that you’d identified using the techniques above and to apply the Cynefin framework to further categorise them into achievable actions.

cynefin-1024x928

For example, if one of the experiences falls into the obvious portion of the Cynefin diagram you might choose not to take any invasive action other than to keep an eye on it.  However, if an experience fell into the complex or complicated domains you’d be more likely to spend some effort in addressing it and producing a plan of action to tackle the issue in future sprints.

Netflix

 Two Netflix employees, Dianne March (Director of Engineering) and Scott Mansfield (Senior Software Engineer) gave interesting presentations.

netflix-1

Both presentations were quite heavily based on “here’s what we do at Netflix, take from that what you will” rather than giving any specific guidance, but were both interesting.

Dianne’s presentation covered their version of “build it, run it” and being a good Netflix citizen, rather than having a command and control type structure. She also talked a bit about the problems Netflix had had in the past with AWS regions going offline and how that had effected the business. This lead on to Chaos Monkey and them running monthly traffic drains from one region to another, to make sure they aren’t building in any region-specific dependencies.

netflix-2

Scott’s talk was even more specific, covering EVCache and how Netflix use this technology they’ve developed to enable them to be low latency and high throughput. The talk was incredibly detailed and technical so I won’t go into detail, but check out his slides to get an idea of the stuff he was talking about.

An off the cuff comment that he made during this presentation which I found quite interesting was that despite Chaos Monkey running practically all the time, AWS still kills far more of their instances than Chaos Monkey. If ever there was an illustration of the need to develop for the wild west which is the cloud, that is it!

Google

 Playing Tetris live on stage while demoing Kubernetes, enough said.

tetris  

Containerisation at Pinterest

This was another interesting one. There wasn’t anything terribly surprising in the talk, but it was more interesting from the point of view of seeing how well we do at Skyscanner compared to other large and well known tech companies.

pinterest

Pinterest seemed to have taken a rocky road to their current architecture. They first had a big problem with random deployment failures as their infrastructure and machines required a lot of manual intervention, creating inconsistencies across their fleet. They tried to solve this with various technologies including “Teletraan” their open sourced deployment system. Unfortunately, this didn’t meet their needs fully so they’re currently moving more toward a mix of Teletraan and Docker.

Microservices – The Supporting Cast

 The last session I want to mention specifically is “Microservices – The Supporting Cast”. This talk was by Randy Layman from Pindrop, a company who provide fraud detection software to financial institutions.

microservices-supporting-cast  

The talk basically covered three supporting services which Pindrop have alongside their business logic services.

First was their API Gateway, a concept with which most of you will likely be familiar. Something potentially unique that Pindrop build into their API gateways however, is the ability to protect themselves from bad request.

The story behind this was a client installing a now VoIP phone system which didn’t quite speak the language of their previous installation. The change triggered a series of events which caused a DDOS-like effect downstream.

With their new API Gateway, however they are now able detect bad traffic like this, drop it if appropriate or reshape it into an understandable structure before forwarding it downstream.

The next service was what they call a “cleaner”. This was quite specific to their business but I could see how a similar process could be applied to various more general scenarios. The cleaner is responsible for identifying sensitive or personally identifiable information and removing it before transmission to downstream services which don’t require it. Depending on if and how the data is needed it can be removed in various ways:

  • If the data isn’t required, it can simply be removed
  • If the data is needed, but can anonymised, its replaced by a generated token
  • Finally, if the data is required they still do the token step. However, in this version the original data is stored within a datastore, from which it can be retrieved later. This means they only have one data store which stores personally identifiable or sensitive information, while still allowing other services to access the data. Making their lives much easier in terms of compliance audits etc.

Finally, he talked about their auth proxy which uses Kong and JWT (JSON Web Tokens). There wasn’t anything terribly ground breaking in this section, but it was interesting to hear how they use JWT, which sounded like a really nice tool that I hadn’t heard of before.

The other take away from this section was that they deliberately separate their auth proxy from any specific services. The reason for this is that it means they can ensure that requests can’t reach services without going through the proxy first. Also, due to them using JWT to encode the requests via the proxy, even if a request could be made around the proxy, the downstream service wouldn’t be able to understand it.

Summary

Overall I really enjoyed the conference, the quality of the talks was generally great and it was brilliant to go and listen to how others are tackling their software architecture, especially around microservices and the cloud. The conference is coming to New York and London next year and I’d recommend attending if you’re interested in this sort of stuff.

Thanks again to Skyscanner for sending me along.

 

 

Breaking The Monolith — July 18, 2016

Breaking The Monolith

A few months ago I spoke at my first conference, ddd.scot in Edinburgh, in front of a couple of hundred developers from all over Scotland.

CiaaEEDWgAA_-PM

After attending Velocity conference last year I was really inspired and excited by the talks I’d seen. The people I saw speak were great and had a lot of interesting information and experience to share. Once I returned home I started thinking that it would be awesome to talk at a few conferences myself, and to share the cool things I’ve been working on.

I first submitted my talk idea two months before the conference date, at which point the submitted talks were voted on by people planning to attend. Not really expecting to be selected I waited to hear back from the organisers once the votes were tallied.

To my surprise however I was selected to speak in the largest room – better get writing that talk!

agenda

The weeks passed as I prepared my talk and practiced it in front of my Skyscanner colleagues, girlfriend and cat. Before long almost all of the 400 tickets had been sold.

Screen Shot 2016-05-16 at 10.37.18

On the morning of the conference I was up bright and early to drive out to the venue near Edinburgh Airport and to get ready for my talk.

Screen Shot 2016-05-16 at 10.47.16.png

I was starting to get nervous, especially as the audience rolled in, however as I kicked off and moved past the first few slides I started to find my flow. That was the hard bit over, all down hill from here I was thinking to myself. Early support was also rolling in on Twitter, not that I could see it at the time.

Screen Shot 2016-05-16 at 10.49.33

45 minutes plus some questions later and I was all done and dusted, phew!

Despite my initial nerves I really enjoyed speaking and sort of wished I could have got back up and done it all over again, it was a bit like playing in my band actually.

It was lunch time at this point though and a sandwich was calling me – time to take to Twitter to see how I’d been received.

Screen Shot 2016-05-16 at 10.55.29Screen Shot 2016-05-16 at 10.55.43

I also got a great review over at techneuk.com who were in attendance.

Overall speaking at ddd.scot was really enjoyable and something I highly recommend for anyone else who is considering doing something similar. No matter how nervous you might be it really is worth doing.

Thanks to everyone who came to see me talk, hopefully I’ll see you at a conference again as I’d love to keep doing talks like this in the future.

IMG_20160514_120832.jpg

If you are interested you can find my slides here and view a recording of my talk (audio isn’t great) here.

Configuration as a Service — January 8, 2016
Easy “Read the Docs” Workflow — October 7, 2015

Easy “Read the Docs” Workflow

If you’re a developer in an organisation with many engineers and/or teams or are involved in open source projects I’m sure you’ll be aware of the need for good, up to date documentation and quick start guides.

Having this documentation readily available can make the difference between your project being a success or a failure as people’s first experiences can often shape their opinion of a piece of software. Imagine the difference between someone who gets up and running quickly after following your documentation compared with someone who couldn’t find your documentation and is left frustrated and stuck.

If you’ve followed documentation for some popular open source projects online you’ll have likely used “Read the Docs”, perhaps without knowing it.

It’s a really good, easy to browse and update system for documentation which automatically builds search and indexing functionality from your documentation.

read the docs

At Skyscanner we’ve recently started using a private Read the Docs instance for our internal documentation.  It’s a great tool to help cross team collaboration.

The documentation on Read the Docs is generated from restructured text files (.rst) which you can store in your source control system along with your code and have automatically pushed to a Read the Docs site when you check in.

Documentation that lives with the code is really convenient and having it automatically update when you check in takes a lot of the pain out of maintaining a good set of docs for your project – all good so far.

However one snag you might hit, if like myself, you are new to the .rst format is that it isn’t always obvious how your document will look until you see it built into the template you are using.

The below image is an example of raw restructured text – not terribly intuitive I must say.

rawrst

Worry not!  Grunt can come to the rescue and give you a way to live preview your documentation changes before you push them to a Read the Docs instance.

Firstly you need to install Sphinx, the generator which creates the readable docs from the restructured text files.

pip install sphinx will do the trick for that.

You’ll now be able to build documentation manually using sphinx-build or by creating a makefile and make.bat (depending on your OS preference) using the sphinx-quickstart command.  If you are starting a new documentation project the quick start command is a great way to get up and running as it generates all the structure and files you need to start.

That’s all well and good, but you don’t want to have to run these commands every time you make a change to your existing documentation project – enter our friend grunt.

You’ll need to do an npm install in your project to install grunt and a few other dependencies.  You can get the full list by checking out my package.json in the code block below

 
{
    "private" : true,
    "name" : "documentation-project",
    "author" : "Raymond Davies <me@me.net>",
    "version" : "0.0.1",
    "dependencies" : {},
    "devDependencies" : {
        "grunt" : "^0.4.5",
        "grunt-cli" : "^0.1.13",
        "grunt-shell" : "1.1.2",
        "coffee-script": "^1.9.2",
        "load-grunt-tasks": "~0.3.0",
        "grunt-contrib-watch": "^0.6.1"
    }
}

You can then add a gruntfile like ours (in the code block below) which uses grunt-watch to keep an eye on your changes and grunt-shell to run either the makefile or the make.bat, live updating your local preview while you work – awsome.

 
module.exports = (grunt) ->
#   Load grunt tasks automatically
    require('load-grunt-tasks')(grunt)

    grunt.initConfig(
        shell: {
            buildHtml: {
                command: 'make.bat html'
                options: {
                    execOptions: {
                        cwd: __dirname +  '/docs'
                    }
                }
            }
        }
        watch:
            options:
                livereload: true

            files: [
                'docs/**/*.rst'
            ]
            tasks: [
                'shell:buildHtml'
            ]
    )

    grunt.registerTask('default', ['shell:buildHtml'])

If you are using Pycharm (which has good .rst syntax highlighting) it is really easy to do the above using the Grunt panel at the bottom of the screen.

grunt

Hopefully this example will help make your documentation workflow even easier and there will be more great projects with matching documentation out there for us all to enjoy!

Stimulating Simulations — September 23, 2015

Stimulating Simulations

For some time I’ve been really interested in simulation games from SimCity to Creatures, however more specifically I find Zero-player games really fascinating. I like the idea that the interactions of many small and often simple rules build together to create complex systems and hopefully a fun game or interesting simulation. The ultimate example of both a zero player game and a situation in which “simple rules give rise to complex systems” is in my opinion Conway’s Game of Life (also a fun programming task to do when learning a new language).

Gospers_glider_gun

One of the things I find compelling about these types of simulation is how simply they can be represented – the game of life is nothing more than a 2D array where each position is either on or off.

Having never been one for fancy graphics I think embracing your constraints in true 37signals style and focusing on building the rules underneath the simplest interface possible is a lot of fun. I also find building up the rules over time and seeing how each new rule effects the simulation really interesting.

My first attempt at building something like this was a predator prey simulation which started with the most basic representations you could possibly imagine in the command prompt (and eventually became my honours project).

hons project test2hons project test1

Eventually once I was happy with my rules I polished up the interface to support some of the more advanced features of the simulation and to make it more useable and interesting. However at the end of the day it is really only a representation of rules being applied across a few 2D arrays.

hons projecthons project graph

Recently as part of my squad learning day at work I decided to take this type of idea and to apply it to the web as simulation / game where multiple players would be able to interact with a system in order to effect the outcome in one way or another – eventually in a competitive game style.

To get started I used Oauth and Facebook login to create unique users, allowing them to join the simulation.  I wanted to be able to track unique users and be able to use profile images etc but didn’t want to implement all of this stuff behind the scenes so Facebook login seemed like a good starting point.

fb login

I found the .NET OAuth libraries and to be a little fragmented, especially when (depending on project settings) it sometimes comes down with the Membership Provider which I found to be even more fragmented across versions of MVC.  Thankfully once I got my OAuth working I was able to bin the Membership Provider and run a fairly simple and stripped down authentication model.

After someone logs in they are moved along to the main page of the application and in the process the browser tells the server side app the size of the view port. This is important as it is used by the program to work out how big a map it can produce.

As with my previous examples I went for a simple representation of the world, this time using HTML5 canvas to draw the map based on 32x32px tiles.

The canvas API was really easy to use and well documented.  The only slight issue I ran into was a condition on first load where it would try to draw the map before the tiles had been downloaded, resulting in black empty squares. The simple solution was to add onload functions for the Image objects and to only draw the map once they had all loaded.

tiles

The server then creates a “procedurally” generated game map, based on the view port / tile size, which is different each time you start.  To help with generating the map I found an open source library called LibNoise which generates noise maps.  I use this noise map to generate the terrain – the lowest points become water, middle points are grass and the peaks become forest areas.

map big

Once the terrain is generated the largest unbroken piece of land is found and a “town” is generated, the size of which is based on the size of the full map.  I then use my best attempt at implementing A* pathfinding to draw a road from the town to one of the edges of the map (prioritising edges that are far away to make things more interesting).

The path finding algorithm is probably one of the most interesting (and weirdest) things I’ve implemented in a while, I guess because it is quite removed from web development that I normally do. To build it I followed the steps in this tutorial, writing the code as I went, and to my surprise, (baring a few edge cases I needed to fix) it worked once I was done and I had a nice road from the town to the edge of the map.

I then add certain number of townspeople to the map, the number of people is based on the map size.  Most of them are placed in the town but a few are placed randomly throughout the map.

In what time I had left I implemented a very simple gathering rule for the townspeople whereby if they have nothing in their inventory they will move toward the nearest group of trees to collect “resources”.  Once they have done some collecting they will return to the town to drop off what they have, again using my probably less than perfect implementation of A*.

people

At the last minute I added spawning of enemies, in the form of skeleton zombies, with the intention of having them chase the townspeople.  I wanted to use this to add another “mode” to the townspeople where they would stop the normal gathering behaviour and run away from danger if a zombie came within a certain radius.

zombie

Unfortunately though I ran out of time so the zombies can only watch on from the side lines, envious of the townspeople and their fancy movement skills….

Over all this was a fun learning day, building on previous projects I’ve worked on in my spare time.  Moving over to the web using Facebook login, OAuth and the canvas was also interesting giving me a bit of an insight into the Facebook API and using Open Auth (something I’d use in future projects rather than writing a whole logon system myself).  The most interesting part though was the pathfinding as it is so different from the stuff I work on day to day.

My First Memcache – Working with Python session state from a .NET background — May 29, 2015

My First Memcache – Working with Python session state from a .NET background

As a web developer it’s important to know about the different methods for preserving state available to you and the advantages and disadvantages of each. Speed of access, security, transmission costs, data size, volatility and shared access across web front ends are the types of things that need to be considered when deciding where parts of your session and state will be stored.

If you are coming from a .NET background, like me, you’ll be aware of how to set up an application’s session state, in memory, in a SQL database or on a separate state server, and how to access this from C# code.  You’ll also be aware of the different .NET caching options and how to cache both MVC action results and arbitrary objects.  Finally you’ll have seen the Request/Response.Cookies object allowing you to manage cookie storage.

Setting up a new .NET application with the concept of session and some form of state preservation across multiple sessions would be simple.  Good times.

However what about a Python application using Flask?

Recently someone in web apps *cough Dave cough* found this browser based version of battleships that we absolutely didn’t play at work.

It got me thinking that it would be a neat challenge to try and replicate in Python as a project to learn more about Flask applications.  So I fired up Pycharm and got my Flask skeleton working, however it wasn’t long before I realised I didn’t know how to access the Flask equivalents of the .NET objects I described earlier and that they’d be crucial to an application like this.

Step up the Flask session object and Flask Cache backed by a memcached server on EC2.

flask mEMCACHEDflask-cache

<disclaimer> Before I begin I’m clearly not a Flask or Python expert and this is just what I found from my initial investigation coming from .NET land. </disclaimer>

Sessions

Firstly I wanted the idea of a “player” in my game which required me to have a session and some form of unique id.  I was surprised that, unlike in .NET, Flask doesn’t generate its own sessions when users arrive.  This also meant that there wasn’t a session id for each user like there is in .NET sessions.  I did like how clean this meant my sessions / cookies were though, no asp.net garbage taking up space.

The session object in Flask is basically just a wrapper around cookies, however for “session” type things where you might have account data or things you don’t want the user to be able to modify it is advantageous to use session over cookies directly.  The reason for this is that going through the session object creates secure cookies that are encrypted and decrypted for you in the background.

All you need to do to access this functionality is to supply a “secret key” to your application, for example app.secret_key = “mysecret”.  The documentation recommends using os.urandom(24) to generate a key to use.

After this the session object behaves like a dictionary, for example:

session['sessionid'] = sessionid

or

if 'sessionid' in session:

     sessionid = session['sessionid']

I decided to base my session IDs on uuid.uuid4() which supplied substantial GUIDs.

Caching / Application State

Once I had my session and “players” I wanted a way a way for the application to be able to connect two sessions for a game and to remember the state of a game at the application, rather than session level.  For this I used Flask Cache, first in “simple” development mode and later backed by a memcached server hosted in our AWS sandbox.

Getting memcached running on an EC2 instance was a piece of cake – yum install and then use the memcached command (with –vv if you want to see the connections being made from your application etc).  The only thing you really need to do in the way of configuration is to open access to the server on the default port memcached uses to communicate, 11211 UDP and TCP.

One important note is that you’d generally not open up access to your memcached server to the internet at large, and would instead restrict access to known IPs, otherwise anyone can add and read data from the server.  Memcached does have an authentication method, however I didn’t enable this for my test.

Getting the libraries necessary to connect Flask Cache to memcache on my Windows machine was a bit of a pain in the ass. I eventually managed after going from Python3 back to 2.7 and doing some fiddling around with different libraries.  I get the impression from my reading that this would have been much easier on Linux.

Once I got going the cache object was very easy to use to store arbitrary objects, and I quickly saw my access to the memcached server in the logs:

app.cache.set('users', users, timeout=500)

app.cache.get('users')

You can also use the cache to store the results of functions and templates too, similar to how you might use .NET output caching:

@cache.cached(timeout=50)

@app.route('/')

def index():

    return render_template('index.html')

or

@cache.cached(timeout=50, key_prefix='all_comments')

def get_all_comments():

    comments = do_serious_dbio()

    return [x.author for x in comments]

cached_comments = get_all_comments()

All in all a very enjoyable dive into creating a stateful application using Flask, and despite the slight difficulty in getting the correct libraries on Windows, everything worked as expected and was easy to plug into my app.

If you know of any good alternatives to the stuff I’ve mentioned or other things you think I should look at please let me know in the comments.

Consul for Service Discovery and Registration — May 6, 2015

Consul for Service Discovery and Registration

You might remember that in my last blog I talked about using AWS Lambda. In that blog one of the possible use cases for Lambda I mentioned was service discovery and registration.

Service discovery is an important area of interest and development at my work going forward. As we move further into the micro service squad model and push our stack into the cloud, knowing which services are available, the health of services, where they are and their configuration becomes more and more important.

In the more “uncertain” or “unpredictable” world of AWS, running our services in a multi-tenant environment, we are aiming to create an architecture in which failure is expected and gracefully tolerated. We are also aiming to create services where changes in demand are automatically detected and scaled with accordingly – adding and removing resources in real time.

Building in this way will allow us to fully utilise the fantastic opportunities working in the cloud brings and will also ensure we have a first class resilient and scalable web application. However to achieve failure resilience and on demand scaling we need to be able to answer service discovery type questions.

Enter – Consul

I had a look at Zookeeper, etcd and Consul which all aim to address the service discovery question.

There were a couple of little things that put me off Zookeeper and etcd. However I liked the look of how Consul worked and the website was also a lot nicer to browse for “getting started” type information which was in its favour. I also noticed that Consul is built by the same people that make Vagrant, which pushed me over the edge, so I chose to work with Consul.

consul-logo-treatment

My Test Environment

My test environment was 3 Amazon Linux AMI based machines running in our sandbox account.

The first box was my Consul “server” you can think of this like the controller of the Consul cluster. In a production environment you’d have more than one of these to be fault tolerant. The other two boxes were Consul “agents”, which in the real world would be nodes associated with our services. I also stuck Nginx on the agents to make them slightly more like real boxes.

Installation

Besides doing a yum install this was perhaps the easiest installation I’ve seen. You download a zip file unzip it and copy the binary it contains into somewhere like /usr/local/bin and you’re done!

Typing “consul” verifies everything is ready to go:

consul command

This step is the same regardless of whether the machine is going to be a server or an agent.

Starting a Consul Cluster

I then started up the first two machines, the first as the server and the other as an agent. Doing so is as simple as running the “consul agent” command and either passing (or not passing) it the –server flag. Normally you’d run this in the background, possibly using something like supervisor, however for simplicity I was doing this by hand.

At this point the consul server and agent came to life, the server even elected itself king of its little one node world, however at this point they don’t know about each other.

consul agent start consul server start

Connecting the dots

To connect the agent to the server I needed to issue the “consul join” command with the IP of the server. This is what a node would do if it was added to a cluster in order to announce itself to the world.

consul join

One of the interesting claims of Consul is that if a new node contacts one of the existing nodes in a cluster to join that existing node will “gossip” to all the others letting them know about the new one.

To test this I got my third machine online and asked it to join the cluster by contacting the existing agent, not mentioning the server at all. It told me it was accepted into the cluster so I went over to the server and ran the “consul members” command to list all the nodes​​ it knew about. As if by magic – all three machines were listed.

consul server members

That’s great but where is the useful information?

HTTP “Service” API

Now I had a little cluster running I wanted to know what it could tell me.

Using a very simple JSON config file I told one of the agents that it was a “web” service running Nginx on port 80.

I was then able to use the Consul HTTP API to query the cluster for what was running as a “web” service. Doing so from one of the other nodes returned some JSON listing the service, what it was running, its IP address and node name. All useful information if I was another box looking to send requests this way.

consul web query http

Depending on the configuration file on the boxes and the query you issue the HTTP API can be quite powerful in finding out what is running in a cluster and how to speak to it. There is also a DNS API which supplies the same information in a different format.

Key Value Store

Another way to pass information around the cluster is to use Consul’s key value store. This was probably the most interesting part for me as it would enable us to replace the costly “requirements end point” calls which the website scaffolding currently has to make each time it builds a page for an ESI component.

To put something into the key value store an agent in a cluster can call its own HTTP API and any changes will be distributed to the rest of the cluster.

To mimic how requirements might work in this system I added keys using the path

/SERVICE-NAME/NODE-NAME/KEY

For example:

curl -X PUT -d '' http://localhost:8500/v1/kv/homepage/node1/endpointconfiguration

curl -X PUT -d '10.10.10.10' http://localhost:8500/v1/kv/homepage/node1/nodeipaddress

curl -X PUT -d '6400' http://localhost:8500/v1/kv/homepage/node1/version

After issuing these requests I was able to read them back again from the other nodes, resulting in the following (the values are base64 encoded):

consul key value result

I was pleasantly surprised to see both my simple data and my JSON object successfully stored, proving that this could be a valid alternative to requirements end point calls.

Conclusion

I couldn’t really fault Consul, it was incredibly easy to set up and play with and seems to provide everything that would be required to solve the service discovery problem out the box.

I’d like to have a go at setting up a more automated cluster, using Ansible for deployment and having nodes register and deregister themselves automatically, to get a fuller understanding of the system. I’d also be interested in writing an application layer to consume the Consul information and make decisions based on it, to see how easy this would be to implement. However off the back of what I’ve tried so far I can’t imagine it being terribly difficult.

If you have some spare time I’d also recommend watching this video from dotScale last year where Mitchell Hashimoto (one of the founders of Hashi Corp) takes you through how Consul works – https://www.youtube.com/watch?v=tQ99V7QjEHc

AWS Lambda First Impressions — April 13, 2015

AWS Lambda First Impressions

AWS Lambda

At work we have adopted the Spotify squad model. In my squad we’ve started doing “learning days”.

Each sprint one person in the team takes a day to do some experimentation and learning on an interesting new technology, technique, platform etc.

Each week ideas are put forward in our planning meeting and the team votes for the idea they find most interesting.

The person who suggested the winning idea then gets to investigate it for a day during upcoming sprint. At the end of that day we have a short talk back and demo session where the topic is discussed.

So without further ado here is my what I found out about AWS Lambda.

reinvent_launch-page_illustration_lambda

What is Lambda?

Lambda is a fairly new service that is provided by Amazon Web Services. Basically it allows you to run NodeJS functions on Amazon’s compute resources in response to events and triggers. One of the big selling points for me was that absolutely all you need to do is write the code and upload it, Amazon manages getting it onto a box and running it for you when your events are triggered.

If you are using Lambda to plumb together existing AWS services such as S3, RDS, SQS etc you don’t even need to write the event triggering or plumbing code, you can configure it all in the relevant control panels online.

Why did I choose to learn about Lambda?

Firstly I was intrigued by having nothing to do with managing the underlying compute resource. As a developer I want to be able to write code that makes an impact, I don’t particularly want to worry about managing underlying computing resources. I was also interested in the event driven nature of running these functions and how they could be used.

During the talk back the squad and I discussed some interesting uses for Lambda including log file parsing (or other large files and streams of data), scraping, service registration and user tracking.

What did I build?

I was quite interested in using Lambda for service registration because a system like this would be useful for registering ESI services (which is part of the usual squad work we do).

As a proof of concept I created two S3 buckets, one to receive data and another as an output location. Whenever an ESI service’s requirements file (which details the version and dependencies of the service in JSON format) was uploaded to the input bucket my lambda function was triggered.

The function reads and parses the JSON file and out puts a txt file containing the version of the service to the output bucket. Basically to prove how easy reading a file like this would be.

My original idea was that the output would be put onto an SQS queue, however I just ran out of time so decided to use S3 for output as I already had the code required to upload to a bucket.

Conclusion

I really enjoyed working with AWS, I’ve done it in the past and it was great to get back into it with some of the newer services like Lambda. If you haven’t used AWS get an account and have a go, the free tier will easily cover any experimentation you do.

Having only briefly used Lambda without much previous Node experience I found it surprisingly easy to create something potentially useful. Given a bit more time and the right application I think it could be incredibly powerful and I’ll be keeping it in the back of my mind when we are thinking of how to implement new systems.

Adding Device Detection & Web Optimisation To An Existing Web Application / Architecture — February 7, 2015

Adding Device Detection & Web Optimisation To An Existing Web Application / Architecture

In a commercial environment we don’t always have the luxury to change our existing systems or architecture to accommodate the latest best practices or to respond quickly to a changing market.

In general, visitors and session numbers are in decline on desktop and on the rise on mobile.  Due to this it is becoming more and more important to provide the best, and most performant, experience possible to mobile users.

Part of this of course, is general good practice for web performance across the board.  Another is a tailored mobile experience which, in the modern internet landscape, needs to reach beyond simple responsive design.

It was a combination of these factors that recently led me to implement a web optimisation and device detection solution which could be applied to an existing system with little or no change.

nginx

Nginx

I started with Nginx working as a reverse proxy, simply serving unmodified requests, at the front of our stack as a proof of concept.  Where exactly this sits will depend on your stack, however behind a load balancer / traffic manager (Stingray, HA Proxy) and in front of the top layer of the application obviously makes sense.

Being behind traffic management lets you route requests which, for whatever reason, aren’t suitable for optimisation around this service. This can be very convenient while evaluating and tuning your system as it allows you to bring in a component, service or market at a time rather than a “big bang”.

The other advantage of this is that should these new servers fail, your load balancer can be configured to bypass them as part of a fail over pool.  This is especially true if your next layer is an ESI/caching layer such as Varnish.

pagespeed-256

Mod PageSpeed

Once you are running Nginx (or Apache) and are proxying request to your application, adding modpagespeed to give you a whole load of awesome optimisations is pretty simple.

In both cases the pagespeed module is a relatively simple install.  Tuning the module on the other hand can, depending on the complexity of your architecture, be more tricky. Between traffic management and module configuration you’ll need to make sure your requests end up in the correct place.  There are also some “interesting” configuration settings that can cause problems if you aren’t aware of them.  I’ll cover these at the end of the post in the “Pitfalls” section.

Modpagespeed will automatically minify and combine resources, in-line other resources (including some images which is very cool) and will perform HTML and image compression and optimisation among other things.  It isn’t a substitute for lazy development practices by any means, but it does help speed things up and is very worth while given how little effort it is to activate.

logo

Device Atlas

Device Atlas is a product which supplies comprehensive device and capability detection. It’s very easy to build as part of Nginx, and once you do its detection properties are available to you inside the Nginx config during each request.

The simplest way to supply this information to your application is to pass the desired variables downstream via request headers.

For example:

proxy_set_header DA_IS_MOBILE_PHONE $da_isMobilePhone;

Your application will then receive this information in the headers of each request allowing it to make device or capability intelligent decisions.

Typically, the easiest first step to make with this information is to stop sending content or resources to mobile devices which are never actually used.

Adverts, large images and other secondary content is often hidden by responsive design on smaller screen sizes.  Simply saving bytes and time by not sending these items to mobile devices is a great first step in improving your mobile performance.

Where you go from here will obviously depend on your application and situation but having the ability to tailor the users experience to the device they are using is very powerful.

Metrics

measure-all-the-things

As you’re embarking on a project that will involve adding a new system to your architecture with the goal of improving performance you are going to want to measure:

  • The new system to analyse how it behaves and to ensure it isn’t causing problems
  • The (hopefully increased) performance of your application.

After all you’re going to want to take all the credit within your organisation for this uplift in performance with your awesome graphs and stats!

Three pieces of open source software that work brilliantly with the tools I’ve already discussed are DiamondStatsd and Graphite.

Diamond is a deamon written in python which can collect system metrics (CPU, RAM diskspace, networking etc) and publish them to Graphite.  It’s really easy to install – download node, clone the repo, customise the config file and start the deamon.  Diamond can also be used as a collector for Statsd metrics which is useful so that you only have one process sending metrics to Graphite.

Graphite comes in two parts.  Firstly the backend service “carbon” collects the metrics which you send to it and stores them for retrieval later.  The other part of Graphite is its graphing interface, allowing you to navigate and build graphs from the metrics you send.  The graphs graphite produces out the box aren’t exactly what you’d call “sexy”. They are, however, very functional and they have a number of built in functions and tools to allow you to shape the data in a meaningful way.

Finally Statsd is a deamon (built by Etsy) that can listen for custom metrics from applications and systems that communicate with it using a simple interface.  You can send metrics to Statsd directly from the Nginx config making it a no-brainer to track what Nginx and modpagespeed are doing.

For example by putting the following inside a server block:

    # Set the server that you want to send stats to.
    statsd_server localhost;
    # Randomly sample 10% of requests so that you do not overwhelm your statsd server.
    # Defaults to sending all statsd (100%).
    statsd_sample_rate 10; # 10% of requests
    # Increment "pagespeed.requests" by 1 whenever any request hits this server.
    statsd_count "pagespeed.requests.total" 1;

Statsd can also be used easily inside Python, C#, Ruby and Java code so it can be quite a convenient tool to introduce to your stack.

It’s also worth noting that all of these components can be configured to run under Supervisor.  Having supervisor’s web interface to monitor and start/stop/restart processes can be really handy, especially if your infrastructure team limit access to servers.

Logging

In this section I want to mention something specific to Nginx, if you are not using Nginx you can skip this.

Firstly to mention that Nginx can, depending on your settings and traffic levels, create some absolutely monstrous logs.

Logs are in general obviously very useful. Huge logs however, which are difficult to read and obfuscate problems due to the volume of data they contain, aren’t useful in the slightest.

To solve the problem of having large logs taking up space and being difficult to use effectively I investigated other solutions and found grey log.

graylog

Grey log is a system which can be used to collect, store and analyse logs from many different systems.  Its searchable interface beats trawling log files every day of the week.

Sending your logs to grey log from Nginx is a one line config change, which is awesome…HOWEVER…it isn’t available in all versions of Nginx!

If you want to use grey log with Nginx make sure you use version 1.7.1 or later or your life will be more difficult.

Pitfalls

Resources

The first pitfall you’ll want to avoid is under specing these machines.  If they are working hard, serving lots of traffic and optimising plenty of requests and resources, they’ll need a reasonable amount of memory and CPU power.

Obviously this all depends on traffic levels but I’d suggest at least 8GB of RAM (we eventually settled on 16GB).

Disk space is the hottest resource for these machines.  The whole pagespeed resource cache is stored on disk so depending on the size of your site this cache can become very large.

As I mentioned before the log files, if enabled, can also contribute to this issue.  One quite silly problem that I observed was Nginx writing to the error log when diskspace is low and resources aren’t being rewritten.  This sounds logical enough, however this just makes the logs grow larger and larger on a busy system, eventually running the free disk space down to zero.

The final resource question is of course, how many of these servers will I need?

That’s a difficult question to answer and there of course isn’t a single answer. Your number will depend on traffic levels, caching rates, backend performance, payload size and a whole host of other factors.  We’re serving more than 2 million unique visitors a day, and although we have more capacity, I think we could comfortably get away with running 16 of these boxes at a time.

Device Atlas Headers

We decided to create our Device Atlas headers following the underscore naming convention that they use in their documentation.

The problem with this, and using underscores in headers in general, is that they are part of CGI legacy.  Due to this Nginx (and some other servers) silently drop them on arrival.  This might not be a problem if you are simply sending them down stream to your application, this will work as you’d expect.

However, if for testing purposes, you’d like to spoof these headers by overriding them in a program like Fiddler you’ll have problems as they’ll be dropped on arrival at the Nginx server.

You can either change your configuration not to use underscores or you can enable underscores by using the following statement:

underscores_in_headers on;

Request Body & Header Sizes

A final pitfall to avoid is failing requests and errors due to large requests and headers being sent to these servers.

The following configuration lines which set the buffer, header and body sizes which Nginx will accept should stop any issues related to request size.

    # Buffer Size Stuff
    proxy_buffering on;
    proxy_buffers 8 32k;
    proxy_buffer_size 64k;
    proxy_temp_file_write_size 256k;
    proxy_temp_path /path/ 1 2;
    large_client_header_buffers 8 32k;
    client_max_body_size 256k;

Simply substitute the size values for suitable ones based on the requests you have monitored entering your systems.

Note that if you are using Varnish you’ll also need to replicate similar settings in the Varnish config.

Conclusion

Adding these programs together on top of an Nginx server isn’t difficult and all of the components I’ve mentioned are free and open source (beside Device Atlas, however it does have a free version).  Furthermore they can be added in front of an existing system with little or no modifications.

Given the ease of installation and the fact it can give significant performance gains to an already running system it seems a fairly obvious choice to at least try when looking to conquer an ever faster and more mobile friendly world.