Interview: For the FT’s Sarah Wells it’s problem first, tech second

Interview: For the FT’s Sarah Wells it’s problem first, tech second

It’s easy to forget that not everyone has the perfectly automated pipeline in place and as exciting as projects such as Kubernetes sound, most companies have a long way to go until using them makes sense. To give you an idea of how other people manoeuver the changing waters of technology and maybe find inspiration for your own projects, Devclass will regularly talk to leading practitioners, who’ll share their success stories, struggles and let you know what’s on the horizon.

Sarah Wells is Technical Director for Operations and Reliability at the Financial Times. She has been a developer for over 15 years and regularly shares her experience at conferences all over the world. Before her current position, she was working on building a semantic publishing platform, providing access to all the FT’s published content and metadata via APIs. Last spring, she gave a keynote in front of the over 4300 visitors of KubeCon and CloudNativeCon Europe in Copenhagen.

DevClass: To get things started and draw a better picture of your working environment, could you tell us a bit about how the tech teams at the Financial Times are organised?

Sarah Wells: Teams are pretty empowered so there is a lot of variety in how they organise themselves. Some teams use Jira, some Trello, and some use physical boards. Some have scrum masters, others don’t. Generally the approach is more kanban-like with WIP (=work in progress) limits than scrum-like with sprints.

We don’t tend to have long backlogs any more we try to be outcome driven, and to flex based on what we find out as we go. I’d worry if I had more than three things lined up. I have a longer backlog list in my mind, but it’s only when I start planning the next few months that I really decide on the key things I want the team to do.

We do use the Spotify team health check in most teams as a way to see how we are doing in terms of culture. It opens up some interesting discussions and the trends tell us a lot.

From time to time you mention DevOps, a practice which is supposed to unify software development (Dev) and operations (Ops), in your talks. How has following this approach worked for you?

DevOps has been part of a big culture change at the FT. I think you can say DevOps is culture change.

Automation freed us up to move fast. Delivery teams run their own systems in hours and are there for escalation out of hours. That in turn means we have built more resilient (although also likely more complex) systems.

With Serverless being the next big thing, some seem to think that this will mean another change in the relationship between development and operations. Since you started looking into it, do you feel this will be true for your teams?

Collaboration and sharing is valued, and important.

Serverless doesn’t change things with any of that. We won’t need as many people applying patches to VMs but we still need people to manage AWS policies, tooling and costs.

Serverless is still in its early stages, with problems in areas such as multi-cloud, security and monitoring. What would you like providers to concentrate on right now, to make their offerings more useful/appealing to your team?

Multi cloud isn’t my own major issue  and I do hear lots of people linking elements from AWS and GCP with success.

The big issue for me is observability. Our healthcheck standard doesn’t really work for Lambdas: you don’t want to fire up a Lambda just to ask it if it can still connect to a system it depends on. So we need to find other ways to know whether things are working. At the moment, it all feels a bit basic.

At CloudNativeCon you mentioned that your team is looking into chaos engineering at the moment. How did you end up with this method and how are your first steps going so far?

I think people have been doing this kind of thing for a while now but didn’t have a cool name for it. We were talking about what Netflix was doing with Chaos Monkey three years ago in fact the FT had their own simple version called the chaos snail (as it was written in shell).

What is different for me is the idea of running specific experiments, with a hypothesis and an evaluation of whether what you expected to happen did, where the Chaos Monkey was a bit more lucky dip. And there is tooling coming out for this – you don’t have to build stuff yourself. We haven’t played with tooling yet although I’m interested in Gremlin.

When you have a microservice architecture, you have things failing all the time. Every request that goes over the network has a certain percentage chance of failing or timing out, and the more you make, the more likely that will happen somewhere. So you are generally in a state where something isn’t quite working but the resilience you have built in protects you. But you need to test that resilience works the way you expect it to. Often, you find out that it doesn’t. For example  one that my colleague Euan quoted in his talk at Continuous Lifecycle London  we discovered when testing a full data centre outage that the failover mechanism actually didn’t work if either data centre wasn’t there!

Continuous Integration isn’t new either, but if one takes a look at the Kubernetes ecosystem, it seems to be only just landing now. What are the challenges in this particular context and what would you wish for in terms of “cloud native CI”?

We’ve been doing Continuous Integration for a long time, and naturally did that with our container stack too. What’s different for me is more about microservice architecture many separate deployables. You need a deployment mechanism where you don’t have to manually set up new pipelines for a new service and where you can change every existing pipeline via templating.

I think we will get different ways of doing deployments. Alexis Richardson at Weaveworks talks a lot about GitOps, the idea that everything lives in source control and operations on that source are what sets off provisioning changes and deployments. I like that and we do a fair bit of this at the FT.

It needs to be easy to see the deployment state of all services easily. On my last team, we changed away from a pure Git commit-merge driven deployment to provide better overview to developers using Jenkins pipelines.

What is your team’s success rate with adopting new technologies?

It’s really common at the FT to try stuff out in a quick, possibly hacky way and then when we see it working to go back around. We often try a SaaS solution or a community edition to get going quickly, and often stick with that but may then upgrade or install in house.

Do you have a time budget for proofs of concept or have there been times you weren’t sure if something was working because it wasn’t for you vs because it was still in an early stages testing phase?

Doing evaluations quickly means it generally isn’t too painful to move on. We probably have abandoned some things that we may have got working if we’d taken longer but to be honest, things nowadays need to be easy to get going quickly. This means above all good documentation and self service installation. Without that, something else will get used.

Many businesses hear about new technologies but are hesitant to give them a go because their revenue depends on working, supportable systems. How does using leading edge technology like the FT did with Docker three years ago fit with the need for boring tech in business?

If the value you get is worth the complexity of building and, in particular, operating something, you can choose something new. Just be aware of the likely cost. Using containers saved us a lot of money on VMs and a fair amount of time on VM provisioning. We thought it was worth a more complicated operational landscape. Once Kubernetes approached maturity though, we moved to that.

Since your teams use open source products, how do you handle compliance, since it seems to be a concern for many?

We have a procurement process which includes technical due diligence and GDPR concerns. Anything that we pay over a certain amount for, or that stores personal data, or that is used in a critical system, or a couple of other criteria  will go through that. And we often pay for support, for example that’s the approach we have taken with CoreOS.

We have a security checklist and we do consider the implications of using new tools.

How does your team “give back” to the community? Is committing to open source just encouraged or actively supported?

A team at the FT leads development on polyfill.io, which provides a service that returns polyfills that are suitable for the requesting browser. But other than that, while we support people contributing to open source, we don’t have a specific strategy around that as far as I am aware.

What are your main tech struggles at the moment?

My challenges at the moment are around my new role, which involves setting up a new approach to operations and reliability tooling at the FT.

I don’t think it’s particularly a tech struggle, it’s a cultural change. We had a DevOps transformation in our delivery teams. We need to extend that further and improve the tooling available for our first line operations team.

Are there any new tools or technologies you see popping up right now that you’d be interested in trying?

I’m lucky   I get the chance to try the things that interest me, because normally the stuff I’m interested in is what I think will solve a problem I’m interested in. The problem comes first.

What are five key ideas you can share with teams that want to do more tech experiments?

  • Start small
  • Think what you/your team/the company will get the most value out of
  • Hack days get people used to trying things out. You need a lot of encouragement to get everyone comfortable with this.
  • 10 per cent days set a culture where this is fine . You have to require a goal and a presentation of results. (There is this idea of giving people 10 per cent of their working week to do research and development without them having to worry about the day-to-day tasks, but on my team we prefer to do one full day a fortnight instead  therefore 10 per cent day). And also, you have many fewer arguments you just get people to try it in a 10 per cent day.
  • Celebrate it in a way that rewards doing it, not based around success. You want failures too and we probably had 50 per cent of the things we try end up being things we decided not to go ahead with. A day of work to get to that conclusion is a good investment.

Is there any tool you couldn’t do without in your day-to-day work?

Sadly, because I’m now a Tech Director and not coding day to day, the tools that make the most day to day impact for me are probably Slack and Google Meet. Slack has largely replaced email, particularly for alerting, and we use the integrations quite heavily. Google Meet because we have teams in multiple locations and it’s the best way to make sure communication is equal i.e. everyone at their laptops on one call, not most people in a room and others remote struggling to hear.

You seem to go to quite a few conferences, what/who has been the most inspiring conference session/speaker for you recently?

There was a fantastic track at QCon London this year about ethics in code, curated by Anne Currie and Gareth Rushgrove. All the talks were interesting and thought provoking. I think this is a discussion that’s starting to gain momentum. There was a conference on this in London in July, coed:ethics.

I also enormously enjoyed KubeCon Europe keynotes from Oliver Beattie telling the story of a production outage at Monzo, and Simon Wardley giving a whistlestop tour of strategy, value mapping and why serverless is going to win…

Also at KubeCon, Vallery Lancey, from Checkfront, talking about “Challenges to Writing Cloud Native Applications” was great. A really good overview.

Thank you for taking the time to talk to us.