How GitOps is Driving Cloud Native at Scale

GitOps is great, isn’t it? What’s that, I hear you ask. Simply put, in these days, where all infrastructure can be virtualized, GitOps is about managing information about what that needs to look like (written as a text file), alongside the application that’s going to run on it. Hold onto that word ‘managing’.

The concept of infrastructure-as-code managed in the same way as software code may be simple, but its consequences are powerful. Thence GitOps, the term coined by Alexis Richardson, CEO, and co-founder at Weaveworks: ‘git’ being the code repository of choice for cloud-native applications, and ‘ops’ because, well, isn’t everything about that these days?

Weaveworks’ own GitOps workflow solution, FluxCD, has just graduated from the incubator factory that is the Cloud Native Computing Foundation (CNCF) – no mean feat given the hoops through which it will have had to jump. “We had security auditors all over the code,” said Alexis when I caught up with him about it.

FluxCD is not the only kid on the block: ArgoCD for example, led by teams at Intuit, Codefresh, and others, has also achieved CNCF graduation. Two competing solutions aren’t a problem – they work in different ways and suit different use cases.

And what of those powerful consequences? Well. Driving GitOps work is the clear-and-present need, to manage configuration data in massively distributed, potentially highly change-able application environments. In the increasingly containerized space of cloud-native applications, this same driver spawned the existence of orchestration engines such as DockerSwarm and Kubernetes, as well as the need for cloud observability tooling – a.k.a. how the heck do we identify a problem when we don’t even know where our software is running?

In the cloud native space, this generally means that any applications that have achieved their goals of delivering at scale – cue examples that follow the Netflix architecture – need to keep on top of how they deploy their software and then how they manage it at the same scale. Do so and you can achieve great things.

For example, the manifestation of all three is vital to scenarios such as machine to machine communications and driverless cars. In the telecoms space, in which the latest generation of wireless (5G) is cloud-native by design, the ability to deliver software and configuration updates in parallel and at scale only becomes possible by adopting such principles as GitOps. “You can update forty thousand telco towers without touching them. That just wouldn’t be possible otherwise,” remarks Alexis, referring to Weaveworks’ partnership with Deutsche Telekom.

GitOps is neat. However, there’s a lot to unpack in the phrase “manage configuration data” from the fifth paragraph above: this isn’t all about moving left to right, from application/infrastructure design to deployment and then into operations. Close to my heart, and something I’ve written about before is an issue at the heart of all things DevOps – that, in our drive to innovate at speed, we have sacrificed our ability to manage what we have created.

This inability to close the DevOps infinity loop can be likened to a firehose spluttering out trace data, incident reports, user experience metrics and the like, showering the development side of the house with bits and pieces of information without any real prioritization or controls. It’s a mess, often meaning (I am told, anecdotally) that developers don’t know what to work on next in terms of fixes, so they just get on with what they were going to do anyway, such as new functionality.

Elsewhere I’ve talked about the governance gap between innovation strategy (“Let’s build some cloud native stuff”) and delivery. It’s a reason why I latched onto Value Stream Management early on as a way of building visibility across the pipeline; it’s also why I was keen to learn more about Atlassian’s move squarely into the IT service management space.

GitOps solves for the governance gap, not by adding dashboards and controls – at least, not by themselves. Rather, a fundamental principle of GitOps is that configuration information is pushed in the same way as code and then not tampered with post-deployment, unless it can’t be helped.

These two concepts are enshrined in the heart of GitOps tooling, as otherwise it’s just stuff that I bet looks good on a whiteboard. From the Open GitOps site, the full set of principles is as follows:

1. Declarative – a system needs to be documented in advance through declared statements rather than having to discern the system from its runtime configuration

2. Versioned and Immutable – this is the bit about storing these infrastructure declarations alongside application code, in a version-controlled repository such as git.

3. Pulled Automatically – now we’re talking about how the desired system is always built based on its declared configuration rather than by tinkering.

4. Continuously Reconciled. This is the coolest and most important bit – if you do go and tweak the runtime configuration, the tooling should detect the change, and trigger a fix.

Tools such as FluxCD and ArgoCD enact these principles. Fascinatingly, that they work with the fact that engineers aren’t going to want to slow how they build stuff, they just enforce the fact that you can’t tamper with it once it’s done – and if you do, an alert will be raised. This can cause pushback from people who want to enact changes on the running system, rather than changing source of truth, says Alexis. “People say there’s high latency, they often haven’t set their system up right.”

I’m making this point as clearly and directly as I can, because of the dangers of (can I call it) GitOps-washing. Just delivering in the first two principles above, or simply storing infrastructure-as-code information in git, does not mean GitOps is being done. Either it’s a closed loop with alert-driven configuration drift identification and reconciliation, or it’s just another pipeline.

Neither is this merely about principles but benefits. That point earlier about rolling out updates to forty thousand telco towers? That’s only possible if the sources of deployment friction are minimized or removed altogether and if the resulting environment can be operationally managed based on a clear-as-possible understanding of what it looks like. “There’s no other operating model that really scales,” remarks Alexis, and he’s right.

Ultimately this goes to the heart of what it means to be agile in the digital world. Agility is not about controlled chaos or breaking things without ever really creating them: it succeeds with ways of working and accompanying tooling that aligns with the needs of innovation at scale. Yes, GitOps is great, but only if all its facets are adopted wholesale – GitOps lite is no GitOps at all.