The Case Against Gitops

For a long time the default method Ops teams use to introduce changes to an environment has been gitops. It works reasonably well but it’s not exactly light weight or nimble. We assume that the safest way to deploy any changes is to have those changes tested, reviewed, merged, validated, and finally applied to a cluster.

On the other hand, feature flagging has been common practice for development teams for more than 10 years. Development teams change the running state of their applications all the time without having to go through the gitops workflow. Why are we ok with this but not for configuration changes? Turns out, we actually are ok with configuration changes like this.

Ops accepts on-the-fly changes to configuration all the time in the form of HPAs, VPAs, mutating webhooks, and by adding annotations to resources that 3rd party operators can act on. Why hasn’t this been extended beyond these use cases yet? I’d argue that the only reason for this has been a lack of adequate tooling. It doesn’t have to be like this anymore though.

What changed?

With the release of Crossplane v2, it suddenly became very easy to do server-side rendering of yaml manifests. Crossplane’s ability to introduce additional functions beyond simple manifest templating using composition pipelines has also opened the door to much more advanced platform management techniques.

An example:

I recently vibe coded a function called function-backstage. What this function does is read a label on an XR, fetch details about the corresponding entity in Backstage, and add those details to the composition’s pipeline context so that they can be used by other functions further down the pipeline to conditional render Kubernetes manifests.

Here’s what this looks like in code:

XR

A Database associated with your company’s example-service project:

apiVersion: crossplane.my-company.com/v1alpha1
kind: Database
metadata:
  name: example-service
  namespace: default
  labels:
    backstage.fn.crossplane.io/type: Component
    backstage.fn.crossplane.io/name: example-service
    backstage.fn.crossplane.io/namespace: engineering
spec:
  engine: postgres
  version: 18.1

Backstage entity body

The data returned from Backstage about the example-service:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: example-service
  namespace: engineering
spec:
  developmentStage: experimental
  owner: [email protected]

Composition

Using that entity to provision databases using different providers:

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: database
spec:
  pipeline:
    - step: fetch-backstage-entity-data
      functionRef:
        name: function-backstage
      input:
        apiVersion: backstage.fn.crossplane.io/v1beta1
        kind: Input
        spec:
          apiEndpoint: "https://backstage.my-company.com/api/catalog"
          entityRef:
            nameLabel: "backstage.fn.crossplane.io/name"
            kindLabel: "backstage.fn.crossplane.io/kind"
            namespaceLabel: "backstage.fn.crossplane.io/namespace"
          policy:
            # Fail to render anything if there is an error fetching the entity
            resolution: Required

    - step: create-resources
      functionRef:
        name: function-go-templating
      input:
        apiVersion: gotemplating.fn.crossplane.io/v1beta1
        kind: GoTemplate
        inline:
          template: |
            # Defining the XR so it can be more easily used in the template
            {{ $xr := .observed.composite.resource }}

            # The entity is stored in a key called "backstage.fn.crossplane.io/entity"
            # We define a shortcut to that here
            {{ $entity := index .context "backstage.fn.crossplane.io/entity" }}

            # CNPG is used for experimental services
            {{ if eq $entity.spec.developmentStage "experimental" }}
              ---
              apiVersion: postgresql.cnpg.io/v1
              kind: Cluster
              metadata:
                name: {{ $xr.metadata.name }}
                namespace: {{ $xr.metadata.namespace }}
                labels:
                  owner: {{ $entity.spec.owner }}
              spec:
                imageName: ghcr.io/cloudnative-pg/postgresql:{{ $xr.spec.version }}
                ...

            # AWS RDS is used for stable services
            {{ else if eq $entity.spec.developmentStage "stable" }}
              ---
              # Use AWS RDS if this is a stable service
              apiVersion: rds.aws.m.upbound.io/v1beta1
              kind: Cluster
              metadata:
                name: {{ $xr.metadata.name }}
                namespace: {{ $xr.metadata.namespace }}
                labels:
                  owner: {{ $entity.spec.owner }}
              spec:
                forProvider:
                  engine: mysql
                  engineVersion: {{ $xr.spec.version }}
                  region: us-west-1
                  tags:
                    owner: {{ $entity.spec.owner }}
                  ...

            {{ end }}

Why this is a big deal

Without this type of tooling available, you would need to manually make changes to your infrastructure, deployment configurations, etc. Now, we can expose simple web forms to developers through your developer portal that ask for things we traditionally ask for in Slack or ticketing systems. The developers can autonomously supply answers to our questions for things like security, compliance, or reliability and the infrastructure is built automatically. Because we build the guardrails into these compositions, the developers can do whatever they want without asking for permission or assistance. Ops doesn’t need to be involved at all. This is a pretty major shift in how platform engineering teams are able to structure infrastructure and offer self-service workflows. We build the guardrails and then get out of the developer’s way.

Here are some other examples of how this workflow could be used:

  • Kubernetes Deployments could be packaged as fully locked down services (low resource limits, aggressive egress rate limiting, node isolation, etc). Users can add context about their service as the maturity of the project evolves. As more details are provided, more access to infrastructure resources are made available to the workload. Developers would be able to move fast with experimental services (even pushing directly to production) in a safe manner.
  • Permission to deploy to specific environments can be automatically granted when company defined prerequisite are met. For example: you must have custom metrics and dashboards in place before being able to deploy to a performance testing environment. When a link to a dashboard is provided, an ArgoCD ApplicationSet can be automatically updated to allow syncing to the new environment.
  • A function-slo could check to see if a service has exceeded it’s monthly error budget. If so, you could do things like modify the progressive rollout strategy currently in place so that changes bake for longer intervals to smaller percentages of traffic or you could add additional CI jobs to the source repository for enhanced scrutiny of changes.
  • A function-central-reporting could store all rendered manifests in a database. This would allow you to reference resources spread across multiple clusters, and use the status of those resources to render other compositions, without granting Kubernetes API access excessively. You could have multiple isolated Kubernetes/Crossplane control planes dedicated to things like access management, data layer infrastructure, compute layer infrastructure, etc.

There are really endless options here.

These function and pipeline features in Crossplane are specifically why it is a significant improvement over Helm and Terraform for internal services and infrastructure. You could probably build similar experiences for end users with Helm and Terraform if you really wanted but, if you did, it would be much more complicated and in the end you would literally be building Crossplane. Why reinvent the wheel?

What about gitops?

The principals of gitops don’t really change honestly. You still need to have changes tested, reviewed, merged, and validated before being applied to the cluster. The main difference here is that the thing we are putting our primary effort into testing and reviewing is the composition and less so the XRs that are consuming it. Once we know that the guardrails are in place, it’s ok to loosen your grip on the running configuration of a service.

Gitops just stops being the exclusive method of modifying a service when adopting this pattern. Now what we have is almost like an ops-centric approach to feature flags. Gitops is still valuable in some situations but for others it’s overkill. When building your compositions, consider what you really want to have approved through the PR process and what you are comfortable automating. That will give you a pretty good indicator of when to use or not use gitops.

With this new generation of tooling available, there isn’t any reason we can’t have services fully automated and presented to development teams via self-service interfaces. Platform teams assume the responsibility of building and testing the compositions. The development teams can take full control over the XRs and the infrastructure associated with their applications by extension.

With all this being said, platform teams need to remember that they must build their platforms with empathy for the end user in mind. It’s almost never enough to build a composition in isolation. They need to come with well written, easily digestible docs about what each option in your XRD does and the impact it will have when configuring it.