Beyond CI/CD: GitLab's DevOps vision

By Mark Pundsack, GitLab

October 11, 2017

How we're building GitLab into the complete DevOps toolchain. With GitLab 10.0, we shipped Auto DevOps for the Community and Enterprise Editions. Read on for an in-depth look at our strategy behind it, and beyond.

I recently met with my colleagues Joe and Courtland to give them the lowdown on GitLab's DevOps vision: where we've come from and where we're headed. You can watch the video of our discussion or check out the lightly edited transcript below. You can also jump into the rabbit hole, starting with the meta issue for GitLab DevOps.

CI/CD: Where we've come from

CI/CD/Beyond CD

When I joined GitLab about a year ago, I created a vision document for CI/CD, and outlined a lot of the key things that I thought were missing in CI/CD in general, and going beyond CD. I literally called one section "beyond CD" because I didnít have a name for it then.

And in that document, I create an example pipeline to characterize all this stuff, to show how the pieces fit together into a development lifecycle.

Example pipeline

I love this diagram not only because it's complex and scary, but because when we started, we had maybe four boxes filled in, and now we have 10 or 12 filled in. To start with, we had code management and, obviously, builds and tests. And we kind of did deployment, but not really.

Since then, weíve added review apps Ė a specific example of deployments Ė which is really awesome. We also added a more formalized mechanism for doing deployments; actually recording deployments and deployment histories, keeping track of environments, and everything else. Then we added Canary Deployments in 9.2 and code quality in 9.3. We added system monitoring with Prometheus in 9.0.

We donít yet have what I called "business monitoring," which could mean monitoring revenue, or clicks, or whatever you care about; but thatís coming. We don't yet have load testing, but the Prometheus team is thinking about that. We don't yet have a plan for feature flags, but I think it's a really important part.

And then we have this other dimension of pipelines, which is the relationship between different codebases (or projects), and in 9.3 we introduced the first version of multi-project pipelines.

So we've gone from a core view of three or four boxes to where 90 percent is complete. That's pretty awesome.

It became obvious to me that we were viewing the scope with this hard line: developer focused rather than an ops focused. For example, weíll deploy into production, and we might even watch the metrics related to your code in production, but weíre not going to monitor your entire production app, because thatís operations, and thatís clearly out of scope, right?

Where we're headed: Beyond CD

What hit me a few months ago is, "Why is that out of scope? Thatís ridiculous. No, weíre going to keep going. We're going to go past production into operations." Most of this still applies, but instead of just monitoring the system as it relates to a merge request, what about monitoring the system for network errors, outages, or dependency problems? What if we don't stop at production, and monitor things that are typically ops related that may not involve a developer at all?

Then I realized that this thing I called Beyond CD, maybe it's really DevOps. Maybe the whole thing is DevOps.

The DevOps tool chain

To offer some context: DevOps is hard to define, because everybody defines it slightly differently. Sometimes DevOps is defined as the intersection of development, operations, and quality assurance.

DevOps Venn diagram

Image by Rajiv.Pant, derived from Devops.png:, CC BY 3.0

For the most part, my personal interest in DevOps has been in that intersection. We do great code management; weíve done that for quite a while. How do we get that code into production? How do we get it into QA?

Review apps are a great example that fits squarely in that tiny, little triangle in the middle of the Venn diagram. You take your code, you deploy it, which is an operations thing, but you have it deployed in a temporary, ephemeral, app, just for QA people (or designers, product managers, or anyone who is not a primary coder), so they can test your application for quality assurance, feature assurance, or whatever.

But now, I'm looking beyond the intersection. Here's the DevOps tool chain definition from Wikipedia:

DevOps Toolchain

Image by Kharnagy (Own work) CC BY-SA 4.0, via Wikimedia Commons

Well, thatís everything! Thatís not the intersection; thatís the union of everything from code, to releasing, to monitoring. And that's where things get confusing. Sometimes when people talk about DevOps, theyíre not talking about all of your code stuff. Itís the intersection parts that are the interesting parts of DevOps. Itís the parts where we let developers get their code into production easily. That slice, that intersection, of the Venn diagram, thatís the interesting part about DevOps.

Having said that, as a product company, we are going to deliver things that are pretty squarely on the development side, and, eventually, weíre going to deliver things that are pretty squarely in the operations side. At some point, we may have an operations dashboard that lets you understand your dependencies in your network infrastructure, and your routers, and your whatever. Thatís pretty far fetched at this point, but it could happen. Why not? Just have GitLab be your one operations dashboard, and then itís not just about the intersection of the DevOps, itís the whole DevOps tool chain.

So, that is the whirlwind, high-level summary of where we've been, and a little bit about where weíre going. Now let's get into specific issues.

The Ops Dashboard Ė #1788

We have a monitoring dashboard that's very developer centric. What about taking that same content and slicing it from the operator's perspective? For a moment, ignore all the stuff below, letís just pretend thereís only the four boxes at the top:

Ops view of monitoring and deploy board

So an operator might want to know, "Whatís the state of production?" If I'm a developer I can go into a project, into environments, see the production environment for that project, and I can see what the status is. But what if I want to see all production environments? As an operations person, I care a little less about individual projects than I care about "production." So this is giving me the overview of "production." All of these little boxes would represent production deploys of projects that you have in your GitLab infrastructure.

The view is explicitly convoluted because we had just introduced sub-groups and I wanted to make sure this mechanism expanded. So ignore all the stuff below and just look at the top-level dashboards. Or maybe one level down, which is already still pretty complicated, but letís say your marketing organization had different properties than your other developer operations; youíd be able to see really quickly what the status is. If somethingís red, youíd be able to click down, and see details.

Ops view - service health

Ops view - pod health

Youíd be able to see graphs like this, which are similar to what we already provide, but from the other angle. As a developer Iím looking at the deploy, and saying, "Oh, how did my deploy affect my performance?" But this is saying, "Howís production? Is anything wrong with my entire production suite?"

This is really just scratching the surface of the ops views of things, but I think it's going to become much more important as people embrace DevOps. You want your developers to be talking the same language as your operations people. In a lot of organizations, itís already the same people Ė there are no separate operations people. Developers push code to production, and they're paged if something goes wrong. In others, developers and operators are separate, but they want to work together towards DevOps.

Either way, you want to be using the same tools. You want to be able to point to, for example, a memory bump that your operations people should also be able to see. But if theyíre using completely different tools, like New Relic and Datadog, that kind of sucks. So letís give them the same tools.

Pipeline view of environments Ė #28698

I particularly love this proposal, and I really want to see this happen soon.

The environments page today is just a list of environments showing the last deployment. The picture tells you who deployed, which is good, and you can see that the commit is from the same SHA as staging, which is kind of nice. I can see the deploy board, and if there's a deploy ongoing, Iím able to see the state as it rolls out. We donít yet show you the current health of these pods; once they're deployed, all we know is that they're deployed. This is how the environment view is today, and it's centered around deployments.

Environments list

 Current Environment view

You can click through to see the deployment history and this is actually really valuable because I can see who deployed things, how long ago, and if something went wrong in production I can really quickly roll back and let the developers have some space to go and figure out what went wrong.

Deployment history

 Current Deployment History view

But this proposal turns it around to have more of a DevOps view of the thing.

Pipeline view of environments

 Proposed pipeline view of Environments

The idea is to take the same application, and instead of just looking at a list of environments, Iíd be looking at columns with lots of review apps, and some number of staging environments, and a production environment. Instead of just showing you the SHA, we would show you, for example, what merge requests have been merged into staging that are not yet in production. Thatís a great marriage of these two views, that youíd be able to see the diff between them.

This list, although itís just a mockup, shows maybe the last five things that were in production, or what was included in the last deploy, or whatever works best for your environment. Showing whatís in the last deploy might be enough, but for people who deploy 17 times a day, maybe thatís a little less useful, and we just show history.

But then what about building in more of the operations kind of stuff, and saying, "Alright, whatís the state of my pods?" Here we were flagging where the error rate exceeded a threshold and thereís some alert that popped up. And here weíre showing this automatic rollback kind of stuff, but basically just really building on this ops view. Of course this is still a DevOps view, in the sense that Iím looking at an individual project. So, one permutation of that would marry that ops view of all of production. Or if Iím looking at a microservices kind of thing, where there are five or 100 different projects, and I want to see the status of all those really quickly. See #28707.

Dependency security Ė #28566

So, here, the idea is that you've deployed something in production, and some module or something that you depend on has been updated, not by you, but by the community, or someone else.

The easiest and most naive way to approach this is that with the next merge request, or next CI/CD run, we would go and check to see if anythingís outdated. And we might fail your CI/CD because of this.

It would make much more sense to run this stuff automatically. Even if, for example, nobody pushes for seven days, and in the middle of that, thereís a security release; just proactively run stuff and notify me. So, that's sort of a second iteration of thinking about how you would notify somebody, and tell them, "Oh, youíve got a security change. You should go in and do something about it."

Now, the third iteration is, "Well, what would you do with that information?" Youíd go and maybe give it to your junior developer to go and make the change, and point to the new version. And then, of course, you need to test that it works. So, youíre going to create a merge request, and then test it, to make sure that it still functions properly.

Well, why notify somebody, and tell the junior developer to go and do this? Why donít we just do it for you? Why donít we just go and submit the merge request for you, and then tell you what the results are. And, in fact, letís go further, and say, "Hey it passed. We just deployed into production for you." Why would you have security vulnerability in place any longer than necessary?

And instead of having 100 alerts about 100 projects or microservices that all need to get updated, you just get alerts about three of them that fail, that actually have some weird dependency that it didnít work on. And then, you can focus on real problems.

Dependency security

So, thatís a glimpse at how weíre thinking about this.

This would definitely be an enterprise-level feature. And again, we've fleshed out some ideas and itís unscheduled, but it does really tie into the ops mindset.

Question: Enterprise Edition features

Courtland: You mentioned that sort of automation would be an enterprise edition feature. Can you talk a little bit more about why a smaller development team, like under 100 developers, wouldnít get value out of something like that?

Mark: So, this is where things get a little tricky, because of course, smaller developer teams would get value out of that too. Everybody would get value out of that. Some of it has to do with proportionality. One test I like to use is: is there some other way you could achieve the same thing, using workarounds, and weíre just making it easier? And thatís a good case, here. You can already do this, but weíre going to automate it. And automation is something that affects larger companies a lot more, because theyíve got hundreds of projects, with thousands of developers. And they just canít deal with the scale, or itís worth dealing with the automation. Whereas, if youíve got a small developer, with a single project, youíre pretty much on top of it. And if something changes, yeah, you just go ahead and fix it; youíre aware of it. The bigger challenges are when youíre just not aware of how this thing might affect one project that somebodyís almost forgotten about.

The other thing is that, just to be blunt, our concept that Enterprise Edition is only for more than X people, is a little flawed. Itís that it applies more to those companies, that those people value it more, and theyíd be willing to pay for it more, or however you judge your value there. Clearly, small companies would value all this automation, and everything else, but theyíre not going to get as much incremental value out of it, as a larger company would.

The other way to look at it is that this is pretty advanced stuff, and frankly, it doesnít deserve to be, free, open source. Itís probably really complicated stuff, and youíre going to have to pay there. Maybe thereíd be levels to it, right? Thereíd be a version that gives you an alert: weíll run this test once a day. Or even just have a blog post about how to do this: you set up a recurring, scheduled pipeline job, once a day, to test if any of your dependencies have been updated. And you can do that today and then it would alert you. But to automate it, to actually, create a merge request for you, and everything else? Well, thatís in the Enterprise feature. Itís not that version checking isnít important for everybody, but the automation around it really, really matters for larger companies. Does that make sense?

Courtland: Yeah, I mean, I think that the first way you described it, in that, "Yeah, everyone gets some value out a feature like this, but the overwhelming value and use for this is in larger development teams," that resonated.

SLO and auto revert Ė #1661

This is a feature showing how weíre thinking about auto reverting something. Weíve got canary deployments, and we have another feature weíre not currently working on or scheduled, but itís incremental rollout, so that you would not just rollout to a single canary, or a bucket of canaries, but it would slowly increment: 1 percent, then 5 percent, then 25 percent. But letís say, at some point, during my rollout, you detect an error.


This a mockup of what it would look like. Youíre like, "Oh, error rates increased by something above our threshold; letís revert that one, go back, and create a new issue, and alert somebody to take a look at it." Lately, Iím thinking that I donít know if I really want to automatically roll back, versus just stop it in its canary form, and say, "Well, itís canary. Letís let canary be there, so you can debug the canary, but just donít let the canary go on further."

Error rate exceeding is a pretty tough one. But letís say memory bumps up, and you might be like, "Yeah, we added something, and itís using more memory, and weíre okay with that. Donít stop my deploy just because itís using more memory." There might need to be human intervention in there, but somewhere along this line weíre automating a lot of the deploy stuff.

Onboarding and adoption Ė #32638

Onboarding and adoption is a really big issue, with lots of different ideas for how to improve onboarding, how to get people actually using idea to production, improving auto deploy. Not a lot of visuals, so I wonít really talk about it, but itís definitely one of our top priorities; the next most important thing weíre working on.

Cloud development Ė #32637

Cloud development is the idea that setting up your local host machine is actually kind of a pain sometimes. Especially with microservices, where each service can be in their own language, you donít want to maintain Java, and Ruby, and Node, and all these other versions of dependencies, and every time something switches, youíve got to reinstall a new version of stuff. Or even these days, you might develop on an iPad, and you donít have a local host to compile things.

Cloud9 is the biggest, well known thing, from an IDE perspective, and Amazon bought them a little while ago. But even aside from the IDE portion of it, itís just being able to develop in the cloud, and being able to make some changes, and then push them back; commit them to a repo.

We have a little bit of a demo like this, right now, with our web terminal. So, if you have Kubernetes, you see this terminal button, and it just pops up the terminal right in the staging server. And I can actually go ahead and edit a file there, andÖ I just made a live change into my staging app.

Now, generally speaking, I would not actually recommend you do that, because Iím messing with my staging app, thatís not what it's for. It makes an awesome little demo, but itís not what you should do. What we want to do is come up with a way that people could do that, but have it be not on your staging app, but in maybe a dev environment that is specifically for this purpose. But that also, after you make your changes, and test them, and run them live, you can then go and commit them back to version control, and close that loop. So thereís a whole bunch of issues related to that. And to be honest, it was what we were hoping that Koding would have provided for us, and we have an integration with them, but it hasnít worked out, really, the way that we had hoped. And so, weíre looking at alternatives, and we think we can probably do this ourselves.

Anyway, thatís a big thing to flesh out.

GitLab PaaS Ė #32820

Heroku is awesome, because it gives you this really great platform thatís easy to use, and gives you all this functionality on top of Amazon. Five or six years ago it was super, brain-meltingly awesome to get people to do ops. For a developer, I donít have to be aware of how to do ops; Heroku just does ops for us.

GitLab PaaS is basically the idea that youíve got a lot of these components, and weíre not going to invent them all from scratch. Weíre going to rely on Kubernetes, for example. But on top of Kubernetes, we could make an awesome environment for ops. An ops environment, or a platform as a service. And so, thereís an issue to discuss what it would take to do that. At some point in time, this is a big item for us. If we can make it super really easy for you to fully manage your ops environment via GitLab, and maybe, for example, never touch the Kubernetes dashboard; never touch any of the tools, just use the GitLab tools to do this. Thatís pretty powerful.

Sort of related is an idea in the onboarding stuff, that on we can actually provide you with a Kubernetes cluster; maybe a shared cluster. We have to worry about security, of course. But imagine if you were a brand new user on, and you push up an app, and you have nothing in there specifically for GitLab, you just push up your code, and GitLab is like, "Oh, thatís a Ruby app. Okay, I know how to build Ruby apps. Oh, and I also know how to test Ruby apps. Iím just going to go and test them automatically for you." And, "Oh, by the way, I know how to deploy this. Iím just going to go ahead and deploy this to production." And weíll make a, whatever the hell, some domain so that itís not going to affect your actual production. But if you wanted to, you would just point your DNS over to this production app, and you've got the production app running on GitLab infrastructure. And thatís, really, what Heroku provided, right?

But that also is an onboarding thing for us to make it really easy. Because if we want everybody to have CI, well, letís turn it on for you. Thatís pretty awesome. If we want everybody to have CD, we canít just turn it on for you, because you have to have a place to deploy it to. So, if we just provided you a Kubernetes cluster ("everybody gets a cluster"), then you just got a place. And, I mean, weíll severely limit it. Weíll make it limited in some way, so that youíre not going to run the production stuff for long there. Or if you do, you have to pay for it. But weíre not going to try and make money off of the production resources. We want to make money off of making it really easy. So, really, what we want to do is encourage you to, then, go and spin up your own Kubernetes cluster, say, on Google. And weíll make a nice little link that says, "Go and spin up a cluster on GKE." Weíll make that really, really easy, but to make it super easy, for some number of days, we can just provide you that cluster, automatically.

Feature flags Ė #779

Feature flags are really about decoupling delivery from deployment. Itís the idea that you make your code, you deploy it, but you havenít turned it on, so itís not delivered yet. And the idea there is that it means you can merge in the main line, more often, because itís not affecting anybody. And, also, it really helps because you can do things like: when I do deliver, I can deliver it for certain people; just GitLab employees or just the Beta group, and then I can control that rollout. So then, if there's an error rate spike, well, itís just a few a people and I know who they are, and theyíre going to complain to me. Itís no big deal. But I can test things out, get it polished, fix the problems, before rolling it out. And then, you can also do things like, roll it out to 10 percent of the people, 50 percent of the people, whatever. Itís all about reducing risk, and improving quality, and fundamentally about getting things into your mainline quicker. So, itís ops-ish, in that sense, but itís, really, still pretty fully on dev.

Artifact management Ė #2752

Artifact management has become a hot topic lately. We already have a container registry for Docker image artifacts, and we also have file-based artifacts that you can pass between jobs, and pass between pipelines, and even pass between cross project pipelines. And we have ways to download them, and browse them, but if those artifacts happen to be things like Maven or Ruby or node modules, and you want to publish them, and then consume them in other pipelines, we donít have a formal way to do that.

And you could, obviously, publish to the open source, RubyGems, for example. But if you want a private Gem, that is only consumed by your teamÖ Maybe that's not as big for Ruby developers, but Java developers do that all the time. A lot of Java developers use Artifactory or Sonatype Nexus. In order to complete the DevOps tool chain, we need to have some first class support for that, either by bundling in one of these other providers, or by adding layers, and APIs, on top of our existing artifacts. My personal pet favorite right now is, letís say we can just tag our existing artifact, and say, "Oh, this is Maven type of artifact," and then we expose that via an API and so then you can declare that in another project, and it would just consume the APIs, and just know how to do that. But it would also use our built-in authentication so you donít have to set up creds and do all this declaration; you can be like, "Oh, Iíve got access to this project and this project, so I can get the artifacts, and I can consume it all really easily."

Auto DevOps Ė #35712

Note: We shipped the first iteration of Auto DevOps in 10.0

So, letís talk about Auto DevOps. This spans from the near-term to the very long-term. Itís great that we do a lot of DevOps, and in a very simplistic way, itís like, "Oh, but shouldnít we just make this stuff automatic?" The way I phrase it is, we should provide the best practices in an easy and default way. You can set up a GitLab CI YAML, but you have to actively go and do that. But, really, every project should be running some kind of CI. So, why donít we just detect when youíve pushed up a project; weíll just build it, and weíll go and test it, because we know how to do testing. Today, with Auto Deploy, we already use Auto Build, with build packs. We will automatically detect, I think, one of seven different languages, and automatically build your Java app, or Ruby, or NodeÖ and we use Herokuís build packs, actually, to do this build. And so we build that up, and when using Auto Deploy, weíll go ahead and deploy that. You still have to, obviously, have a Kubernetes cluster in order to do that, so itís not fully automated if you donít have that. But if youíve got Kubernetes, hey, this is a literally one click. You pick from a menu, say, "Oh, Iím on Kubernetes," and then hit submit, and youíve got Auto Deploy and Auto Build.

But one of the things we donít have is Auto CI. And thatís a little annoying, but itís one of the things we want to pick up, and actually, hopefully our CTO, Dmitriy, is going to pick that up in Q3; it's one of his OKRs. Heroku, themselves, actually extended build packs to do testing, and so that means that thereís at least five build packs that know how to test these languages. And so, hey, letís use that. But even if that doesnít work, thereís a lot of other things we can do. Other companies have all this stuff automated, as well. So if we canít use Heroku CI, being able to say, "Oh, this is this language; we know how to test this language," we'll be making that automatic.

Automatic is multiple levels of things. Is it a wizard that configures this stuff for me? Is it one click checkbox, that says, "Yes, turn on auto CI," or is it templates that I can easily add into my GitLab CI YAML? I think, in order to qualify as auto, what we have to do here is that it shouldnít be templates. It shouldnít be blog posts that tell me how to do it. Thatís just CI. It should be, literally, just "I pushed and it worked;" or at most a checkbox or two.

Letís go further, what other thing could we just automate here? And not automate strictly for the purposes of automation, but about bringing best practices to people. So, you have to actively work hard, to turn these things off. If you donít want CI, then shut it off, but by default you should have this.

So, this is a really, really long list of things that will take us forever to get to. The first ones have links, because weíre tracking real issues for this. Auto Metrics is a great one. If youíre running certain languages, you should just be able to, really easily, go and just pull the right information out of there. But whatever, the list is huge.

But the idea is that we can build up this Auto DevOps, even the marketing term, and start talking about it in that way, and to not just say that GitLab is great for your DevOps and is a complete DevOps tool chain. But, in fact, we do all this stuff for you automatically.

Thereís a lot to be done to make this fully automated. And what percentage of projects can we really do? Auto Deploy is a great example that only works for web apps. If itís not a web app, we canít just deploy it. What would it mean? We deploy it, and it just wouldnít function. If you made a command line app, what would deploy even mean? Or if itís a Maven, or really any kind of module that you bundled up and released, thatís not the same thing as a deploy. So, maybe we need an Auto Release. Itís not on this list, but maybe it should be. But within the web app space, we can do some of this stuff automatically.

So thatís it. Everything you ever wanted to know about DevOps.

Terms of Use | Copyright © 2002 - 2017 CONSTITUENTWORKS SM  CORPORATION. All rights reserved. | Privacy Statement