mirror of
https://github.com/gravitational/teleport
synced 2024-10-19 08:43:58 +00:00
Add RFD 134: Cloud first deployments (#28072)
This commit is contained in:
parent
1df1ef89b4
commit
432c61a39a
217
rfd/0134-cloud-first-deployment.md
Normal file
217
rfd/0134-cloud-first-deployment.md
Normal file
|
@ -0,0 +1,217 @@
|
|||
---
|
||||
authors: Roman Tkachenko (roman@goteleport.com),
|
||||
Zac Bergquist (zac@goteleport.com),
|
||||
Jim Bishopp (jim@goteleport.com),
|
||||
RJ (rjones@goteleport.com),
|
||||
Sasha Klizhentas (sasha@goteleport.com)
|
||||
state: draft
|
||||
---
|
||||
|
||||
# RFD 134 - Cloud-first deployment
|
||||
|
||||
## Required approvers
|
||||
|
||||
- Engineering: @russjones, @jimbishopp, @zmb3
|
||||
- Product: @klizhentas, @xinding33
|
||||
|
||||
## What
|
||||
|
||||
This RFD describes the Cloud-first deployment approach, as well as the release
|
||||
process changes required to support it.
|
||||
|
||||
## Why
|
||||
|
||||
Our current feature delivery process is not compatible with the company's shift
|
||||
to "SaaSify" Teleport. The new (or rather, extended) Cloud-focused process aims
|
||||
to address shortcomings like:
|
||||
|
||||
- Long release cycles where marquee new features get delivered to Cloud users
|
||||
(roughly) once a quarter and are tied to major self-hosted releases.
|
||||
- Slow and monolithic release pipeline that always builds all artifacts and
|
||||
does not allow for very fast Cloud rollouts.
|
||||
- Engineers being disconnected from Cloud deployments and unaware of when their
|
||||
changes are getting rolled out.
|
||||
|
||||
## Self-hosted release process (current)
|
||||
|
||||
Before diving into the new process, let's set the stage with a high-level overview
|
||||
of the current release process.
|
||||
|
||||
Current release versioning scheme follows semver, with quarterly major releases
|
||||
and monthly minor releases.
|
||||
|
||||
Major releases are getting rolled onto Cloud quarterly, with a month's delay.
|
||||
Major features are reserved for major releases, which leads to Cloud customers
|
||||
getting major features with a month's delay.
|
||||
|
||||
The release process is triggered on a GitHub tag creation and builds the full
|
||||
suite of artifacts that takes 2-3 hours so it's unrealistic to do multiple
|
||||
deploys a day.
|
||||
|
||||
Cloud upgrade happens once a week to a release published the previous week, and
|
||||
engineers oftentimes don't have a clear idea of when their change is being
|
||||
deployed.
|
||||
|
||||
## Cloud release process
|
||||
|
||||
In the Cloud-first world, every release that's deployed on Cloud does not necessarily
|
||||
have a corresponding self-hosted release tag. For this reason, we can't reuse
|
||||
the current versioning scheme and apply it to both Cloud and self-hosted.
|
||||
|
||||
Instead, we want to get to a place where Cloud and self-hosted releases are
|
||||
decoupled, with Cloud releases being regularly cut from the main development
|
||||
branch and self-hosted releases cut from their respective release branches.
|
||||
|
||||
Although the concept of semantic versioning loses most of its meaning with the
|
||||
continuous release train, we do need to keep it for Cloud releases to be able
|
||||
to use automatic upgrades system we built.
|
||||
|
||||
With this in mind, the Cloud version will consist of:
|
||||
|
||||
- Major: "development" version number on master branch (latest stable self-hosted release + 1).
|
||||
- Minor/patch: always 0.
|
||||
- Prerelease: `-cloud.X.Y` where `X` will represent a timestamp of when the tag
|
||||
was cut and `Y` the corresponding commit hash: `v14.0.0-cloud.20230712T123633.aabbcc`.
|
||||
|
||||
Once it is time for the next self-hosted major release (every 3-4 months), the
|
||||
new release branch will cut and the Cloud release version major component will
|
||||
increment.
|
||||
|
||||
```
|
||||
14.0.0-alpha.1 14.0.0 14.0.1
|
||||
branch/v14 |------------------|----------|---------....---->
|
||||
|
|
||||
|
|
||||
14.0.0-cloud.20230712T123633.aabbcc | 15.0.0-cloud... 15.0.0-cloud... 16.0.0-cloud... 16.0.0-cloud...
|
||||
master -----|-------....---------------------------|-----|----------------|-------....--|----|------------------|--------------....---->
|
||||
|
|
||||
|
|
||||
branch/v15 |-----------------|---------|---------...---->
|
||||
15.0.0-alpha.1 15.0.0 15.0.1
|
||||
```
|
||||
|
||||
Once a self-hosted release branch is cut, subsequent self-hosted minor/patch
|
||||
releases are published without any changes to the current process, but on a
|
||||
less frequent cadence than Cloud releases off of master. Changes to the release
|
||||
branches will follow the same "develop on master, backport to release branch"
|
||||
workflow we're using now.
|
||||
|
||||
### Cloud release artifacts
|
||||
|
||||
The self-hosted release and promotion process is slow and takes 2-3 hours
|
||||
because it builds a full suite of artifacts. Cloud users need most of these
|
||||
artifacts as well for the agents they run on their own infrastructure, however
|
||||
Cloud control plane needs a very limited subset of all artifacts. For this
|
||||
reason, the Cloud release and promotion process will be split in 2 parts:
|
||||
|
||||
- The workflow that builds artifacts required for the control plane: amd64
|
||||
binaries and Docker images.
|
||||
- The workflow that builds remaining artifacts such as deb/rpm packages and
|
||||
Helm charts.
|
||||
|
||||
This will allow us to produce rapid releases that affect only control plane
|
||||
services (auth, proxy) while still having the ability to push remaining artifacts
|
||||
when needed.
|
||||
|
||||
Both workflows will run in Github Actions and will reuse existing bits of the
|
||||
full self-hosted release pipeline.
|
||||
|
||||
### Cloud promotion process
|
||||
|
||||
The Cloud release promotion process will be staged, with a different class of
|
||||
tenants being upgraded at each stage:
|
||||
|
||||
1. Canary
|
||||
2. Internal
|
||||
3. Trials
|
||||
4. Team & Enterprise
|
||||
|
||||
The canary tenants will be a set of precreated clusters with some data in them
|
||||
which engineers will use for connecting their agents and testing their features.
|
||||
Canary clusters will be updated automatically once a Cloud release is cut.
|
||||
|
||||
Before promoting the release further, engineers will test any significant change
|
||||
in the cloud staging environment and then explicitly sign off that the change is
|
||||
ready for deployment by applying a label (e.g. `cloud/verified`) on the master
|
||||
PR corresponding to the change. The release manager then will approve the
|
||||
further rollout.
|
||||
|
||||
Once the change is rolled out to production tenants, engineers will be expected
|
||||
to monitor error rates and Cloud metrics relevant to their changes.
|
||||
|
||||
Let's now look at some of the edge case scenarios in the release process and
|
||||
how we'll be handling them.
|
||||
|
||||
### Behavior-incompatible changes
|
||||
|
||||
Teleport's component compatibility guarantees become somewhat obsolete in a
|
||||
Cloud-first world. Historically, we've been reserving behavior-incompatible
|
||||
changes for major releases. The concept of a major release, however, becomes
|
||||
non-existent in the Cloud-first world where any change, backwards compatible
|
||||
or not, gets deployed to all users within days.
|
||||
|
||||
When applied to Cloud release model, and accounted for potential delays in
|
||||
agent automatic upgrades, the compatibility requirement becomes:
|
||||
|
||||
| Any Cloud release must be always compatible with the oldest version of an
|
||||
| agent that's connected to Cloud and enrolled in automatic upgrades.
|
||||
|
||||
Compatibility requirements affect control plane (proxy, auth and other services
|
||||
we host for our users), agents and clients:
|
||||
|
||||
- Control plane. We have full control over it, deploy in tandem and make sure
|
||||
these are always compatible with each other.
|
||||
- Agents deployed by users. We always keep backwards compatibilithy between
|
||||
adjacent releases and rely on automatic upgrades to make sure they don't drift.
|
||||
- Clients. Clients require automatic upgrades to avoid breakage on behavior
|
||||
incompatible changes. This is [in progress](https://github.com/gravitational/cloud/issues/4880).
|
||||
|
||||
For self-hosted releases, the compatibility guarantees will stay the same:
|
||||
https://goteleport.com/docs/faq/#version-compatibility.
|
||||
|
||||
### Rollbacks
|
||||
|
||||
Bad releases can happen even with the multi-staged rollout process. In that
|
||||
situation the process will be:
|
||||
|
||||
- For the control plane, we will prefer rolling forward in most cases to a new
|
||||
release with the addressed issue or a rolled back change. The control plane
|
||||
release process described above will be fast enough to be able to push the
|
||||
new release quickly.
|
||||
- In a scenario where a major critical path is affected and a full "roll
|
||||
forward" process is not fast enough, the deployment team will manually
|
||||
perform release rollback using the same process that is used today.
|
||||
- For the agents, the rollback mechanism is supported via automatic upgrade
|
||||
system's critical/version endpoints described in the [RFD109](https://github.com/gravitational/teleport/blob/master/rfd/0109-cloud-agent-upgrades.md).
|
||||
- For the clients, which don't support automatic upgrades yet, the details are
|
||||
TBD as part of the [automatic upgrades](https://github.com/gravitational/cloud/issues/4880)
|
||||
work but we will likely need to build some automatic downgrade mechanism in
|
||||
case a client was auto-upgraded but a corresponding Github release was pulled.
|
||||
|
||||
### Security releases
|
||||
|
||||
Existing security release process is compatible with Cloud-first deployment.
|
||||
|
||||
Both `-cloud.X` and regular self-hosted tags will be published from teleport-private
|
||||
repository and the process will be as follows:
|
||||
|
||||
- Cloud control plane is upgraded using promotion mechanism described above.
|
||||
- Security release is promoted to Cloud repositories prompting agent auto-upgrades.
|
||||
- Self-hosted private releases are published using existing process.
|
||||
|
||||
While the security embargo is in effect, if another Cloud release needs to be
|
||||
published, it will be published from `teleport-private` repository to include
|
||||
the security fix like in the current process.
|
||||
|
||||
No changes are expected in the `teleport-private` and `teleport` relationship
|
||||
with regards to security release publishing in the short term.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Prerequisites for implementing the Cloud-first deployment process:
|
||||
|
||||
- Cloud-specific promotion workflows (control plane and full).
|
||||
- Automation for release changeset sign-off via Github PR labels.
|
||||
- Canary cluster(-s).
|
||||
- Automatic upgrades for `tsh` and other clients.
|
||||
- Metric and error rate monitoring dashboards.
|
Loading…
Reference in a new issue