---
authors: Forrest Marshall (forrest@goteleport.com)
state: draft
---

# RFD 90 - Upgrade System

## Required Approvers
* Engineering: @klizhentas && (@zmb3 || @rosstimothy || @espadolini)
* Product: (@klizhentas || @xinding33)

## What

System for automatic upgrades of teleport installations.


## Why

Teleport must be periodically updated in order to integrate security patches.  Regular
updates also ensure that users can take advantage of improvements in stability and
performance.  Outdated teleport installations impose additional burdens on us and on
our users. Teleport does not currently assist with upgrades in any way, and the burden
of manual upgrades can be prohibitive.

Reducing the friction of upgrades is beneficial both in terms of security and user
experience. Doing this may also indirectly lower our own support load.

Upgrades may be particularly beneficial for deployments where instances may run on
infrastructure that is not directly controlled by cluster administrators (teleport cloud
being a prime example).


## Intro

### Suggested Reading

While not required, it is helpful to have some familiarity with [The Update Framework](https://theupdateframework.com/)
when reading this RFD. TUF is a flexible framework for securing upgrade systems. It provides a robust framework
for key rotation, censorship detection, package validation, and much more.


### High-Level Goals

1. Maintain or improve the security of teleport installations by keeping them better
updated and potentially providing more secure paths to upgrade.

2. Improve the experience of teleport upgrade/administration by reducing the need for
manual intervention.

3. Improve the auditability of teleport clusters by providing insight into, and policy
enforcement for, the versioning of teleport installations.

4. Support a wide range of uses by providing a flexible and extensible set of tools
with support for things like user-provided upgrade scripts and custom target selection.

5. Provide options for a wide range of deployment contexts (bare-metal, k8s, etc).

6. Offer a simple "batteries included" automatic upgrade option that requires minimal
configuration and "just works" for most non-containerized environments.


### Abstract Model Overview

This document proposes a modular system capable of supporting a wide range
of upgrade strategies, with the intention being that the default or "batteries
included" upgrade strategy will be implemented primarily as a set of interchangeable
components which can be swapped out and/or extended.

The proposed system consists of at least the following components:

- `version-directive`: A static resource that describes the desired state of
versioning across the cluster. The directive includes matchers which allow
the auth server to match individual teleport instances with both the appropriate
installation target, and the appropriate installation method. This resource may
be periodically generated by teleport or by some custom external program. It may
also be manually created by an administrator. See the [version directives](#version-directives)
section for details.

- `version-controller`: An optional/pluggable component responsible for generating the
`version-directive` based on some dynamic state (e.g. a server which publishes
package versions and hashes). A builtin `version-controller` would simply be a
configuration resource from the user's perspective. A custom/external version
controller would be any program with the permissions necessary to update the
`version-directive` resource. See the [version controllers](#version-controllers)
section for details.

- *version reconciliation loop*: A control loop that runs in the auth server which
compares the desired state as specified by the `version-directive` with the current
state of teleport installations across the cluster. When mismatches are discovered,
the appropriate `installer`s are run. See the [Version Reconciliation Loop](#version-reconciliation-loop)
section for details.

- `installer`: A component capable of attempting to affect installation
of a specific target on one or more teleport hosts. The auth server needs to know
enough to at least start the process, but the core logic of a given installer
may be an external command (e.g. the proposed `local-script` installer would cause
each teleport instance in need of upgrade to run a user-supplied script locally and
then restart). From the perspective of a user, an installer is a teleport
configuration object (though that configuration object may only be a thin hook into
the "real" `installer`). Whether or not the teleport instance being upgraded
needs to understand the installer will vary depending on type. See the [Installers](#installers)
section for details.

There is room in the above model for a lot more granularity, but it gives us a
good framework for reasoning about how state is handled within the system. Version
controllers generate version directives describing what releases should be running where
and how to install them. The version control loop reconciles desired state with actual
state, and invokes installers as-needed. Installers attempt to affect installation of
the targets they are given.


### Implementation Phases

Implementation will be divided into a series of phases consisting of 1 or more
separate PRs/releases each:

- Setup: Changes required for inventory status/control model, but not necessarily
specific to the upgrade system.

- Notification-Only System: Optional phase intended to deliver value sooner
at the cost of overall feature development time.

- Script-Based Installs MVP: Early MVP supporting only manual control and
simple script-based installers for non-auth instances.

- TUF-Based System MVP: Fully functional but minimalistic upgrade system
based on TUF (still excludes auth instances).

- Stability & Polish: Additional necessary features (including auth installs
and local rollbacks). Represents the point at which the core upgrade system
can be considered "complete".

- Extended Feature Set: A collection of nice-to-haves that we're pretty sure
folks are going to want.

See the [implementation plan](#implementation-plan) section for detailed breakdown
of the required elements for each phase.


## Details


### Usage Scenarios

Hypothetical usage scenarios that we would like to be able to support.


#### Notification-Only Usecase

Cluster administrators may or may not be using teleport's own install mechanisms, but they
want teleport to be able to inform them when instances are outdated, and possibly generate
alerts if teleport is running a deprecated version and/or there is a newer security patch
available.

In this case we want options for both displaying suggested version (or just a "needs upgrade"
badge) on inventory views, and also probably some kind of cluster-wide alert that can be
made difficult to ignore (e.g. a message on login or a banner in the web UI). We also
probably want an API that will support plugins that can emit alerts to external locations
(e.g. a slack channel).

In this usecase teleport is serving up recommendations based on external state, so a client
capable of discovery (e.g. the `tuf` version controller) is required, but the actual
ability to affect installations may not be necessary.


#### Minimal Installs Usecase

Cluster administrators manually specify the exact versions/targets for the cluster, and
have a specific in-house installation script that should be used for upgrades. The teleport cluster
may not even have any access to the public internet.  The installation process is essentially a
black box from teleport's perspective. The target may even be an internally built
fork of teleport.

In this case, we want to provide the means to specify the target version and desired script. Teleport
should then be able to detect when an instance is not running the target that it ought to, and
invoke the install script. Teleport should not care about the internals of how the script plans to
perform the installation. Instances that require upgrades run the script, and may be required to perform
a graceful restart if the script succeeds. The script may expect inputs (e.g. version string), and
there may be different scripts to run depending on the nature of the specific node (e.g. `prod-upgrade`, `staging-upgrade`
or `apt-upgrade`, `yum-upgrade`), but things like selecting the correct processor architecture or OS
compatibility are likely handled by the script.

In this minimal usecase teleport's role is primarily as a coordinator. It detects when and where user
provided scripts should be run, and invokes them. All integrity checks are the responsibility
of the user-provided script, or the underlying install mechanism that it invokes.


#### Automatic Installs Usecase

Cluster administrators opt into a mostly "set and forget" upgrade policy which keeps their teleport
cluster up to date automatically. They may wish to stay at a specific major version, but would like
to have patches and minor backwards-compatible improvements come in automatically.  They want features
like maintenance schedules to prevent a node from being upgraded when they need it available, and automatic
rollbacks where nodes revert to their previous installation version of they are unhealthy for too long.
They also want teleport to be able to upgrade itself without dependencies.

This usecase requires the same coordination powers as the minimal usecase, but also a lot more.  Teleport
needs to be able to securely and reliably detect new releases when they become available.  Teleport needs
to be able to evaluate new releases in the context of flexible upgrade policies and select which releases
(and which targets within those releases) are appropriate and when they should be installed. Teleport needs
to be able to download and verify installation packages, upgrade itself, and monitor the health of newly
installed versions.

In this maximal usecase, teleport is responsible for discovery, selection, coordination, validation, and
monitoring.  Most importantly, teleport must do all of this in a secure and reliable manner.  The potential
fallout from bugs and vulnerabilities is greater than in the minimal usecase.


#### Plan/Apply Usecase

Cluster administrators want automatic discovery of new versions and the ability to trigger automatic installs,
but they want manual control over when installs happen. They may also wish for additional controls/checks such
as multiparty approval for upgrades and/or the ability to perform dry runs that attempt to detect potential
problems early.

The core `plan`/`apply` usecase is mostly the same as automatic installs (minus the automatic part), but
the more advanced workflows require additional features. Multiparty approval and dry runs both necessitate
 a concept of "pending" version directives, and dry runs require that all installers expose a dry run mode
of some kind.


#### Hacker Usecase

Cluster administrators march to the beat of their own drum.  They want to know the latest publicly available
teleport releases, but skip all prime number patches.  Nodes can only be upgraded if it is low
tide in their region, the moon is waxing, and the ISS is on the other side of the planet. They want to use
teleport's native download and verification logic, but they also need to start the downloaded binary in
a sandbox first to ensure it won't trigger their server's self-destruct.  If rollback is necessary, the rollback
reason and timestamp need to be steganographically encoded into a picture of a turtle and posted to instagram.

This usecase has essentially the same requirements as the automatic installs usecase, with one addition. It necessitates
*loose coupling* of components.


### Security

Due to the pluggable nature of the system proposed here, it is difficult to make *general* statements about
the security model. This is because most of the responsibilities of an upgrade system (package validation,
censorship resistance, malicious downgrade resistance, etc) are responsibilities that fall to the pluggable
components. That being said, we can lay down some specific principals:

- Version controllers should have some form of censorship detection (e.g. the TUF controller verifies that
the package metadata it downloads has been recently re-signed by a hot key to prove liveness). Teleport will
provide a `stale_after` field for version directives so that failure to gather new state is warned about, but
additional warnings generated by the controller itself are encouraged.

- Installers must fail if they are not provided with sufficient target information to ensure that the acquired
package matches the target (e.g. if installation is delegated to an external package manager that is deemed trusted
this might be as simple as being explicit about version, but in the case of the TUF installer this means rejecting
target specifications that don't include all required TUF metadata).

- We should encourage decentralized trust. The TUF based system should leverage TUF's multisignature support
to ensure that compromise of a single key cannot compromise installations. We should also provide tools
to help those using custom installation mechanism to avoid single-point failures as well (e.g. multiparty approval 
for pending `version-directive`s), and the ability to cross-validate their sources with the TUF validation
metadata pulled from our repos.

- Teleport should have checks for invalid state-transitions independently of any specific controller. 


#### TUF Security

We won't be re-iterating all of the attack vectors that (correct) usage of TUF is intended to protect against.
I suggest at least reading the [attacks](https://theupdateframework.github.io/specification/v1.0.28/index.html#goals-to-protect-against-specific-attacks)
section of the specification. Instead we will zoom in on how we intend to use TUF for our purposes.

TUF provides a very good mechanism for securely getting detailed package metadata distributed to clients,
including sufficient information to verify downloaded packages, and to ensure that censorship and
tampering can be detected.

The trick to making sure a TUF-based system really lives up to the promise of the framework, is to have
a good model for how the TUF metadata is generated and signed in the first place. This is where we come
to the heart of our specific security model. We will leverage TUF's thresholded signature system and
go's ability to produce deterministic builds in order to establish isolated cross-checks that can
independently produce the same TUF metadata for a given release. At a minimum, we will have two separate
signers:

- Build Signer: Our existing build infrastructure will be extended to generate and sign TUF metadata for
all release artifacts (or at least the subset that can be built deterministically).

- Verification Signer: A separate environment isolated from the main build system will independently build
all deterministic artifacts. All metadata will be independently generated and signed by this system.

With this dual system in place, we can ensure that compromised build infrastructure cannot compromise the
upgrade system (and be able to detect compromises essentially immediately).

If we can manage to fully isolate the two environments such that no teleport team member has access to both
environments, we should be able to secure the upgrade system from any single compromise short of a direct
compromise of our public repositories.

All of the above presumes that no exploits are found in TUF itself, or its official go library, s.t. TUF's core
checks (multisignature verification, package/metadata validation, etc) could be directly or indirectly circumvented.
The TUF spec has been audited multiple times, but the most recent audit as of the time of writing was performed in
2018 and did not cover the go implementation specifically.

In order to further mitigate potential TUF related issues, we will wrap all download and TUF metadata retrieval
operations in our own custom API with required TLS authentication. TUF metadata will be used only as an additional
verification check, and will not be used to discover the identity from which a package should be downloaded
(i.e. malicious TUF metadata won't be able to change _where_ we download a package from).
The intent here will be to ensure that a vulnerability in TUF itself cannot be exploited without also
compromising the TLS client and/or our own servers directly. This means we won't be taking advantage of TUF's
ability to support unauthenticated mirrors, but since we have no immediate plans to support that feature anyhow,
adding this further layer of security has no meaningful downside.


### Inventory Control Model

- Auth servers exert direct control over non-auth instance upgrades via bidirectional
GRPC control stream.

- Non-auth instances advertise detailed information about the current installation, and
implement handlers for control stream messages that can execute whatever local component
is required for running a given install method (e.g. executing a specific script if the
`local-script` installer is in use).

- Each control stream is registered with a single auth server, so each auth server is
responsible for triggering the upgrade of a subset of the server inventory.  In order
to reduce thundering herd effects, upgrades will be rolling with some reasonable default
rate.

- Upgrade decisions are level-based. Remote downgrades and retries are an emergent
property of a level-based system, and won't be given special treatment

- The auth server may skip a directive that it recognizes as resulting in an incompatible
change in version (e.g. skipping a full major version).

- By default, semver pre-release installations are not upgraded (e.g. `1.2.3-debug.2`).

- In order to avoid nearly doubling the amount of backend writes for existing large
clusters (all of whose instances are predominantly ssh services), the existing "node"
resource (which would be more accurately described as the `ssh_server` resource), will
be repurposed to represent a server installation which may or may not be running an ssh
service.  Whether or not other services would also benefit from unification in this way
can be evaluated on a case-by-case basis down the road.

- In order to support having a single control stream per teleport instance (rather than
separate control streams for each service) we will need to refactor how instance
certs are provisioned. Currently, separate certs are granted for each service running
on a instance, with no single certificate ever encoding all the permissions granted by the
instance's join token.

Hypothetical GRPC spec:

```protobuf
// InventoryService is a subset of the AuthService (broken out for the readability)
service InventoryService {
    // InventoryControlStream is a bidirectional stream that handles presence and
    // control messages for peripheral teleport installations.
    rpc InventoryControlStream(stream ClientMessage) returns (stream ServerMessage);
}


// ClientMessage is sent from the client to the server.
message ClientMessage {
    oneof Msg {
        // Hello is always the first message sent.
        ClientHello Hello = 1;
        // Heartbeat periodically updates status.
        Heartbeat Heartbeat = 2;
        // LocalScriptInstallResult notifies of installation failures.
        LocalScriptInstallResult LocalScriptInstallResult = 3;
    }
}

// ServerMessage is sent from the server to the client.
message ServerMessage {
    oneof Msg {
        // Hello is always the first message sent.
        ServerHello Hello = 1;
        // LocalScriptInstall instructs the client to perform a local-script
        // upgrade operation.
        LocalScriptInstall LocalScriptInstall = 2;
    }
}

// ClientHello is the first message sent by the client and contains
// information about the client's version, identity, and claimed capabilities.
// The client's certificate is used to validate that it has *at least* the capabilities
// claimed by its hello message. Subsequent messages are evaluated by the limits
// claimed here.
message ClientHello {
    // Version is the currently running teleport version.
    string Version = 1;
    // ServerID is the unique ID of the server.
    string ServerID = 2;
    // Installers is a list of supported installers (e.g. `local-script`).
    repeated string Installers = 3;

    // ServerRoles is a list of teleport server roles (e.g. ``).
    repeated string ServerRoles = 4; 
}

// Heartbeat periodically 
message Heartbeat {
    // TODO
}

// ServerHello is the first message sent by the server. 
message ServerHello {
    // Version is the currently running teleport version.
    string Version = 1;
}


// LocalScriptInstall instructs a teleport instance to perform a local-script
// installation.
message LocalScriptInstall {
    // Target is the install target metadata.
    map<string, string> Target = 1;
    // Env is the script env variables.
    map<string, string> Env = 2;
    // Shell is the optional shell override.
    string Shell = 3;
    // Script is the script to be run.
    string Script = 4;
}


// LocalScriptInstallResult informs auth server of result of a local-scrip installer
// running. This is a best-effort message since some local-script installers may restart
// the process as part of the installation.
message LocalScriptInstallResult {
    bool Success = 1;
    string Error = 2; 
}
```


### Inventory Status and Visibility

We face some non-trivial constraints when trying to track the status and health of ongoing
installations. These aren't problems per-say, but they are important to keep in mind:

- Teleport instances are ephemeral and can be expected to disappear quite regularly, including
mid-install. As such, we can't make a hard distinction between a node disappearing due to normal
churn, and a node disappearing due to a critical issue with the install process.

- Backend state related to teleport instances is not persistent. A teleport instance should have its
associated backend state cleaned up in a reasonable amount of time, and the auth server should
handle instances for which no backend state exists gracefully.

- The flexible/modular nature of the upgrade system means that there is a very significant benefit
to minimizing the complexity of a component's interface/contract. E.g. a `local-script`
installer that just runs an arbitrary script is much easier for a user to deal with than one that
must expose discrete download/run/finalize/rollback steps.

- Ordering in distributed systems is hard.

With the above in mind, lets look at some basic ideas for how to track installation state:

- Immediately before triggering a local install against a server, the auth server must update
that server's corresponding backend resource with some basic info about the install attempt
(time, installer, current version, target version, etc). The presence of this information
does not guarantee that an install attempt was ever made (e.g. the auth server might have crashed
after writing, but before sending).

- Auth servers will use CompareAndSwap operations when updating server resources to avoid
overwriting concurrent updates from other auth servers. This is important because we don't
want two auth servers to send install messages to the same instance in quick succession, and
we also don't want to accidentally lose information related to install attempts.

- An instance *may*, but is not required to, send various status updates related to an install
attempt after it has been triggered. As features are added into the upgrade system (e.g. local rollbacks)
new messages with special meanings can be added to improve the reliability and safety of rollouts.

- Auth servers will make inferences based on the available information attached to server inventory
resources to roughly divide them into the following states:
  - `VersionParity`: server advertises the correct version (or no version directive matches the server) and
  the server was not recently sent any install messages.
  - `NeedsInstall`: server advertises different version than the one specified in its matching version directive,
  and no recent install attempts have been made.
  - `InstallTriggered`: install was triggered recently enough that it is unclear what the result is.
  - `RecentInstall`: server has recently sent a local install message, and is now advertising a version matching
  the target of that message. Whether recency in this case should be measured in time, number of heartbeats, or
  some combination of both is an open question, but it is likely that we'll need to tolerate some overlap where
  heartbeats advertising two different versions are interleaved. We should try to limit this possibility, but
  eliminating it completely is unreasonable.
  - `ChurnedDuringInstall`: server appears to have gone offline immediately before, during, or immediately
  after an installation. It is impossible to determine whether this was caused by the
  install attempt, but for a given environment there is some portion/rate of churn that, if exceeded, is likely
  significant. 
  - `ImplicitInstallFault`: server is online but seems to have failed to install the new version for some
  reason. Its possible that the server never got the install message, or that it performed a full install
  and rollback, but could not update its status for some reason.
  - `ExplicitInstallFault`: server is online and seems to have failed to install the new version for some
  reason, but has successfully emitted at least one error message. For a `local-script` installer this likely
  just means that the script had a non-zero exit code, but for a builtin installer we may have a failure
  message with sufficient information to be programmatically actionable (e.g. `Rollback` vs `DownloadFailed`).

- By aggregating the counts of servers in the above states by target, version, and installer the auth servers
can generate health metrics to assess the state of an ongoing rollout, potentially halting it if some threshold
is reached (e.g. `max_churn`).


Hypothetical inventory view:

```
$ tctl inventory ls
Server ID                               Version    Services       Status
------------------------------------    -------    -----------    -----------------------------------------------
eb115c75-692f-4d7d-814e-e6f9e4e94c01    v0.1.2     ssh,db         installing -> v1.2.3 (17s ago)
717249d1-9e31-4929-b113-4c64fa2d5005    v1.2.3     ssh,app        online (32s ago)
bbe161cb-a934-4df4-a9c5-78e18b599601    v0.1.2     ssh            churned during install -> v1.2.3 (6m ago)
5e6d98ef-e7ec-4a09-b3c5-4698b10acb9e    v0.1.2     k8s            online, must install >= v1.2.2 (eol) (38s ago)
751b8b44-5f96-450d-b76a-50504aa47e1f    v1.2.3     ssh            online (14s ago)
3e869f3f-8caa-4df3-aa5c-0a85e884a240    v1.2.3     db             offline (12m ago)
166dc9b9-fc85-44a0-96ca-f4bec069aa92    v1.2.1     k8s            online, must install >= v1.2.2 (sec) (12s ago)
f67dbc3a-2eff-42c8-87c2-747ee1eedb56    v1.2.1     proxy          online, install soon -> v1.2.3 (46s ago)
9db81c94-558a-4f2d-98f9-25e0d1ec0214    v1.2.2     k8s            online, install recommended -> v1.2.3 (20s ago)
5247f33a-1bd1-4227-8c6e-4464fee2c585    v1.2.3     auth           online (21s ago)
...

Warning: 1 instance(s) need upgrade due to newer security patch (sec).
Warning: 1 instance(s) need upgrade due to having reached end of life (eol).
```

Some kind of status summary should also exist for the version-control system as a whole. I'm still
a bit uncertain about how this should be formatted and what all should be in it, but key points
like the current versioning source, targets, and installers should be covered, as well as stats
on recent installs/faults/churns:

```
$ tctl version-control status
Directive:
  Source: tuf/default
  Status: active
  Promotion: auto

Installers:
  Kind            Name           Status     Recent Installs    Installing    Faults    Churned
  ------------    -----------    -------    ---------------    ----------    ------    -------
  tuf             default        enabled    6                  2             1         1
  local-script    apt-install    enabled    3                  2             -         2

Inventory Summary:
  Current Version    Target Version    Count    Recent Installs    Installing    Faults    Churned
  ---------------    --------------    -----    ---------------    ----------    ------    -------
  v1.2.3             v2.3.4            12       -                  4             1         3
  v2.3.4             -                 10       9                  -             -         -
  v3.4.5-beta.1      -                 2        -                  -             -         -
  v0.1.2             -                 1        -                  -             -         -

Critical Versioning Alerts:
  Version    Alert                                Count
  -------    ---------------------------------    -----
  v1.2.3     Security patch available (v2.3.4)    12
  v0.1.2     Version reached end of life          1
```


### Version Reconciliation Loop

The version reconciliation loop is a level-triggered control loop that is responsible for determining
and applying state-transitions in order to make the current inventory versioning match the desired
inventory versioning. Each auth server runs its own version reconciliation loop which manages the
server control streams attached to that auth server.

The core job of the version reconciliation loop is fairly intuitive (compare desired state to actual
state, and launch installers to correct the difference). To get a better idea of how it should work
in practice, we need to look at the caveats that make it more complex:

- We need to use a rolling update strategy with a configurable rate, which means that not all
servers eligible for installation will actually have installation triggered on a given iteration.
The version directive may change mid rollout, so simply blocking the loop on a given directive until
it has been fully applied isn't reasonable.

- We need to monitor cluster-wide health of ongoing installations and pause installations if we see excess
failures/churn, which means that aggregating information about failures is a key part of the reconciliation
loop's job.

- We should avoid triggering installs against servers that recently made an install attempt (regardless
of success/failure), and we should also avoid sending install messages to servers that just connected
or are in the process of graceful shutdown. This means that a server's eligibility for installation is
a combination of both persistent backend records, and "live" control stream status.

Given the above, the reconciliation loop is a bit more complex, but still falls into three distinct phases:

1. Setup: load cluster-level upgrade system configuration, active `version-directive`, churn/fault stats, etc.

2. Reconciliation: Match servers to target and installer, and categorize them by their current
install eligibility given recent install attempts, control stream status, etc.

3. Application: Determine the number of eligible servers that will actually be slated for install
given current target rollout rate, update their backend states with a summary of the install attempt
that is about to be made (skipping servers which had their installation status concurrently updated),
and pass them off to installer-specific logic.

As much as possible, we want the real "decision making" power to rest with the `version-controller` rather
than the version reconciliation loop. That being said, the version reconciliation loop will have some
internal rules that it evaluates to make sure that directives, as applied to the current server inventory,
do not result in any invalid state-transitions (e.g. it will refuse to change the target arch for a given
server, or skip a major version).


### Version Directives

#### The Version Directive Resource

The `version-directive` resource is the heart of the upgrade system. It is a static resource that describes
the current desired state of the cluster and how to get to that state. This is achieved through a series of
matchers which are used to pair servers with installation targets and installers. At its core, a `version-directive`
can be thought of as a function of the form `f(server) -> optional(target,installer)`.

Installation targets are arbitrary attribute mappings that must *at least* contain `version`, but may contain
any additional information as well. Certain metadata is understood by teleport (e.g. `fips:yes|no`,
`arch:amd64|arm64|...`), but additional metadata (e.g. `sha256sum:12345...`) is simply passed through
to the installer.  The target to be used for a given server is the first target that
*is not incompatible* (i.e. no attempt to find the "most compatible" target is made).  A target is
incompatible with a server if that server's version cannot safely upgrade/downgrade to that target version,
*or* if the target specifies a build attribute that differs from a build attribute of the current
installation (e.g. `fips:yes` when current build is `fips:no`). We don't require that all build attributes
are present since not all systems require knowledge of said attributes.

It is the responsibility of an installer to fail if it is not provided with sufficient target attributes to perform
the installation safely (e.g. the `tuf` installer would fail if the target passed to it did not contain the
expected length and hash data). The first compatible installer from the installer list will be selected. Compatibility
will be determined *at least* by the version of the instance, as older instances may not support all installer
types. How rich of compatibility checks we want to support here is an open question. I am wary of being too
"smart" about it (per-installer selectors, pre-checking expected attributes, etc), as too much customization
may result in configurations that are harder to review and more likely to silently misbehave.

Within the context of installation target matching, version compatibility for a given server is defined
as any version within the inclusive range of `vN.0.0` through `vN+1.*`, where `N` is the current major
version of the server. Stated another way, upgrades may keep the major version the same, or increment it
by one major version. Downgrades may revert as far back as the earliest release of the current major
version. Downgrades to an earlier major version are not supported.

All matchers in the `version-directive` resource are lists of matchers that are checked in sequence, with
the first matching entry being selected. If a server matches a specific sub-directive, but no installation
targets and/or installers in that sub-directive are compatible, that server has no defined `(target,installer)`
tuple.

Beyond matching installation targets to servers, the `version-directive` also supports some basic time
constraints to assist in scheduling, and a `stale_after` field which will be used by teleport to determine
if the directive is old enough to start emitting warnings about it (especially useful if directives are generated
by external plugins which might otherwise fail silently).

Example `version-directive` resource:

```yaml
# version directive is a singleton resource that is either supplied by a user,
# or periodically generated by a version controller (e.g. tuf, plugin, etc).
# this represents the desired state of the cluster, and is used to guide a control
# loop that matches install targets to appropriate nodes and installers.
kind: version-directive
version: v1
metadata:
  name: version-directive
spec:
  nonce: 2
  status: enabled
  version_controller: static-config
  confid_id: <random-value>
  stale_after: <time>
  not_before: <time>
  not_after: <time>
  directives:
    - name: Staging
      targets:
        - version: 2.3.4
          fips: yes
        - version: 2.3.4
          fips: no
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: staging
          services: [db,ssh] # unspecified matches all services *except* auth
        - labels:
            env: testing
          services: [db,ssh]

    - name: Prod
      targets:
        - version: 1.2.3
          fips: yes
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: prod
```

The above example covers the core information needed to effectively orchestrate installations, but it
does not quite cover an equally pressing need: providing reliable visibility into what instances are in need of
security patches and/or are running deprecated/eol versions. We cover more nuanced mechanisms for dealing
with customizable notifications in later sections, but it seems important that we also provide a mechanism for
establishing a very basic security/deprecation "floor" that can be baked into the version directive. Something
that lets us say "warn about versions before X" regardless of the details of our specific server -> version
mapping that is in effect at the moment.

Exact syntax is TBD, but something like this would be sufficient:

```yaml
critical_floor:
  end_of_life: v1 # all releases earlier than v2 are EOL
  security_patches:
    - version: v2.3.4 # v2 releases prior to v2.3.4 need to be upgraded to at least v2.3.4
      desc: 'CVE-12345: teleport may become sentient and incite robot uprising'
    - version: v3.4.5 # v3 releases prior to v3.4.5 need to be upgraded to at least v3.4.5
      desc: 'prime factorization proven trivial, abandon hope all ye who enter here'
```


#### Version Directive Flow

Up to this point, we've been fairly vague about what happens between a `version-directive` being
created by the initial controller that generates it, and becoming the new desired state for the
cluster. In order to reason about this intervening space, it is good to start by taking stock of
what features we would like to eventually support that take effect between generating the initial
directive, and final application of that directive:

- Mapping/Plugins: Some intervening process takes a `version-directive` generated by the
originating controller and modifies it in some way. Some examples of this might be a plugin that applies
a custom filter to installation targets, or a scheduler that creates custom start/end times for
the directive.

- Plan/Apply Workflow: It is reasonable to assume that not everyone will want new installation
targets to be selected automatically, and we should provide a workflow that permits previewing
the new target state before applying it.

- Multiparty Approval: Upgrading sensitive infrastructure can be a big deal. Providing an equivalent
to the `plan`/`apply` workflow that also supports multiparty approval (think access requests but for
changing the version directive) seems like an obvious feature that we'll want to land eventually.

- Notifications/Recommendations: When using a plan/apply or multiparty approval workflow, being able
to be notified when new versions are available seems reasonable and useful. Ideally, it should be possible
to provide both the means for external plugins to generate notifications (e.g. via slack), and also for
teleport's own interfaces to mark servers as being eligible for upgrade.

- Live Modality/Selection: Not all configurations work for all scenarios. It seems reasonable that
we will eventually want to support workflows that allow some concept of differing configurations
or directives, either by providing the ability to have multiple distinct configurations available at
the same time (e.g. `plan <variant-a>` vs `plan <variant-b>`), or to allow some form of live subselection
(e.g. `plan --servers env=prod`).

- Dry-Run: Similar to a `plan` phase, it might be nice to be able to execute dry runs of potential
directives. What a dry run entails varies by installer, but "download and verify without installing"
is a reasonable interpretation for local installers at least. It might even be possible to cache a
package that was downloaded during a dry run and install it immediately during a normal install. Note
that caching may present a new attack vector and implementing it would require careful thought to
prevent new attack vectors from being introduced. This is a lower priority feature, but it is useful
to keep in mind so that we don't select an architecture that precludes it as a possibility.

Each of the features described above requires some amount of engineering specific to itself, but they
also have an overlapping set of needs that we can use to inform the basic directive flow. We'll cover
the high-level flow itself, and then examine why it meets our needs.

Directives will come in three distinct flavors: "draft", "pending", and "active".  Draft and pending
the directives will be sets stored by `<kind>/<name>` and `<uuid>` respectively. The active directive
will be the singleton directive representing the current desired state, as discussed elsewhere. This
storage model will be used to enforce a specific "flow" through the following operations:

- `WriteDraft`: A draft is written out by its generating controller/plugin to `/drafts/<kind>/<name>`.
By convention, `kind` and `name` are the kind and name of the controller that writes the draft. The effect
of this is that subscribing to write events on the key `/drafts/tuf/default` is essentially equivalent to
consuming a stream of the `tuf/default` controller's outputs. Drafts include information about when they
become stale, ensuring easy detection if a controller is offline, even if it is external to teleport.

- `FreezeDraft`: The latest version of target draft is copied and stored at a random UUID. Frozen drafts
are stored as an immutable sub-field within a "pending directive" object which encodes additional information
that allows teleport to make decisions about the pending directive (e.g. a an approval policy in the case
of a multiparty approval scenario).

- `PromotePending`: Target pending directive overwrites the "active" singleton, becoming the new target
state of the cluster.

With the above flow defined, we can now look at how we might implement our desired features:

- Mapping/Plugins: Each intermediate plugin loads some upstream draft, performs its modifications, and
writes them to some downstream draft.  E.g. a `scheduler` plugin might load from `drafts/tuf/default`
and write to `drafts/scheduler/default`.

- Plan/Apply Workflow: Invoking `tctl version-control plan` freezes the latest draft with an associated
attribute indicating that it is frozen for a plan operation. The frozen draft is used to generate a
summary of changes to be displayed to the user (e.g. number of nodes that would be upgraded and to
what versions). If the user likes what they see, they can run `tctl version-control apply <id>` to
promote the pending directive. If no action is taken, the pending directive expires after a short time.

- Multiparty Approval: Essentially the same workflow as Plan/Apply, except with `tsh` commands instead,
possibly with slightly different wording (e.g. `propose`/`apply`), and an additional
`tsh version-control review` command. The auth server freezes the target along with an approval policy,
and waits for sufficient approvals before permitting promotion.

- Notifications/Recommendations: Teleport and/or external plugins periodically load the latest draft
directive and compare it to current cluster state. Where the draft recommends a different version,
users are notified and recommended version is displayed when listing servers.

- Live Modality/Selection: While we want `apply` commands to "just work" if users only have one
controller/pipeline, we can also support selecting drafts by name
(e.g. `tctl version-control plan foo/bar`) so that users can configure their clusters to present multiple
alternative drafts that can be compared and selected between.

- Dry-Run: Invoking `tctl version-control dry-run <id>` marks a pending directive for dry run. Auth servers
invoke installers in dry-run mode (for those that support it), and periodically embed stats about the
state of the dry run (churns faults, etc) as attributes on the pending draft object for some time period.
Since dry runs still trigger installers, multiparty approval would need to define approval thresholds for
invoking dry runs. As noted in the previous dry run discussion, this feature is tricky and probably of lower
priority than the others on this list.


### High-Level Configuration

Some configuration parameters are independent of specific controllers/installers (namely rollouts and
promotion policies), and are best controlled from a central configuration object, rather than having
competing configurations attached to each controller. In addition, it is desirable to provide a simple
single-step operation for enabling automatic upgrades in our "batteries included" usecase. With this in
mind, we will provide a top-level configuration object that can conveniently control the key parameters
of the upgrade system:

```yaml
kind: version-control-config
version: v1
spec:
  enabled: yes
  
  rolling_install:
    churn_limit: 5% # percent or count
    fault_limit: 10
    rate: 20%/h # <percent or count>/<h|m>
  
  promotion:
    strategy: automatic # set to 'manual' for plan/apply workflow
    from: tuf/default

  notification:
    from: tuf/latest # defaults to using the value from `promotion.from`

  # shorthand for the more verbose syntax of the version-directive resource with support
  # for wildcards in the version string. Version controllers can use these as templates
  # to build concrete actionable directives using targets from the latest matching version.
  # This is an optional feature, since any controllers we write will also support verbose
  # templates in their own config objects, but simple rules like this will likely be sufficient
  # for many usecases, and are generic enough for us to assume that all future controllers
  # should be able to support them.
  basic_directives:
    - name: Prod
      version: v1.1.* # at least major version must be specified
      server_labels:
        env: prod
    - name: Staging
      version: v1.2.*
      server_labels:
        env: staging
```

The above configuration object should be all a user needs to activate automatic updates (once we've implemented
the `tuf` controller and installer). Additional controller-specific features will be accessible by using a
custom configuration (e.g. `tuf/my-tuf-controller`), but the default should "just work" in most cases.

In the event that a mapping/plugin strategy (as described in the Version Directive Flow section) is in use,
the `promotion.from` field should be the draft output location of the final plugin in the chain. If using
the `manual` promotion strategy this field is optional but omitting it will cause `tctl version-control plan`
to always require an explicit target.


### Version Controllers

A `version-controller` is an abstract entity that periodically generates a draft `version-directive`.  It may
be a loop that runs within the auth server, an external plugin, or just a human manually creating
directives as needed. A *builtin* controller is a control loop that runs within teleport capable of generating
version directives. The only builtin controller that is currently part of the development plan is
the TUF controller, though we may also introduce a simpler "notification only" controller that can't be
used to trigger updates, but could be used to suggest that installations are out of date.


#### TUF Version Controller

The TUF version controller will be based on [go-tuf](https://github.com/theupdateframework/go-tuf)
and will maintain TUF client state within the teleport backend (TUF clients are stateful, since they
need to support concepts like key rotation). When enabled, the TUF controller will periodically sync
with a TUF repository that we maintain, discover available packages, and generate a version-directive
with the necessary metadata for the tuf installer to securely verify said packages.

The details of the TUF protocol are complex enough that I won't try to reiterate them here, but the
complexity is mostly in the process by which the per-package metadata is securely distributed. The output
generated by the TUF controller will be very simple. In addition to standard target information
(`version`, `arch`, etc), it will include a size in bytes and one or more hashes.

Custom configurations can be supplied, but in the interest of convenience a `tuf/default` controller
will be automatically activated if referenced by the `version-control-config`, which will seek to fill
the directive templates specified there.

Example custom configuration: 

```yaml
kind: version-controller
version: v1
sub_kind: tuf
metadata:
  name: my-tuf-controller
spec:
    status: enabled
    directives:
      - name: Staging
        target_selectors:
          - version: 7.*
        server_selectors:
          - labels:
              env: staging
      - name: Prod
        target_selectors:
          - version: 7.2.*
        server_selectors:
          - labels:
              env: prod
      - name: Minimum
        target_selectors:
          - version: 6.*
        server_selectors:
          - labels:
              '*': '*'
```

*note*: Generally speaking, TUF is fips compatible, but I have yet to assess what, if any, additional
work may be needed to get the tuf controller working on fips teleport builds. It is possible that
we may end up supporting the tuf controller on non-fips builds earlier if this process ends up being
complex.


#### Notification-Only Install Controller

The TUF install controller is going to be a fairly substantial undertaking, with various moving parts
needing to come together behind the scenes (e.g. deterministic compilation). This is why the MVP release
is intended to support only manually-constructed directives and `local-script` installers.

It may still be desirable to provide a means of using the notification workflow before
TUF has landed. We could achieve this by providing a simple "low stakes" controller that produces
notification-only version directives, usable for displaying recommended versions in inventory lists,
but not suitable for providing sufficient information for package validation.

An example of a notification-only install controller would be a `github-releases` controller, which
periodically scraped the teleport repo's releases page. While the information contained there isn't
sufficient for robust package validation, its more than sufficient for displaying a "recommended version"
in an output like `tctl inventory ls`.

If we wanted to go with a compromise between prioritizing full TUF features and prioritizing fast delivery
of notifications, we could establish a beta/preview TUF repo which did not provide any package hashes,
but did serve a list of recommended install versions, including metadata indicating which versions were
security releases. While this would take more time to deliver than a minimal "scraper", it would allow us
the ability to spend our efforts on work that could be mostly re-used during the main TUF development
phase.


### Installers

An installer is a mechanism for attempting to install a target on a server or set of servers.
Conceptually, installers fall into two categories:

- Local Installers: A local installer runs on the teleport instance that needs the installation.
Each local installer type needs to be supported by the instance being upgraded. From the point of
view of the version reconciliation loop a local installer is a divergent function of the form
`f(server_control_stream, target)`.

- Remote Installers: A remote installer runs on a teleport instance other than the instance(s)
being updated. Remote installers need to provide a selector for the controlling host on which
they need to be run. Remote installers are invoked for sets of servers and may be invoked
multiple times for overlapping sets, making idempotence essential. From the point of view of
the version reconciliation loop a remote installer is a function of the form
`f(host_control_stream, servers, target)`.

Different installers have different required target attributes (e.g. the `tuf` installer requires
package size and hashes). Installers must reject any target which is missing any attribute
required by that installer's security model.


#### Local-Script Install Controller

The `local-script` installer is the simplest and most flexible installer, and the first one we will
be implementing. It runs the provided script on the host that is in need of upgrade, providing
a basic mechanism for inserting target information (e.g. `version`) as env variables.

While sanity is generally the responsibility of the use for this controller, we can assist by enforcing
strict limits on allow characters for inputs/vars (e.g. `^[a-zA-Z0-9\.\-_]*$`). This should be in
addition to any rules we create for specific values (e.g. `target.version`).

The initial version of the local-script installer will be as bare-bones as possible:

```yaml
# an installer attempts to apply an installation target to a node. this is an example
# of an installer that gets passed from the auth server to the node so that the node
# itself can run it, but some installers may run somewhere other than the node itself
# (e.g. if invoking some API that remotely upgrades teleport installs). The auth server
# uses the version-directive to determine which installers should be run for which nodes
# and with which targets.
kind: installer
sub_kind: script
version: v1
metadata:
  name: apt-install
spec:
  enabled: yes
  env:
      "VERSION": '{target.version}'
  shell: /bin/bash
  install.sh: |
    set -euo pipefail
    apt install teleport-${VERSION:?}
```

Possible future improvements include:

- Additional scripts for special operations (e.g. `dry_run.sh`, `rollback.sh`, etc).

- Piping output into our session recording system so that install scripts can be played
back (seems useful).

- Special teleport subcommands meant to be invoked inside of install scripts (e.g. for
verifying tuf metadata against an arbitrary file).


#### TUF Install Controller

The TUF install controller will not need to be configured by users.  It will be the
default install controller used whenever the TUF `version-controller` is active.
It will download the appropriate package from `get.gravitational.com` and perform standard
TUF verification (hash + size).

Since the download+verify functionality will be present in teleport anyhow, it may be useful
to expose it as hidden subcommands that could be used inside of scripts, which could allow
users to inject their own special logic within the normal tuf installation flow.

#### Remote-Script Install Controller

- Affects installation indirectly by running a user-provided script on a pre-determined host
(not the host in need of upgrade).

- Intended as a simple means of hooking into systems such as k8s, where the teleport version is
controlled via a remote API, though that does not preclude us making official remote install
controllers for specific APIs down the road.

- Details of functionality are TBD, but the basic idea will be that we will mirror the functionality of
`local-script` wherever possible, and add an additional server selector that is used to
determine where the installer should be run.

- Q: Should the list of target servers be provided to the script?  Is that even useful?  It seems
more likely that scripts will be written per externally managed set, though that could be a
failure of imagination on my part.


### TUF CI and Repository

In order to enable the TUF version controller, we will need to maintain CI that generates and signs
TUF metadata, and maintain a TUF repository.  Details of how the TUF repository will be hosted are
still TBD, but TUF repositories are basically static files, so distribution should be fairly straightforward.
We may be able to simply distribute it via a git repo.

We will leverage deterministic builds and TUF's multisignature support to harden ourselves against CI
compromise. Our standard build pipeline will generate and sign one set of package hashes, and another set
will be generated and signed by a separate isolated env.

TUF repositories prove liveness via periodic resigning with a "hot" key (not the keys used for package signing).
This hot key should be isolated from the package signing keys, so we're likely looking at two new isolated
envs that need to be added in addition to the modifications to our existing CI.

*note*: some initial work was done to get deterministic builds working on linux packages. We know its possible
(and might even still be working), but don't currently have test coverage for build determinism. This will be
an important part of the prerequisite work to get the TUF system online. We don't need to add TUF support for
all build targets at once, so we may specifically target reliable signing of amd64/linux packages first.


### Rollbacks

Rollbacks will come in two flavors:

1. Remote rollback: Version directive is changed to target an older version. Older version is installed via
normal install controller. Requires the new teleport installation to work at least well
enough to perform any functions required by the install controller.

2. Local rollback: The previous teleport installation remains cached during the upgrade, and some local process
monitors the health of the new version. If the new version remains unhealthy for too long, it is forcibly
terminated and the previous installation is replaced.

The first option is an emergent property of the level-triggered system and will be supported from the beginning.
Teleport won't bother to distinguish between an upgrade and a downgrade. No special downgrade logic is required
for this option to work.

The second option will require a decent amount of specialized support and will be added later down the line. Script
installers would likely need to be amended in some way to work correctly with a local rollback scheme. The
details of how exactly local rollbacks should function are TBD. Some possibilities include:

- Initially install new versions to a pending location (e.g. `/usr/local/bin/teleport.pending`).  Have teleport
automatically fork a background monitor and `exec`s into the pending binary if it is detected on startup. If the
background monitor observes that its requirements are met, it moves the pending binary to the active location,
replacing the previous install.

- Formally embrace the idea of multiple concurrently installed teleport versions
and provide a thin "proxy binary" that can seamlessly `exec` into the current target version based on some filesystem
config, potentially launching a background monitor of a different version first depending on said config. This has the
downside of introducing a new binary, but the upside of eliminating the need for messy move/rename schemes.

- Fully install the new version, creating a backup of the previous version first. Rely on an external mechanism for
ensuring that the monitor/revert process gets run (e.g. by registering a new systemd unit). This has the upside of
probably being compatible with script-based installers without any changes (teleport could create the backup and register
the unit before starting the script), but has the downside of introducing an external dependency.


## UX

### Static Configuration CLI UX

Static configuration objects will be managed via `tctl`'s normal `get`/`create` resource
commands.

Enabling the version control system (notification-only):

```bash
$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes

  notification:
    alert_on:
      - security-patch
    from: github-releases/default
EOF

$ tctl create vcc.yaml
```

Enabling the version control system (manual upgrades):

```bash
$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes
  promotion:
    from: tuf/default

  basic_directives:
    - name: All Servers
      version: v1.2.*
      selector:
        labels:
          '*': '*'
EOF

$ tctl create vcc.yaml
```

Configuring a custom TUF controller:

```bash
$ cat > vc.yaml <<EOF
kind: version-controller
version: v1
sub_kind: tuf
metadata:
  name: my-tuf-controller
spec:
    status: enabled
    directives:
      - name: Staging
        target_selectors:
          - version: 7.*
        server_selectors:
          - labels:
              env: staging
      - name: Prod
        target_selectors:
          - version: 7.2.*
        server_selectors:
          - labels:
              env: prod
EOF

$ tctl create vc.yaml
```


### Version Directive Flow CLI UX

The version directive flow will be managed via the `tctl version-control` family of subcommands.

Manually creating a custom directive:

```
$ cat > vdd.yaml <<EOF
kind: version-directive
version: v1
sub_kind: custom
metadata:
  name: my-directive
spec:
  status: enabled
  directives:
    - name: Staging
      targets:
        - version: 2.3.4
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: staging
          services: [db,ssh]

    - name: Prod
      targets:
        - version: 1.2.3
          fips: yes
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: prod

    - name: Minimum
      targets:
        - version: 1.2.0
          fips: no
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            '*': '*'
EOF

$ tctl version-control create-draft vdd.yaml
```

Plan/apply workflow:

```bash
$ tctl version-control plan custom/my-draft
Directive custom/my-draft frozen with ID 'bba14536-0ad9-4b14-a071-1296d570e52e'...

Warning: Sub-directive "Staging" proposes version newer than current auth version (will not take effect until auth is upgraded).

Estimated Changes:
  Current Version    Target Version    Count    Sub-Directive
  ---------------    --------------    -----    -------------
  v1.2.3             v2.3.4            12       Staging
  v1.2.1             v1.2.3            2        Prod

Estimated Unaffected Instances: 52

help: you can run 'tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e' to enable these changes.

$ tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e
Successfully promoted pending directive 'bba14536-0ad9-4b14-a071-1296d570e52e'.
help: run 'tctl version-control status' to monitor rollout progress.
```


### CLI Recommendations and Alerts UX

Recommended version info will be added as part of normal server status info for user-facing interfaces
(`tsh inventory ls` for start, but other per-server displays could also include it). Cluster-level alerts
(e.g. due to a security patch becoming available or a major version reaching EOL) will be displayed on
login, and could be expanded to other "frequently used" commands if need-be.

Recommended version, displayed as part of status in `tsh inventory ls`:

```bash
$ tsh inventory ls
Server ID                               Version    Services       Status
------------------------------------    -------    -----------    -----------------------------------------------
eb115c75-692f-4d7d-814e-e6f9e4e94c01    v0.1.2     ssh,db         installing -> v1.2.3 (17s ago)
9db81c94-558a-4f2d-98f9-25e0d1ec0214    v1.2.2     k8s            online, upgrade recommended -> v1.2.3 (20s ago)
b170f8f1-e369-4e10-9a04-5fb33b8e40d5    v1.2.2     ssh            online, upgrade recommended -> v1.2.3 (45s ago)
5247f33a-1bd1-4227-8c6e-4464fee2c585    v1.2.3     auth           online
...
```

Alerts related to available security patches and EOL show up on login for those with sufficient permissions (exact permissions
TBD, but if you have blanket read for server inventory, that should be sufficient):

```bash
$ tsh login cluster.example.com
[...]
> Profile URL:        https://cluster.example.com:3080
  Logged in as:       alice
  Cluster:            cluster.example.com
  Roles:              populist, dictator
  Logins:             alice
  Kubernetes:         disabled
  Valid until:        2022-04-05 10:20:13 +0000 UTC [valid for 12h0m0s]
  Extensions:         permit-agent-forwarding, permit-port-forwarding, permit-pty

WARNING: Cluster "cluster.example.com" contains instance(s) eligible for security patch.
```


### Web UI Recommendations and Alerts UX

GUIs aren't really my area of expertise, and I'm not certain if we're going to opt to actually
port the unified "inventory" view to the web UI, but here's some ideas that I think are good
starting points:

- An "alerts" section under the "Activity" dropdown that can list cluster-level alerts about
version-control now, and possibly other related alerts as well down the road.

- Some kind of small but visually distinct banner alert that shows up on login but can be
minimized/dismissed and/or a badge on the activity dropdown indicating that alerts exist.

- Color-coded badges for some or all of the following per-instance states:
  - upgrade available
  - eol/deprecated version
  - security update available


## Hypothetical Docs

Some hypothetical documentation snippets to help us imagine how comprehensible this system
will be to end users.


### Quickstart

Teleport's update system uses pluggable components to make it easy to get the exact behavior
you're looking for. The simplest way to get started with teleport's upgrade system is to use
the builtin TUF controller and installer, based on [The Update Framework](https://theupdateframework.com/).

You can enable these components like so:

```bash
$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes
  promotion:
    strategy: manual
    from: tuf/default
EOF

$ tctl create vcc.yaml
```

Once enabled, teleport will automatically detect new releases and draft an update plan
for your cluster.  You can run `tctl version-control plan` to preview the latest draft's
effect on your cluster and run `tctl version-control apply <id>` to accept it if
everything is to your liking.  Ex:

```bash
$ tctl version-control plan
Draft tuf/default frozen with ID 'bba14536-0ad9-4b14-a071-1296d570e52e'...

Estimated Changes:
  Current Version    Target Version    Count    Sub-Directive
  ---------------    --------------    -----    -------------
  v1.2.3             v1.3.5            12       Default

Estimated Unaffected Instances: 2

help: you can run 'tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e' to enable these changes.

$ tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e
Successfully promoted pending directive 'bba14536-0ad9-4b14-a071-1296d570e52e'.
help: run 'tctl version-control status' to monitor rollout progress.
```

Note that we didn't tell teleport what version we want to install. By default, teleport looks
for the latest releases for the major version you are already on (though it will notify you
if newer major versions are available). If we want to perform a major version upgrade, we need
to provide explicit configuration. Explicit versions or version ranges can be specified using
a `basic_directive`:

```bash
$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes
  promotion:
    strategy: manual
    from: tuf/default

  basic_directives:
    - name: All servers
      version: v2.3.*
      installer: tuf/default
      selector:
        labels:
          '*': '*'
        services: ['*']
EOF

$ tctl create -f vcc.yaml
```

You can read the above configuration as "the latest v2.3.X release should be installed on
all servers, using the default install method". We specify a version matcher, installer,
and instance selector (wildcard labels and services matches all instances). Teleport then
creates a draft proposal that matches our configuration in the background.

If you run `tctl version-control plan` immediately after creating/updating the config, you might
see an error like `Draft tuf/default appears outdated (config has been changed)`
or `Draft tuf/default has not been generated`.  This is normal. Teleport needs to download
and verify detailed release metadata in order to generate a draft. This may take a
few seconds.


### Customization

- TODO


## Implementation Plan

There are a lot of individual components and features discussed in this RFD. As such,
implementation will be divided into phases with multiple iterative releases consisting
of subsets of the final feature set.


### Inventory Status/Control Setup

This phase sees no meaningful user-facing features added, but is the building block upon
which most of the rest of the features are built.

Instance-level status and control stream:

- Refactor agent certificate logic to support advertising multiple system roles on a
single cert (currently each service has its own disjoint certificate).

- Implement per-instance bidirectional GRPC control stream capable of advertising all
services running on a given teleport instance, and accepting commands directly from the
controlling auth server.

Improved inventory version tracking:

- Improve teleport's self-knowledge so that instances can heartbeat detailed build
attributes (arch, target os, fips status, etc).

- Add new server inventory resource and `tctl inventory ls` command for viewing all
instances w/ build info and services.

With the above changes in place, we will have the ability to reliably inspect per-instance
state regardless of running services, and each auth server will have a bidirectional handle
to its connected nodes, allowing for real-time signaling.


### Notification-Only System (?)

*note*: This step is optional, but might allow us to provide more value to users much
sooner.

Implement a notification-only upgrade controller and basic `version-directive` resource,
without any concept of having an "active" directive, and no reconciliation loop or installers.
The notification-only controller would serve only for detecting the existence of new versions
without providing any of the strong censorship resistance or package validation of the
tuf based controller. Instead, the purpose of this controller would be to generate a very
basic `version-directive` that could be used to display the *recommended* version for
teleport instances.

In theory, the ability to display recommended version, and/or generate notifications, is
a less "core" functionality, and could be added in a later step with less overall
development effort. Once the TUF controller exists, using its output for notifications
would be easy. That being said, it may be more valuable to deliver a pretty good way of
informing users that they aught to upgrade sooner, rather than waiting on a very robust
way of automatically upgrading that happens to bring notifications along with it.

Regardless of ordering, notifications in general depend on the following components
that need to be built anyway:

- The target+server matching part of the `version-directive` resource (installer matching
comes later).

- The draft phase of the [version directive flow](#version-directive-flow).

- The basic top-level config API (get/put/del).

- The basic version controller configuration API (get/put/del) (only required if
we want to support a controller configuration other than `default`).

- The `tctl version-control status` command (though not all fields will be available yet).

Creating the notification-only system first will also necessitate an additional
builtin `version-controller` that would not otherwise be needed. Luckily, it can
be *very* simple (e.g. a github release page scraper), since it will be explicitly
*not* usable for actual upgrades.


### Script-Based Upgrades MVP

With the core work done for inventory status/control, we can move on to a barebones
MVP/pathfinder for installers, version directives, and the version reconciliation loop.
We will implement a no-frills version of these components with the goal of supporting
one specific use-case: manually creating a basic `version-directive` resource and
having a user-provided script run on all servers that don't match the directive.

This phase will be a bit of a pathfinder, with a focus on weeding out any issues with
the proposed design of the core system. It will also provide an early preview for users
that are interested in reducing per-instance upgrade work, but are still willing to get
their hands dirty. Finally, this will mark the point after which manual upgrade of
non-auth instances can (theoretically) end, as new versions that support new installers
can be "bootstrapped" using older installers.

The components that must be developed for this phase are as follows:

- Per-instance installation attempt status info.

- The version reconciliation loop (minus more advanced features like being able to
trigger remote installers).

- The version directive resource (mostly complete already if we did the notification-only
system first), and the version directive flow.

- The `local-script` installer, and basic installer configuration API (get/put/del).

- The `tctl version-control plan`/`tctl version-control apply` commands.

- A rudimentary version of rollout health monitoring and automatic-pause system.

- Interactive `tctl version-control setup` command.


### TUF-Based System MVP

This phase sees the beginning of "batteries included" functionality. We will be adding
the TUF-based version controller and installer, as well as setting up supporting
CI and repository infrastructure. In this phase, teleport will start being able to
detect and install new versions on its own (though this will still be a "preview" feature
and not recommended for production).

Development in this version will be split between core teleport changes and build/infra
work. The core teleport work will be as follow:

- Basic version controller configuration API (if not added in notification-only phase).

- Internal TUF client implementation w/ stateful components stored in teleport backend.

- Builtin TUF version controller (basically just a control loop that runs the client and
then converts TUF package metadata to version-directive format).

- Rudimentary TUF installer (no local rollbacks yet, so this is basically just
download, validate, and replace).

- Basic notification/version recommendations (if not added in notification-only phase).

Build system/infra work:

- Get deterministic builds working (they might still work, since I did get them mostly
functional a while back, but this isn't covered by tests, so its basically meaningless).

- Set up isolated automation for independently building, hashing, and signing teleport releases.

- Add hashing + signing to existing build pipeline (different keypair).

- Set up TUF repository with thresholded signing so that compromise of one of the two build
envs does not compromise the TUF repository. TUF repositories are just static files, so this
can be hosted just about anywhere, through there is some regular re-signing by a "hot" key
that is used to prove liveness.


### Stability & Polish

The timeline for this phase isn't linear and the individual changes aren't interdependent like in
previous phases, but we're moving out of the realm of a preview/MVP feature and that means polish
and stability improvements.  In no particular order:

- Officially move TUF components out of preview (good time to try our first public repo key rotation?). 

- Implement local rollbacks.

- Extend upgrade system to support upgrading auth servers.

- Extend TUF repository to cover more package types (deterministic docker images are theoretically
possible I hear).

- Add `remote-script` installer.

- Improve upgrade visibility (e.g. create "session recordings" for `local-script` installers).

- Tackle outstanding feedback & any issues that have been uncovered prior to moving to
extended feature set.


### Extended Feature Set

- Multiparty approval and dry run workflows.

- Notification plugins (e.g. slack notifications for very outdated instances).

- Other remote installers (e.g. k8s).


## Other Stuff

### Anonymized Metrics

While not uniquely related to the upgrade system, we are going to start looking toward supporting opt-in
collection of anonymized metrics from users. The first instance of this new feature will appear alongside
the TUF-based system in the form of additional optional headers that can be appended to TUF metadata requests
and can be aggregated by the TUF server.

The heart of the anonymized metrics system will be two new abstract elements to be added to cluster-level state
(which configuration object they should appear in is TBD):

```
enabled: yes|no
random_value: <random-bytes>
```

If a user chooses to enable anonymized metrics for a cluster, a random value with some reasonably large entropy
will be generated. This will form the basis for an anonymous identifier that will allow us to distinguish between
metrics from different clusters without the identifier revealing anything about that cluster's identity.
The random value can be used directly as an identifier, or as a MAC key to be used to hash some other value.  I lean
toward preferring a scheme where the presented identifier rotates periodically (e.g. monthly).
If combined with the right amount of "bucketing" of any scalar values, this should help us prevent the emergence of
any "long term" narratives related to a single identifier, thereby further improving anonymization

I am currently leaning toward the idea of using the random value to create a keyed/salted hash of the current year/month
gmt (`YYYY-MM`) s.t. each month is effectively a separate dataset with separate identifiers. This kind of scheme would
both produce cleaner datasets and improve anonymity by effectively causing all clusters across the ecosystem to
rotate their IDs simultaneously. Still thinking this through, so maybe there are issues with this particular angle, but
the aforementioned properties are appealing.

To start with, cluster identifiers will be the only data the user is actually opting into sharing. The TUF server
will know the version of the client calling it, and whether it is an open source or enterprise request. The optional
cluster identifier will be what transforms this information from being just useful per-request debug into,
into a meaningful metric about the state of the teleport ecosystem. By using the cluster ID to deduplicate requests,
will will start to be able to make more informed guesses about the scope of the teleport ecosystem and the distribution
of teleport versions across it.

We will therefore end up collecting datapoints of the following format:

```
version: <semver>
flavor: <enum: oss|ent>
id: <random-string>
```

A few notes about working with this system:

- If scalar values are added in the future (e.g. cluster size), they will need to be bucketed s.t. no one
cluster identifier is unique enough to be traceable across ID rotations, or unique enough to be correlated
with a given user should their cluster size (approximate or not) be shared for any reason.

- Cluster identifiers (both the seed/salt value and the ephemeral identifier) should be treated as secrets and
not emitted in any logs or in any tctl commands that don't include `--with-secrets`.

- Addition of any new metrics in the future should be subject to heightened scrutiny and cynicism. A healthy
dose of 'professional paranoia' is beneficial here.

### Open Questions

- It seems reasonable that folks should be able to specifically watch for security patches and
have them automatically installed, or have special notifications generated just for them. It may even
be good to have such a feature come as one of the pre-configured controllers alongside the
`default` controllers (e.g. `tuf/security-patches`). How should we handle this? I'm currently
leaning toward introducing a new build attribute that can be filtered for
(e.g. `version=v1.*,fips=yes,security-patch=yes`), but its possible that there are better ways
to go about this (e.g. separate repos or the concept of "release channels").

- How explicitly should local nodes require opt-in? We obviously don't want to run any installers
on a node that doesn't include explicit opt-in, but should we require explicit opt-in for specific
installers? (e.g. `auto_upgrade: true` vs `permit_upgrade: ['local-script/foo', 'tuf/default']`)

- How are we going to handle auth server upgrades? Since auth-servers can't do rolling upgrades,
this is a lot trickier than upgrades of other kinds. We can obviously coordinate via backend state,
but its basically impossible to distinguish between shutdown and bugs or performance issues.

- Some folks have a *lot* of leaf clusters. It may be burdensome to manage upgrade states for individual
clusters separately. Should we consider providing some means of having root clusters coordinate
the versioning of leaf clusters? At what point are they so integrated that the isolation benefits are
nullified and the users would be better off with a monolithic cluster?