teleport/rfd/0090-upgrade-system.md
2023-04-03 03:05:10 +00:00

80 KiB

authors state
Forrest Marshall (forrest@goteleport.com) draft

RFD 90 - Upgrade System

Required Approvers

  • Engineering: @klizhentas && (@zmb3 || @rosstimothy || @espadolini)
  • Product: (@klizhentas || @xinding33)

What

System for automatic upgrades of teleport installations.

Why

Teleport must be periodically updated in order to integrate security patches. Regular updates also ensure that users can take advantage of improvements in stability and performance. Outdated teleport installations impose additional burdens on us and on our users. Teleport does not currently assist with upgrades in any way, and the burden of manual upgrades can be prohibitive.

Reducing the friction of upgrades is beneficial both in terms of security and user experience. Doing this may also indirectly lower our own support load.

Upgrades may be particularly beneficial for deployments where instances may run on infrastructure that is not directly controlled by cluster administrators (teleport cloud being a prime example).

Intro

Suggested Reading

While not required, it is helpful to have some familiarity with The Update Framework when reading this RFD. TUF is a flexible framework for securing upgrade systems. It provides a robust framework for key rotation, censorship detection, package validation, and much more.

High-Level Goals

  1. Maintain or improve the security of teleport installations by keeping them better updated and potentially providing more secure paths to upgrade.

  2. Improve the experience of teleport upgrade/administration by reducing the need for manual intervention.

  3. Improve the auditability of teleport clusters by providing insight into, and policy enforcement for, the versioning of teleport installations.

  4. Support a wide range of uses by providing a flexible and extensible set of tools with support for things like user-provided upgrade scripts and custom target selection.

  5. Provide options for a wide range of deployment contexts (bare-metal, k8s, etc).

  6. Offer a simple "batteries included" automatic upgrade option that requires minimal configuration and "just works" for most non-containerized environments.

Abstract Model Overview

This document proposes a modular system capable of supporting a wide range of upgrade strategies, with the intention being that the default or "batteries included" upgrade strategy will be implemented primarily as a set of interchangeable components which can be swapped out and/or extended.

The proposed system consists of at least the following components:

  • version-directive: A static resource that describes the desired state of versioning across the cluster. The directive includes matchers which allow the auth server to match individual teleport instances with both the appropriate installation target, and the appropriate installation method. This resource may be periodically generated by teleport or by some custom external program. It may also be manually created by an administrator. See the version directives section for details.

  • version-controller: An optional/pluggable component responsible for generating the version-directive based on some dynamic state (e.g. a server which publishes package versions and hashes). A builtin version-controller would simply be a configuration resource from the user's perspective. A custom/external version controller would be any program with the permissions necessary to update the version-directive resource. See the version controllers section for details.

  • version reconciliation loop: A control loop that runs in the auth server which compares the desired state as specified by the version-directive with the current state of teleport installations across the cluster. When mismatches are discovered, the appropriate installers are run. See the Version Reconciliation Loop section for details.

  • installer: A component capable of attempting to affect installation of a specific target on one or more teleport hosts. The auth server needs to know enough to at least start the process, but the core logic of a given installer may be an external command (e.g. the proposed local-script installer would cause each teleport instance in need of upgrade to run a user-supplied script locally and then restart). From the perspective of a user, an installer is a teleport configuration object (though that configuration object may only be a thin hook into the "real" installer). Whether or not the teleport instance being upgraded needs to understand the installer will vary depending on type. See the Installers section for details.

There is room in the above model for a lot more granularity, but it gives us a good framework for reasoning about how state is handled within the system. Version controllers generate version directives describing what releases should be running where and how to install them. The version control loop reconciles desired state with actual state, and invokes installers as-needed. Installers attempt to affect installation of the targets they are given.

Implementation Phases

Implementation will be divided into a series of phases consisting of 1 or more separate PRs/releases each:

  • Setup: Changes required for inventory status/control model, but not necessarily specific to the upgrade system.

  • Notification-Only System: Optional phase intended to deliver value sooner at the cost of overall feature development time.

  • Script-Based Installs MVP: Early MVP supporting only manual control and simple script-based installers for non-auth instances.

  • TUF-Based System MVP: Fully functional but minimalistic upgrade system based on TUF (still excludes auth instances).

  • Stability & Polish: Additional necessary features (including auth installs and local rollbacks). Represents the point at which the core upgrade system can be considered "complete".

  • Extended Feature Set: A collection of nice-to-haves that we're pretty sure folks are going to want.

See the implementation plan section for detailed breakdown of the required elements for each phase.

Details

Usage Scenarios

Hypothetical usage scenarios that we would like to be able to support.

Notification-Only Usecase

Cluster administrators may or may not be using teleport's own install mechanisms, but they want teleport to be able to inform them when instances are outdated, and possibly generate alerts if teleport is running a deprecated version and/or there is a newer security patch available.

In this case we want options for both displaying suggested version (or just a "needs upgrade" badge) on inventory views, and also probably some kind of cluster-wide alert that can be made difficult to ignore (e.g. a message on login or a banner in the web UI). We also probably want an API that will support plugins that can emit alerts to external locations (e.g. a slack channel).

In this usecase teleport is serving up recommendations based on external state, so a client capable of discovery (e.g. the tuf version controller) is required, but the actual ability to affect installations may not be necessary.

Minimal Installs Usecase

Cluster administrators manually specify the exact versions/targets for the cluster, and have a specific in-house installation script that should be used for upgrades. The teleport cluster may not even have any access to the public internet. The installation process is essentially a black box from teleport's perspective. The target may even be an internally built fork of teleport.

In this case, we want to provide the means to specify the target version and desired script. Teleport should then be able to detect when an instance is not running the target that it ought to, and invoke the install script. Teleport should not care about the internals of how the script plans to perform the installation. Instances that require upgrades run the script, and may be required to perform a graceful restart if the script succeeds. The script may expect inputs (e.g. version string), and there may be different scripts to run depending on the nature of the specific node (e.g. prod-upgrade, staging-upgrade or apt-upgrade, yum-upgrade), but things like selecting the correct processor architecture or OS compatibility are likely handled by the script.

In this minimal usecase teleport's role is primarily as a coordinator. It detects when and where user provided scripts should be run, and invokes them. All integrity checks are the responsibility of the user-provided script, or the underlying install mechanism that it invokes.

Automatic Installs Usecase

Cluster administrators opt into a mostly "set and forget" upgrade policy which keeps their teleport cluster up to date automatically. They may wish to stay at a specific major version, but would like to have patches and minor backwards-compatible improvements come in automatically. They want features like maintenance schedules to prevent a node from being upgraded when they need it available, and automatic rollbacks where nodes revert to their previous installation version of they are unhealthy for too long. They also want teleport to be able to upgrade itself without dependencies.

This usecase requires the same coordination powers as the minimal usecase, but also a lot more. Teleport needs to be able to securely and reliably detect new releases when they become available. Teleport needs to be able to evaluate new releases in the context of flexible upgrade policies and select which releases (and which targets within those releases) are appropriate and when they should be installed. Teleport needs to be able to download and verify installation packages, upgrade itself, and monitor the health of newly installed versions.

In this maximal usecase, teleport is responsible for discovery, selection, coordination, validation, and monitoring. Most importantly, teleport must do all of this in a secure and reliable manner. The potential fallout from bugs and vulnerabilities is greater than in the minimal usecase.

Plan/Apply Usecase

Cluster administrators want automatic discovery of new versions and the ability to trigger automatic installs, but they want manual control over when installs happen. They may also wish for additional controls/checks such as multiparty approval for upgrades and/or the ability to perform dry runs that attempt to detect potential problems early.

The core plan/apply usecase is mostly the same as automatic installs (minus the automatic part), but the more advanced workflows require additional features. Multiparty approval and dry runs both necessitate a concept of "pending" version directives, and dry runs require that all installers expose a dry run mode of some kind.

Hacker Usecase

Cluster administrators march to the beat of their own drum. They want to know the latest publicly available teleport releases, but skip all prime number patches. Nodes can only be upgraded if it is low tide in their region, the moon is waxing, and the ISS is on the other side of the planet. They want to use teleport's native download and verification logic, but they also need to start the downloaded binary in a sandbox first to ensure it won't trigger their server's self-destruct. If rollback is necessary, the rollback reason and timestamp need to be steganographically encoded into a picture of a turtle and posted to instagram.

This usecase has essentially the same requirements as the automatic installs usecase, with one addition. It necessitates loose coupling of components.

Security

Due to the pluggable nature of the system proposed here, it is difficult to make general statements about the security model. This is because most of the responsibilities of an upgrade system (package validation, censorship resistance, malicious downgrade resistance, etc) are responsibilities that fall to the pluggable components. That being said, we can lay down some specific principals:

  • Version controllers should have some form of censorship detection (e.g. the TUF controller verifies that the package metadata it downloads has been recently re-signed by a hot key to prove liveness). Teleport will provide a stale_after field for version directives so that failure to gather new state is warned about, but additional warnings generated by the controller itself are encouraged.

  • Installers must fail if they are not provided with sufficient target information to ensure that the acquired package matches the target (e.g. if installation is delegated to an external package manager that is deemed trusted this might be as simple as being explicit about version, but in the case of the TUF installer this means rejecting target specifications that don't include all required TUF metadata).

  • We should encourage decentralized trust. The TUF based system should leverage TUF's multisignature support to ensure that compromise of a single key cannot compromise installations. We should also provide tools to help those using custom installation mechanism to avoid single-point failures as well (e.g. multiparty approval for pending version-directives), and the ability to cross-validate their sources with the TUF validation metadata pulled from our repos.

  • Teleport should have checks for invalid state-transitions independently of any specific controller.

TUF Security

We won't be re-iterating all of the attack vectors that (correct) usage of TUF is intended to protect against. I suggest at least reading the attacks section of the specification. Instead we will zoom in on how we intend to use TUF for our purposes.

TUF provides a very good mechanism for securely getting detailed package metadata distributed to clients, including sufficient information to verify downloaded packages, and to ensure that censorship and tampering can be detected.

The trick to making sure a TUF-based system really lives up to the promise of the framework, is to have a good model for how the TUF metadata is generated and signed in the first place. This is where we come to the heart of our specific security model. We will leverage TUF's thresholded signature system and go's ability to produce deterministic builds in order to establish isolated cross-checks that can independently produce the same TUF metadata for a given release. At a minimum, we will have two separate signers:

  • Build Signer: Our existing build infrastructure will be extended to generate and sign TUF metadata for all release artifacts (or at least the subset that can be built deterministically).

  • Verification Signer: A separate environment isolated from the main build system will independently build all deterministic artifacts. All metadata will be independently generated and signed by this system.

With this dual system in place, we can ensure that compromised build infrastructure cannot compromise the upgrade system (and be able to detect compromises essentially immediately).

If we can manage to fully isolate the two environments such that no teleport team member has access to both environments, we should be able to secure the upgrade system from any single compromise short of a direct compromise of our public repositories.

All of the above presumes that no exploits are found in TUF itself, or its official go library, s.t. TUF's core checks (multisignature verification, package/metadata validation, etc) could be directly or indirectly circumvented. The TUF spec has been audited multiple times, but the most recent audit as of the time of writing was performed in 2018 and did not cover the go implementation specifically.

In order to further mitigate potential TUF related issues, we will wrap all download and TUF metadata retrieval operations in our own custom API with required TLS authentication. TUF metadata will be used only as an additional verification check, and will not be used to discover the identity from which a package should be downloaded (i.e. malicious TUF metadata won't be able to change where we download a package from). The intent here will be to ensure that a vulnerability in TUF itself cannot be exploited without also compromising the TLS client and/or our own servers directly. This means we won't be taking advantage of TUF's ability to support unauthenticated mirrors, but since we have no immediate plans to support that feature anyhow, adding this further layer of security has no meaningful downside.

Inventory Control Model

  • Auth servers exert direct control over non-auth instance upgrades via bidirectional GRPC control stream.

  • Non-auth instances advertise detailed information about the current installation, and implement handlers for control stream messages that can execute whatever local component is required for running a given install method (e.g. executing a specific script if the local-script installer is in use).

  • Each control stream is registered with a single auth server, so each auth server is responsible for triggering the upgrade of a subset of the server inventory. In order to reduce thundering herd effects, upgrades will be rolling with some reasonable default rate.

  • Upgrade decisions are level-based. Remote downgrades and retries are an emergent property of a level-based system, and won't be given special treatment

  • The auth server may skip a directive that it recognizes as resulting in an incompatible change in version (e.g. skipping a full major version).

  • By default, semver pre-release installations are not upgraded (e.g. 1.2.3-debug.2).

  • In order to avoid nearly doubling the amount of backend writes for existing large clusters (all of whose instances are predominantly ssh services), the existing "node" resource (which would be more accurately described as the ssh_server resource), will be repurposed to represent a server installation which may or may not be running an ssh service. Whether or not other services would also benefit from unification in this way can be evaluated on a case-by-case basis down the road.

  • In order to support having a single control stream per teleport instance (rather than separate control streams for each service) we will need to refactor how instance certs are provisioned. Currently, separate certs are granted for each service running on a instance, with no single certificate ever encoding all the permissions granted by the instance's join token.

Hypothetical GRPC spec:

// InventoryService is a subset of the AuthService (broken out for the readability)
service InventoryService {
    // InventoryControlStream is a bidirectional stream that handles presence and
    // control messages for peripheral teleport installations.
    rpc InventoryControlStream(stream ClientMessage) returns (stream ServerMessage);
}


// ClientMessage is sent from the client to the server.
message ClientMessage {
    oneof Msg {
        // Hello is always the first message sent.
        ClientHello Hello = 1;
        // Heartbeat periodically updates status.
        Heartbeat Heartbeat = 2;
        // LocalScriptInstallResult notifies of installation failures.
        LocalScriptInstallResult LocalScriptInstallResult = 3;
    }
}

// ServerMessage is sent from the server to the client.
message ServerMessage {
    oneof Msg {
        // Hello is always the first message sent.
        ServerHello Hello = 1;
        // LocalScriptInstall instructs the client to perform a local-script
        // upgrade operation.
        LocalScriptInstall LocalScriptInstall = 2;
    }
}

// ClientHello is the first message sent by the client and contains
// information about the client's version, identity, and claimed capabilities.
// The client's certificate is used to validate that it has *at least* the capabilities
// claimed by its hello message. Subsequent messages are evaluated by the limits
// claimed here.
message ClientHello {
    // Version is the currently running teleport version.
    string Version = 1;
    // ServerID is the unique ID of the server.
    string ServerID = 2;
    // Installers is a list of supported installers (e.g. `local-script`).
    repeated string Installers = 3;

    // ServerRoles is a list of teleport server roles (e.g. ``).
    repeated string ServerRoles = 4; 
}

// Heartbeat periodically 
message Heartbeat {
    // TODO
}

// ServerHello is the first message sent by the server. 
message ServerHello {
    // Version is the currently running teleport version.
    string Version = 1;
}



// LocalScriptInstall instructs a teleport instance to perform a local-script
// installation.
message LocalScriptInstall {
    // Target is the install target metadata.
    map<string, string> Target = 1;
    // Env is the script env variables.
    map<string, string> Env = 2;
    // Shell is the optional shell override.
    string Shell = 3;
    // Script is the script to be run.
    string Script = 4;
}


// LocalScriptInstallResult informs auth server of result of a local-scrip installer
// running. This is a best-effort message since some local-script installers may restart
// the process as part of the installation.
message LocalScriptInstallResult {
    bool Success = 1;
    string Error = 2; 
}

Inventory Status and Visibility

We face some non-trivial constraints when trying to track the status and health of ongoing installations. These aren't problems per-say, but they are important to keep in mind:

  • Teleport instances are ephemeral and can be expected to disappear quite regularly, including mid-install. As such, we can't make a hard distinction between a node disappearing due to normal churn, and a node disappearing due to a critical issue with the install process.

  • Backend state related to teleport instances is not persistent. A teleport instance should have its associated backend state cleaned up in a reasonable amount of time, and the auth server should handle instances for which no backend state exists gracefully.

  • The flexible/modular nature of the upgrade system means that there is a very significant benefit to minimizing the complexity of a component's interface/contract. E.g. a local-script installer that just runs an arbitrary script is much easier for a user to deal with than one that must expose discrete download/run/finalize/rollback steps.

  • Ordering in distributed systems is hard.

With the above in mind, lets look at some basic ideas for how to track installation state:

  • Immediately before triggering a local install against a server, the auth server must update that server's corresponding backend resource with some basic info about the install attempt (time, installer, current version, target version, etc). The presence of this information does not guarantee that an install attempt was ever made (e.g. the auth server might have crashed after writing, but before sending).

  • Auth servers will use CompareAndSwap operations when updating server resources to avoid overwriting concurrent updates from other auth servers. This is important because we don't want two auth servers to send install messages to the same instance in quick succession, and we also don't want to accidentally lose information related to install attempts.

  • An instance may, but is not required to, send various status updates related to an install attempt after it has been triggered. As features are added into the upgrade system (e.g. local rollbacks) new messages with special meanings can be added to improve the reliability and safety of rollouts.

  • Auth servers will make inferences based on the available information attached to server inventory resources to roughly divide them into the following states:

    • VersionParity: server advertises the correct version (or no version directive matches the server) and the server was not recently sent any install messages.
    • NeedsInstall: server advertises different version than the one specified in its matching version directive, and no recent install attempts have been made.
    • InstallTriggered: install was triggered recently enough that it is unclear what the result is.
    • RecentInstall: server has recently sent a local install message, and is now advertising a version matching the target of that message. Whether recency in this case should be measured in time, number of heartbeats, or some combination of both is an open question, but it is likely that we'll need to tolerate some overlap where heartbeats advertising two different versions are interleaved. We should try to limit this possibility, but eliminating it completely is unreasonable.
    • ChurnedDuringInstall: server appears to have gone offline immediately before, during, or immediately after an installation. It is impossible to determine whether this was caused by the install attempt, but for a given environment there is some portion/rate of churn that, if exceeded, is likely significant.
    • ImplicitInstallFault: server is online but seems to have failed to install the new version for some reason. Its possible that the server never got the install message, or that it performed a full install and rollback, but could not update its status for some reason.
    • ExplicitInstallFault: server is online and seems to have failed to install the new version for some reason, but has successfully emitted at least one error message. For a local-script installer this likely just means that the script had a non-zero exit code, but for a builtin installer we may have a failure message with sufficient information to be programmatically actionable (e.g. Rollback vs DownloadFailed).
  • By aggregating the counts of servers in the above states by target, version, and installer the auth servers can generate health metrics to assess the state of an ongoing rollout, potentially halting it if some threshold is reached (e.g. max_churn).

Hypothetical inventory view:

$ tctl inventory ls
Server ID                               Version    Services       Status
------------------------------------    -------    -----------    -----------------------------------------------
eb115c75-692f-4d7d-814e-e6f9e4e94c01    v0.1.2     ssh,db         installing -> v1.2.3 (17s ago)
717249d1-9e31-4929-b113-4c64fa2d5005    v1.2.3     ssh,app        online (32s ago)
bbe161cb-a934-4df4-a9c5-78e18b599601    v0.1.2     ssh            churned during install -> v1.2.3 (6m ago)
5e6d98ef-e7ec-4a09-b3c5-4698b10acb9e    v0.1.2     k8s            online, must install >= v1.2.2 (eol) (38s ago)
751b8b44-5f96-450d-b76a-50504aa47e1f    v1.2.3     ssh            online (14s ago)
3e869f3f-8caa-4df3-aa5c-0a85e884a240    v1.2.3     db             offline (12m ago)
166dc9b9-fc85-44a0-96ca-f4bec069aa92    v1.2.1     k8s            online, must install >= v1.2.2 (sec) (12s ago)
f67dbc3a-2eff-42c8-87c2-747ee1eedb56    v1.2.1     proxy          online, install soon -> v1.2.3 (46s ago)
9db81c94-558a-4f2d-98f9-25e0d1ec0214    v1.2.2     k8s            online, install recommended -> v1.2.3 (20s ago)
5247f33a-1bd1-4227-8c6e-4464fee2c585    v1.2.3     auth           online (21s ago)
...

Warning: 1 instance(s) need upgrade due to newer security patch (sec).
Warning: 1 instance(s) need upgrade due to having reached end of life (eol).

Some kind of status summary should also exist for the version-control system as a whole. I'm still a bit uncertain about how this should be formatted and what all should be in it, but key points like the current versioning source, targets, and installers should be covered, as well as stats on recent installs/faults/churns:

$ tctl version-control status
Directive:
  Source: tuf/default
  Status: active
  Promotion: auto

Installers:
  Kind            Name           Status     Recent Installs    Installing    Faults    Churned
  ------------    -----------    -------    ---------------    ----------    ------    -------
  tuf             default        enabled    6                  2             1         1
  local-script    apt-install    enabled    3                  2             -         2

Inventory Summary:
  Current Version    Target Version    Count    Recent Installs    Installing    Faults    Churned
  ---------------    --------------    -----    ---------------    ----------    ------    -------
  v1.2.3             v2.3.4            12       -                  4             1         3
  v2.3.4             -                 10       9                  -             -         -
  v3.4.5-beta.1      -                 2        -                  -             -         -
  v0.1.2             -                 1        -                  -             -         -

Critical Versioning Alerts:
  Version    Alert                                Count
  -------    ---------------------------------    -----
  v1.2.3     Security patch available (v2.3.4)    12
  v0.1.2     Version reached end of life          1

Version Reconciliation Loop

The version reconciliation loop is a level-triggered control loop that is responsible for determining and applying state-transitions in order to make the current inventory versioning match the desired inventory versioning. Each auth server runs its own version reconciliation loop which manages the server control streams attached to that auth server.

The core job of the version reconciliation loop is fairly intuitive (compare desired state to actual state, and launch installers to correct the difference). To get a better idea of how it should work in practice, we need to look at the caveats that make it more complex:

  • We need to use a rolling update strategy with a configurable rate, which means that not all servers eligible for installation will actually have installation triggered on a given iteration. The version directive may change mid rollout, so simply blocking the loop on a given directive until it has been fully applied isn't reasonable.

  • We need to monitor cluster-wide health of ongoing installations and pause installations if we see excess failures/churn, which means that aggregating information about failures is a key part of the reconciliation loop's job.

  • We should avoid triggering installs against servers that recently made an install attempt (regardless of success/failure), and we should also avoid sending install messages to servers that just connected or are in the process of graceful shutdown. This means that a server's eligibility for installation is a combination of both persistent backend records, and "live" control stream status.

Given the above, the reconciliation loop is a bit more complex, but still falls into three distinct phases:

  1. Setup: load cluster-level upgrade system configuration, active version-directive, churn/fault stats, etc.

  2. Reconciliation: Match servers to target and installer, and categorize them by their current install eligibility given recent install attempts, control stream status, etc.

  3. Application: Determine the number of eligible servers that will actually be slated for install given current target rollout rate, update their backend states with a summary of the install attempt that is about to be made (skipping servers which had their installation status concurrently updated), and pass them off to installer-specific logic.

As much as possible, we want the real "decision making" power to rest with the version-controller rather than the version reconciliation loop. That being said, the version reconciliation loop will have some internal rules that it evaluates to make sure that directives, as applied to the current server inventory, do not result in any invalid state-transitions (e.g. it will refuse to change the target arch for a given server, or skip a major version).

Version Directives

The Version Directive Resource

The version-directive resource is the heart of the upgrade system. It is a static resource that describes the current desired state of the cluster and how to get to that state. This is achieved through a series of matchers which are used to pair servers with installation targets and installers. At its core, a version-directive can be thought of as a function of the form f(server) -> optional(target,installer).

Installation targets are arbitrary attribute mappings that must at least contain version, but may contain any additional information as well. Certain metadata is understood by teleport (e.g. fips:yes|no, arch:amd64|arm64|...), but additional metadata (e.g. sha256sum:12345...) is simply passed through to the installer. The target to be used for a given server is the first target that is not incompatible (i.e. no attempt to find the "most compatible" target is made). A target is incompatible with a server if that server's version cannot safely upgrade/downgrade to that target version, or if the target specifies a build attribute that differs from a build attribute of the current installation (e.g. fips:yes when current build is fips:no). We don't require that all build attributes are present since not all systems require knowledge of said attributes.

It is the responsibility of an installer to fail if it is not provided with sufficient target attributes to perform the installation safely (e.g. the tuf installer would fail if the target passed to it did not contain the expected length and hash data). The first compatible installer from the installer list will be selected. Compatibility will be determined at least by the version of the instance, as older instances may not support all installer types. How rich of compatibility checks we want to support here is an open question. I am wary of being too "smart" about it (per-installer selectors, pre-checking expected attributes, etc), as too much customization may result in configurations that are harder to review and more likely to silently misbehave.

Within the context of installation target matching, version compatibility for a given server is defined as any version within the inclusive range of vN.0.0 through vN+1.*, where N is the current major version of the server. Stated another way, upgrades may keep the major version the same, or increment it by one major version. Downgrades may revert as far back as the earliest release of the current major version. Downgrades to an earlier major version are not supported.

All matchers in the version-directive resource are lists of matchers that are checked in sequence, with the first matching entry being selected. If a server matches a specific sub-directive, but no installation targets and/or installers in that sub-directive are compatible, that server has no defined (target,installer) tuple.

Beyond matching installation targets to servers, the version-directive also supports some basic time constraints to assist in scheduling, and a stale_after field which will be used by teleport to determine if the directive is old enough to start emitting warnings about it (especially useful if directives are generated by external plugins which might otherwise fail silently).

Example version-directive resource:

# version directive is a singleton resource that is either supplied by a user,
# or periodically generated by a version controller (e.g. tuf, plugin, etc).
# this represents the desired state of the cluster, and is used to guide a control
# loop that matches install targets to appropriate nodes and installers.
kind: version-directive
version: v1
metadata:
  name: version-directive
spec:
  nonce: 2
  status: enabled
  version_controller: static-config
  confid_id: <random-value>
  stale_after: <time>
  not_before: <time>
  not_after: <time>
  directives:
    - name: Staging
      targets:
        - version: 2.3.4
          fips: yes
        - version: 2.3.4
          fips: no
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: staging
          services: [db,ssh] # unspecified matches all services *except* auth
        - labels:
            env: testing
          services: [db,ssh]

    - name: Prod
      targets:
        - version: 1.2.3
          fips: yes
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: prod

The above example covers the core information needed to effectively orchestrate installations, but it does not quite cover an equally pressing need: providing reliable visibility into what instances are in need of security patches and/or are running deprecated/eol versions. We cover more nuanced mechanisms for dealing with customizable notifications in later sections, but it seems important that we also provide a mechanism for establishing a very basic security/deprecation "floor" that can be baked into the version directive. Something that lets us say "warn about versions before X" regardless of the details of our specific server -> version mapping that is in effect at the moment.

Exact syntax is TBD, but something like this would be sufficient:

critical_floor:
  end_of_life: v1 # all releases earlier than v2 are EOL
  security_patches:
    - version: v2.3.4 # v2 releases prior to v2.3.4 need to be upgraded to at least v2.3.4
      desc: 'CVE-12345: teleport may become sentient and incite robot uprising'
    - version: v3.4.5 # v3 releases prior to v3.4.5 need to be upgraded to at least v3.4.5
      desc: 'prime factorization proven trivial, abandon hope all ye who enter here'

Version Directive Flow

Up to this point, we've been fairly vague about what happens between a version-directive being created by the initial controller that generates it, and becoming the new desired state for the cluster. In order to reason about this intervening space, it is good to start by taking stock of what features we would like to eventually support that take effect between generating the initial directive, and final application of that directive:

  • Mapping/Plugins: Some intervening process takes a version-directive generated by the originating controller and modifies it in some way. Some examples of this might be a plugin that applies a custom filter to installation targets, or a scheduler that creates custom start/end times for the directive.

  • Plan/Apply Workflow: It is reasonable to assume that not everyone will want new installation targets to be selected automatically, and we should provide a workflow that permits previewing the new target state before applying it.

  • Multiparty Approval: Upgrading sensitive infrastructure can be a big deal. Providing an equivalent to the plan/apply workflow that also supports multiparty approval (think access requests but for changing the version directive) seems like an obvious feature that we'll want to land eventually.

  • Notifications/Recommendations: When using a plan/apply or multiparty approval workflow, being able to be notified when new versions are available seems reasonable and useful. Ideally, it should be possible to provide both the means for external plugins to generate notifications (e.g. via slack), and also for teleport's own interfaces to mark servers as being eligible for upgrade.

  • Live Modality/Selection: Not all configurations work for all scenarios. It seems reasonable that we will eventually want to support workflows that allow some concept of differing configurations or directives, either by providing the ability to have multiple distinct configurations available at the same time (e.g. plan <variant-a> vs plan <variant-b>), or to allow some form of live subselection (e.g. plan --servers env=prod).

  • Dry-Run: Similar to a plan phase, it might be nice to be able to execute dry runs of potential directives. What a dry run entails varies by installer, but "download and verify without installing" is a reasonable interpretation for local installers at least. It might even be possible to cache a package that was downloaded during a dry run and install it immediately during a normal install. Note that caching may present a new attack vector and implementing it would require careful thought to prevent new attack vectors from being introduced. This is a lower priority feature, but it is useful to keep in mind so that we don't select an architecture that precludes it as a possibility.

Each of the features described above requires some amount of engineering specific to itself, but they also have an overlapping set of needs that we can use to inform the basic directive flow. We'll cover the high-level flow itself, and then examine why it meets our needs.

Directives will come in three distinct flavors: "draft", "pending", and "active". Draft and pending the directives will be sets stored by <kind>/<name> and <uuid> respectively. The active directive will be the singleton directive representing the current desired state, as discussed elsewhere. This storage model will be used to enforce a specific "flow" through the following operations:

  • WriteDraft: A draft is written out by its generating controller/plugin to /drafts/<kind>/<name>. By convention, kind and name are the kind and name of the controller that writes the draft. The effect of this is that subscribing to write events on the key /drafts/tuf/default is essentially equivalent to consuming a stream of the tuf/default controller's outputs. Drafts include information about when they become stale, ensuring easy detection if a controller is offline, even if it is external to teleport.

  • FreezeDraft: The latest version of target draft is copied and stored at a random UUID. Frozen drafts are stored as an immutable sub-field within a "pending directive" object which encodes additional information that allows teleport to make decisions about the pending directive (e.g. a an approval policy in the case of a multiparty approval scenario).

  • PromotePending: Target pending directive overwrites the "active" singleton, becoming the new target state of the cluster.

With the above flow defined, we can now look at how we might implement our desired features:

  • Mapping/Plugins: Each intermediate plugin loads some upstream draft, performs its modifications, and writes them to some downstream draft. E.g. a scheduler plugin might load from drafts/tuf/default and write to drafts/scheduler/default.

  • Plan/Apply Workflow: Invoking tctl version-control plan freezes the latest draft with an associated attribute indicating that it is frozen for a plan operation. The frozen draft is used to generate a summary of changes to be displayed to the user (e.g. number of nodes that would be upgraded and to what versions). If the user likes what they see, they can run tctl version-control apply <id> to promote the pending directive. If no action is taken, the pending directive expires after a short time.

  • Multiparty Approval: Essentially the same workflow as Plan/Apply, except with tsh commands instead, possibly with slightly different wording (e.g. propose/apply), and an additional tsh version-control review command. The auth server freezes the target along with an approval policy, and waits for sufficient approvals before permitting promotion.

  • Notifications/Recommendations: Teleport and/or external plugins periodically load the latest draft directive and compare it to current cluster state. Where the draft recommends a different version, users are notified and recommended version is displayed when listing servers.

  • Live Modality/Selection: While we want apply commands to "just work" if users only have one controller/pipeline, we can also support selecting drafts by name (e.g. tctl version-control plan foo/bar) so that users can configure their clusters to present multiple alternative drafts that can be compared and selected between.

  • Dry-Run: Invoking tctl version-control dry-run <id> marks a pending directive for dry run. Auth servers invoke installers in dry-run mode (for those that support it), and periodically embed stats about the state of the dry run (churns faults, etc) as attributes on the pending draft object for some time period. Since dry runs still trigger installers, multiparty approval would need to define approval thresholds for invoking dry runs. As noted in the previous dry run discussion, this feature is tricky and probably of lower priority than the others on this list.

High-Level Configuration

Some configuration parameters are independent of specific controllers/installers (namely rollouts and promotion policies), and are best controlled from a central configuration object, rather than having competing configurations attached to each controller. In addition, it is desirable to provide a simple single-step operation for enabling automatic upgrades in our "batteries included" usecase. With this in mind, we will provide a top-level configuration object that can conveniently control the key parameters of the upgrade system:

kind: version-control-config
version: v1
spec:
  enabled: yes
  
  rolling_install:
    churn_limit: 5% # percent or count
    fault_limit: 10
    rate: 20%/h # <percent or count>/<h|m>
  
  promotion:
    strategy: automatic # set to 'manual' for plan/apply workflow
    from: tuf/default

  notification:
    from: tuf/latest # defaults to using the value from `promotion.from`

  # shorthand for the more verbose syntax of the version-directive resource with support
  # for wildcards in the version string. Version controllers can use these as templates
  # to build concrete actionable directives using targets from the latest matching version.
  # This is an optional feature, since any controllers we write will also support verbose
  # templates in their own config objects, but simple rules like this will likely be sufficient
  # for many usecases, and are generic enough for us to assume that all future controllers
  # should be able to support them.
  basic_directives:
    - name: Prod
      version: v1.1.* # at least major version must be specified
      server_labels:
        env: prod
    - name: Staging
      version: v1.2.*
      server_labels:
        env: staging

The above configuration object should be all a user needs to activate automatic updates (once we've implemented the tuf controller and installer). Additional controller-specific features will be accessible by using a custom configuration (e.g. tuf/my-tuf-controller), but the default should "just work" in most cases.

In the event that a mapping/plugin strategy (as described in the Version Directive Flow section) is in use, the promotion.from field should be the draft output location of the final plugin in the chain. If using the manual promotion strategy this field is optional but omitting it will cause tctl version-control plan to always require an explicit target.

Version Controllers

A version-controller is an abstract entity that periodically generates a draft version-directive. It may be a loop that runs within the auth server, an external plugin, or just a human manually creating directives as needed. A builtin controller is a control loop that runs within teleport capable of generating version directives. The only builtin controller that is currently part of the development plan is the TUF controller, though we may also introduce a simpler "notification only" controller that can't be used to trigger updates, but could be used to suggest that installations are out of date.

TUF Version Controller

The TUF version controller will be based on go-tuf and will maintain TUF client state within the teleport backend (TUF clients are stateful, since they need to support concepts like key rotation). When enabled, the TUF controller will periodically sync with a TUF repository that we maintain, discover available packages, and generate a version-directive with the necessary metadata for the tuf installer to securely verify said packages.

The details of the TUF protocol are complex enough that I won't try to reiterate them here, but the complexity is mostly in the process by which the per-package metadata is securely distributed. The output generated by the TUF controller will be very simple. In addition to standard target information (version, arch, etc), it will include a size in bytes and one or more hashes.

Custom configurations can be supplied, but in the interest of convenience a tuf/default controller will be automatically activated if referenced by the version-control-config, which will seek to fill the directive templates specified there.

Example custom configuration:

kind: version-controller
version: v1
sub_kind: tuf
metadata:
  name: my-tuf-controller
spec:
    status: enabled
    directives:
      - name: Staging
        target_selectors:
          - version: 7.*
        server_selectors:
          - labels:
              env: staging
      - name: Prod
        target_selectors:
          - version: 7.2.*
        server_selectors:
          - labels:
              env: prod
      - name: Minimum
        target_selectors:
          - version: 6.*
        server_selectors:
          - labels:
              '*': '*'

note: Generally speaking, TUF is fips compatible, but I have yet to assess what, if any, additional work may be needed to get the tuf controller working on fips teleport builds. It is possible that we may end up supporting the tuf controller on non-fips builds earlier if this process ends up being complex.

Notification-Only Install Controller

The TUF install controller is going to be a fairly substantial undertaking, with various moving parts needing to come together behind the scenes (e.g. deterministic compilation). This is why the MVP release is intended to support only manually-constructed directives and local-script installers.

It may still be desirable to provide a means of using the notification workflow before TUF has landed. We could achieve this by providing a simple "low stakes" controller that produces notification-only version directives, usable for displaying recommended versions in inventory lists, but not suitable for providing sufficient information for package validation.

An example of a notification-only install controller would be a github-releases controller, which periodically scraped the teleport repo's releases page. While the information contained there isn't sufficient for robust package validation, its more than sufficient for displaying a "recommended version" in an output like tctl inventory ls.

If we wanted to go with a compromise between prioritizing full TUF features and prioritizing fast delivery of notifications, we could establish a beta/preview TUF repo which did not provide any package hashes, but did serve a list of recommended install versions, including metadata indicating which versions were security releases. While this would take more time to deliver than a minimal "scraper", it would allow us the ability to spend our efforts on work that could be mostly re-used during the main TUF development phase.

Installers

An installer is a mechanism for attempting to install a target on a server or set of servers. Conceptually, installers fall into two categories:

  • Local Installers: A local installer runs on the teleport instance that needs the installation. Each local installer type needs to be supported by the instance being upgraded. From the point of view of the version reconciliation loop a local installer is a divergent function of the form f(server_control_stream, target).

  • Remote Installers: A remote installer runs on a teleport instance other than the instance(s) being updated. Remote installers need to provide a selector for the controlling host on which they need to be run. Remote installers are invoked for sets of servers and may be invoked multiple times for overlapping sets, making idempotence essential. From the point of view of the version reconciliation loop a remote installer is a function of the form f(host_control_stream, servers, target).

Different installers have different required target attributes (e.g. the tuf installer requires package size and hashes). Installers must reject any target which is missing any attribute required by that installer's security model.

Local-Script Install Controller

The local-script installer is the simplest and most flexible installer, and the first one we will be implementing. It runs the provided script on the host that is in need of upgrade, providing a basic mechanism for inserting target information (e.g. version) as env variables.

While sanity is generally the responsibility of the use for this controller, we can assist by enforcing strict limits on allow characters for inputs/vars (e.g. ^[a-zA-Z0-9\.\-_]*$). This should be in addition to any rules we create for specific values (e.g. target.version).

The initial version of the local-script installer will be as bare-bones as possible:

# an installer attempts to apply an installation target to a node. this is an example
# of an installer that gets passed from the auth server to the node so that the node
# itself can run it, but some installers may run somewhere other than the node itself
# (e.g. if invoking some API that remotely upgrades teleport installs). The auth server
# uses the version-directive to determine which installers should be run for which nodes
# and with which targets.
kind: installer
sub_kind: script
version: v1
metadata:
  name: apt-install
spec:
  enabled: yes
  env:
      "VERSION": '{target.version}'
  shell: /bin/bash
  install.sh: |
    set -euo pipefail
    apt install teleport-${VERSION:?}    

Possible future improvements include:

  • Additional scripts for special operations (e.g. dry_run.sh, rollback.sh, etc).

  • Piping output into our session recording system so that install scripts can be played back (seems useful).

  • Special teleport subcommands meant to be invoked inside of install scripts (e.g. for verifying tuf metadata against an arbitrary file).

TUF Install Controller

The TUF install controller will not need to be configured by users. It will be the default install controller used whenever the TUF version-controller is active. It will download the appropriate package from get.gravitational.com and perform standard TUF verification (hash + size).

Since the download+verify functionality will be present in teleport anyhow, it may be useful to expose it as hidden subcommands that could be used inside of scripts, which could allow users to inject their own special logic within the normal tuf installation flow.

Remote-Script Install Controller

  • Affects installation indirectly by running a user-provided script on a pre-determined host (not the host in need of upgrade).

  • Intended as a simple means of hooking into systems such as k8s, where the teleport version is controlled via a remote API, though that does not preclude us making official remote install controllers for specific APIs down the road.

  • Details of functionality are TBD, but the basic idea will be that we will mirror the functionality of local-script wherever possible, and add an additional server selector that is used to determine where the installer should be run.

  • Q: Should the list of target servers be provided to the script? Is that even useful? It seems more likely that scripts will be written per externally managed set, though that could be a failure of imagination on my part.

TUF CI and Repository

In order to enable the TUF version controller, we will need to maintain CI that generates and signs TUF metadata, and maintain a TUF repository. Details of how the TUF repository will be hosted are still TBD, but TUF repositories are basically static files, so distribution should be fairly straightforward. We may be able to simply distribute it via a git repo.

We will leverage deterministic builds and TUF's multisignature support to harden ourselves against CI compromise. Our standard build pipeline will generate and sign one set of package hashes, and another set will be generated and signed by a separate isolated env.

TUF repositories prove liveness via periodic resigning with a "hot" key (not the keys used for package signing). This hot key should be isolated from the package signing keys, so we're likely looking at two new isolated envs that need to be added in addition to the modifications to our existing CI.

note: some initial work was done to get deterministic builds working on linux packages. We know its possible (and might even still be working), but don't currently have test coverage for build determinism. This will be an important part of the prerequisite work to get the TUF system online. We don't need to add TUF support for all build targets at once, so we may specifically target reliable signing of amd64/linux packages first.

Rollbacks

Rollbacks will come in two flavors:

  1. Remote rollback: Version directive is changed to target an older version. Older version is installed via normal install controller. Requires the new teleport installation to work at least well enough to perform any functions required by the install controller.

  2. Local rollback: The previous teleport installation remains cached during the upgrade, and some local process monitors the health of the new version. If the new version remains unhealthy for too long, it is forcibly terminated and the previous installation is replaced.

The first option is an emergent property of the level-triggered system and will be supported from the beginning. Teleport won't bother to distinguish between an upgrade and a downgrade. No special downgrade logic is required for this option to work.

The second option will require a decent amount of specialized support and will be added later down the line. Script installers would likely need to be amended in some way to work correctly with a local rollback scheme. The details of how exactly local rollbacks should function are TBD. Some possibilities include:

  • Initially install new versions to a pending location (e.g. /usr/local/bin/teleport.pending). Have teleport automatically fork a background monitor and execs into the pending binary if it is detected on startup. If the background monitor observes that its requirements are met, it moves the pending binary to the active location, replacing the previous install.

  • Formally embrace the idea of multiple concurrently installed teleport versions and provide a thin "proxy binary" that can seamlessly exec into the current target version based on some filesystem config, potentially launching a background monitor of a different version first depending on said config. This has the downside of introducing a new binary, but the upside of eliminating the need for messy move/rename schemes.

  • Fully install the new version, creating a backup of the previous version first. Rely on an external mechanism for ensuring that the monitor/revert process gets run (e.g. by registering a new systemd unit). This has the upside of probably being compatible with script-based installers without any changes (teleport could create the backup and register the unit before starting the script), but has the downside of introducing an external dependency.

UX

Static Configuration CLI UX

Static configuration objects will be managed via tctl's normal get/create resource commands.

Enabling the version control system (notification-only):

$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes

  notification:
    alert_on:
      - security-patch
    from: github-releases/default
EOF

$ tctl create vcc.yaml

Enabling the version control system (manual upgrades):

$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes
  promotion:
    from: tuf/default

  basic_directives:
    - name: All Servers
      version: v1.2.*
      selector:
        labels:
          '*': '*'
EOF

$ tctl create vcc.yaml

Configuring a custom TUF controller:

$ cat > vc.yaml <<EOF
kind: version-controller
version: v1
sub_kind: tuf
metadata:
  name: my-tuf-controller
spec:
    status: enabled
    directives:
      - name: Staging
        target_selectors:
          - version: 7.*
        server_selectors:
          - labels:
              env: staging
      - name: Prod
        target_selectors:
          - version: 7.2.*
        server_selectors:
          - labels:
              env: prod
EOF

$ tctl create vc.yaml

Version Directive Flow CLI UX

The version directive flow will be managed via the tctl version-control family of subcommands.

Manually creating a custom directive:

$ cat > vdd.yaml <<EOF
kind: version-directive
version: v1
sub_kind: custom
metadata:
  name: my-directive
spec:
  status: enabled
  directives:
    - name: Staging
      targets:
        - version: 2.3.4
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: staging
          services: [db,ssh]

    - name: Prod
      targets:
        - version: 1.2.3
          fips: yes
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            env: prod

    - name: Minimum
      targets:
        - version: 1.2.0
          fips: no
      installers:
        - kind: script
          name: apt-install
      selectors:
        - labels:
            '*': '*'
EOF

$ tctl version-control create-draft vdd.yaml

Plan/apply workflow:

$ tctl version-control plan custom/my-draft
Directive custom/my-draft frozen with ID 'bba14536-0ad9-4b14-a071-1296d570e52e'...

Warning: Sub-directive "Staging" proposes version newer than current auth version (will not take effect until auth is upgraded).

Estimated Changes:
  Current Version    Target Version    Count    Sub-Directive
  ---------------    --------------    -----    -------------
  v1.2.3             v2.3.4            12       Staging
  v1.2.1             v1.2.3            2        Prod

Estimated Unaffected Instances: 52

help: you can run 'tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e' to enable these changes.

$ tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e
Successfully promoted pending directive 'bba14536-0ad9-4b14-a071-1296d570e52e'.
help: run 'tctl version-control status' to monitor rollout progress.

CLI Recommendations and Alerts UX

Recommended version info will be added as part of normal server status info for user-facing interfaces (tsh inventory ls for start, but other per-server displays could also include it). Cluster-level alerts (e.g. due to a security patch becoming available or a major version reaching EOL) will be displayed on login, and could be expanded to other "frequently used" commands if need-be.

Recommended version, displayed as part of status in tsh inventory ls:

$ tsh inventory ls
Server ID                               Version    Services       Status
------------------------------------    -------    -----------    -----------------------------------------------
eb115c75-692f-4d7d-814e-e6f9e4e94c01    v0.1.2     ssh,db         installing -> v1.2.3 (17s ago)
9db81c94-558a-4f2d-98f9-25e0d1ec0214    v1.2.2     k8s            online, upgrade recommended -> v1.2.3 (20s ago)
b170f8f1-e369-4e10-9a04-5fb33b8e40d5    v1.2.2     ssh            online, upgrade recommended -> v1.2.3 (45s ago)
5247f33a-1bd1-4227-8c6e-4464fee2c585    v1.2.3     auth           online
...

Alerts related to available security patches and EOL show up on login for those with sufficient permissions (exact permissions TBD, but if you have blanket read for server inventory, that should be sufficient):

$ tsh login cluster.example.com
[...]
> Profile URL:        https://cluster.example.com:3080
  Logged in as:       alice
  Cluster:            cluster.example.com
  Roles:              populist, dictator
  Logins:             alice
  Kubernetes:         disabled
  Valid until:        2022-04-05 10:20:13 +0000 UTC [valid for 12h0m0s]
  Extensions:         permit-agent-forwarding, permit-port-forwarding, permit-pty

WARNING: Cluster "cluster.example.com" contains instance(s) eligible for security patch.

Web UI Recommendations and Alerts UX

GUIs aren't really my area of expertise, and I'm not certain if we're going to opt to actually port the unified "inventory" view to the web UI, but here's some ideas that I think are good starting points:

  • An "alerts" section under the "Activity" dropdown that can list cluster-level alerts about version-control now, and possibly other related alerts as well down the road.

  • Some kind of small but visually distinct banner alert that shows up on login but can be minimized/dismissed and/or a badge on the activity dropdown indicating that alerts exist.

  • Color-coded badges for some or all of the following per-instance states:

    • upgrade available
    • eol/deprecated version
    • security update available

Hypothetical Docs

Some hypothetical documentation snippets to help us imagine how comprehensible this system will be to end users.

Quickstart

Teleport's update system uses pluggable components to make it easy to get the exact behavior you're looking for. The simplest way to get started with teleport's upgrade system is to use the builtin TUF controller and installer, based on The Update Framework.

You can enable these components like so:

$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes
  promotion:
    strategy: manual
    from: tuf/default
EOF

$ tctl create vcc.yaml

Once enabled, teleport will automatically detect new releases and draft an update plan for your cluster. You can run tctl version-control plan to preview the latest draft's effect on your cluster and run tctl version-control apply <id> to accept it if everything is to your liking. Ex:

$ tctl version-control plan
Draft tuf/default frozen with ID 'bba14536-0ad9-4b14-a071-1296d570e52e'...

Estimated Changes:
  Current Version    Target Version    Count    Sub-Directive
  ---------------    --------------    -----    -------------
  v1.2.3             v1.3.5            12       Default

Estimated Unaffected Instances: 2

help: you can run 'tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e' to enable these changes.

$ tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e
Successfully promoted pending directive 'bba14536-0ad9-4b14-a071-1296d570e52e'.
help: run 'tctl version-control status' to monitor rollout progress.

Note that we didn't tell teleport what version we want to install. By default, teleport looks for the latest releases for the major version you are already on (though it will notify you if newer major versions are available). If we want to perform a major version upgrade, we need to provide explicit configuration. Explicit versions or version ranges can be specified using a basic_directive:

$ cat > vcc.yaml <<EOF
kind: version-control-config
version: v1
spec:
  enabled: yes
  promotion:
    strategy: manual
    from: tuf/default

  basic_directives:
    - name: All servers
      version: v2.3.*
      installer: tuf/default
      selector:
        labels:
          '*': '*'
        services: ['*']
EOF

$ tctl create -f vcc.yaml

You can read the above configuration as "the latest v2.3.X release should be installed on all servers, using the default install method". We specify a version matcher, installer, and instance selector (wildcard labels and services matches all instances). Teleport then creates a draft proposal that matches our configuration in the background.

If you run tctl version-control plan immediately after creating/updating the config, you might see an error like Draft tuf/default appears outdated (config has been changed) or Draft tuf/default has not been generated. This is normal. Teleport needs to download and verify detailed release metadata in order to generate a draft. This may take a few seconds.

Customization

  • TODO

Implementation Plan

There are a lot of individual components and features discussed in this RFD. As such, implementation will be divided into phases with multiple iterative releases consisting of subsets of the final feature set.

Inventory Status/Control Setup

This phase sees no meaningful user-facing features added, but is the building block upon which most of the rest of the features are built.

Instance-level status and control stream:

  • Refactor agent certificate logic to support advertising multiple system roles on a single cert (currently each service has its own disjoint certificate).

  • Implement per-instance bidirectional GRPC control stream capable of advertising all services running on a given teleport instance, and accepting commands directly from the controlling auth server.

Improved inventory version tracking:

  • Improve teleport's self-knowledge so that instances can heartbeat detailed build attributes (arch, target os, fips status, etc).

  • Add new server inventory resource and tctl inventory ls command for viewing all instances w/ build info and services.

With the above changes in place, we will have the ability to reliably inspect per-instance state regardless of running services, and each auth server will have a bidirectional handle to its connected nodes, allowing for real-time signaling.

Notification-Only System (?)

note: This step is optional, but might allow us to provide more value to users much sooner.

Implement a notification-only upgrade controller and basic version-directive resource, without any concept of having an "active" directive, and no reconciliation loop or installers. The notification-only controller would serve only for detecting the existence of new versions without providing any of the strong censorship resistance or package validation of the tuf based controller. Instead, the purpose of this controller would be to generate a very basic version-directive that could be used to display the recommended version for teleport instances.

In theory, the ability to display recommended version, and/or generate notifications, is a less "core" functionality, and could be added in a later step with less overall development effort. Once the TUF controller exists, using its output for notifications would be easy. That being said, it may be more valuable to deliver a pretty good way of informing users that they aught to upgrade sooner, rather than waiting on a very robust way of automatically upgrading that happens to bring notifications along with it.

Regardless of ordering, notifications in general depend on the following components that need to be built anyway:

  • The target+server matching part of the version-directive resource (installer matching comes later).

  • The draft phase of the version directive flow.

  • The basic top-level config API (get/put/del).

  • The basic version controller configuration API (get/put/del) (only required if we want to support a controller configuration other than default).

  • The tctl version-control status command (though not all fields will be available yet).

Creating the notification-only system first will also necessitate an additional builtin version-controller that would not otherwise be needed. Luckily, it can be very simple (e.g. a github release page scraper), since it will be explicitly not usable for actual upgrades.

Script-Based Upgrades MVP

With the core work done for inventory status/control, we can move on to a barebones MVP/pathfinder for installers, version directives, and the version reconciliation loop. We will implement a no-frills version of these components with the goal of supporting one specific use-case: manually creating a basic version-directive resource and having a user-provided script run on all servers that don't match the directive.

This phase will be a bit of a pathfinder, with a focus on weeding out any issues with the proposed design of the core system. It will also provide an early preview for users that are interested in reducing per-instance upgrade work, but are still willing to get their hands dirty. Finally, this will mark the point after which manual upgrade of non-auth instances can (theoretically) end, as new versions that support new installers can be "bootstrapped" using older installers.

The components that must be developed for this phase are as follows:

  • Per-instance installation attempt status info.

  • The version reconciliation loop (minus more advanced features like being able to trigger remote installers).

  • The version directive resource (mostly complete already if we did the notification-only system first), and the version directive flow.

  • The local-script installer, and basic installer configuration API (get/put/del).

  • The tctl version-control plan/tctl version-control apply commands.

  • A rudimentary version of rollout health monitoring and automatic-pause system.

  • Interactive tctl version-control setup command.

TUF-Based System MVP

This phase sees the beginning of "batteries included" functionality. We will be adding the TUF-based version controller and installer, as well as setting up supporting CI and repository infrastructure. In this phase, teleport will start being able to detect and install new versions on its own (though this will still be a "preview" feature and not recommended for production).

Development in this version will be split between core teleport changes and build/infra work. The core teleport work will be as follow:

  • Basic version controller configuration API (if not added in notification-only phase).

  • Internal TUF client implementation w/ stateful components stored in teleport backend.

  • Builtin TUF version controller (basically just a control loop that runs the client and then converts TUF package metadata to version-directive format).

  • Rudimentary TUF installer (no local rollbacks yet, so this is basically just download, validate, and replace).

  • Basic notification/version recommendations (if not added in notification-only phase).

Build system/infra work:

  • Get deterministic builds working (they might still work, since I did get them mostly functional a while back, but this isn't covered by tests, so its basically meaningless).

  • Set up isolated automation for independently building, hashing, and signing teleport releases.

  • Add hashing + signing to existing build pipeline (different keypair).

  • Set up TUF repository with thresholded signing so that compromise of one of the two build envs does not compromise the TUF repository. TUF repositories are just static files, so this can be hosted just about anywhere, through there is some regular re-signing by a "hot" key that is used to prove liveness.

Stability & Polish

The timeline for this phase isn't linear and the individual changes aren't interdependent like in previous phases, but we're moving out of the realm of a preview/MVP feature and that means polish and stability improvements. In no particular order:

  • Officially move TUF components out of preview (good time to try our first public repo key rotation?).

  • Implement local rollbacks.

  • Extend upgrade system to support upgrading auth servers.

  • Extend TUF repository to cover more package types (deterministic docker images are theoretically possible I hear).

  • Add remote-script installer.

  • Improve upgrade visibility (e.g. create "session recordings" for local-script installers).

  • Tackle outstanding feedback & any issues that have been uncovered prior to moving to extended feature set.

Extended Feature Set

  • Multiparty approval and dry run workflows.

  • Notification plugins (e.g. slack notifications for very outdated instances).

  • Other remote installers (e.g. k8s).

Other Stuff

Anonymized Metrics

While not uniquely related to the upgrade system, we are going to start looking toward supporting opt-in collection of anonymized metrics from users. The first instance of this new feature will appear alongside the TUF-based system in the form of additional optional headers that can be appended to TUF metadata requests and can be aggregated by the TUF server.

The heart of the anonymized metrics system will be two new abstract elements to be added to cluster-level state (which configuration object they should appear in is TBD):

enabled: yes|no
random_value: <random-bytes>

If a user chooses to enable anonymized metrics for a cluster, a random value with some reasonably large entropy will be generated. This will form the basis for an anonymous identifier that will allow us to distinguish between metrics from different clusters without the identifier revealing anything about that cluster's identity. The random value can be used directly as an identifier, or as a MAC key to be used to hash some other value. I lean toward preferring a scheme where the presented identifier rotates periodically (e.g. monthly). If combined with the right amount of "bucketing" of any scalar values, this should help us prevent the emergence of any "long term" narratives related to a single identifier, thereby further improving anonymization

I am currently leaning toward the idea of using the random value to create a keyed/salted hash of the current year/month gmt (YYYY-MM) s.t. each month is effectively a separate dataset with separate identifiers. This kind of scheme would both produce cleaner datasets and improve anonymity by effectively causing all clusters across the ecosystem to rotate their IDs simultaneously. Still thinking this through, so maybe there are issues with this particular angle, but the aforementioned properties are appealing.

To start with, cluster identifiers will be the only data the user is actually opting into sharing. The TUF server will know the version of the client calling it, and whether it is an open source or enterprise request. The optional cluster identifier will be what transforms this information from being just useful per-request debug into, into a meaningful metric about the state of the teleport ecosystem. By using the cluster ID to deduplicate requests, will will start to be able to make more informed guesses about the scope of the teleport ecosystem and the distribution of teleport versions across it.

We will therefore end up collecting datapoints of the following format:

version: <semver>
flavor: <enum: oss|ent>
id: <random-string>

A few notes about working with this system:

  • If scalar values are added in the future (e.g. cluster size), they will need to be bucketed s.t. no one cluster identifier is unique enough to be traceable across ID rotations, or unique enough to be correlated with a given user should their cluster size (approximate or not) be shared for any reason.

  • Cluster identifiers (both the seed/salt value and the ephemeral identifier) should be treated as secrets and not emitted in any logs or in any tctl commands that don't include --with-secrets.

  • Addition of any new metrics in the future should be subject to heightened scrutiny and cynicism. A healthy dose of 'professional paranoia' is beneficial here.

Open Questions

  • It seems reasonable that folks should be able to specifically watch for security patches and have them automatically installed, or have special notifications generated just for them. It may even be good to have such a feature come as one of the pre-configured controllers alongside the default controllers (e.g. tuf/security-patches). How should we handle this? I'm currently leaning toward introducing a new build attribute that can be filtered for (e.g. version=v1.*,fips=yes,security-patch=yes), but its possible that there are better ways to go about this (e.g. separate repos or the concept of "release channels").

  • How explicitly should local nodes require opt-in? We obviously don't want to run any installers on a node that doesn't include explicit opt-in, but should we require explicit opt-in for specific installers? (e.g. auto_upgrade: true vs permit_upgrade: ['local-script/foo', 'tuf/default'])

  • How are we going to handle auth server upgrades? Since auth-servers can't do rolling upgrades, this is a lot trickier than upgrades of other kinds. We can obviously coordinate via backend state, but its basically impossible to distinguish between shutdown and bugs or performance issues.

  • Some folks have a lot of leaf clusters. It may be burdensome to manage upgrade states for individual clusters separately. Should we consider providing some means of having root clusters coordinate the versioning of leaf clusters? At what point are they so integrated that the isolation benefits are nullified and the users would be better off with a monolithic cluster?