mirror of
https://github.com/gravitational/teleport
synced 2024-10-19 08:43:58 +00:00
1670 lines
80 KiB
Markdown
1670 lines
80 KiB
Markdown
---
|
|
authors: Forrest Marshall (forrest@goteleport.com)
|
|
state: draft
|
|
---
|
|
|
|
# RFD 90 - Upgrade System
|
|
|
|
## Required Approvers
|
|
* Engineering: @klizhentas && (@zmb3 || @rosstimothy || @espadolini)
|
|
* Product: (@klizhentas || @xinding33)
|
|
|
|
## What
|
|
|
|
System for automatic upgrades of teleport installations.
|
|
|
|
|
|
## Why
|
|
|
|
Teleport must be periodically updated in order to integrate security patches. Regular
|
|
updates also ensure that users can take advantage of improvements in stability and
|
|
performance. Outdated teleport installations impose additional burdens on us and on
|
|
our users. Teleport does not currently assist with upgrades in any way, and the burden
|
|
of manual upgrades can be prohibitive.
|
|
|
|
Reducing the friction of upgrades is beneficial both in terms of security and user
|
|
experience. Doing this may also indirectly lower our own support load.
|
|
|
|
Upgrades may be particularly beneficial for deployments where instances may run on
|
|
infrastructure that is not directly controlled by cluster administrators (teleport cloud
|
|
being a prime example).
|
|
|
|
|
|
## Intro
|
|
|
|
### Suggested Reading
|
|
|
|
While not required, it is helpful to have some familiarity with [The Update Framework](https://theupdateframework.com/)
|
|
when reading this RFD. TUF is a flexible framework for securing upgrade systems. It provides a robust framework
|
|
for key rotation, censorship detection, package validation, and much more.
|
|
|
|
|
|
### High-Level Goals
|
|
|
|
1. Maintain or improve the security of teleport installations by keeping them better
|
|
updated and potentially providing more secure paths to upgrade.
|
|
|
|
2. Improve the experience of teleport upgrade/administration by reducing the need for
|
|
manual intervention.
|
|
|
|
3. Improve the auditability of teleport clusters by providing insight into, and policy
|
|
enforcement for, the versioning of teleport installations.
|
|
|
|
4. Support a wide range of uses by providing a flexible and extensible set of tools
|
|
with support for things like user-provided upgrade scripts and custom target selection.
|
|
|
|
5. Provide options for a wide range of deployment contexts (bare-metal, k8s, etc).
|
|
|
|
6. Offer a simple "batteries included" automatic upgrade option that requires minimal
|
|
configuration and "just works" for most non-containerized environments.
|
|
|
|
|
|
### Abstract Model Overview
|
|
|
|
This document proposes a modular system capable of supporting a wide range
|
|
of upgrade strategies, with the intention being that the default or "batteries
|
|
included" upgrade strategy will be implemented primarily as a set of interchangeable
|
|
components which can be swapped out and/or extended.
|
|
|
|
The proposed system consists of at least the following components:
|
|
|
|
- `version-directive`: A static resource that describes the desired state of
|
|
versioning across the cluster. The directive includes matchers which allow
|
|
the auth server to match individual teleport instances with both the appropriate
|
|
installation target, and the appropriate installation method. This resource may
|
|
be periodically generated by teleport or by some custom external program. It may
|
|
also be manually created by an administrator. See the [version directives](#version-directives)
|
|
section for details.
|
|
|
|
- `version-controller`: An optional/pluggable component responsible for generating the
|
|
`version-directive` based on some dynamic state (e.g. a server which publishes
|
|
package versions and hashes). A builtin `version-controller` would simply be a
|
|
configuration resource from the user's perspective. A custom/external version
|
|
controller would be any program with the permissions necessary to update the
|
|
`version-directive` resource. See the [version controllers](#version-controllers)
|
|
section for details.
|
|
|
|
- *version reconciliation loop*: A control loop that runs in the auth server which
|
|
compares the desired state as specified by the `version-directive` with the current
|
|
state of teleport installations across the cluster. When mismatches are discovered,
|
|
the appropriate `installer`s are run. See the [Version Reconciliation Loop](#version-reconciliation-loop)
|
|
section for details.
|
|
|
|
- `installer`: A component capable of attempting to affect installation
|
|
of a specific target on one or more teleport hosts. The auth server needs to know
|
|
enough to at least start the process, but the core logic of a given installer
|
|
may be an external command (e.g. the proposed `local-script` installer would cause
|
|
each teleport instance in need of upgrade to run a user-supplied script locally and
|
|
then restart). From the perspective of a user, an installer is a teleport
|
|
configuration object (though that configuration object may only be a thin hook into
|
|
the "real" `installer`). Whether or not the teleport instance being upgraded
|
|
needs to understand the installer will vary depending on type. See the [Installers](#installers)
|
|
section for details.
|
|
|
|
There is room in the above model for a lot more granularity, but it gives us a
|
|
good framework for reasoning about how state is handled within the system. Version
|
|
controllers generate version directives describing what releases should be running where
|
|
and how to install them. The version control loop reconciles desired state with actual
|
|
state, and invokes installers as-needed. Installers attempt to affect installation of
|
|
the targets they are given.
|
|
|
|
|
|
### Implementation Phases
|
|
|
|
Implementation will be divided into a series of phases consisting of 1 or more
|
|
separate PRs/releases each:
|
|
|
|
- Setup: Changes required for inventory status/control model, but not necessarily
|
|
specific to the upgrade system.
|
|
|
|
- Notification-Only System: Optional phase intended to deliver value sooner
|
|
at the cost of overall feature development time.
|
|
|
|
- Script-Based Installs MVP: Early MVP supporting only manual control and
|
|
simple script-based installers for non-auth instances.
|
|
|
|
- TUF-Based System MVP: Fully functional but minimalistic upgrade system
|
|
based on TUF (still excludes auth instances).
|
|
|
|
- Stability & Polish: Additional necessary features (including auth installs
|
|
and local rollbacks). Represents the point at which the core upgrade system
|
|
can be considered "complete".
|
|
|
|
- Extended Feature Set: A collection of nice-to-haves that we're pretty sure
|
|
folks are going to want.
|
|
|
|
See the [implementation plan](#implementation-plan) section for detailed breakdown
|
|
of the required elements for each phase.
|
|
|
|
|
|
## Details
|
|
|
|
|
|
### Usage Scenarios
|
|
|
|
Hypothetical usage scenarios that we would like to be able to support.
|
|
|
|
|
|
#### Notification-Only Usecase
|
|
|
|
Cluster administrators may or may not be using teleport's own install mechanisms, but they
|
|
want teleport to be able to inform them when instances are outdated, and possibly generate
|
|
alerts if teleport is running a deprecated version and/or there is a newer security patch
|
|
available.
|
|
|
|
In this case we want options for both displaying suggested version (or just a "needs upgrade"
|
|
badge) on inventory views, and also probably some kind of cluster-wide alert that can be
|
|
made difficult to ignore (e.g. a message on login or a banner in the web UI). We also
|
|
probably want an API that will support plugins that can emit alerts to external locations
|
|
(e.g. a slack channel).
|
|
|
|
In this usecase teleport is serving up recommendations based on external state, so a client
|
|
capable of discovery (e.g. the `tuf` version controller) is required, but the actual
|
|
ability to affect installations may not be necessary.
|
|
|
|
|
|
#### Minimal Installs Usecase
|
|
|
|
Cluster administrators manually specify the exact versions/targets for the cluster, and
|
|
have a specific in-house installation script that should be used for upgrades. The teleport cluster
|
|
may not even have any access to the public internet. The installation process is essentially a
|
|
black box from teleport's perspective. The target may even be an internally built
|
|
fork of teleport.
|
|
|
|
In this case, we want to provide the means to specify the target version and desired script. Teleport
|
|
should then be able to detect when an instance is not running the target that it ought to, and
|
|
invoke the install script. Teleport should not care about the internals of how the script plans to
|
|
perform the installation. Instances that require upgrades run the script, and may be required to perform
|
|
a graceful restart if the script succeeds. The script may expect inputs (e.g. version string), and
|
|
there may be different scripts to run depending on the nature of the specific node (e.g. `prod-upgrade`, `staging-upgrade`
|
|
or `apt-upgrade`, `yum-upgrade`), but things like selecting the correct processor architecture or OS
|
|
compatibility are likely handled by the script.
|
|
|
|
In this minimal usecase teleport's role is primarily as a coordinator. It detects when and where user
|
|
provided scripts should be run, and invokes them. All integrity checks are the responsibility
|
|
of the user-provided script, or the underlying install mechanism that it invokes.
|
|
|
|
|
|
#### Automatic Installs Usecase
|
|
|
|
Cluster administrators opt into a mostly "set and forget" upgrade policy which keeps their teleport
|
|
cluster up to date automatically. They may wish to stay at a specific major version, but would like
|
|
to have patches and minor backwards-compatible improvements come in automatically. They want features
|
|
like maintenance schedules to prevent a node from being upgraded when they need it available, and automatic
|
|
rollbacks where nodes revert to their previous installation version of they are unhealthy for too long.
|
|
They also want teleport to be able to upgrade itself without dependencies.
|
|
|
|
This usecase requires the same coordination powers as the minimal usecase, but also a lot more. Teleport
|
|
needs to be able to securely and reliably detect new releases when they become available. Teleport needs
|
|
to be able to evaluate new releases in the context of flexible upgrade policies and select which releases
|
|
(and which targets within those releases) are appropriate and when they should be installed. Teleport needs
|
|
to be able to download and verify installation packages, upgrade itself, and monitor the health of newly
|
|
installed versions.
|
|
|
|
In this maximal usecase, teleport is responsible for discovery, selection, coordination, validation, and
|
|
monitoring. Most importantly, teleport must do all of this in a secure and reliable manner. The potential
|
|
fallout from bugs and vulnerabilities is greater than in the minimal usecase.
|
|
|
|
|
|
#### Plan/Apply Usecase
|
|
|
|
Cluster administrators want automatic discovery of new versions and the ability to trigger automatic installs,
|
|
but they want manual control over when installs happen. They may also wish for additional controls/checks such
|
|
as multiparty approval for upgrades and/or the ability to perform dry runs that attempt to detect potential
|
|
problems early.
|
|
|
|
The core `plan`/`apply` usecase is mostly the same as automatic installs (minus the automatic part), but
|
|
the more advanced workflows require additional features. Multiparty approval and dry runs both necessitate
|
|
a concept of "pending" version directives, and dry runs require that all installers expose a dry run mode
|
|
of some kind.
|
|
|
|
|
|
#### Hacker Usecase
|
|
|
|
Cluster administrators march to the beat of their own drum. They want to know the latest publicly available
|
|
teleport releases, but skip all prime number patches. Nodes can only be upgraded if it is low
|
|
tide in their region, the moon is waxing, and the ISS is on the other side of the planet. They want to use
|
|
teleport's native download and verification logic, but they also need to start the downloaded binary in
|
|
a sandbox first to ensure it won't trigger their server's self-destruct. If rollback is necessary, the rollback
|
|
reason and timestamp need to be steganographically encoded into a picture of a turtle and posted to instagram.
|
|
|
|
This usecase has essentially the same requirements as the automatic installs usecase, with one addition. It necessitates
|
|
*loose coupling* of components.
|
|
|
|
|
|
### Security
|
|
|
|
Due to the pluggable nature of the system proposed here, it is difficult to make *general* statements about
|
|
the security model. This is because most of the responsibilities of an upgrade system (package validation,
|
|
censorship resistance, malicious downgrade resistance, etc) are responsibilities that fall to the pluggable
|
|
components. That being said, we can lay down some specific principals:
|
|
|
|
- Version controllers should have some form of censorship detection (e.g. the TUF controller verifies that
|
|
the package metadata it downloads has been recently re-signed by a hot key to prove liveness). Teleport will
|
|
provide a `stale_after` field for version directives so that failure to gather new state is warned about, but
|
|
additional warnings generated by the controller itself are encouraged.
|
|
|
|
- Installers must fail if they are not provided with sufficient target information to ensure that the acquired
|
|
package matches the target (e.g. if installation is delegated to an external package manager that is deemed trusted
|
|
this might be as simple as being explicit about version, but in the case of the TUF installer this means rejecting
|
|
target specifications that don't include all required TUF metadata).
|
|
|
|
- We should encourage decentralized trust. The TUF based system should leverage TUF's multisignature support
|
|
to ensure that compromise of a single key cannot compromise installations. We should also provide tools
|
|
to help those using custom installation mechanism to avoid single-point failures as well (e.g. multiparty approval
|
|
for pending `version-directive`s), and the ability to cross-validate their sources with the TUF validation
|
|
metadata pulled from our repos.
|
|
|
|
- Teleport should have checks for invalid state-transitions independently of any specific controller.
|
|
|
|
|
|
#### TUF Security
|
|
|
|
We won't be re-iterating all of the attack vectors that (correct) usage of TUF is intended to protect against.
|
|
I suggest at least reading the [attacks](https://theupdateframework.github.io/specification/v1.0.28/index.html#goals-to-protect-against-specific-attacks)
|
|
section of the specification. Instead we will zoom in on how we intend to use TUF for our purposes.
|
|
|
|
TUF provides a very good mechanism for securely getting detailed package metadata distributed to clients,
|
|
including sufficient information to verify downloaded packages, and to ensure that censorship and
|
|
tampering can be detected.
|
|
|
|
The trick to making sure a TUF-based system really lives up to the promise of the framework, is to have
|
|
a good model for how the TUF metadata is generated and signed in the first place. This is where we come
|
|
to the heart of our specific security model. We will leverage TUF's thresholded signature system and
|
|
go's ability to produce deterministic builds in order to establish isolated cross-checks that can
|
|
independently produce the same TUF metadata for a given release. At a minimum, we will have two separate
|
|
signers:
|
|
|
|
- Build Signer: Our existing build infrastructure will be extended to generate and sign TUF metadata for
|
|
all release artifacts (or at least the subset that can be built deterministically).
|
|
|
|
- Verification Signer: A separate environment isolated from the main build system will independently build
|
|
all deterministic artifacts. All metadata will be independently generated and signed by this system.
|
|
|
|
With this dual system in place, we can ensure that compromised build infrastructure cannot compromise the
|
|
upgrade system (and be able to detect compromises essentially immediately).
|
|
|
|
If we can manage to fully isolate the two environments such that no teleport team member has access to both
|
|
environments, we should be able to secure the upgrade system from any single compromise short of a direct
|
|
compromise of our public repositories.
|
|
|
|
All of the above presumes that no exploits are found in TUF itself, or its official go library, s.t. TUF's core
|
|
checks (multisignature verification, package/metadata validation, etc) could be directly or indirectly circumvented.
|
|
The TUF spec has been audited multiple times, but the most recent audit as of the time of writing was performed in
|
|
2018 and did not cover the go implementation specifically.
|
|
|
|
In order to further mitigate potential TUF related issues, we will wrap all download and TUF metadata retrieval
|
|
operations in our own custom API with required TLS authentication. TUF metadata will be used only as an additional
|
|
verification check, and will not be used to discover the identity from which a package should be downloaded
|
|
(i.e. malicious TUF metadata won't be able to change _where_ we download a package from).
|
|
The intent here will be to ensure that a vulnerability in TUF itself cannot be exploited without also
|
|
compromising the TLS client and/or our own servers directly. This means we won't be taking advantage of TUF's
|
|
ability to support unauthenticated mirrors, but since we have no immediate plans to support that feature anyhow,
|
|
adding this further layer of security has no meaningful downside.
|
|
|
|
|
|
### Inventory Control Model
|
|
|
|
- Auth servers exert direct control over non-auth instance upgrades via bidirectional
|
|
GRPC control stream.
|
|
|
|
- Non-auth instances advertise detailed information about the current installation, and
|
|
implement handlers for control stream messages that can execute whatever local component
|
|
is required for running a given install method (e.g. executing a specific script if the
|
|
`local-script` installer is in use).
|
|
|
|
- Each control stream is registered with a single auth server, so each auth server is
|
|
responsible for triggering the upgrade of a subset of the server inventory. In order
|
|
to reduce thundering herd effects, upgrades will be rolling with some reasonable default
|
|
rate.
|
|
|
|
- Upgrade decisions are level-based. Remote downgrades and retries are an emergent
|
|
property of a level-based system, and won't be given special treatment
|
|
|
|
- The auth server may skip a directive that it recognizes as resulting in an incompatible
|
|
change in version (e.g. skipping a full major version).
|
|
|
|
- By default, semver pre-release installations are not upgraded (e.g. `1.2.3-debug.2`).
|
|
|
|
- In order to avoid nearly doubling the amount of backend writes for existing large
|
|
clusters (all of whose instances are predominantly ssh services), the existing "node"
|
|
resource (which would be more accurately described as the `ssh_server` resource), will
|
|
be repurposed to represent a server installation which may or may not be running an ssh
|
|
service. Whether or not other services would also benefit from unification in this way
|
|
can be evaluated on a case-by-case basis down the road.
|
|
|
|
- In order to support having a single control stream per teleport instance (rather than
|
|
separate control streams for each service) we will need to refactor how instance
|
|
certs are provisioned. Currently, separate certs are granted for each service running
|
|
on a instance, with no single certificate ever encoding all the permissions granted by the
|
|
instance's join token.
|
|
|
|
Hypothetical GRPC spec:
|
|
|
|
```protobuf
|
|
// InventoryService is a subset of the AuthService (broken out for the readability)
|
|
service InventoryService {
|
|
// InventoryControlStream is a bidirectional stream that handles presence and
|
|
// control messages for peripheral teleport installations.
|
|
rpc InventoryControlStream(stream ClientMessage) returns (stream ServerMessage);
|
|
}
|
|
|
|
|
|
// ClientMessage is sent from the client to the server.
|
|
message ClientMessage {
|
|
oneof Msg {
|
|
// Hello is always the first message sent.
|
|
ClientHello Hello = 1;
|
|
// Heartbeat periodically updates status.
|
|
Heartbeat Heartbeat = 2;
|
|
// LocalScriptInstallResult notifies of installation failures.
|
|
LocalScriptInstallResult LocalScriptInstallResult = 3;
|
|
}
|
|
}
|
|
|
|
// ServerMessage is sent from the server to the client.
|
|
message ServerMessage {
|
|
oneof Msg {
|
|
// Hello is always the first message sent.
|
|
ServerHello Hello = 1;
|
|
// LocalScriptInstall instructs the client to perform a local-script
|
|
// upgrade operation.
|
|
LocalScriptInstall LocalScriptInstall = 2;
|
|
}
|
|
}
|
|
|
|
// ClientHello is the first message sent by the client and contains
|
|
// information about the client's version, identity, and claimed capabilities.
|
|
// The client's certificate is used to validate that it has *at least* the capabilities
|
|
// claimed by its hello message. Subsequent messages are evaluated by the limits
|
|
// claimed here.
|
|
message ClientHello {
|
|
// Version is the currently running teleport version.
|
|
string Version = 1;
|
|
// ServerID is the unique ID of the server.
|
|
string ServerID = 2;
|
|
// Installers is a list of supported installers (e.g. `local-script`).
|
|
repeated string Installers = 3;
|
|
|
|
// ServerRoles is a list of teleport server roles (e.g. ``).
|
|
repeated string ServerRoles = 4;
|
|
}
|
|
|
|
// Heartbeat periodically
|
|
message Heartbeat {
|
|
// TODO
|
|
}
|
|
|
|
// ServerHello is the first message sent by the server.
|
|
message ServerHello {
|
|
// Version is the currently running teleport version.
|
|
string Version = 1;
|
|
}
|
|
|
|
|
|
|
|
// LocalScriptInstall instructs a teleport instance to perform a local-script
|
|
// installation.
|
|
message LocalScriptInstall {
|
|
// Target is the install target metadata.
|
|
map<string, string> Target = 1;
|
|
// Env is the script env variables.
|
|
map<string, string> Env = 2;
|
|
// Shell is the optional shell override.
|
|
string Shell = 3;
|
|
// Script is the script to be run.
|
|
string Script = 4;
|
|
}
|
|
|
|
|
|
// LocalScriptInstallResult informs auth server of result of a local-scrip installer
|
|
// running. This is a best-effort message since some local-script installers may restart
|
|
// the process as part of the installation.
|
|
message LocalScriptInstallResult {
|
|
bool Success = 1;
|
|
string Error = 2;
|
|
}
|
|
```
|
|
|
|
|
|
### Inventory Status and Visibility
|
|
|
|
We face some non-trivial constraints when trying to track the status and health of ongoing
|
|
installations. These aren't problems per-say, but they are important to keep in mind:
|
|
|
|
- Teleport instances are ephemeral and can be expected to disappear quite regularly, including
|
|
mid-install. As such, we can't make a hard distinction between a node disappearing due to normal
|
|
churn, and a node disappearing due to a critical issue with the install process.
|
|
|
|
- Backend state related to teleport instances is not persistent. A teleport instance should have its
|
|
associated backend state cleaned up in a reasonable amount of time, and the auth server should
|
|
handle instances for which no backend state exists gracefully.
|
|
|
|
- The flexible/modular nature of the upgrade system means that there is a very significant benefit
|
|
to minimizing the complexity of a component's interface/contract. E.g. a `local-script`
|
|
installer that just runs an arbitrary script is much easier for a user to deal with than one that
|
|
must expose discrete download/run/finalize/rollback steps.
|
|
|
|
- Ordering in distributed systems is hard.
|
|
|
|
With the above in mind, lets look at some basic ideas for how to track installation state:
|
|
|
|
- Immediately before triggering a local install against a server, the auth server must update
|
|
that server's corresponding backend resource with some basic info about the install attempt
|
|
(time, installer, current version, target version, etc). The presence of this information
|
|
does not guarantee that an install attempt was ever made (e.g. the auth server might have crashed
|
|
after writing, but before sending).
|
|
|
|
- Auth servers will use CompareAndSwap operations when updating server resources to avoid
|
|
overwriting concurrent updates from other auth servers. This is important because we don't
|
|
want two auth servers to send install messages to the same instance in quick succession, and
|
|
we also don't want to accidentally lose information related to install attempts.
|
|
|
|
- An instance *may*, but is not required to, send various status updates related to an install
|
|
attempt after it has been triggered. As features are added into the upgrade system (e.g. local rollbacks)
|
|
new messages with special meanings can be added to improve the reliability and safety of rollouts.
|
|
|
|
- Auth servers will make inferences based on the available information attached to server inventory
|
|
resources to roughly divide them into the following states:
|
|
- `VersionParity`: server advertises the correct version (or no version directive matches the server) and
|
|
the server was not recently sent any install messages.
|
|
- `NeedsInstall`: server advertises different version than the one specified in its matching version directive,
|
|
and no recent install attempts have been made.
|
|
- `InstallTriggered`: install was triggered recently enough that it is unclear what the result is.
|
|
- `RecentInstall`: server has recently sent a local install message, and is now advertising a version matching
|
|
the target of that message. Whether recency in this case should be measured in time, number of heartbeats, or
|
|
some combination of both is an open question, but it is likely that we'll need to tolerate some overlap where
|
|
heartbeats advertising two different versions are interleaved. We should try to limit this possibility, but
|
|
eliminating it completely is unreasonable.
|
|
- `ChurnedDuringInstall`: server appears to have gone offline immediately before, during, or immediately
|
|
after an installation. It is impossible to determine whether this was caused by the
|
|
install attempt, but for a given environment there is some portion/rate of churn that, if exceeded, is likely
|
|
significant.
|
|
- `ImplicitInstallFault`: server is online but seems to have failed to install the new version for some
|
|
reason. Its possible that the server never got the install message, or that it performed a full install
|
|
and rollback, but could not update its status for some reason.
|
|
- `ExplicitInstallFault`: server is online and seems to have failed to install the new version for some
|
|
reason, but has successfully emitted at least one error message. For a `local-script` installer this likely
|
|
just means that the script had a non-zero exit code, but for a builtin installer we may have a failure
|
|
message with sufficient information to be programmatically actionable (e.g. `Rollback` vs `DownloadFailed`).
|
|
|
|
- By aggregating the counts of servers in the above states by target, version, and installer the auth servers
|
|
can generate health metrics to assess the state of an ongoing rollout, potentially halting it if some threshold
|
|
is reached (e.g. `max_churn`).
|
|
|
|
|
|
Hypothetical inventory view:
|
|
|
|
```
|
|
$ tctl inventory ls
|
|
Server ID Version Services Status
|
|
------------------------------------ ------- ----------- -----------------------------------------------
|
|
eb115c75-692f-4d7d-814e-e6f9e4e94c01 v0.1.2 ssh,db installing -> v1.2.3 (17s ago)
|
|
717249d1-9e31-4929-b113-4c64fa2d5005 v1.2.3 ssh,app online (32s ago)
|
|
bbe161cb-a934-4df4-a9c5-78e18b599601 v0.1.2 ssh churned during install -> v1.2.3 (6m ago)
|
|
5e6d98ef-e7ec-4a09-b3c5-4698b10acb9e v0.1.2 k8s online, must install >= v1.2.2 (eol) (38s ago)
|
|
751b8b44-5f96-450d-b76a-50504aa47e1f v1.2.3 ssh online (14s ago)
|
|
3e869f3f-8caa-4df3-aa5c-0a85e884a240 v1.2.3 db offline (12m ago)
|
|
166dc9b9-fc85-44a0-96ca-f4bec069aa92 v1.2.1 k8s online, must install >= v1.2.2 (sec) (12s ago)
|
|
f67dbc3a-2eff-42c8-87c2-747ee1eedb56 v1.2.1 proxy online, install soon -> v1.2.3 (46s ago)
|
|
9db81c94-558a-4f2d-98f9-25e0d1ec0214 v1.2.2 k8s online, install recommended -> v1.2.3 (20s ago)
|
|
5247f33a-1bd1-4227-8c6e-4464fee2c585 v1.2.3 auth online (21s ago)
|
|
...
|
|
|
|
Warning: 1 instance(s) need upgrade due to newer security patch (sec).
|
|
Warning: 1 instance(s) need upgrade due to having reached end of life (eol).
|
|
```
|
|
|
|
Some kind of status summary should also exist for the version-control system as a whole. I'm still
|
|
a bit uncertain about how this should be formatted and what all should be in it, but key points
|
|
like the current versioning source, targets, and installers should be covered, as well as stats
|
|
on recent installs/faults/churns:
|
|
|
|
```
|
|
$ tctl version-control status
|
|
Directive:
|
|
Source: tuf/default
|
|
Status: active
|
|
Promotion: auto
|
|
|
|
Installers:
|
|
Kind Name Status Recent Installs Installing Faults Churned
|
|
------------ ----------- ------- --------------- ---------- ------ -------
|
|
tuf default enabled 6 2 1 1
|
|
local-script apt-install enabled 3 2 - 2
|
|
|
|
Inventory Summary:
|
|
Current Version Target Version Count Recent Installs Installing Faults Churned
|
|
--------------- -------------- ----- --------------- ---------- ------ -------
|
|
v1.2.3 v2.3.4 12 - 4 1 3
|
|
v2.3.4 - 10 9 - - -
|
|
v3.4.5-beta.1 - 2 - - - -
|
|
v0.1.2 - 1 - - - -
|
|
|
|
Critical Versioning Alerts:
|
|
Version Alert Count
|
|
------- --------------------------------- -----
|
|
v1.2.3 Security patch available (v2.3.4) 12
|
|
v0.1.2 Version reached end of life 1
|
|
```
|
|
|
|
|
|
### Version Reconciliation Loop
|
|
|
|
The version reconciliation loop is a level-triggered control loop that is responsible for determining
|
|
and applying state-transitions in order to make the current inventory versioning match the desired
|
|
inventory versioning. Each auth server runs its own version reconciliation loop which manages the
|
|
server control streams attached to that auth server.
|
|
|
|
The core job of the version reconciliation loop is fairly intuitive (compare desired state to actual
|
|
state, and launch installers to correct the difference). To get a better idea of how it should work
|
|
in practice, we need to look at the caveats that make it more complex:
|
|
|
|
- We need to use a rolling update strategy with a configurable rate, which means that not all
|
|
servers eligible for installation will actually have installation triggered on a given iteration.
|
|
The version directive may change mid rollout, so simply blocking the loop on a given directive until
|
|
it has been fully applied isn't reasonable.
|
|
|
|
- We need to monitor cluster-wide health of ongoing installations and pause installations if we see excess
|
|
failures/churn, which means that aggregating information about failures is a key part of the reconciliation
|
|
loop's job.
|
|
|
|
- We should avoid triggering installs against servers that recently made an install attempt (regardless
|
|
of success/failure), and we should also avoid sending install messages to servers that just connected
|
|
or are in the process of graceful shutdown. This means that a server's eligibility for installation is
|
|
a combination of both persistent backend records, and "live" control stream status.
|
|
|
|
Given the above, the reconciliation loop is a bit more complex, but still falls into three distinct phases:
|
|
|
|
1. Setup: load cluster-level upgrade system configuration, active `version-directive`, churn/fault stats, etc.
|
|
|
|
2. Reconciliation: Match servers to target and installer, and categorize them by their current
|
|
install eligibility given recent install attempts, control stream status, etc.
|
|
|
|
3. Application: Determine the number of eligible servers that will actually be slated for install
|
|
given current target rollout rate, update their backend states with a summary of the install attempt
|
|
that is about to be made (skipping servers which had their installation status concurrently updated),
|
|
and pass them off to installer-specific logic.
|
|
|
|
As much as possible, we want the real "decision making" power to rest with the `version-controller` rather
|
|
than the version reconciliation loop. That being said, the version reconciliation loop will have some
|
|
internal rules that it evaluates to make sure that directives, as applied to the current server inventory,
|
|
do not result in any invalid state-transitions (e.g. it will refuse to change the target arch for a given
|
|
server, or skip a major version).
|
|
|
|
|
|
### Version Directives
|
|
|
|
#### The Version Directive Resource
|
|
|
|
The `version-directive` resource is the heart of the upgrade system. It is a static resource that describes
|
|
the current desired state of the cluster and how to get to that state. This is achieved through a series of
|
|
matchers which are used to pair servers with installation targets and installers. At its core, a `version-directive`
|
|
can be thought of as a function of the form `f(server) -> optional(target,installer)`.
|
|
|
|
Installation targets are arbitrary attribute mappings that must *at least* contain `version`, but may contain
|
|
any additional information as well. Certain metadata is understood by teleport (e.g. `fips:yes|no`,
|
|
`arch:amd64|arm64|...`), but additional metadata (e.g. `sha256sum:12345...`) is simply passed through
|
|
to the installer. The target to be used for a given server is the first target that
|
|
*is not incompatible* (i.e. no attempt to find the "most compatible" target is made). A target is
|
|
incompatible with a server if that server's version cannot safely upgrade/downgrade to that target version,
|
|
*or* if the target specifies a build attribute that differs from a build attribute of the current
|
|
installation (e.g. `fips:yes` when current build is `fips:no`). We don't require that all build attributes
|
|
are present since not all systems require knowledge of said attributes.
|
|
|
|
It is the responsibility of an installer to fail if it is not provided with sufficient target attributes to perform
|
|
the installation safely (e.g. the `tuf` installer would fail if the target passed to it did not contain the
|
|
expected length and hash data). The first compatible installer from the installer list will be selected. Compatibility
|
|
will be determined *at least* by the version of the instance, as older instances may not support all installer
|
|
types. How rich of compatibility checks we want to support here is an open question. I am wary of being too
|
|
"smart" about it (per-installer selectors, pre-checking expected attributes, etc), as too much customization
|
|
may result in configurations that are harder to review and more likely to silently misbehave.
|
|
|
|
Within the context of installation target matching, version compatibility for a given server is defined
|
|
as any version within the inclusive range of `vN.0.0` through `vN+1.*`, where `N` is the current major
|
|
version of the server. Stated another way, upgrades may keep the major version the same, or increment it
|
|
by one major version. Downgrades may revert as far back as the earliest release of the current major
|
|
version. Downgrades to an earlier major version are not supported.
|
|
|
|
All matchers in the `version-directive` resource are lists of matchers that are checked in sequence, with
|
|
the first matching entry being selected. If a server matches a specific sub-directive, but no installation
|
|
targets and/or installers in that sub-directive are compatible, that server has no defined `(target,installer)`
|
|
tuple.
|
|
|
|
Beyond matching installation targets to servers, the `version-directive` also supports some basic time
|
|
constraints to assist in scheduling, and a `stale_after` field which will be used by teleport to determine
|
|
if the directive is old enough to start emitting warnings about it (especially useful if directives are generated
|
|
by external plugins which might otherwise fail silently).
|
|
|
|
Example `version-directive` resource:
|
|
|
|
```yaml
|
|
# version directive is a singleton resource that is either supplied by a user,
|
|
# or periodically generated by a version controller (e.g. tuf, plugin, etc).
|
|
# this represents the desired state of the cluster, and is used to guide a control
|
|
# loop that matches install targets to appropriate nodes and installers.
|
|
kind: version-directive
|
|
version: v1
|
|
metadata:
|
|
name: version-directive
|
|
spec:
|
|
nonce: 2
|
|
status: enabled
|
|
version_controller: static-config
|
|
confid_id: <random-value>
|
|
stale_after: <time>
|
|
not_before: <time>
|
|
not_after: <time>
|
|
directives:
|
|
- name: Staging
|
|
targets:
|
|
- version: 2.3.4
|
|
fips: yes
|
|
- version: 2.3.4
|
|
fips: no
|
|
installers:
|
|
- kind: script
|
|
name: apt-install
|
|
selectors:
|
|
- labels:
|
|
env: staging
|
|
services: [db,ssh] # unspecified matches all services *except* auth
|
|
- labels:
|
|
env: testing
|
|
services: [db,ssh]
|
|
|
|
- name: Prod
|
|
targets:
|
|
- version: 1.2.3
|
|
fips: yes
|
|
installers:
|
|
- kind: script
|
|
name: apt-install
|
|
selectors:
|
|
- labels:
|
|
env: prod
|
|
```
|
|
|
|
The above example covers the core information needed to effectively orchestrate installations, but it
|
|
does not quite cover an equally pressing need: providing reliable visibility into what instances are in need of
|
|
security patches and/or are running deprecated/eol versions. We cover more nuanced mechanisms for dealing
|
|
with customizable notifications in later sections, but it seems important that we also provide a mechanism for
|
|
establishing a very basic security/deprecation "floor" that can be baked into the version directive. Something
|
|
that lets us say "warn about versions before X" regardless of the details of our specific server -> version
|
|
mapping that is in effect at the moment.
|
|
|
|
Exact syntax is TBD, but something like this would be sufficient:
|
|
|
|
```yaml
|
|
critical_floor:
|
|
end_of_life: v1 # all releases earlier than v2 are EOL
|
|
security_patches:
|
|
- version: v2.3.4 # v2 releases prior to v2.3.4 need to be upgraded to at least v2.3.4
|
|
desc: 'CVE-12345: teleport may become sentient and incite robot uprising'
|
|
- version: v3.4.5 # v3 releases prior to v3.4.5 need to be upgraded to at least v3.4.5
|
|
desc: 'prime factorization proven trivial, abandon hope all ye who enter here'
|
|
```
|
|
|
|
|
|
#### Version Directive Flow
|
|
|
|
Up to this point, we've been fairly vague about what happens between a `version-directive` being
|
|
created by the initial controller that generates it, and becoming the new desired state for the
|
|
cluster. In order to reason about this intervening space, it is good to start by taking stock of
|
|
what features we would like to eventually support that take effect between generating the initial
|
|
directive, and final application of that directive:
|
|
|
|
- Mapping/Plugins: Some intervening process takes a `version-directive` generated by the
|
|
originating controller and modifies it in some way. Some examples of this might be a plugin that applies
|
|
a custom filter to installation targets, or a scheduler that creates custom start/end times for
|
|
the directive.
|
|
|
|
- Plan/Apply Workflow: It is reasonable to assume that not everyone will want new installation
|
|
targets to be selected automatically, and we should provide a workflow that permits previewing
|
|
the new target state before applying it.
|
|
|
|
- Multiparty Approval: Upgrading sensitive infrastructure can be a big deal. Providing an equivalent
|
|
to the `plan`/`apply` workflow that also supports multiparty approval (think access requests but for
|
|
changing the version directive) seems like an obvious feature that we'll want to land eventually.
|
|
|
|
- Notifications/Recommendations: When using a plan/apply or multiparty approval workflow, being able
|
|
to be notified when new versions are available seems reasonable and useful. Ideally, it should be possible
|
|
to provide both the means for external plugins to generate notifications (e.g. via slack), and also for
|
|
teleport's own interfaces to mark servers as being eligible for upgrade.
|
|
|
|
- Live Modality/Selection: Not all configurations work for all scenarios. It seems reasonable that
|
|
we will eventually want to support workflows that allow some concept of differing configurations
|
|
or directives, either by providing the ability to have multiple distinct configurations available at
|
|
the same time (e.g. `plan <variant-a>` vs `plan <variant-b>`), or to allow some form of live subselection
|
|
(e.g. `plan --servers env=prod`).
|
|
|
|
- Dry-Run: Similar to a `plan` phase, it might be nice to be able to execute dry runs of potential
|
|
directives. What a dry run entails varies by installer, but "download and verify without installing"
|
|
is a reasonable interpretation for local installers at least. It might even be possible to cache a
|
|
package that was downloaded during a dry run and install it immediately during a normal install. Note
|
|
that caching may present a new attack vector and implementing it would require careful thought to
|
|
prevent new attack vectors from being introduced. This is a lower priority feature, but it is useful
|
|
to keep in mind so that we don't select an architecture that precludes it as a possibility.
|
|
|
|
Each of the features described above requires some amount of engineering specific to itself, but they
|
|
also have an overlapping set of needs that we can use to inform the basic directive flow. We'll cover
|
|
the high-level flow itself, and then examine why it meets our needs.
|
|
|
|
Directives will come in three distinct flavors: "draft", "pending", and "active". Draft and pending
|
|
the directives will be sets stored by `<kind>/<name>` and `<uuid>` respectively. The active directive
|
|
will be the singleton directive representing the current desired state, as discussed elsewhere. This
|
|
storage model will be used to enforce a specific "flow" through the following operations:
|
|
|
|
- `WriteDraft`: A draft is written out by its generating controller/plugin to `/drafts/<kind>/<name>`.
|
|
By convention, `kind` and `name` are the kind and name of the controller that writes the draft. The effect
|
|
of this is that subscribing to write events on the key `/drafts/tuf/default` is essentially equivalent to
|
|
consuming a stream of the `tuf/default` controller's outputs. Drafts include information about when they
|
|
become stale, ensuring easy detection if a controller is offline, even if it is external to teleport.
|
|
|
|
- `FreezeDraft`: The latest version of target draft is copied and stored at a random UUID. Frozen drafts
|
|
are stored as an immutable sub-field within a "pending directive" object which encodes additional information
|
|
that allows teleport to make decisions about the pending directive (e.g. a an approval policy in the case
|
|
of a multiparty approval scenario).
|
|
|
|
- `PromotePending`: Target pending directive overwrites the "active" singleton, becoming the new target
|
|
state of the cluster.
|
|
|
|
With the above flow defined, we can now look at how we might implement our desired features:
|
|
|
|
- Mapping/Plugins: Each intermediate plugin loads some upstream draft, performs its modifications, and
|
|
writes them to some downstream draft. E.g. a `scheduler` plugin might load from `drafts/tuf/default`
|
|
and write to `drafts/scheduler/default`.
|
|
|
|
- Plan/Apply Workflow: Invoking `tctl version-control plan` freezes the latest draft with an associated
|
|
attribute indicating that it is frozen for a plan operation. The frozen draft is used to generate a
|
|
summary of changes to be displayed to the user (e.g. number of nodes that would be upgraded and to
|
|
what versions). If the user likes what they see, they can run `tctl version-control apply <id>` to
|
|
promote the pending directive. If no action is taken, the pending directive expires after a short time.
|
|
|
|
- Multiparty Approval: Essentially the same workflow as Plan/Apply, except with `tsh` commands instead,
|
|
possibly with slightly different wording (e.g. `propose`/`apply`), and an additional
|
|
`tsh version-control review` command. The auth server freezes the target along with an approval policy,
|
|
and waits for sufficient approvals before permitting promotion.
|
|
|
|
- Notifications/Recommendations: Teleport and/or external plugins periodically load the latest draft
|
|
directive and compare it to current cluster state. Where the draft recommends a different version,
|
|
users are notified and recommended version is displayed when listing servers.
|
|
|
|
- Live Modality/Selection: While we want `apply` commands to "just work" if users only have one
|
|
controller/pipeline, we can also support selecting drafts by name
|
|
(e.g. `tctl version-control plan foo/bar`) so that users can configure their clusters to present multiple
|
|
alternative drafts that can be compared and selected between.
|
|
|
|
- Dry-Run: Invoking `tctl version-control dry-run <id>` marks a pending directive for dry run. Auth servers
|
|
invoke installers in dry-run mode (for those that support it), and periodically embed stats about the
|
|
state of the dry run (churns faults, etc) as attributes on the pending draft object for some time period.
|
|
Since dry runs still trigger installers, multiparty approval would need to define approval thresholds for
|
|
invoking dry runs. As noted in the previous dry run discussion, this feature is tricky and probably of lower
|
|
priority than the others on this list.
|
|
|
|
|
|
### High-Level Configuration
|
|
|
|
Some configuration parameters are independent of specific controllers/installers (namely rollouts and
|
|
promotion policies), and are best controlled from a central configuration object, rather than having
|
|
competing configurations attached to each controller. In addition, it is desirable to provide a simple
|
|
single-step operation for enabling automatic upgrades in our "batteries included" usecase. With this in
|
|
mind, we will provide a top-level configuration object that can conveniently control the key parameters
|
|
of the upgrade system:
|
|
|
|
```yaml
|
|
kind: version-control-config
|
|
version: v1
|
|
spec:
|
|
enabled: yes
|
|
|
|
rolling_install:
|
|
churn_limit: 5% # percent or count
|
|
fault_limit: 10
|
|
rate: 20%/h # <percent or count>/<h|m>
|
|
|
|
promotion:
|
|
strategy: automatic # set to 'manual' for plan/apply workflow
|
|
from: tuf/default
|
|
|
|
notification:
|
|
from: tuf/latest # defaults to using the value from `promotion.from`
|
|
|
|
# shorthand for the more verbose syntax of the version-directive resource with support
|
|
# for wildcards in the version string. Version controllers can use these as templates
|
|
# to build concrete actionable directives using targets from the latest matching version.
|
|
# This is an optional feature, since any controllers we write will also support verbose
|
|
# templates in their own config objects, but simple rules like this will likely be sufficient
|
|
# for many usecases, and are generic enough for us to assume that all future controllers
|
|
# should be able to support them.
|
|
basic_directives:
|
|
- name: Prod
|
|
version: v1.1.* # at least major version must be specified
|
|
server_labels:
|
|
env: prod
|
|
- name: Staging
|
|
version: v1.2.*
|
|
server_labels:
|
|
env: staging
|
|
```
|
|
|
|
The above configuration object should be all a user needs to activate automatic updates (once we've implemented
|
|
the `tuf` controller and installer). Additional controller-specific features will be accessible by using a
|
|
custom configuration (e.g. `tuf/my-tuf-controller`), but the default should "just work" in most cases.
|
|
|
|
In the event that a mapping/plugin strategy (as described in the Version Directive Flow section) is in use,
|
|
the `promotion.from` field should be the draft output location of the final plugin in the chain. If using
|
|
the `manual` promotion strategy this field is optional but omitting it will cause `tctl version-control plan`
|
|
to always require an explicit target.
|
|
|
|
|
|
### Version Controllers
|
|
|
|
A `version-controller` is an abstract entity that periodically generates a draft `version-directive`. It may
|
|
be a loop that runs within the auth server, an external plugin, or just a human manually creating
|
|
directives as needed. A *builtin* controller is a control loop that runs within teleport capable of generating
|
|
version directives. The only builtin controller that is currently part of the development plan is
|
|
the TUF controller, though we may also introduce a simpler "notification only" controller that can't be
|
|
used to trigger updates, but could be used to suggest that installations are out of date.
|
|
|
|
|
|
#### TUF Version Controller
|
|
|
|
The TUF version controller will be based on [go-tuf](https://github.com/theupdateframework/go-tuf)
|
|
and will maintain TUF client state within the teleport backend (TUF clients are stateful, since they
|
|
need to support concepts like key rotation). When enabled, the TUF controller will periodically sync
|
|
with a TUF repository that we maintain, discover available packages, and generate a version-directive
|
|
with the necessary metadata for the tuf installer to securely verify said packages.
|
|
|
|
The details of the TUF protocol are complex enough that I won't try to reiterate them here, but the
|
|
complexity is mostly in the process by which the per-package metadata is securely distributed. The output
|
|
generated by the TUF controller will be very simple. In addition to standard target information
|
|
(`version`, `arch`, etc), it will include a size in bytes and one or more hashes.
|
|
|
|
Custom configurations can be supplied, but in the interest of convenience a `tuf/default` controller
|
|
will be automatically activated if referenced by the `version-control-config`, which will seek to fill
|
|
the directive templates specified there.
|
|
|
|
Example custom configuration:
|
|
|
|
```yaml
|
|
kind: version-controller
|
|
version: v1
|
|
sub_kind: tuf
|
|
metadata:
|
|
name: my-tuf-controller
|
|
spec:
|
|
status: enabled
|
|
directives:
|
|
- name: Staging
|
|
target_selectors:
|
|
- version: 7.*
|
|
server_selectors:
|
|
- labels:
|
|
env: staging
|
|
- name: Prod
|
|
target_selectors:
|
|
- version: 7.2.*
|
|
server_selectors:
|
|
- labels:
|
|
env: prod
|
|
- name: Minimum
|
|
target_selectors:
|
|
- version: 6.*
|
|
server_selectors:
|
|
- labels:
|
|
'*': '*'
|
|
```
|
|
|
|
*note*: Generally speaking, TUF is fips compatible, but I have yet to assess what, if any, additional
|
|
work may be needed to get the tuf controller working on fips teleport builds. It is possible that
|
|
we may end up supporting the tuf controller on non-fips builds earlier if this process ends up being
|
|
complex.
|
|
|
|
|
|
#### Notification-Only Install Controller
|
|
|
|
The TUF install controller is going to be a fairly substantial undertaking, with various moving parts
|
|
needing to come together behind the scenes (e.g. deterministic compilation). This is why the MVP release
|
|
is intended to support only manually-constructed directives and `local-script` installers.
|
|
|
|
It may still be desirable to provide a means of using the notification workflow before
|
|
TUF has landed. We could achieve this by providing a simple "low stakes" controller that produces
|
|
notification-only version directives, usable for displaying recommended versions in inventory lists,
|
|
but not suitable for providing sufficient information for package validation.
|
|
|
|
An example of a notification-only install controller would be a `github-releases` controller, which
|
|
periodically scraped the teleport repo's releases page. While the information contained there isn't
|
|
sufficient for robust package validation, its more than sufficient for displaying a "recommended version"
|
|
in an output like `tctl inventory ls`.
|
|
|
|
If we wanted to go with a compromise between prioritizing full TUF features and prioritizing fast delivery
|
|
of notifications, we could establish a beta/preview TUF repo which did not provide any package hashes,
|
|
but did serve a list of recommended install versions, including metadata indicating which versions were
|
|
security releases. While this would take more time to deliver than a minimal "scraper", it would allow us
|
|
the ability to spend our efforts on work that could be mostly re-used during the main TUF development
|
|
phase.
|
|
|
|
|
|
### Installers
|
|
|
|
An installer is a mechanism for attempting to install a target on a server or set of servers.
|
|
Conceptually, installers fall into two categories:
|
|
|
|
- Local Installers: A local installer runs on the teleport instance that needs the installation.
|
|
Each local installer type needs to be supported by the instance being upgraded. From the point of
|
|
view of the version reconciliation loop a local installer is a divergent function of the form
|
|
`f(server_control_stream, target)`.
|
|
|
|
- Remote Installers: A remote installer runs on a teleport instance other than the instance(s)
|
|
being updated. Remote installers need to provide a selector for the controlling host on which
|
|
they need to be run. Remote installers are invoked for sets of servers and may be invoked
|
|
multiple times for overlapping sets, making idempotence essential. From the point of view of
|
|
the version reconciliation loop a remote installer is a function of the form
|
|
`f(host_control_stream, servers, target)`.
|
|
|
|
Different installers have different required target attributes (e.g. the `tuf` installer requires
|
|
package size and hashes). Installers must reject any target which is missing any attribute
|
|
required by that installer's security model.
|
|
|
|
|
|
#### Local-Script Install Controller
|
|
|
|
The `local-script` installer is the simplest and most flexible installer, and the first one we will
|
|
be implementing. It runs the provided script on the host that is in need of upgrade, providing
|
|
a basic mechanism for inserting target information (e.g. `version`) as env variables.
|
|
|
|
While sanity is generally the responsibility of the use for this controller, we can assist by enforcing
|
|
strict limits on allow characters for inputs/vars (e.g. `^[a-zA-Z0-9\.\-_]*$`). This should be in
|
|
addition to any rules we create for specific values (e.g. `target.version`).
|
|
|
|
The initial version of the local-script installer will be as bare-bones as possible:
|
|
|
|
```yaml
|
|
# an installer attempts to apply an installation target to a node. this is an example
|
|
# of an installer that gets passed from the auth server to the node so that the node
|
|
# itself can run it, but some installers may run somewhere other than the node itself
|
|
# (e.g. if invoking some API that remotely upgrades teleport installs). The auth server
|
|
# uses the version-directive to determine which installers should be run for which nodes
|
|
# and with which targets.
|
|
kind: installer
|
|
sub_kind: script
|
|
version: v1
|
|
metadata:
|
|
name: apt-install
|
|
spec:
|
|
enabled: yes
|
|
env:
|
|
"VERSION": '{target.version}'
|
|
shell: /bin/bash
|
|
install.sh: |
|
|
set -euo pipefail
|
|
apt install teleport-${VERSION:?}
|
|
```
|
|
|
|
Possible future improvements include:
|
|
|
|
- Additional scripts for special operations (e.g. `dry_run.sh`, `rollback.sh`, etc).
|
|
|
|
- Piping output into our session recording system so that install scripts can be played
|
|
back (seems useful).
|
|
|
|
- Special teleport subcommands meant to be invoked inside of install scripts (e.g. for
|
|
verifying tuf metadata against an arbitrary file).
|
|
|
|
|
|
#### TUF Install Controller
|
|
|
|
The TUF install controller will not need to be configured by users. It will be the
|
|
default install controller used whenever the TUF `version-controller` is active.
|
|
It will download the appropriate package from `get.gravitational.com` and perform standard
|
|
TUF verification (hash + size).
|
|
|
|
Since the download+verify functionality will be present in teleport anyhow, it may be useful
|
|
to expose it as hidden subcommands that could be used inside of scripts, which could allow
|
|
users to inject their own special logic within the normal tuf installation flow.
|
|
|
|
#### Remote-Script Install Controller
|
|
|
|
- Affects installation indirectly by running a user-provided script on a pre-determined host
|
|
(not the host in need of upgrade).
|
|
|
|
- Intended as a simple means of hooking into systems such as k8s, where the teleport version is
|
|
controlled via a remote API, though that does not preclude us making official remote install
|
|
controllers for specific APIs down the road.
|
|
|
|
- Details of functionality are TBD, but the basic idea will be that we will mirror the functionality of
|
|
`local-script` wherever possible, and add an additional server selector that is used to
|
|
determine where the installer should be run.
|
|
|
|
- Q: Should the list of target servers be provided to the script? Is that even useful? It seems
|
|
more likely that scripts will be written per externally managed set, though that could be a
|
|
failure of imagination on my part.
|
|
|
|
|
|
### TUF CI and Repository
|
|
|
|
In order to enable the TUF version controller, we will need to maintain CI that generates and signs
|
|
TUF metadata, and maintain a TUF repository. Details of how the TUF repository will be hosted are
|
|
still TBD, but TUF repositories are basically static files, so distribution should be fairly straightforward.
|
|
We may be able to simply distribute it via a git repo.
|
|
|
|
We will leverage deterministic builds and TUF's multisignature support to harden ourselves against CI
|
|
compromise. Our standard build pipeline will generate and sign one set of package hashes, and another set
|
|
will be generated and signed by a separate isolated env.
|
|
|
|
TUF repositories prove liveness via periodic resigning with a "hot" key (not the keys used for package signing).
|
|
This hot key should be isolated from the package signing keys, so we're likely looking at two new isolated
|
|
envs that need to be added in addition to the modifications to our existing CI.
|
|
|
|
*note*: some initial work was done to get deterministic builds working on linux packages. We know its possible
|
|
(and might even still be working), but don't currently have test coverage for build determinism. This will be
|
|
an important part of the prerequisite work to get the TUF system online. We don't need to add TUF support for
|
|
all build targets at once, so we may specifically target reliable signing of amd64/linux packages first.
|
|
|
|
|
|
### Rollbacks
|
|
|
|
Rollbacks will come in two flavors:
|
|
|
|
1. Remote rollback: Version directive is changed to target an older version. Older version is installed via
|
|
normal install controller. Requires the new teleport installation to work at least well
|
|
enough to perform any functions required by the install controller.
|
|
|
|
2. Local rollback: The previous teleport installation remains cached during the upgrade, and some local process
|
|
monitors the health of the new version. If the new version remains unhealthy for too long, it is forcibly
|
|
terminated and the previous installation is replaced.
|
|
|
|
The first option is an emergent property of the level-triggered system and will be supported from the beginning.
|
|
Teleport won't bother to distinguish between an upgrade and a downgrade. No special downgrade logic is required
|
|
for this option to work.
|
|
|
|
The second option will require a decent amount of specialized support and will be added later down the line. Script
|
|
installers would likely need to be amended in some way to work correctly with a local rollback scheme. The
|
|
details of how exactly local rollbacks should function are TBD. Some possibilities include:
|
|
|
|
- Initially install new versions to a pending location (e.g. `/usr/local/bin/teleport.pending`). Have teleport
|
|
automatically fork a background monitor and `exec`s into the pending binary if it is detected on startup. If the
|
|
background monitor observes that its requirements are met, it moves the pending binary to the active location,
|
|
replacing the previous install.
|
|
|
|
- Formally embrace the idea of multiple concurrently installed teleport versions
|
|
and provide a thin "proxy binary" that can seamlessly `exec` into the current target version based on some filesystem
|
|
config, potentially launching a background monitor of a different version first depending on said config. This has the
|
|
downside of introducing a new binary, but the upside of eliminating the need for messy move/rename schemes.
|
|
|
|
- Fully install the new version, creating a backup of the previous version first. Rely on an external mechanism for
|
|
ensuring that the monitor/revert process gets run (e.g. by registering a new systemd unit). This has the upside of
|
|
probably being compatible with script-based installers without any changes (teleport could create the backup and register
|
|
the unit before starting the script), but has the downside of introducing an external dependency.
|
|
|
|
|
|
## UX
|
|
|
|
### Static Configuration CLI UX
|
|
|
|
Static configuration objects will be managed via `tctl`'s normal `get`/`create` resource
|
|
commands.
|
|
|
|
Enabling the version control system (notification-only):
|
|
|
|
```bash
|
|
$ cat > vcc.yaml <<EOF
|
|
kind: version-control-config
|
|
version: v1
|
|
spec:
|
|
enabled: yes
|
|
|
|
notification:
|
|
alert_on:
|
|
- security-patch
|
|
from: github-releases/default
|
|
EOF
|
|
|
|
$ tctl create vcc.yaml
|
|
```
|
|
|
|
Enabling the version control system (manual upgrades):
|
|
|
|
```bash
|
|
$ cat > vcc.yaml <<EOF
|
|
kind: version-control-config
|
|
version: v1
|
|
spec:
|
|
enabled: yes
|
|
promotion:
|
|
from: tuf/default
|
|
|
|
basic_directives:
|
|
- name: All Servers
|
|
version: v1.2.*
|
|
selector:
|
|
labels:
|
|
'*': '*'
|
|
EOF
|
|
|
|
$ tctl create vcc.yaml
|
|
```
|
|
|
|
Configuring a custom TUF controller:
|
|
|
|
```bash
|
|
$ cat > vc.yaml <<EOF
|
|
kind: version-controller
|
|
version: v1
|
|
sub_kind: tuf
|
|
metadata:
|
|
name: my-tuf-controller
|
|
spec:
|
|
status: enabled
|
|
directives:
|
|
- name: Staging
|
|
target_selectors:
|
|
- version: 7.*
|
|
server_selectors:
|
|
- labels:
|
|
env: staging
|
|
- name: Prod
|
|
target_selectors:
|
|
- version: 7.2.*
|
|
server_selectors:
|
|
- labels:
|
|
env: prod
|
|
EOF
|
|
|
|
$ tctl create vc.yaml
|
|
```
|
|
|
|
|
|
### Version Directive Flow CLI UX
|
|
|
|
The version directive flow will be managed via the `tctl version-control` family of subcommands.
|
|
|
|
Manually creating a custom directive:
|
|
|
|
```
|
|
$ cat > vdd.yaml <<EOF
|
|
kind: version-directive
|
|
version: v1
|
|
sub_kind: custom
|
|
metadata:
|
|
name: my-directive
|
|
spec:
|
|
status: enabled
|
|
directives:
|
|
- name: Staging
|
|
targets:
|
|
- version: 2.3.4
|
|
installers:
|
|
- kind: script
|
|
name: apt-install
|
|
selectors:
|
|
- labels:
|
|
env: staging
|
|
services: [db,ssh]
|
|
|
|
- name: Prod
|
|
targets:
|
|
- version: 1.2.3
|
|
fips: yes
|
|
installers:
|
|
- kind: script
|
|
name: apt-install
|
|
selectors:
|
|
- labels:
|
|
env: prod
|
|
|
|
- name: Minimum
|
|
targets:
|
|
- version: 1.2.0
|
|
fips: no
|
|
installers:
|
|
- kind: script
|
|
name: apt-install
|
|
selectors:
|
|
- labels:
|
|
'*': '*'
|
|
EOF
|
|
|
|
$ tctl version-control create-draft vdd.yaml
|
|
```
|
|
|
|
Plan/apply workflow:
|
|
|
|
```bash
|
|
$ tctl version-control plan custom/my-draft
|
|
Directive custom/my-draft frozen with ID 'bba14536-0ad9-4b14-a071-1296d570e52e'...
|
|
|
|
Warning: Sub-directive "Staging" proposes version newer than current auth version (will not take effect until auth is upgraded).
|
|
|
|
Estimated Changes:
|
|
Current Version Target Version Count Sub-Directive
|
|
--------------- -------------- ----- -------------
|
|
v1.2.3 v2.3.4 12 Staging
|
|
v1.2.1 v1.2.3 2 Prod
|
|
|
|
Estimated Unaffected Instances: 52
|
|
|
|
help: you can run 'tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e' to enable these changes.
|
|
|
|
$ tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e
|
|
Successfully promoted pending directive 'bba14536-0ad9-4b14-a071-1296d570e52e'.
|
|
help: run 'tctl version-control status' to monitor rollout progress.
|
|
```
|
|
|
|
|
|
### CLI Recommendations and Alerts UX
|
|
|
|
Recommended version info will be added as part of normal server status info for user-facing interfaces
|
|
(`tsh inventory ls` for start, but other per-server displays could also include it). Cluster-level alerts
|
|
(e.g. due to a security patch becoming available or a major version reaching EOL) will be displayed on
|
|
login, and could be expanded to other "frequently used" commands if need-be.
|
|
|
|
Recommended version, displayed as part of status in `tsh inventory ls`:
|
|
|
|
```bash
|
|
$ tsh inventory ls
|
|
Server ID Version Services Status
|
|
------------------------------------ ------- ----------- -----------------------------------------------
|
|
eb115c75-692f-4d7d-814e-e6f9e4e94c01 v0.1.2 ssh,db installing -> v1.2.3 (17s ago)
|
|
9db81c94-558a-4f2d-98f9-25e0d1ec0214 v1.2.2 k8s online, upgrade recommended -> v1.2.3 (20s ago)
|
|
b170f8f1-e369-4e10-9a04-5fb33b8e40d5 v1.2.2 ssh online, upgrade recommended -> v1.2.3 (45s ago)
|
|
5247f33a-1bd1-4227-8c6e-4464fee2c585 v1.2.3 auth online
|
|
...
|
|
```
|
|
|
|
Alerts related to available security patches and EOL show up on login for those with sufficient permissions (exact permissions
|
|
TBD, but if you have blanket read for server inventory, that should be sufficient):
|
|
|
|
```bash
|
|
$ tsh login cluster.example.com
|
|
[...]
|
|
> Profile URL: https://cluster.example.com:3080
|
|
Logged in as: alice
|
|
Cluster: cluster.example.com
|
|
Roles: populist, dictator
|
|
Logins: alice
|
|
Kubernetes: disabled
|
|
Valid until: 2022-04-05 10:20:13 +0000 UTC [valid for 12h0m0s]
|
|
Extensions: permit-agent-forwarding, permit-port-forwarding, permit-pty
|
|
|
|
WARNING: Cluster "cluster.example.com" contains instance(s) eligible for security patch.
|
|
```
|
|
|
|
|
|
### Web UI Recommendations and Alerts UX
|
|
|
|
GUIs aren't really my area of expertise, and I'm not certain if we're going to opt to actually
|
|
port the unified "inventory" view to the web UI, but here's some ideas that I think are good
|
|
starting points:
|
|
|
|
- An "alerts" section under the "Activity" dropdown that can list cluster-level alerts about
|
|
version-control now, and possibly other related alerts as well down the road.
|
|
|
|
- Some kind of small but visually distinct banner alert that shows up on login but can be
|
|
minimized/dismissed and/or a badge on the activity dropdown indicating that alerts exist.
|
|
|
|
- Color-coded badges for some or all of the following per-instance states:
|
|
- upgrade available
|
|
- eol/deprecated version
|
|
- security update available
|
|
|
|
|
|
## Hypothetical Docs
|
|
|
|
Some hypothetical documentation snippets to help us imagine how comprehensible this system
|
|
will be to end users.
|
|
|
|
|
|
### Quickstart
|
|
|
|
Teleport's update system uses pluggable components to make it easy to get the exact behavior
|
|
you're looking for. The simplest way to get started with teleport's upgrade system is to use
|
|
the builtin TUF controller and installer, based on [The Update Framework](https://theupdateframework.com/).
|
|
|
|
You can enable these components like so:
|
|
|
|
```bash
|
|
$ cat > vcc.yaml <<EOF
|
|
kind: version-control-config
|
|
version: v1
|
|
spec:
|
|
enabled: yes
|
|
promotion:
|
|
strategy: manual
|
|
from: tuf/default
|
|
EOF
|
|
|
|
$ tctl create vcc.yaml
|
|
```
|
|
|
|
Once enabled, teleport will automatically detect new releases and draft an update plan
|
|
for your cluster. You can run `tctl version-control plan` to preview the latest draft's
|
|
effect on your cluster and run `tctl version-control apply <id>` to accept it if
|
|
everything is to your liking. Ex:
|
|
|
|
```bash
|
|
$ tctl version-control plan
|
|
Draft tuf/default frozen with ID 'bba14536-0ad9-4b14-a071-1296d570e52e'...
|
|
|
|
Estimated Changes:
|
|
Current Version Target Version Count Sub-Directive
|
|
--------------- -------------- ----- -------------
|
|
v1.2.3 v1.3.5 12 Default
|
|
|
|
Estimated Unaffected Instances: 2
|
|
|
|
help: you can run 'tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e' to enable these changes.
|
|
|
|
$ tctl version-control apply bba14536-0ad9-4b14-a071-1296d570e52e
|
|
Successfully promoted pending directive 'bba14536-0ad9-4b14-a071-1296d570e52e'.
|
|
help: run 'tctl version-control status' to monitor rollout progress.
|
|
```
|
|
|
|
Note that we didn't tell teleport what version we want to install. By default, teleport looks
|
|
for the latest releases for the major version you are already on (though it will notify you
|
|
if newer major versions are available). If we want to perform a major version upgrade, we need
|
|
to provide explicit configuration. Explicit versions or version ranges can be specified using
|
|
a `basic_directive`:
|
|
|
|
```bash
|
|
$ cat > vcc.yaml <<EOF
|
|
kind: version-control-config
|
|
version: v1
|
|
spec:
|
|
enabled: yes
|
|
promotion:
|
|
strategy: manual
|
|
from: tuf/default
|
|
|
|
basic_directives:
|
|
- name: All servers
|
|
version: v2.3.*
|
|
installer: tuf/default
|
|
selector:
|
|
labels:
|
|
'*': '*'
|
|
services: ['*']
|
|
EOF
|
|
|
|
$ tctl create -f vcc.yaml
|
|
```
|
|
|
|
You can read the above configuration as "the latest v2.3.X release should be installed on
|
|
all servers, using the default install method". We specify a version matcher, installer,
|
|
and instance selector (wildcard labels and services matches all instances). Teleport then
|
|
creates a draft proposal that matches our configuration in the background.
|
|
|
|
If you run `tctl version-control plan` immediately after creating/updating the config, you might
|
|
see an error like `Draft tuf/default appears outdated (config has been changed)`
|
|
or `Draft tuf/default has not been generated`. This is normal. Teleport needs to download
|
|
and verify detailed release metadata in order to generate a draft. This may take a
|
|
few seconds.
|
|
|
|
|
|
### Customization
|
|
|
|
- TODO
|
|
|
|
|
|
## Implementation Plan
|
|
|
|
There are a lot of individual components and features discussed in this RFD. As such,
|
|
implementation will be divided into phases with multiple iterative releases consisting
|
|
of subsets of the final feature set.
|
|
|
|
|
|
### Inventory Status/Control Setup
|
|
|
|
This phase sees no meaningful user-facing features added, but is the building block upon
|
|
which most of the rest of the features are built.
|
|
|
|
Instance-level status and control stream:
|
|
|
|
- Refactor agent certificate logic to support advertising multiple system roles on a
|
|
single cert (currently each service has its own disjoint certificate).
|
|
|
|
- Implement per-instance bidirectional GRPC control stream capable of advertising all
|
|
services running on a given teleport instance, and accepting commands directly from the
|
|
controlling auth server.
|
|
|
|
Improved inventory version tracking:
|
|
|
|
- Improve teleport's self-knowledge so that instances can heartbeat detailed build
|
|
attributes (arch, target os, fips status, etc).
|
|
|
|
- Add new server inventory resource and `tctl inventory ls` command for viewing all
|
|
instances w/ build info and services.
|
|
|
|
With the above changes in place, we will have the ability to reliably inspect per-instance
|
|
state regardless of running services, and each auth server will have a bidirectional handle
|
|
to its connected nodes, allowing for real-time signaling.
|
|
|
|
|
|
### Notification-Only System (?)
|
|
|
|
*note*: This step is optional, but might allow us to provide more value to users much
|
|
sooner.
|
|
|
|
Implement a notification-only upgrade controller and basic `version-directive` resource,
|
|
without any concept of having an "active" directive, and no reconciliation loop or installers.
|
|
The notification-only controller would serve only for detecting the existence of new versions
|
|
without providing any of the strong censorship resistance or package validation of the
|
|
tuf based controller. Instead, the purpose of this controller would be to generate a very
|
|
basic `version-directive` that could be used to display the *recommended* version for
|
|
teleport instances.
|
|
|
|
In theory, the ability to display recommended version, and/or generate notifications, is
|
|
a less "core" functionality, and could be added in a later step with less overall
|
|
development effort. Once the TUF controller exists, using its output for notifications
|
|
would be easy. That being said, it may be more valuable to deliver a pretty good way of
|
|
informing users that they aught to upgrade sooner, rather than waiting on a very robust
|
|
way of automatically upgrading that happens to bring notifications along with it.
|
|
|
|
Regardless of ordering, notifications in general depend on the following components
|
|
that need to be built anyway:
|
|
|
|
- The target+server matching part of the `version-directive` resource (installer matching
|
|
comes later).
|
|
|
|
- The draft phase of the [version directive flow](#version-directive-flow).
|
|
|
|
- The basic top-level config API (get/put/del).
|
|
|
|
- The basic version controller configuration API (get/put/del) (only required if
|
|
we want to support a controller configuration other than `default`).
|
|
|
|
- The `tctl version-control status` command (though not all fields will be available yet).
|
|
|
|
Creating the notification-only system first will also necessitate an additional
|
|
builtin `version-controller` that would not otherwise be needed. Luckily, it can
|
|
be *very* simple (e.g. a github release page scraper), since it will be explicitly
|
|
*not* usable for actual upgrades.
|
|
|
|
|
|
### Script-Based Upgrades MVP
|
|
|
|
With the core work done for inventory status/control, we can move on to a barebones
|
|
MVP/pathfinder for installers, version directives, and the version reconciliation loop.
|
|
We will implement a no-frills version of these components with the goal of supporting
|
|
one specific use-case: manually creating a basic `version-directive` resource and
|
|
having a user-provided script run on all servers that don't match the directive.
|
|
|
|
This phase will be a bit of a pathfinder, with a focus on weeding out any issues with
|
|
the proposed design of the core system. It will also provide an early preview for users
|
|
that are interested in reducing per-instance upgrade work, but are still willing to get
|
|
their hands dirty. Finally, this will mark the point after which manual upgrade of
|
|
non-auth instances can (theoretically) end, as new versions that support new installers
|
|
can be "bootstrapped" using older installers.
|
|
|
|
The components that must be developed for this phase are as follows:
|
|
|
|
- Per-instance installation attempt status info.
|
|
|
|
- The version reconciliation loop (minus more advanced features like being able to
|
|
trigger remote installers).
|
|
|
|
- The version directive resource (mostly complete already if we did the notification-only
|
|
system first), and the version directive flow.
|
|
|
|
- The `local-script` installer, and basic installer configuration API (get/put/del).
|
|
|
|
- The `tctl version-control plan`/`tctl version-control apply` commands.
|
|
|
|
- A rudimentary version of rollout health monitoring and automatic-pause system.
|
|
|
|
- Interactive `tctl version-control setup` command.
|
|
|
|
|
|
### TUF-Based System MVP
|
|
|
|
This phase sees the beginning of "batteries included" functionality. We will be adding
|
|
the TUF-based version controller and installer, as well as setting up supporting
|
|
CI and repository infrastructure. In this phase, teleport will start being able to
|
|
detect and install new versions on its own (though this will still be a "preview" feature
|
|
and not recommended for production).
|
|
|
|
Development in this version will be split between core teleport changes and build/infra
|
|
work. The core teleport work will be as follow:
|
|
|
|
- Basic version controller configuration API (if not added in notification-only phase).
|
|
|
|
- Internal TUF client implementation w/ stateful components stored in teleport backend.
|
|
|
|
- Builtin TUF version controller (basically just a control loop that runs the client and
|
|
then converts TUF package metadata to version-directive format).
|
|
|
|
- Rudimentary TUF installer (no local rollbacks yet, so this is basically just
|
|
download, validate, and replace).
|
|
|
|
- Basic notification/version recommendations (if not added in notification-only phase).
|
|
|
|
Build system/infra work:
|
|
|
|
- Get deterministic builds working (they might still work, since I did get them mostly
|
|
functional a while back, but this isn't covered by tests, so its basically meaningless).
|
|
|
|
- Set up isolated automation for independently building, hashing, and signing teleport releases.
|
|
|
|
- Add hashing + signing to existing build pipeline (different keypair).
|
|
|
|
- Set up TUF repository with thresholded signing so that compromise of one of the two build
|
|
envs does not compromise the TUF repository. TUF repositories are just static files, so this
|
|
can be hosted just about anywhere, through there is some regular re-signing by a "hot" key
|
|
that is used to prove liveness.
|
|
|
|
|
|
### Stability & Polish
|
|
|
|
The timeline for this phase isn't linear and the individual changes aren't interdependent like in
|
|
previous phases, but we're moving out of the realm of a preview/MVP feature and that means polish
|
|
and stability improvements. In no particular order:
|
|
|
|
- Officially move TUF components out of preview (good time to try our first public repo key rotation?).
|
|
|
|
- Implement local rollbacks.
|
|
|
|
- Extend upgrade system to support upgrading auth servers.
|
|
|
|
- Extend TUF repository to cover more package types (deterministic docker images are theoretically
|
|
possible I hear).
|
|
|
|
- Add `remote-script` installer.
|
|
|
|
- Improve upgrade visibility (e.g. create "session recordings" for `local-script` installers).
|
|
|
|
- Tackle outstanding feedback & any issues that have been uncovered prior to moving to
|
|
extended feature set.
|
|
|
|
|
|
### Extended Feature Set
|
|
|
|
- Multiparty approval and dry run workflows.
|
|
|
|
- Notification plugins (e.g. slack notifications for very outdated instances).
|
|
|
|
- Other remote installers (e.g. k8s).
|
|
|
|
|
|
## Other Stuff
|
|
|
|
### Anonymized Metrics
|
|
|
|
While not uniquely related to the upgrade system, we are going to start looking toward supporting opt-in
|
|
collection of anonymized metrics from users. The first instance of this new feature will appear alongside
|
|
the TUF-based system in the form of additional optional headers that can be appended to TUF metadata requests
|
|
and can be aggregated by the TUF server.
|
|
|
|
The heart of the anonymized metrics system will be two new abstract elements to be added to cluster-level state
|
|
(which configuration object they should appear in is TBD):
|
|
|
|
```
|
|
enabled: yes|no
|
|
random_value: <random-bytes>
|
|
```
|
|
|
|
If a user chooses to enable anonymized metrics for a cluster, a random value with some reasonably large entropy
|
|
will be generated. This will form the basis for an anonymous identifier that will allow us to distinguish between
|
|
metrics from different clusters without the identifier revealing anything about that cluster's identity.
|
|
The random value can be used directly as an identifier, or as a MAC key to be used to hash some other value. I lean
|
|
toward preferring a scheme where the presented identifier rotates periodically (e.g. monthly).
|
|
If combined with the right amount of "bucketing" of any scalar values, this should help us prevent the emergence of
|
|
any "long term" narratives related to a single identifier, thereby further improving anonymization
|
|
|
|
I am currently leaning toward the idea of using the random value to create a keyed/salted hash of the current year/month
|
|
gmt (`YYYY-MM`) s.t. each month is effectively a separate dataset with separate identifiers. This kind of scheme would
|
|
both produce cleaner datasets and improve anonymity by effectively causing all clusters across the ecosystem to
|
|
rotate their IDs simultaneously. Still thinking this through, so maybe there are issues with this particular angle, but
|
|
the aforementioned properties are appealing.
|
|
|
|
To start with, cluster identifiers will be the only data the user is actually opting into sharing. The TUF server
|
|
will know the version of the client calling it, and whether it is an open source or enterprise request. The optional
|
|
cluster identifier will be what transforms this information from being just useful per-request debug into,
|
|
into a meaningful metric about the state of the teleport ecosystem. By using the cluster ID to deduplicate requests,
|
|
will will start to be able to make more informed guesses about the scope of the teleport ecosystem and the distribution
|
|
of teleport versions across it.
|
|
|
|
We will therefore end up collecting datapoints of the following format:
|
|
|
|
```
|
|
version: <semver>
|
|
flavor: <enum: oss|ent>
|
|
id: <random-string>
|
|
```
|
|
|
|
A few notes about working with this system:
|
|
|
|
- If scalar values are added in the future (e.g. cluster size), they will need to be bucketed s.t. no one
|
|
cluster identifier is unique enough to be traceable across ID rotations, or unique enough to be correlated
|
|
with a given user should their cluster size (approximate or not) be shared for any reason.
|
|
|
|
- Cluster identifiers (both the seed/salt value and the ephemeral identifier) should be treated as secrets and
|
|
not emitted in any logs or in any tctl commands that don't include `--with-secrets`.
|
|
|
|
- Addition of any new metrics in the future should be subject to heightened scrutiny and cynicism. A healthy
|
|
dose of 'professional paranoia' is beneficial here.
|
|
|
|
### Open Questions
|
|
|
|
- It seems reasonable that folks should be able to specifically watch for security patches and
|
|
have them automatically installed, or have special notifications generated just for them. It may even
|
|
be good to have such a feature come as one of the pre-configured controllers alongside the
|
|
`default` controllers (e.g. `tuf/security-patches`). How should we handle this? I'm currently
|
|
leaning toward introducing a new build attribute that can be filtered for
|
|
(e.g. `version=v1.*,fips=yes,security-patch=yes`), but its possible that there are better ways
|
|
to go about this (e.g. separate repos or the concept of "release channels").
|
|
|
|
- How explicitly should local nodes require opt-in? We obviously don't want to run any installers
|
|
on a node that doesn't include explicit opt-in, but should we require explicit opt-in for specific
|
|
installers? (e.g. `auto_upgrade: true` vs `permit_upgrade: ['local-script/foo', 'tuf/default']`)
|
|
|
|
- How are we going to handle auth server upgrades? Since auth-servers can't do rolling upgrades,
|
|
this is a lot trickier than upgrades of other kinds. We can obviously coordinate via backend state,
|
|
but its basically impossible to distinguish between shutdown and bugs or performance issues.
|
|
|
|
- Some folks have a *lot* of leaf clusters. It may be burdensome to manage upgrade states for individual
|
|
clusters separately. Should we consider providing some means of having root clusters coordinate
|
|
the versioning of leaf clusters? At what point are they so integrated that the isolation benefits are
|
|
nullified and the users would be better off with a monolithic cluster?
|
|
|