teleport/rfd/0025-hsm.md
Zac Bergquist e174004612
Update RFD statuses (#24454)
We haven't been good about going back and marking RFDs as implemented,
and it's helpful when looking at old designs to know if they ever made
their way into the product.
2023-04-13 17:22:46 +00:00

14 KiB

authors: Andrew Lytvynov (andrew@goteleport.com), Nic Klaassen (nic@goteleport.com) state: implemented (v7.2)

RFD 25 - Hardware security module (HSM) support

What

Support HSM devices for CA private key storage.

Why

In simple terms, HSMs are hardware devices that store private keys and expose only cryptographic operations using them (like sign/verify and encrypt/decrypt). HSMs provide stronger protections for storing private keys compared to disks or remote databases. Private keys stored in the HSM can be non-exportable, meaning they will never leave the device. HSMs are also required by some regulation regimes, like FIPS and PCI.

Details

Background

First, some background on HSMs and Teleport CAs.

HSM operations

It's important to understand that HSMs are a product category (like "storage devices" or "monitors") and not a specific standard (like NVMe, SATA or HDMI). There are commonalities between HSM devices, but also many vendor-specific variations, extensions and features.

That said, most HSM devices support the following minimum set of operations that Teleport needs:

  • generating asymmetric key pairs, with persistence
    • using user-specific parameters, like algorithm and key size
    • the set of valid parameters is limited (e.g. you can have RSA but not EdDSA)
  • exporting the public key of the pair
  • using those keys for signing/verification of arbitrary data
  • lookup of keys by ID specified during creation
  • import of an existing private key from disk

Some devices support other interesting operations, but they are not universal:

  • backup/restore/migration of keys to a different HSM device
  • remote access

HSM interfaces

By far, the most ubiquitous interface to HSM devices is PKCS#11. It is often despised because of complexity and usage difficulty. PKCS#11 works by loading a module (.so library) provided by the vendor, which has a known C API. To cover the largest number of devices, we have to support PKCS#11.

Cloud HSM (and KMS) products are also available from the biggest cloud providers, usually over a custom API. AWS CloudHSM is an exception - it mounts a real HSM over PKCS#11. Since cloud HSM could become a popular request, we should design with the assumption of multiple HSM interfaces. But we will not implement their support initially.

Each HSM vendor typically also provides a custom SDK or library for their specific product. We should not support these in favor of the standard PKCS#11 interface.

yubihsm-connector is an interesting HTTP-to-USB proxy for the device. It allows a single HSM to be shared by multiple clients. It doesn't seem to do authentication though and only works for YubiHSM 2.

HSM key export

Despite popular belief, HSMs don't guarantee that private keys can't be stolen. Using key wrapping and special attributes on private keys, it's possible to copy keys created on one device to a different one or extract them in plaintext. Yubico even provides CLI tools for this, specific to YubiHSM.

This may be used to distribute a single CA private key to multiple HSMs over the network. However, it significantly weakens the security properties of using an HSM in the first place.

Teleport CAs

Each Teleport cluster has multiple CAs:

  • Host CA (SSH + TLS)
  • User CA (SSH + TLS)
  • JWT CA (TLS only)

Private keys in each CA are stored as a list to enable CA rotation. For example, when rotating a Host CA, we shuffle the key fields of the existing CA object instead of creating a brand new backend object. The expectation of only 1 CA of each type is deeply integrated in Teleport.

Architecture

Each Auth server may have an HSM device available, configured in teleport.yaml. Auth servers create and manage unique CA keys in their local HSMs, distinct from the keys on other Auth servers. Each Auth server encodes its local HSM information in the private key field of the CA backend object (instead of the PEM-encoded raw private key).

CA storage schema has to support multiple active private keys. Auth servers decide on which key to use based on their local configuration.

CA rotation also needs to update, to support multiple active key pairs and multiple trusted key pairs.

Implementing key creation in the Auth server (instead of asking users to manually create keys) gives us the flexibility to change key algorithms in the future. Support for multiple private keys sets us up for CA backup/restore/migration and full federation in the future.

PKCS#11 library

There are several PKCS#11 libraries available, we'll use crypto11. This library allows both key creation and usage, and is a nice usable abstraction over the raw PKCS#11 protocol.

We could use miekg/pkcs11 directly, which represents the PKCS#11 API as closely as possible. But we'd end up with a lot of verbose code, essentially reimplementing most of crypto11 (which is based on miekg/pkcs11 anyway).

CA rotation

Current CA rotation logic operates on one CA type at a time and performs a phased rotation to give clients and servers a chance to fetch updated CA certs list and reissue their credentials with the new CA.

The rotation mechanism switches new and old CA keys to be "signing" key and "trusted" keys in different phases:

  • "signing" key is the first item in the keys list (e.g. CertificateAuthorityV2.Spec.TLSKeyPairs[0]) and is used to sign any new certificates
  • "trusted" key is the second item in the keys list and is trusted for certificate validation, but does not sign any new certificates

This index-based approach will break down with multiple active CA keys (in different HSMs on different Auth servers). Instead, we'll update the storage schema to support multiple "signing" and "trusted" keys for each CA.

Clients will trust all "signing" + "trusted" CA certs for validation. Auth servers will use their corresponding HSM key for signing (see below).

Backend storage

The current backend schema stores CA keys as lists:

message CertAuthoritySpecV2 {
    // Note: some fields are omitted below.

    // SSH keys. CheckingKeys are public keys, SigningKeys are private keys.
    repeated bytes CheckingKeys = 3 [ (gogoproto.jsontag) = "checking_keys,omitempty" ];
    repeated bytes SigningKeys = 4 [ (gogoproto.jsontag) = "signing_keys,omitempty" ];

    // TLS keys and certs.
    repeated TLSKeyPair TLSKeyPairs = 7 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "tls_key_pairs,omitempty" ];

    // JWT keys.
    repeated JWTKeyPair JWTKeyPairs = 10 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "jwt_key_pairs,omitempty" ];
}

This looks like it supports multiple private keys, but in practice only the first item of each list is used for signing. Since we can't add a bool signing flag to SSH keys (because they are byte slices), there isn't an incremental way to mark multiple signing keys.

I propose revamping the schema entirely and doing a migration to move existing data:

message CertAuthoritySpecV2 {
    // Note: some fields are omitted below.

    // Deprecated fields for backwards-compatibility.
    // Delete in teleport v8.
    repeated bytes CheckingKeys = 3 [ (gogoproto.jsontag) = "checking_keys,omitempty" ];
    repeated bytes SigningKeys = 4 [ (gogoproto.jsontag) = "signing_keys,omitempty" ];
    repeated TLSKeyPair TLSKeyPairs = 7 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "tls_key_pairs,omitempty" ];
    repeated JWTKeyPair JWTKeyPairs = 10 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "jwt_key_pairs,omitempty" ];

    // New fields in teleport v7.
    CAKeySet ActiveKeys = 11;
    // Note: this field is called AdditionalTrustedKeys instead of just
    // TrustedKeys to make it clear that these are not the only keys used to
    // validate certificates.
    // Callers must validate using ActiveKeys + AdditionalTrustedKeys.
    CAKeySet AdditionalTrustedKeys = 12;
}

message CAKeySet {
    repeated SSHKeyPair SSH = 1;
    repeated TLSKeyPair TLS = 2;
    repeated JWTKeyPair JWT = 3;
}

message SSHKeyPair {
    bytes PublicKey = 1;
    bytes PrivateKey = 2;
    PrivateKeyType PrivateKeyType = 3;
}

// TLSKeyPair already exists.
message TLSKeyPair {
    bytes Cert = 1;
    bytes Key = 2;
    // New field.
    PrivateKeyType KeyType = 3;
}

// JWTKeyPair already exists.
message JWTKeyPair {
    bytes PublicKey = 1;
    bytes PrivateKey = 2;
    // New field.
    PrivateKeyType PrivateKeyType = 3;
}

enum PrivateKeyType {
    RAW = 0;
    PKCS11 = 1;
}

Clients will use a concatenation of both CAKeySets. Auth servers will use ActiveKeys to sign new certificates. CA rotation can now swap ActiveKeys and AdditionalTrustedKeys instead of relying on indexes.

When PrivateKeyType is set to PKCS11, the PrivateKey/Key fields contain encoded information about the PKCS#11 key, as a JSON object prefixed with pkcs11::

pkcs11:{"host_id": "host-uuid", "key_id": "key-uuid"}

Where host-uuid is the UUID of the Auth server that created the key and key_id is the UUID of the key set in the CKA_ID attribute on the HSM.

Auth servers, when signing, will use:

  • if configured with PKCS#11, the ActiveKeys PKCS#11 key that belongs to their host (matched via host_id field)
  • if not configured with PKCS#11, the first non-PKCS#11 item in ActiveKeys

Configuration

Auth servers need PKCS#11 configuration, passed in via teleport.yaml:

auth_service:
  ca_key_params:
    pkcs11:
      # Path to PKCS#11 module from the HSM vendor.
      module_path: "/path/to/pkcs11.so"
      # CKA_LABEL of the hSM, set during HSM initialization.
      label: "myhsm"
      # PIN for connecting to the HSM, set during HSM initialization.
      # This can also be a path to a file containing the PIN.
      pin: "password"

If pkcs11 field is not set, Auth server will start with plaintext keys stored in the backend. If Auth server fails to connect to the token at startup with these parameters, the process will exit.

The fields from PKCS#11 configuration will also be added to the auth_server API resource as read-only.

Migration

Migration of existing clusters to use HSMs will involve a CA rotation. CA rotation is needed to propagate the CA certificates associated with the new private key to all users and hosts. Without the rotation, all certificates issued by this Auth server will not be trusted by any client.

On startup, Auth server will fetch existing CAs from the backend and:

  • if no CAs exist, create a key in HSM and register it in ActiveKeys
    • this takes care of brand new Teleport clusters
  • if CAs exist, but don't have an HSM key from this host registered in ActiveKeys, create a key in HSM and register it in AdditionalTrustedKeys
    • this key will be used to sign the Admin identity, necessary for tctl
    • this key is not expected to be trusted by the rest of the cluster until a CA rotation is completed, but will be trusted by the local auth server.
    • print a warning on startup and in tctl status until a CA rotation is completed.
    • all user and jwt signing operations against this Auth server will fail until rotation; when adding a new HSM-enabled Auth server to the cluster, admins should not route any traffic to it until the rotation is complete
  • if CAs exist and have an HSM key from this host, load the key and use it for signing
    • if the key is not found in the HSM (e.g. it was factory reset), the Auth server will exit; to resolve this, admins have to remove /var/lib/teleport/host_uuid to explicitly change the UUID and dis-associate the old HSM key from this instance; from there the Auth server behaves as above when registering a new key

When rotation is initiated through a different Auth server, Auth servers with HSMs will add a new PKCS#11 key reference to CA AdditionalTrustedKeys in the init phase of rotation. Admins should not perform a rotation with --grace_period=0 (or any very short value, less than 1 minute) so that all auth servers have a chance to add their keys. A message will be logged when this is completed.

Corner Cases

  1. HSM keys are deleted, or a new/different HSM is configured
  • The auth server will fail to find the existing keys and exit on startup.
  • The admin should remove /var/lib/teleport/host_uuid to explicitly change the UUID and dis-associate the old HSM key from this instance; from there the Auth server behaves as if it is brand new.
  1. HSM is de-configured (switching back to raw keys)
  • The auth server will add AdditionalTrusted raw keys and require a CA rotation, see Migration.
  1. HA cluster has a mix of HSM and non-HSM auth servers
  • This is supported.
  • Assume the cluster is in the middle of migrating, it's expected to migrate one auth server at a time.
  • Print a warning on startup that not all key types match to remind the admin to update all auth servers if they are trying to setup HSM support.
  1. Multiple auth servers connected to the same HSM
  • This is supported (might be common with CloudHSM).
  • All keys created in the HSM will be labelled with the UUID of the auth server, so the auth servers will not realize they are connected to the same HSM and there should be no difference.

Reliability

Using HSMs is inherently more fragile than using plaintext keys. Auth servers become much less dynamic and are tied to a specific machine with an HSM device. CA rotation becomes more complex, with new failure modes.

This design attempts to automate as much HSM management as possible, but still relies on human operators to:

  • provision and initialize HSM devices
  • not remove/reset/swap HSM devices when in use
  • carefully manage teleport.yaml on Auth servers to ensure correct PKCS#11 configuration
  • trigger CA rotations when adding new Auth servers with HSMs
    • and check logs / tctl status to ensure the new HSM keys are in use

Compatibility

HSM support should support any PKCS#11-compatible device, but will be tested with: