teleport/rfd/0060-gRPC-backend.md

---
authors: Michael McAllister (michael.mcallister@goteleport.com)
state: canceled
---

# RFD 60 - gRPC Backend

## What

A new [backend](https://github.com/gravitational/teleport/tree/v8.3.1/lib/backend) that sends requests to a user defined gRPC server for persistence. Initially targeting cluster state with audit events to come later (in a subsequent RFD)

## Why

As Cloud-Hosted Teleport continues to grow, so does our need to have greater flexibility on not only what persistence layer we use (be it currently supported backends like DynamoDB, or otherwise) but _how_ these are implemented.

Take for instance, the current implementation of the [DynamoDB backend](https://github.com/gravitational/teleport/tree/v8.3.1/lib/backend/dynamo) which allows in its configuration to specify the table name, but does not presently have the ability to customize the Partition key. This means that it's not possible to co-locate multiple installations of Teleport within the same DynamoDB table, and as a result are bounded by [AWS account limits](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html#limits-tables) when deploying multiple installations.

Other backend solutions have their own challenges regarding authentication and authorization. For instance, most databases have limits on the number of users/roles that you can define which we'd need to leverage to ensure strong data isolation between each deployment. Using an authentication method that would allow one tenant to access another tenants data would be a non-option.

Implementing a gRPC client within Teleport allows not only the Cloud team, but external developers to implement:
 - A custom persistence layer that may not be supported by Teleport natively
 - Custom caching logic, connection pooling, etc.
 - Custom database schema, partition scheme, etc.
 - Custom authentication and authorization (mTLS, LDAP, etc.)

## Details

### Scope

It is proposed that the Teleport core codebase hold the protocol buffer definitions and the corresponding client.

To keep this RFD brief the Cloud teams (aspirational) implementation of the corresponding gRPC server is out of scope of this documentation and not discussed in detail, it will also be wholly owned by the Cloud team and not apart of the core codebase.

Additionally audit events are initially outside of scope, but will be implemented at a later date.

### Protobuf Definition

The proposed protobuf definition aims to mirror the interface that is defined in [backend.go](https://github.com/gravitational/teleport/blob/cf162af679f3c136b0cc5a7c5bfcd8bba14afdaa/lib/backend/backend.go#L41-L91)

```protobuf
message Item {
    // Key is a key of the key value item
    bytes key = 1;

    // Value is a value of the key value item
    bytes value = 2;

    // Expires is an optional record expiry time
    google.protobuf.Timestamp expires = 3 [
        (gogoproto.stdtime) = true,
        (gogoproto.nullable) = false
    ];

    // ID is a record ID, newer records have newer ids
    sfixed64 id = 4;

    // LeaseID is a lease ID, could be set on objects with TTL
    sfixed64 lease_id = 5;
}

message Lease {
    // Key is a key of the key value item
    bytes key = 1;

    // ID is a record ID, newer records have newer ids
    sfixed64 id = 2;
}

message Watch {
    // resume_last_id is the last item index we saw, and we want to resume from any event after that point
    sfixed64 resume_last_id = 1;
}

message CompareAndSwapRequest {
    // Expected is the existing item
    Item expected = 1;

    // replace_with is the new item to swap
    Item replace_with = 2;
}

message GetRequest {
    // Key is a key of the key value item
    bytes key = 1;
}

message GetRangeRequest {
    // start_key is the starting key to request in the range request
    bytes start_key = 1;

    // end_key is the ending key to request in the range request
    bytes end_key = 2;

    // limit is the maximum number of results to return
    int32 limit = 3;
}

message GetRangeResult {
    repeated Item items = 1;
}

message DeleteRequest {
    // Key is a key of the key value item
    bytes key = 1;
}

message DeleteRangeRequest {
    // start_key is the starting key to request in the range request
    bytes start_key = 1;

    // end_key is the ending key to request in the range request
    bytes end_key = 2;
}

message KeepAliveRequest {
    Lease lease = 1;
    google.protobuf.Timestamp expires = 2 [
        (gogoproto.stdtime) = true,
        (gogoproto.nullable) = false
    ];
}


message Event {
   // OpType specifies operation type
   enum OpType {
      UNKNOWN = 0;
      UNRELIABLE = 1;
      INVALID = 2;
      INIT = 3;
      PUT = 4;
      DELETE = 5;
      GET = 6;
   }
   OpType type = 1;
   Item item = 2;
}

service BackendService {
    // Create creates item if it does not exist
    rpc Create(Item) returns (Lease);

    // Put upserts value into backend
    rpc Put(Item) returns (Lease);

    // CompareAndSwap compares item with existing item and replaces is with replace_with item
    rpc CompareAndSwap(CompareAndSwapRequest) returns (Lease);

    // Update updates value in the backend
    rpc Update(Item) returns (Lease);

    // Get returns a single item or not found error
    rpc Get(GetRequest) returns (Item);

    // GetRange returns query range
    rpc GetRange(GetRangeRequest) returns (GetRangeResult);

    // Delete deletes item by key, returns NotFound error if item does not exist
    rpc Delete(DeleteRequest) returns (google.protobuf.Empty);

    // DeleteRange deletes range of items with keys between startKey and endKey
    rpc DeleteRange(DeleteRangeRequest) returns (google.protobuf.Empty);

    // KeepAlive keeps object from expiring, updates lease on the existing object,
	 // expires contains the new expiry to set on the lease,
	 // some backends may ignore expires based on the implementation
	 // in case if the lease managed server side
    rpc KeepAlive(KeepAliveRequest) returns (google.protobuf.Empty);

    // NewWatcher returns a new event watcher
    rpc NewWatcher(Watch) returns (stream Event);
}
```

### Configuration

In order for the gRPC client (the Teleport side) to be able to talk to a server, at minimum the address of the server providing the gRPC service will be required, additionally the following existing configuration properties will be reused:

- buffer size for client size buffering
- mTLS details (CA, Cert, Key) (currently used for etcd)

Representing this in yaml the storage stanza of the configuration could potentially look like:
```yaml
storage:
   type: grpc
   server: endpoint.example.com:1992
   tls_ca_file: /secrets/grpc/ca.crt
   tls_cert_file: /secrets/grpc/tls.crt
   tls_key_file:  /secrets/tls.key
   audit_events_uri: ['dynamodb://example-ddb-table.events'] # Future iteration will support grpc://endpoint.example.com:1992
   audit_sessions_uri: s3://example-bucket/sessions
   continuous_backups: true
   auto_scaling: true
   read_min_capacity: 20
   read_max_capacity: 100
   read_target_value: 50.0
   write_min_capacity: 10
   write_max_capacity: 100
   write_target_value: 70.0
```
RFD 56: gRPC Backend Proposal (#10541) Design for implementing a new gRPC client backend for Teleport. The main driving motivation for this proposal is growth and requirements driven from Cloud-Hosted Teleport rather than demand from external customers. The server side implementation is outside of scope for this document, but the Why? in the RFD touches on what is driving this proposal The initial implementation is intended to be done by the Cloud-Reliability team (with oversight, review and assistance from Core) 2022-03-09 00:43:15 +00:00			`---`
			`authors: Michael McAllister (michael.mcallister@goteleport.com)`
Update RFD statuses (#24454) We haven't been good about going back and marking RFDs as implemented, and it's helpful when looking at old designs to know if they ever made their way into the product. 2023-04-13 17:22:46 +00:00			`state: canceled`
RFD 56: gRPC Backend Proposal (#10541) Design for implementing a new gRPC client backend for Teleport. The main driving motivation for this proposal is growth and requirements driven from Cloud-Hosted Teleport rather than demand from external customers. The server side implementation is outside of scope for this document, but the Why? in the RFD touches on what is driving this proposal The initial implementation is intended to be done by the Cloud-Reliability team (with oversight, review and assistance from Core) 2022-03-09 00:43:15 +00:00			`---`

Change rfd number (#11312) 2022-03-21 23:20:28 +00:00			`# RFD 60 - gRPC Backend`
RFD 56: gRPC Backend Proposal (#10541) Design for implementing a new gRPC client backend for Teleport. The main driving motivation for this proposal is growth and requirements driven from Cloud-Hosted Teleport rather than demand from external customers. The server side implementation is outside of scope for this document, but the Why? in the RFD touches on what is driving this proposal The initial implementation is intended to be done by the Cloud-Reliability team (with oversight, review and assistance from Core) 2022-03-09 00:43:15 +00:00
			`## What`

			`A new [backend](https://github.com/gravitational/teleport/tree/v8.3.1/lib/backend) that sends requests to a user defined gRPC server for persistence. Initially targeting cluster state with audit events to come later (in a subsequent RFD)`

			`## Why`

Update RFD statuses (#24454) We haven't been good about going back and marking RFDs as implemented, and it's helpful when looking at old designs to know if they ever made their way into the product. 2023-04-13 17:22:46 +00:00			`As Cloud-Hosted Teleport continues to grow, so does our need to have greater flexibility on not only what persistence layer we use (be it currently supported backends like DynamoDB, or otherwise) but _how_ these are implemented.`
RFD 56: gRPC Backend Proposal (#10541) Design for implementing a new gRPC client backend for Teleport. The main driving motivation for this proposal is growth and requirements driven from Cloud-Hosted Teleport rather than demand from external customers. The server side implementation is outside of scope for this document, but the Why? in the RFD touches on what is driving this proposal The initial implementation is intended to be done by the Cloud-Reliability team (with oversight, review and assistance from Core) 2022-03-09 00:43:15 +00:00
			Take for instance, the current implementation of the [DynamoDB backend](https://github.com/gravitational/teleport/tree/v8.3.1/lib/backend/dynamo) which allows in its configuration to specify the table name, but does not presently have the ability to customize the Partition key. This means that it's not possible to co-locate multiple installations of Teleport within the same DynamoDB table, and as a result are bounded by [AWS account limits](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html#limits-tables) when deploying multiple installations.

Update RFD statuses (#24454) We haven't been good about going back and marking RFDs as implemented, and it's helpful when looking at old designs to know if they ever made their way into the product. 2023-04-13 17:22:46 +00:00			`Other backend solutions have their own challenges regarding authentication and authorization. For instance, most databases have limits on the number of users/roles that you can define which we'd need to leverage to ensure strong data isolation between each deployment. Using an authentication method that would allow one tenant to access another tenants data would be a non-option.`
RFD 56: gRPC Backend Proposal (#10541) Design for implementing a new gRPC client backend for Teleport. The main driving motivation for this proposal is growth and requirements driven from Cloud-Hosted Teleport rather than demand from external customers. The server side implementation is outside of scope for this document, but the Why? in the RFD touches on what is driving this proposal The initial implementation is intended to be done by the Cloud-Reliability team (with oversight, review and assistance from Core) 2022-03-09 00:43:15 +00:00
			`Implementing a gRPC client within Teleport allows not only the Cloud team, but external developers to implement:`
			`- A custom persistence layer that may not be supported by Teleport natively`
			`- Custom caching logic, connection pooling, etc.`
			`- Custom database schema, partition scheme, etc.`
			`- Custom authentication and authorization (mTLS, LDAP, etc.)`

			`## Details`

			`### Scope`

			`It is proposed that the Teleport core codebase hold the protocol buffer definitions and the corresponding client.`

			`To keep this RFD brief the Cloud teams (aspirational) implementation of the corresponding gRPC server is out of scope of this documentation and not discussed in detail, it will also be wholly owned by the Cloud team and not apart of the core codebase.`

			`Additionally audit events are initially outside of scope, but will be implemented at a later date.`

			`### Protobuf Definition`

			`The proposed protobuf definition aims to mirror the interface that is defined in [backend.go](https://github.com/gravitational/teleport/blob/cf162af679f3c136b0cc5a7c5bfcd8bba14afdaa/lib/backend/backend.go#L41-L91)`

			```protobuf
			`message Item {`
			`// Key is a key of the key value item`
			`bytes key = 1;`

			`// Value is a value of the key value item`
			`bytes value = 2;`

			`// Expires is an optional record expiry time`
			`google.protobuf.Timestamp expires = 3 [`
			`(gogoproto.stdtime) = true,`
			`(gogoproto.nullable) = false`
			`];`

			`// ID is a record ID, newer records have newer ids`
			`sfixed64 id = 4;`

			`// LeaseID is a lease ID, could be set on objects with TTL`
			`sfixed64 lease_id = 5;`
			`}`

			`message Lease {`
			`// Key is a key of the key value item`
			`bytes key = 1;`

			`// ID is a record ID, newer records have newer ids`
			`sfixed64 id = 2;`
			`}`

			`message Watch {`
			`// resume_last_id is the last item index we saw, and we want to resume from any event after that point`
			`sfixed64 resume_last_id = 1;`
			`}`

			`message CompareAndSwapRequest {`
			`// Expected is the existing item`
			`Item expected = 1;`

			`// replace_with is the new item to swap`
			`Item replace_with = 2;`
			`}`

			`message GetRequest {`
			`// Key is a key of the key value item`
			`bytes key = 1;`
			`}`

			`message GetRangeRequest {`
			`// start_key is the starting key to request in the range request`
			`bytes start_key = 1;`

			`// end_key is the ending key to request in the range request`
			`bytes end_key = 2;`

			`// limit is the maximum number of results to return`
			`int32 limit = 3;`
			`}`

			`message GetRangeResult {`
			`repeated Item items = 1;`
			`}`

			`message DeleteRequest {`
			`// Key is a key of the key value item`
			`bytes key = 1;`
			`}`

			`message DeleteRangeRequest {`
			`// start_key is the starting key to request in the range request`
			`bytes start_key = 1;`

			`// end_key is the ending key to request in the range request`
			`bytes end_key = 2;`
			`}`

			`message KeepAliveRequest {`
			`Lease lease = 1;`
			`google.protobuf.Timestamp expires = 2 [`
			`(gogoproto.stdtime) = true,`
			`(gogoproto.nullable) = false`
			`];`
			`}`


			`message Event {`
			`// OpType specifies operation type`
			`enum OpType {`
			`UNKNOWN = 0;`
			`UNRELIABLE = 1;`
			`INVALID = 2;`
			`INIT = 3;`
			`PUT = 4;`
			`DELETE = 5;`
			`GET = 6;`
			`}`
			`OpType type = 1;`
			`Item item = 2;`
			`}`

			`service BackendService {`
			`// Create creates item if it does not exist`
			`rpc Create(Item) returns (Lease);`

			`// Put upserts value into backend`
			`rpc Put(Item) returns (Lease);`

			`// CompareAndSwap compares item with existing item and replaces is with replace_with item`
			`rpc CompareAndSwap(CompareAndSwapRequest) returns (Lease);`

			`// Update updates value in the backend`
			`rpc Update(Item) returns (Lease);`

			`// Get returns a single item or not found error`
			`rpc Get(GetRequest) returns (Item);`

			`// GetRange returns query range`
			`rpc GetRange(GetRangeRequest) returns (GetRangeResult);`

			`// Delete deletes item by key, returns NotFound error if item does not exist`
			`rpc Delete(DeleteRequest) returns (google.protobuf.Empty);`

			`// DeleteRange deletes range of items with keys between startKey and endKey`
			`rpc DeleteRange(DeleteRangeRequest) returns (google.protobuf.Empty);`

			`// KeepAlive keeps object from expiring, updates lease on the existing object,`
			`// expires contains the new expiry to set on the lease,`
			`// some backends may ignore expires based on the implementation`
			`// in case if the lease managed server side`
			`rpc KeepAlive(KeepAliveRequest) returns (google.protobuf.Empty);`

			`// NewWatcher returns a new event watcher`
			`rpc NewWatcher(Watch) returns (stream Event);`
			`}`
			```

			`### Configuration`

			`In order for the gRPC client (the Teleport side) to be able to talk to a server, at minimum the address of the server providing the gRPC service will be required, additionally the following existing configuration properties will be reused:`

Update RFD statuses (#24454) We haven't been good about going back and marking RFDs as implemented, and it's helpful when looking at old designs to know if they ever made their way into the product. 2023-04-13 17:22:46 +00:00			`- buffer size for client size buffering`
RFD 56: gRPC Backend Proposal (#10541) Design for implementing a new gRPC client backend for Teleport. The main driving motivation for this proposal is growth and requirements driven from Cloud-Hosted Teleport rather than demand from external customers. The server side implementation is outside of scope for this document, but the Why? in the RFD touches on what is driving this proposal The initial implementation is intended to be done by the Cloud-Reliability team (with oversight, review and assistance from Core) 2022-03-09 00:43:15 +00:00			`- mTLS details (CA, Cert, Key) (currently used for etcd)`

			`Representing this in yaml the storage stanza of the configuration could potentially look like:`
			```yaml
			`storage:`
			`type: grpc`
			`server: endpoint.example.com:1992`
			`tls_ca_file: /secrets/grpc/ca.crt`
			`tls_cert_file: /secrets/grpc/tls.crt`
			`tls_key_file: /secrets/tls.key`
			`audit_events_uri: ['dynamodb://example-ddb-table.events'] # Future iteration will support grpc://endpoint.example.com:1992`
			`audit_sessions_uri: s3://example-bucket/sessions`
			`continuous_backups: true`
			`auto_scaling: true`
			`read_min_capacity: 20`
			`read_max_capacity: 100`
			`read_target_value: 50.0`
			`write_min_capacity: 10`
			`write_max_capacity: 100`
			`write_target_value: 70.0`
Update RFD statuses (#24454) We haven't been good about going back and marking RFDs as implemented, and it's helpful when looking at old designs to know if they ever made their way into the product. 2023-04-13 17:22:46 +00:00			```