teleport/rfd/0097-teleport-connect-usage-metrics.md

6.8 KiB

authors state
Grzegorz Zdunek (grzegorz.zdunek@goteleport.com) draft

RFD 97 - Teleport Connect usage metrics

Required Approvers

  • Engineering:? @ravicious @zmb3
  • Product: @klizhentas @xinding33

What

Collect downloads and usage metrics of Teleport Connect.

Why

Currently, the team has no information on how many users download, install and use Teleport Connect on a daily basis. In order to effectively plan the development of the product, the team should also know what is the adoption of new features, which ones are the most popular and which are problematic for users.

Details

Collecting events

Events for Connect will be collected on the client side. TypeScript code will have a stateless metrics service that will forward them to the gRPC handler exposed by tsh daemon, which will ultimately submit them to a service called prehog. To prevent flooding backend with a large number of small requests, events will be batched before sending to prehog. The batching mechanism has already been implemented in UsageReporter that will be used for collecting cluster events. tsh daemon will try to reuse the same code as much as possible (by providing its own batching parameters and submit function). Events will be sent once every hour (this may change) and before closing the app.

It was considered to use an authorized endpoint provided by cluster's Auth Server, but it seems to not work well for Connect for a few reasons:

  • Some events may not belong to any cluster (at the time of writing this RFD there is no such event, but the solution should be future-proof).
  • Batch can contain events from multiple clusters.
  • Batch can be sent after the session expires.

Anonymization

NOTE: The anonymization solution described below applies only to events that are associated with a cluster. Events that do not belong to any cluster but contain sensitive data will have to be anonymized in a different way.

Each event that contains sensitive data, like cluster name needs to be anonymized. It will be done in tsh daemon, the same way as in Auth Server - using HMAC with unique cluster id as the key. Connect will reuse the same code. The only issue with anonymizing events client-side is lack of cluster id that is kept in Auth Server. To remedy this, when the app starts and retrieves cluster information, it should also retrieve the cluster id, create an anonymizer and store it in the cluster struct.

Storing events

Batches of anonymized events will be sent to a public endpoint in prehog (intended for use only by Connect) that translates them into the PostHog's data model.

Connect events will share the same project with clusters and website events. It will allow to perform queries that need both sources of data, like calculating what is the percentage of users logging to a cluster with Connect.

Some event properties, like OS can be saved as a user property. These properties are then stored directly on each event. For example, when the first emitted event sets a user property os: windows, each next event will have this property set.

To differentiate events coming from multiple application instances, each event needs to have distinct_id field. It will be supplied with UUID generated by Connect with connect. prefix. The value will be created on the start and stored in a file in the app data directory, so it will not change between restarts.

NOTE: As stated above, Connect events are tied to the application instance (or just the client machine). It means that PostHog's Person for Connect and for cluster will be a different thing. They should not be merged.

User agreement

On the start, Teleport Connect will ask user to opt in to volunteer anonymized metrics and usage-data with standard message "Are you OK sending anonymized usage data about Teleport Connect? This will help us to improve product".

If the user refuses, Connect will not send any usage data.

How will collecting metrics support product development?

In the initial version, it should help with getting answers to the following questions:

How many unique users download and use Teleport Connect today?

To answer the first part of the question, download counts from goteleport.com/download are needed. These will be collected from access logs from CloudFront CDN.

To calculate how many users use Teleport Connect on a daily basis, a metric like DAU (Daily Active Users) can be used. This metric can be based on a specific event, but in this case it should be calculated using any event. For example, user logged only once in a given day to refresh certs for a DB proxy connection - such user can be considered as active.

Usage of each feature will be measured basing on events from the events section. They will allow to generate various statistics, like the most common kinds of connections or just show the usage of particular features like Access Requests.

This will be measured in two ways.

  • Based on downloads count for each platform.
  • Based on a real usage - every event will contain the OS field. These events will be then aggregated by unique users.

How usage grows or shrinks over time?

PostHog allows to create Trends basing on DAU. It can be used to show how usage changes in a given period of time.

Events

connect.cluster.login

Successful login to a cluster.

Event properties:

  • cluster_name: string (anonymized)
  • user_name: string (anonymized)
  • connector_type: string
  • os: string (set once on a user properties)
  • arch: string (set once on a user properties) - CPU architecture
  • os_version: string (set on a user properties)
  • connect_version: string (set on a user properties)
  • distinct_id: string

connect.protcol.run

Connecting to the protocol.

Event properties:

  • cluster_name: string (anonymized)
  • user_name: string (anonymized)
  • protocol: one of ssh/proxy_db/kube
  • distinct_id: string

connect.accessRequest.create

Creating an access request.

Event properties:

  • cluster_name: string (anonymized)
  • user_name: string (anonymized)
  • kind: one of role, resource
  • distinct_id: string

connect.accessRequests.review

Reviewing an access request.

Event properties:

  • cluster_name: string (anonymized)
  • user_name: string (anonymized)
  • distinct_id: string

connect.accessRequests.assumeRole

Assuming a requested role.

Event properties:

  • cluster_name: string (anonymized)
  • user_name: string (anonymized)
  • distinct_id: string

connect.fileTransfer.run

Running file transfer.

Event properties:

  • cluster_name: string (anonymized)
  • user_name: string (anonymized)
  • direction: one of upload/download
  • distinct_id: string