teleport/rfd/0118-scalable-audit-logs.md

330 lines
12 KiB
Markdown
Raw Normal View History

---
authors: Tobiasz Heller (tobiasz.heller@goteleport.com)
state: draft
---
# RFD 0118 - Scalable audit logs
## Required Approvers
* Engineering: @rosstimothy && @zmb3
* Security: @reed
* Product: (@xinding33 || @klizhentas)
## What
Allow Teleport to use combination of SNS, SQS, Athena and S3 to provide scalable
and searchable audit log mechanism.
In this RFD we focus on integrating scalable datastore to existing interfaces.
There will be separate RFD which will focus and UI changes and focus on advanced
2023-04-03 03:05:10 +00:00
search capabilities.
## Why
Why is explained in [Cloud RFD](https://github.com/gravitational/cloud/pull/3062)
## Solution
* Ingestion phase - Auth instances send events to AWS SNS + SQS queue in
proto format
* Transform and store phase - Single auth consumes events from queue in batches
and produces Parquet file stored in s3 bucket for long term storage
* Query phase - Athena queries S3 bucket with parquet files partitioned by date
(partition info is stored in Glue Table)
```mermaid
flowchart LR
subgraph IngestPhase[Ingest phase]
t1A1[Auth 1]
t1A2[Auth 2]
t1topic[SNS]
t1A1 --> |events proto/json|t1topic
t1A2 --> |events proto/json|t1topic
t1queue1[SQS]
end
subgraph TransformStorePhase[Transform and store phase]
t1SinkS3[Auth]
t1S3folder[S3 Bucket Long Term storage]
t1SinkS3 --> |Parquet file|t1S3folder
end
subgraph QueryPhase[Query phase]
Athena[Athena]
t1GlueTable[Glue table]
end
t1topic --> t1queue1
t1queue1 --> |consumer|t1SinkS3
t1S3folder <--> Athena
```
SNS + SQS components are used because it allow us buffering events and extending
solution with Export API over queue or Lambda for alerting on certain events.
### Ingestion phase
New `EmitAuditEvent` implementation will consist of following steps:
1. Check message size and trim or upload via s3
2. Marshal event to proto
3. Send message to SNS
Currently dynamo supports payloads of max 400 KB. SNS + SQS supports max
message size of 256 KB. Events in rare cases can be larger than 256 KB. We will
use similar mechanism as in [extended SNS library for
java](https://docs.aws.amazon.com/sns/latest/dg/large-message-payloads.html)
works. It allows to specify s3 bucket where messages larger then max limit are
sent. On SNS/SQS client only sends s3 link to payload.
2023-04-03 03:05:10 +00:00
SNS/SQS message consists of `payload` and `messageAttributes`. `Payload` can be
only valid UTF-8 string.
`messageAttributes` will be used to determine on SQS which type is payload.
It allow us to extend it later, for example by adding compression before base64.
We will use two different kinds of payloads for now:
1. Base64 encoded proto event marshaled as OneOf type from apievents.
It will be send with attribute `raw_proto_event`.
2. Base64 encoded proto of new message with s3 location of payload.
It will be send with attribute `s3_event`.
`s3_event` will use following proto message:
```proto
message S3EventPayload {
string path = 1;
// Custom KMS key for server-side encryption.
string ckms = 2;
}
```
#### Proto vs JSON
We could use either JSON or proto as format for passing data over SNS/SQS.
Proto should be at least 2x smaller and faster to marshal/unmarshal.
JSON advantage is that we don't need to know message struct at all.
Since auth is both emitting events and processing it, it should always contain
newest version of proto and be able to decode it. There are rare cases during
update where there could be 2 different auth instances (although we recommend
rolling update), but simple NACK on not know messages and retrying it with
updated instance should solve the issue.
We decided go with proto.
### Transform and store phase
Consumer will be implemented in one of auth instances. We will use locking
2023-04-03 03:05:10 +00:00
mechanism which can be acquired on backend, so that only single instance does the
job. There is already mechanism for that called
[RunWhileLocked](https://github.com/gravitational/teleport/blob/11eaf9657dcdd9f4c8b73a3880c5648db0139aec/lib/backend/helpers.go#L137-L171).
2023-04-03 03:05:10 +00:00
It's checking backend with 250ms interval if lock can be acquired. I think it
makes sense to make that interval configurable in `RunWhileLocked` function
and set it to 10s. Lock TTL should be set to 30s. It will be automatically
refreshed if job is still running. So TTL will be only used is Auth died and
other instance should take a lead.
Consumer will fetch events from queue and write them to S3 in batches every
`INTERVAL` or `MAX_BUFFER_ITEMS`, which ever comes first.
Flow of actions:
1. Fetch events from queue
2. Group events by date (there could be events from different date, for example
from migration) in format `YYYY-MM-DD` based on UTC time.
3. Write Parquet files to s3
4. Delete messages from queue (aka ack)
Delete message from queue accepts only 10 items. It means that could happen that
some messages won't be ack (due to failure) even though s3 files are written.
Duplicate issue will be solved during query phase.
If writing parquet file will fail, whole batch should be NACK.
We will store basic information like (`event_time`, `event_type`,
`session_id`, `audit_id(uid)`, `user`) as top
2023-04-03 03:05:10 +00:00
level columns in Parquet files. Additionally there will be `event_data` column
which will store string which contains marshaled data from whole audit
event.
Data in s3 will be stored in following path:
`$S3_EVENTS_LOCATION/year-month-day/<suffix-generated-by-worker+timestamp>.parquet`
Object locking must be used to prevent tampering of events. It must be set
during creation bucket. It should be different bucket then session recordings
one.
Parquet files will use Snappy compression mode.
Data retention should be defined on bucket level during creation of bucket.
### Query phase
Athena during query first checks Glue table and its schema. AWS Glue table is
used to store and retrieve table metadata for the Amazon S3 data. This schema
2023-04-03 03:05:10 +00:00
is used by Athena during querying data. The table metadata lets the Athena query
engine know how to find, read, and process the data that you want to query. We
will use dynamic projections to avoid manually creating partitions.
Creating table and database should be done in tenant operator. It's added here
just to bring more context.
```sql
2023-04-03 03:05:10 +00:00
CREATE EXTERNAL TABLE auditevents_tenantid (
`uid` string,
`session_id` string,
`event_type` string,
`user` string,
`event_time` timestamp,
`event_data` string
)
PARTITIONED BY (
event_date DATE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION "s3://teleport-cloud-tenants-audit-logs/tenantid/"
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.event_date.type" = "date",
"projection.event_date.format" = "yyyy-MM-dd",
"projection.event_date.range" = "NOW-4YEARS,NOW",
"projection.event_date.interval" = "1",
"projection.event_date.interval.unit" = "DAYS",
"storage.location.template" = "s3://teleport-cloud-tenants-audit-logs/tenantid/${event_date}/",
"classification" = "parquet",
"parquet.compression" = "SNAPPY"
)
```
Example queries:
```sql
/* get events for given date */
2023-04-03 03:05:10 +00:00
SELECT DISTINCT event_data, event_time, uid FROM auditevents_tenantid
WHERE event_date=date('2023-02-14') ORDER BY event_time DESC, uid DESC
/* get events for specific db instance */
2023-04-03 03:05:10 +00:00
SELECT DISTINCT event_data FROM auditevents_tenantid WHERE event_date>=date('2023-02-14')
AND event_type = 'db.session.query' AND json_extract_scalar(event_data, '$.db_instance')='production.postgres'
```
Querying data from Athena is a combination of 3 operations:
1. `startQueryExecution` (starts new query)
2. `getQueryExecution` (check execution status)
3. `getQueryResults` (download query results)
Results from query execution are stored in s3 bucket (either default for
workspace or one you specify during StartQueryExecution). `getQueryResults`
download results from s3 bucket.
2023-04-03 03:05:10 +00:00
`ExecutionParameters` field from StartQueryExecution endpoint must be used to
pass query parameters. Using that approach protect us from SQL injection.
`getQueryExecution` will be check at defined interval, passed from config.
(default to 100ms).
#### Pagination support
2023-04-03 03:05:10 +00:00
Both `SearchEvents` and `SearchSessionEvents` supports pagination of results by
providing `startKey` and `limit` and part of their signature.
2023-04-03 03:05:10 +00:00
It is recommended in Athena when querying over large number of data, to query
without limit only once, and use `getQueryExecution` to iterate over results.
2023-04-03 03:05:10 +00:00
Because athena stores query results on s3, you can download it by specifying
`queryID` and optional `offsetKey`.
We have decided to not follow that pattern because it opens us with risk of
other results stealing. If malicious user can guess queryID and offsetKey, it
can get other data. Guessing queryID (uuid) and offsetKey is unlikely but it
can result in RBAC bypass, because SearchSessionEvents RBAC is non trivial. If
user has `session.list` permission with specific where condition
`contains(session.participants, user.metadata.name)`, user by guessing queryID
and offsetKey bypass RBAC because we would try to download results instead of
2023-04-03 03:05:10 +00:00
executing query.
Workaround that is using standard SQL pagination support, using limit and
always reexecuting query instead of downloading it.
```sql
SELECT event_time, uid, event_data
FROM athena_table
WHERE ...
AND (event_time, uid < event_time_from_start_key, uid_from_start_key)
ORDER BY event_time DESC, uid DESC LIMIT 5000
```
### Configuration
Configuration of audit logger could be done in similar manner like dynamo or
firestore - by using query parameters.
We have following parameters that will be used for configuration:
```
glueTableName - required
glueDatabaseName - required
getQueryExecutionSleepTime - optional, default 100ms
snsTopicARN - required
snsS3LocationForLargeEvents - required
athenaWorkgroup - optional, default to default
athenaResultsS3Path - optional, default to defined in workspace
sqsURL - required
batchInterval - optional, defaults to 1min
maxBatchSize - optional, defaults to 20000 events (+/- 10MB)
QPS - optional, queries per second in athena search.events, defaults to 20 req/s
```
Example configuration can look like:
```
athena://glueDatabaseName.glueTableName?sqsTopicARN=aaa&athenaResultsS3Path=s3://bbb
```
Configuration using url query params seems a bit hacky but we decided to keep
using it with MVP.
### Infrastructure setup
In MVP Teleport won't set up any infrastructure. In cloud version, tenant
operator will handle it. For self-hosted customers, we will provide docs how
to set up infrastructure manually before using athena based search. In future
bootstraping of infra could be added into teleport codebase.
### Rate limiting of search events
2023-04-03 03:05:10 +00:00
Athena Service Quotas can be tight in certain cases (for example Teleport Cloud
with tenants sharing quota pool). To address that issue we decided to introduce
new rate limiting mechanism which will work per auth instance for all users,
not per IP.
There seems no need so far for that kind of rate limiting mechanism in other
places of Teleport codebase, so I suggest passing it as additional parameter to
`athena` and implementing rate limit just inside `athena`.
Alternatively we can extend `ClusterAuditConfigSpecV2` with new type which will
define `ServiceLevelRateLimit`. So far it will contain just one field `QPS`
which defines number of queries per second and affects only read operations. It
may turned out that we need more granularity for other services, that's why I
think we should start just with query param to `athenasearch` and rework it
later if we have other usecases.
### Security
We will leverage encryption mechanism provided by S3 (either SSE-S3 or
SSE-KMS), which works with Athena. Events tampering is protected using object
lock mechanism on S3. Athena itself does not store data (only metadata in glue
table). All data is stored on S3.
SQS and SNS should also be configured with encryption.
### UX
It's worth to mention that proposed solution will result in slight slower
rendering of audit logs page (up to 1,5s).
Moreover due to buffer interval (1-15min, recommended 1min), last events viewed
in UI can be delayed up to value of buffer interval.
In this RFD we focus on integrating scalable datastore to existing interfaces.
There will be separate RFD which will focus and UI changes and focus on advanced
2023-04-03 03:05:10 +00:00
search capabilities.