Commit graph

26 commits

Author SHA1 Message Date
Marco André Dinis 307dc99400
Metrics: expose install method counter (#30327)
* Metrics: expose install method counter

This PR adds a new metric that exposes the number of servers currently
running grouped by their install method.

Note: install method is a list o strings, so the metric sorts its values
and then joins them by "," to create a single identifier.

* do not mutate original install methods list
2023-08-18 15:38:14 +00:00
Roman Tkachenko 3c0f7fc779
Added Prometheus metric for created access requests (#29761) 2023-07-29 15:07:11 +00:00
Forrest a84681a8e4
upgrader monitoring and alerts (#28951)
* add rate limit stream helper

* upgrader metrics & alert

* add docs for discovering upgrade enroll prospects

* update prehod protos

* Update docs/pages/management/operations/enroll-agent-into-automatic-updates.mdx

Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>

---------

Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
2023-07-17 14:42:27 +00:00
Tobiasz Heller 279b64177c
athena audit logs - add metrics (#26331) 2023-05-25 17:22:40 +00:00
Justinas Stankevičius aec3669d17
Hosted plugin manager prerequisites (#23922)
* Expose Ping() in bare auth server

* Handle both pointer and bare PluginStatusV1

* Add metric name

* Add StatusSink

* Run GCI

* Move comment back to auth_with_roles

* Update lib/auth/auth.go

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Rework SetStatus

* Inline TryEmitStatus and use a proper context

* Fix copyright notice

* Fix bug in statusFromStatusCode

* Test statusFromResponse

* Add link to Slack API schema

* Refactor statusFromStatusCode

* Expand comment for Ping()

* Add basic check for status in slack test

* Address nits

---------

Co-authored-by: Alan Parra <alan.parra@goteleport.com>
2023-04-11 15:24:25 +00:00
Edward Dowling e1856cd8cb
Add incomplete session upload metric to the teleport namespace (#20351)
Also change name to match prometheus naming practices
2023-01-18 19:43:40 +00:00
Edward Dowling a4f972bbc4
Add metric for incomplete file uploads (#19724) 2023-01-16 16:58:54 +00:00
Vitor Enes 87f706d0ec
Track active migrations in Prometheus and tctl top (#19520)
This commit adds a new Prometheus gauge `teleport_migrations` that
tracks for each migration if it is active (1) or not (0).

This gauge is then leveraged in `tctl top` to show a set of active
migrations.
2022-12-22 19:37:44 +00:00
Tim Buckley fba02d9f9d
Add a new usage reporter (#18142)
* [draft] Add a new usage reporter

This adds a new usage reporter service to the auth server. It's
disabled by default in OSS and can only be turned on via startup hook
in Cloud / Enterprise. In OSS, the audit log wrapper is never
configured and any usage events are sent to a no-op discard reporter.

Usage events are defined in prehog and can be sent to the new
UsageReporter Service on the auth server. An audit event wrapper is
used to capture certain events that are otherwise difficult to hook.
Events are anonymized before submission, then held in a non-blocking
queue for batching and submission purposes.

* Remove dead code

* Add SubmitUsageEvent RPC to Auth.

This adds a new SubmitUsageEvent RPC to the Auth API that external
clients (e.g. the UI) can use to submit usage events externally.

* Slight refactor for unit testing

* Add Prometheus metrics and add initial working prehog submitter

* Add more metrics, tweak prehog client, and add unit tests

* Further tweak http transport settings based on Teleport defaults

* Add missing metrics

* Fix goimports

* Add new UI usage events

* Update e ref

* Add prehog directly for now. Improve logging.

* update prehog

* Add new prehog events; use username from request identity

* add HTTP server for user events

* Add username back to pre-onboard events

* unauthenticated user events

* Fix userevent build error

* Use event-provided username where appropriate

* Move barebones prehog reqs to lib/prehog and generate here.

Also, use prod tunable values.

* Fix license lints

* De-flake tests by adding unfortunate amounts of synchronization.

* Add missing license header

* Misc PR cleanup for review

* Update lib/events/usageevents/usageevents.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* Address a batch of review comments

Adds `anonymizer.AnonymizeString` and parent loggers

* Update e ref

* Clean up comments

* Remove onboard prefix from recovery code event

* Address another batch of feedback

* Use defaults.HTTPClient()

* Remove a noisy log message

* Demote noisy log message to debug

* Temporarily revert e ref for merge

Co-authored-by: Michelle Bergquist <michelle.bergquist@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
2022-12-05 17:13:54 +00:00
Carson Anderson 1b758ce929
Add grpc server and client metrics to Teleport (#11534)
Adds grpc metrics on the auth and and proxy service with the option to enable grpc latency via the metrics service.
2022-04-04 16:55:31 +00:00
Carson Anderson 4054c79c7e
Add metric to track number ssh connect attempts (#11240)
* add ssh connect attempts metric

* fix help message wording

Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
2022-03-24 20:34:00 +00:00
Carson Anderson 266811f33e
add teleport_connected_resources metric (#9603)
This adds the Prometheus metric teleport_connected_resources. Gauge increments when the keepalive is established and will decrement whenever the connection is broken/closed.
2022-02-16 20:19:28 +00:00
Carson Anderson edff37226c
Add Prometheus metrics cache events and stale events (#9826)
This adds two Prometheus metrics teleport_cache_events and teleport_cache_stale_events with one label indicating the service.
2022-02-11 09:14:42 -07:00
Carson Anderson b384de6007
Add teleport_reverse_tunnels_connected Prometheus metric (#9698)
Adds teleport_reverse_tunnels_connected Prometheus metric which tracks reverse tunnels connected to the proxy server by type.

* Update prometheus help

Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>

* Update metrics wording

Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
2022-02-02 20:52:19 +00:00
Carson Anderson a8a57b19f8
Add metric tracking number of Teleport agents joined to cluster (#9749)
Adds the Prometheus metric teleport_registered_servers which is a gauge indicating the unique number of Teleport instances connected to the cluster by version. 

Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
2022-02-02 18:47:21 +00:00
Carson Anderson 6e3c703ddb
Add teleport_build_info Prometheus metric to Teleport (#9595)
Adds teleport_build_info metric to Teleport providing the gitref, version, and Go version.
2022-01-05 21:17:54 +00:00
Russell Jones 85b6727f8f Added metrics for missing SSH tunnels.
Added metrics and logging for missing SSH reverse tunnels. This is
useful for debugging to find if nodes are discovering all proxies.
2021-10-15 18:04:28 -07:00
rosstimothy fb0ab2b9b7
Watcher System Metrics (#8338)
* add event watcher prometheus metrics and a new tctl top tab to visualize them
2021-09-28 12:16:03 -04:00
Eugene Yakubovich 67c0eb3b4c Add restricted session
Adds the ability to block network traffic on SSH sessions.
The deny/allow lists of IPs are specified in teleport.yaml file.
Supports both IPv4 and IPv6 communication.

This feature currently relies on enhanced recording for
cgroup management so that needs to be enabled as well.

-- Design rationale:
This patch uses Linux Security Module (LSM) hooks, specifically
security_socket_connect and security_socket_sendmsg, to control
egress traffic. The LSM provides two advantages over socket filtering
program types.
- It's executed early enough that the task information is available.
  This makes it easy to report PID, COMM, etc.
- It becomes a model for extending restrictions beyond networking.

The set of enforced cgroups is stored in a BPF hash map and the
deny/allow lists are stored in BPF trie maps. An IP address is
first checked against the allow list. If found, it's checked for
an override in the deny list. The policy is default deny. However,
the absence of the NetworkRestrictions API object is allow all.

IPv4 addresses are additionally registered in IPv6 trie (as mapped)
to account for dual stacks. However it is unclear if this is sufficient
as 4-to-6 transition methods utilize a multitude of translation and
tunneling methods.
2021-07-16 16:49:04 -07:00
jane quin 7c9fd8e50d
Add additional Prometheus Metrics (#6511) 2021-04-28 15:46:27 -07:00
Andrew Lytvynov 3004b65019 proxy: add proxy_ssh_sessions_total metric
This is similar to server_interactive_sessions_total, but tracks all
SSH sessions through a proxy.
2020-09-18 20:57:34 +00:00
Andrew Lytvynov 96375c7d3d tctl: fix tctl top colors on dark terminals
If we leave `TextStyle` empty on UI elements, it will use the default
foreground color defined by the terminal (light for dark terminals and
vice versa). Same goes for `BorderStyle`.

A few other tweaks to UI and source metrics:
- update table ratios to prevent hiding output rows on short (height)
  terminal windows
- update tab selector style to use bold/underline instead of colors to
  mark selected tab
- print `No data` in histogram tables when there are no values
- don't report the local cluster in `remote_clusters` metric
2020-08-19 22:17:17 +00:00
Andrew Lytvynov cd1344a4a5 Add prometheus metric mirroring /readyz state
This allows users to get the health of their nodes from prometheus
metrics pipeline instead of polling readyz separately.

Updates #3700
2020-05-14 18:08:10 +00:00
Russell Jones 77e8b63470 Enhanced Session Recording.
Added package cgroup to orchestrate cgroups. Only support for cgroup2
was added to utilize because cgroup2 cgroups have unique IDs that can be
used correlated with BPF events.

Added bpf package that contains three BPF programs: execsnoop,
opensnoop, and tcpconnect. The bpf package starts and stops these
programs as well  correlating their output with Teleport sessions
and emitting them to the audit log.

Added support for Teleport to re-exec itself before launching a shell.
This allows Teleport to start a child process, capture it's PID, place
the PID in a cgroup, and then continue to process. Once the process is
continued it can be tracked by it's cgroup ID.

Reduced the total number of connections to a host so Teleport does not
quickly exhaust all file descriptors. Exhausting all file descriptors
happens very quickly when disk events are emitted to the audit log which
are emitted at a very high rate.

Added tarballs for exec sessions. Updated session.start and session.end
events with additional metadata. Updated the format of session tarballs
to include enhanced events.

Added file configuration for enhanced session recording. Added code to
startup enhanced session recording and pass package to SSH nodes.
2019-12-02 15:10:39 -08:00
Alexander Klizhentas 6b5935fb71
Use RADIX trees for prefix matching. (#2666)
Buffer fan out used simple prefix match
in a loop, what resulted in high CPU load
on many connected watchers.

This commit switches to RADIX trees for
prefix matching what reduces CPU load
substantially for 5K+ connected watchers.
2019-04-22 15:28:04 -07:00
Sasha Klizhentas 8356ae6a74 Use in-memory cache for the auth server API.
This commit expands the usage of the caching layer
for auth server API:

* Introduces in-memory cache that is used to serve all
Auth server API requests. This is done to achieve scalability
on 10K+ node clusters, where each node fetches certificate authorities,
roles, users and join tokens. It is not possible to scale
DynamoDB backend or other backends on 10K reads per seconds
on a single shard or partition. The solution is to introduce
an in-memory cache of the backend state that is always used
for reads.

* In-memory cache has been expanded to support all resources
required by the auth server.

* Experimental `tctl top` command has been introduced to display
common single node metrics.

Replace SQLite Memory Backend with BTree

SQLite in memory backend was suffering from
high tail latencies under load (up to 8 seconds
in 99.9%-ile on load configurations).

This commit replaces the SQLite memory caching
backend with in-memory BTree backend that
brought down tail latencies to 2 seconds (99.9%-ile)
and brought overall performance improvement.
2019-04-12 14:23:09 -07:00