Commit graph

10 commits

Author SHA1 Message Date
Zac Bergquist 072956e4a0
docs: clarify /healthz and /readyz (#11085)
- Rename the page, since it's about diagnostics rather than metrics
  alone
- Change major section headings to H2s so they apper in the table of
  contents
- Move information about heartbeats and recovery to an H3 so it's
  more visible

Updates #10799

Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
2022-03-17 16:46:12 +00:00
NajiObeid 86a6abcfcb
lazy init of prometheus collectors (#6561)
* lazy init of prometheus collectors

* incorporate metrics intorduced in #6271

* linting

* tests

* pr changes

* tests

* pr changes
2021-05-19 11:53:36 -04:00
a-palchikov 6684c37103
Use fake clock consistently in units tests. (#5263)
Use fake clock consistently in units tests.
2021-01-12 12:10:00 +01:00
a-palchikov 7c87576a8b
flaky tests: consistent logging (#4849)
* Update logrus package to fix data races
* Introduce a logger that uses the test context to log the messages so they are output if a test fails for improved trouble-shooting.
* Revert introduction of test logger - simply leave logger configuration at debug level outputting to stderr during tests.
* Run integration test for e as well
* Use make with a cap and append to only copy the relevant roles.
* Address review comments
* Update integration test suite to use test-local logger that would only output logs iff a specific test has failed - no logs from other test cases will be output.
* Revert changes to InitLoggerForTests API
* Create a new logger instance when applying defaults or merging with file service configuration
* Introduce a local logger interface to be able to test file configuration merge.
* Fix kube integration tests w.r.t log
* Move goroutine profile dump into a separate func to handle parameters consistently for all invocations
2020-12-07 15:35:15 +01:00
Andrew Lytvynov ba6c4a1354 Get teleport /readyz state from heartbeats instead of cert rotation
Heartbeats are more frequent and result in more up-to-date /readyz
status. Concretely, it goes from ~10min status update to <1m.
Also, refactored the state tracking code to track the status of
individual teleport components (auth/proxy/node).
2020-09-14 23:55:35 +00:00
Andrew Lytvynov ba9c394a83 Start /readyz state tracking in stateStarting instead of stateOK
This only matters for nodes. The new stateStarting will be in effect
until the node successfully joins the cluster. This means that /readyz
for nodes will return '400 Bad Request' instead of '200 OK' until it
joins.

Updates #3700
2020-05-19 22:40:02 +00:00
Andrew Lytvynov cd1344a4a5 Add prometheus metric mirroring /readyz state
This allows users to get the health of their nodes from prometheus
metrics pipeline instead of polling readyz separately.

Updates #3700
2020-05-14 18:08:10 +00:00
Sasha Klizhentas f40df845db Events and GRPC API
This commit introduces several key changes to
Teleport backend and API infrastructure
in order to achieve scalability improvements
on 10K+ node deployments.

Events and plain keyspace
--------------------------

New backend interface supports events,
pagination and range queries
and moves away from buckets to
plain keyspace, what better aligns
with DynamoDB and Etcd featuring similar
interfaces.

All backend implementations are
exposing Events API, allowing
multiple subscribers to consume the same
event stream and avoid polling database.

Replacing BoltDB, Dir with SQLite
-------------------------------

BoltDB backend does not support
having two processes access the database at the
same time. This prevented Teleport
using BoltDB backend to be live reloaded.

SQLite supports reads/writes by multiple
processes and makes Dir backend obsolete
as SQLite is more efficient on larger collections,
supports transactions and can detect data
corruption.

Teleport automatically migrates data from
Bolt and Dir backends into SQLite.

GRPC API and protobuf resources
-------------------------------

GRPC API has been introduced for
the auth server. The auth server now serves both GRPC
and JSON-HTTP API on the same TLS socket and uses
the same client certificate authentication.

All future API methods should use GRPC and HTTP-JSON
API is considered obsolete.

In addition to that some resources like
Server and CertificateAuthority are now
generated from protobuf service specifications in
a way that is fully backward compatible with
original JSON spec and schema, so the same resource
can be encoded and decoded from JSON, YAML
and protobuf.

All models should be refactored
into new proto specification over time.

Streaming presence service
--------------------------

In order to cut bandwidth, nodes
are sending full updates only when changes
to labels or spec have occured, otherwise
new light-weight GRPC keep alive updates are sent
over to the presence service, reducing
bandwidth usage on multi-node deployments.

In addition to that nodes are no longer polling
auth server for certificate authority rotation
updates, instead they subscribe to event updates
to detect updates as soon as they happen.

This is a new API, so the errors are inevitable,
that's why polling is still done, but
on a way slower rate.
2018-12-10 17:20:24 -08:00
Cove Schneider 8b299e9c28 spelling cleanup 2018-11-15 12:44:51 -08:00
Russell Jones c18e33b71f Support different ready states. 2018-11-05 15:00:32 -08:00