* Replace Upload completer grace period logic with session tracker checking to accurately determine whether an upload has been abandoned
* Update session tracker expiration to be 1 hour, and dynamically extend it while the session is active.
Fixes#11065.
This commit:
- ensures that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](16bf416556/lib/service/service.go (L358-L366))) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065
Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
* Throw startup error if `TeleportReadyEvent` is not emitted
Before this commit, the `TeleportReadyEvent` was only waited for when a
process reload occurred. Thus, if a bug exists in the code that emits
this event (as it's currently the case since the `MetricsReady` and
`WindowsDesktopReady` events are never emitted), such a bug may go
unnoticed for a while.
This commit ensures that the `TeleportReadyEvent` is always waited for
on startup, and throws an error if the event is not emitted (after some
timeout).
This commit also:
- removes the `MetricsReady` event (as this is not produced by a
component that sends heartbeats, which is the case of every other
event required by the `TeleportReadyEvent` event mapping)
- ensures that `WindowsDesktopReady` event is emitted
- refactors some of the code in `lib/service/supervisor.go`
- moves the event mapping registration to a new `registerTeleportReadyEvent` function
Introduce Database Certificate Authority. New CA is used by Database Access to sign database certificates making them independent from Host CA.
Co-authored-by: Marek Smoliński <marek@goteleport.com>
* Intercept and update error message when there is a certificate error joining a node.
* Log out error hint and return full wrapped error.
* Updated error message.
* Always use in-memory caches
This also cleans up now-useless fields and constants related to on-disk
caches.
* Remove the cache tombstone mechanism
As we're never reopening the same cache backend twice, this is no longer
useful.
* Warn if a cache directory exists on disk
We can't remove it automatically because we might be in the middle of an
upgrade with a old version of Teleport still running.
* Provider error info on data dir rights
* Added similar message for appropiate access when trying to use a Teleport configuration file (/etc/teleport.yaml) and it fails to load due to permission error.
The Forwarder type has been replaced with the new GRPC/streaming based
session recording and was only used in tests.
The RecordSessions param is never consulted, as it was replaced with
AuditWriter's RecordOutput param a couple of years ago.
- Rename the page, since it's about diagnostics rather than metrics
alone
- Change major section headings to H2s so they apper in the table of
contents
- Move information about heartbeats and recovery to an H3 so it's
more visible
Updates #10799
Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
* Use BEGIN IMMEDIATE to start transactions
This makes it so all transactions grab a write lock
rather than a read lock that can be upgraded in case of
a write; in case of multiple writers (which, in our
case, can only happen during a restart as the new
process reopens the same sqlite database) this will
prevent two transactions from attempting to upgrade
their lock, which would cause a SQLITE_BUSY error in
one of them. In regular operation this shouldn't cause
a performance hit, as we're using a single connection
to the sqlite database (guarded by locks in the go side)
anyway.
* Escape path in sqlite connection URL
This makes it so that the sqlite backend supports paths with ? in them.
* Close process storage on TeleportProcess shutdown
This aligns the behavior of Shutdown with that of Close.
* Allow specifying the journal mode in sqlite
This will let sqlite backend users specify WAL mode in their config
file, and will allow us to specify alternate journal modes for our
on-disk caches in the future.
This also removes sqlite memory mode, as it's not used anywhere because
of its poor query performance compared to our in-memory backend, and
cleans up a bit of old cruft, and runs process storage in FULL sync
mode - it's very seldom written to and holds important data.
* Fix goroutine and memory leak in watchCertAuthorities
The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.
All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.
Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.
Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.
Closes#10648
The Migrate method on the Backend interface was not implemented by any
backends.
Migration should be implemented in the New method of backends so they
can be sure migration happens before any background processes are
started.
The upload completer scans for uploads that need to be completed,
likely due to an error or process restart. Prior to this change,
it only completed uploads that had 1 or more parts. Since completing
an upload is what cleans up the directory on disk (or in the case of
cloud storage, finishes the multipart upload), it was possible
for us to leave behind empty directories (or multipart uploads)
for uploads with no parts.
This change makes it valid to complete uploads with no parts, which
ensures that these directories get cleaned up.
Also fix an issue with the GCS uploader, which failed to properly calculate
the upload ID from the path. This is because strings.Split(s, "/") returns an empty
string as the last element when s ends with a /.
Updates #9646
Passwordless endpoints are rate limited because they allow unauthenticated
challenge generation. The endpoint rate limits are applied in addition to
(pre-existing) storage limits.
Setting limits to Auth only would be sufficient, but it seems best to apply
limits to Proxy as well, so we may spare Auth of unnecessary load.
Auth already has a framework for RPC rate limiting, so we took advantage of it.
The solution for the Proxy is rather simple - the handler is decorated with the
appropriate limits.
#9160
* Fix shadowing of grpcServer variable
* Add rate limiting for CreateAuthenticateChallenge
* Add rate limiting for /mfa/login/begin
* Safe parallel tests
* Add certificate renewal bot
This adds a new `tbot` tool to continuously renew a set of
certificates after registering with a Teleport cluster using a
similar process to standard node joining.
This makes some modifications to user certificate generation to allow
for certificates that can be renewed beyond their original TTL, and
exposes new gRPC endpoints:
* `CreateBotJoinToken` creates a join token for a bot user
* `GenerateInitialRenewableUserCerts` exchanges a token for a set of
certificates with a new `renewable` flag set
A new `tctl` command, `tctl bots add`, creates a bot user and calls
`CreateBotJoinToken` to issue a token. A bot instance can then be
started using a provided command.
* Cert bot refactoring pass
* Use role requests to split renewable certs from end-user certs
* Add bot configuration file
* Use `teleport.dev/bot` label
* Remove `impersonator` flag on initial bot certs
* Remove unnecessary `renew` package
* Misc other cleanup
* Do not pass through `renewable` flag when role requests are set
This adds additional restrictions on when a certificate's `renewable`
flag is carried over to a new certificate. In particular, it now also
denies the flag when either role requests are present, or the
`disallowReissue` flag has been previously set.
In practice `disallow-reissue` would have prevented any undesired
behavior but this improves consistency and resolves a TODO.
* Various tbot UX improvements; render SSH config
* Fully flesh out config template rendering
* Fix rendering for SSH configuration templates
* Added `String()` impls for destination types
* Improve certificate renewal logging; show more detail
* Properly fall back to default (all) roles
* Add mode hints for files
* Add/update copyright headers
* Add stubs for tbot init and watch commands
* Add gRPC endpoints for managing bots
* Add `CreateBot`, `DeleteBot`, and `GetBotUsers` gRPC endpoints
* Replace `tctl bot (add|rm|ls)` implementations with gRPC calls
* Define a few new constants, `DefaultBotJoinTTL`, `BotLabel`,
`BotGenerationLabel`
* Fix outdated destination flag in example tbot command
* Bugfix pass for demo
* Fixed a few nil pointer derefs when using config from CLI args
* Properly create destination if `--destination-dir` flag is used
* Remove improper default on CLI flag
* `DestinationConfig` is now a list of pointers
* Address first wave of review feedback
Fixes the majority of smaller issues caught by reviewers, thanks all!
* Add doc comments for bot.go functions
* Return the token TTL from CreateBot
* Split initial user cert issuance from `generateUserCerts()`
Issuing initial renewable certificate ended up requiring a lot of
hacks to skip checks that prevented anonymous bots from getting
certs even though we'd verified their identity elsewhere (via token).
This reverts all those hacks and splits initial bot cert logic into a
dedicated `generateInitialRenewableUserCerts()` function which should
make the whole process much easier to follow.
* Set bot traits to silence log messages
* tbot log message consistency pass
* Resolve lints
* Add config tests
* Remove CreateBotJoinToken endpoint
Users should instead use the CreateBot/DeleteBot endpoints.
* Create a fresh private key for every impersonated identity renewal
* Hide `config` subcommand
* Rename bot label prefix to `teleport.internal/`
* Use types.NewRole() to create bot roles
* Clean up error handling in custom YAML unmarshallers
Also, add notes about the supported YAML shapes.
* Fetch proxy host via gRPC Ping() instead of GetProxies()
* Update lib/auth/bot.go
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
* Fix some review comments
* Add renewable certificate generation checks (#10098)
* Add renewable certificate generation checks
This adds a new validation check for renewable certificates that
maintains a renewal counter as both a certificate extension and a
user label. This counter is used to ensure only a single certificate
lineage can exist: for example, if a renewable certificate is stolen,
only one copy of the certificate can be renewed as the generation
counter will not match
When renewing a certificate, first the generation counter presented
by the user (via their TLS identity) is compared to a value stored
with the associated user (in a new `teleport.dev/bot-generation`
label field). If they aren't equal, the renewal attempt fails.
Otherwise, the generation counter is incremented by 1, stored to the
database using a `CompareAndSwap()` to ensure atomicity, and set on
the generated certificate for use in future renewals.
* Add unit tests for the generation counter
This adds new unit tests to exercise the generation counter checks.
Additionally, it fixes two other renewable cert tests that were
failing.
* Remove certRequestGeneration() function
* Emit audit event when cert generations don't match
* Fully implement `tctl bots lock`
* Show bot name in `tctl bots ls`
* Lock bots when a cert generation mismatch is found
* Make CompareFailed respones from validateGenerationLabel() more actionable
* Update lib/services/local/users.go
Co-authored-by: Nic Klaassen <nic@goteleport.com>
* Backend changes for tbot IoT and AWS joining (#10360)
* backend changes
* add token permission check
* pass ctx from caller
Co-authored-by: Roman Tkachenko <roman@goteleport.com>
* fix comment typo
Co-authored-by: Roman Tkachenko <roman@goteleport.com>
* use UserMetadata instead of Identity in RenewableCertificateGenerationMismatch event
* Client changes for tbot IoT joining (#10397)
* client changes
* delete replaced APIs
* delete unused tbot/auth.go
* add license header
* don't unecessarily fetch host CA
* log fixes
* s/tunnelling/tunneling/
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
* auth server addresses may be proxies
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
* comment typo fix
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
* move *Server methods out of auth_with_roles.go (#10416)
Co-authored-by: Tim Buckley <tim@goteleport.com>
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
Co-authored-by: Tim Buckley <tim@goteleport.com>
Co-authored-by: Roman Tkachenko <roman@goteleport.com>
Co-authored-by: Tim Buckley <tim@goteleport.com>
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
Co-authored-by: Nic Klaassen <nic@goteleport.com>
Co-authored-by: Roman Tkachenko <roman@goteleport.com>
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
* Address another batch of review feedback
* Addres another batch of review feedback
Add `Role.SetMetadata()`, simplify more `trace.WrapWithMessage()`
calls, clear some TODOs and lints, and address other misc feedback
items.
* Fix lint
* Add missing doc comments to SaveIdentity / LoadIdentity
* Remove pam tag from tbot build
* Update note about bot lock deletion
* Another pass of review feedback
Ensure all requestable roles exist when creating a bot, adjust the
default renewable cert TTL down to 1 hour, and check types during
`CompareAndSwapUser()`
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
Co-authored-by: Nic Klaassen <nic@goteleport.com>
Co-authored-by: Roman Tkachenko <roman@goteleport.com>
Add support for Database Access for Redis for standalone and cluster self-hosted instances. Teleport requires mTLS in order to connect to Redis instance which is only supported in Redis 6.0+. RESP2 is currently the only supported protocol.
* Record desktop sessions
Here we introduce a new protobuf type (DesktopRecording) that contains
an encoded TDP message, and update AuditWriter to treat these similarly
to SessionPrint events (which are used for SSH session recordings).
We also add desktop session playback endpoint, temporarily located at
/webapi/sites/:site/desktopplaybacktest/:session
which streams TDP messages from a recorded session over
a websocket interface.
* update session end (#9795)
* Updates SessionEnd event with fields needed for frontend
* removes the clock which didn't need to be passed
* Add `Recorded` field to `WindowsDesktopSessionEnd` (#9839)
* Adds the SessionRecording field to WindowsDesktopSessionEnd event to mimic SessionEnd events (useful for easy integration with frontend).
* 14 should have been 12
* removing test logic
* switches SessionRecording to simple boolean Recorded
* session recording websocket (#9908)
* Adds the SessionRecording field to WindowsDesktopSessionEnd event to mimic SessionEnd events (useful for easy integration with frontend).
* 14 should have been 12
* removing test logic
* switches SessionRecording to simple boolean Recorded
* Updates the websocket address
* updates desktopPlaybackHandle to restart playback once it reaches the end
* adds playback state and synchronization logic for ensuring that goroutines aren't leaked
* adds toggle functionality for play/pause
* fix for the fact that urls are case insensitive
* moves desktop_playback to its own file, fixes mistaken comment about how websocket.JSON.Receive works, fixes error messaging, wraps playbackState.Close in a sync.Once
* Adds a cond variable for the two goroutines, but doesn't solve the spinning loop problem for the hanging logic
* Adds a cancel-able context which is cancelled in ps.Close() in order to avoid a spinning loop in the websocket.Handler
* Moves the majority of playback logic into the playbackState, which is now renamed to the more accurate playbackPlayer.
* changes pp.hangWhilePaused to pp.waitWhilePaused
* Moves the context out of NewPlaybackPlayer and the playbackPlayer
struct, wraps playback goroutines in playbackPlayer.Play(ctx) in
order to comply with context semantics.
* removing unnecessary warnings
* touchups
* record screen size (#9992)
* Adds the SessionRecording field to WindowsDesktopSessionEnd event to mimic SessionEnd events (useful for easy integration with frontend).
* 14 should have been 12
* removing test logic
* switches SessionRecording to simple boolean Recorded
* Updates the websocket address
* updates desktopPlaybackHandle to restart playback once it reaches the end
* adds playback state and synchronization logic for ensuring that goroutines aren't leaked
* adds toggle functionality for play/pause
* fix for the fact that urls are case insensitive
* moves desktop_playback to its own file, fixes mistaken comment about how websocket.JSON.Receive works, fixes error messaging, wraps playbackState.Close in a sync.Once
* Adds a cond variable for the two goroutines, but doesn't solve the spinning loop problem for the hanging logic
* Adds a cancel-able context which is cancelled in ps.Close() in order to avoid a spinning loop in the websocket.Handler
* Moves the majority of playback logic into the playbackState, which is now renamed to the more accurate playbackPlayer.
* changes pp.hangWhilePaused to pp.waitWhilePaused
* Moves the context out of NewPlaybackPlayer and the playbackPlayer
struct, wraps playback goroutines in playbackPlayer.Play(ctx) in
order to comply with context semantics.
* removing unnecessary warnings
* Adds an OnRecv that's similar to OnSend, for emitting audit events for particular incoming tdp messages
* Send full `DesktopRecording` event as json over playback websocket. (#10052)
* playback websocket now sends a json representation of the DesktopRecording event rather than just the raw tdp message, in order for us to have timing data on the frontend
* updating json.Marshal to utils.FastMarshal
* Removing unnecessary comment
* playback end event (#10088)
* Adds an end event so that the playback player knows to set the progress bar to its end state
* making the end message a json
* if the marshal fails we don't want to send a message over websocket
* Use a static string
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Add participants to session end event
Desktop sessions are not joinable, so the participants list always
has a single member - the user who started the session.
This will ensure that our example role for RBAC for sessions
(which depends on the participants field) will work for desktop
sessions.
* Minor cleanup
* Only record sessions when enabled
In order for desktop sessions to be recorded, session recording
must be enabled in the cluster's session recording config and
at least one of the user's roles must enable it.
* Cleanup
* Start to address review comments
* Move TDP event handlers out of connectRDP
* Address more review comments and add some tests
* Add playback streaming test
* Consistent comments
* Fix tests
* Don't log PNG frames that exceed the size of a protobuf
Since the PNG frame message in our desktop protocol is unbounded,
it is theoretically possible for a message to exceed the size limit
of a single protobuf.
In practice, this is unlikely to occur with any legitimate RDP traffic,
as the bitmaps are at most 64x64 pixels and compressed in PNG form.
Rather than complicating the protocol to allow for PNGs to be split
across events, we simply refuse to log anything this big.
* Mark RFD 48 implemented
Co-authored-by: Isaiah Becker-Mayer <isaiah@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Add more lint coverage
golanglint-ci doesn't pick up subdirectories with their own go.mod
which left certain directories unlinted. To get around this we can
run golanglint-ci directly against those submodules.
* Dynamically resolve reverse tunnel address
The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.
Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
Rather than requiring a password for the LDAP service account,
we can use a Teleport-issued certificate to authenticate.
This works because AD must already be configured to trust the Teleport
CA in order for desktop access to function.
Fixes#8921
- Ensure that the dial request uses proper "server ID" format,
which is <uuid>.<cluster_name>
- Update reverse tunnel agent to handle tunnel connections
to desktops
Fix flaky unit tests
Addresses issues causing failures in TestCache_Backoff, TestTeleportProcess_reconnectToAuth
and TestResourceWatcher_Backoff. By utilizing FakeClock.BlockUntil tests ensure that the clock
will not be advanced until retry.After has been called. Move retry duration channels to config in order to allow them to be buffered by tests.
Now 'verify-full', 'verify-ca' and 'insecure' modes can be used when connecting to a database. 'verify-full` is the default on and it's the most strict. 'verify-ca' skips the server-name check. 'insecure' accepts any certificate provided by a database.
Prior to this change, desktop access only respected locks
on users or roles. This introduces a desktop as a lock target,
preventing new connections and terminating existing connections
to a locked desktop.
Note: when a lock is created, connection attempts will fail
with the generic "websocket connection failed" error.
This will be addressed with #8584.
Updates #8742
Fixesgravitational/teleport-private#78
LAT-APP21-3
Change the multiplexer from opt-out to opt-in for protocol listeners.
The multiplexer previously always created new listeners for each protocol (SSH, TLS, DB) and its Config contained opt-out configurations (DisableTLS, DisableSSH, DisablePostgres). When callers didn't explicitly disable a protocol, new connections for that protocol would never close and leak a goroutine. This exposed a 3-line DoS whereby multiple connections could be passed to the multiplexer for a protocol that was not being serviced, eventually resulting in file descriptor limits being hit, which then prevented the teleport process from operating (see example in issue).
Changing to opt-in means new protocols can be added to the multiplexer without requiring all existing callers to be changed (to opt-out of the new protocol). Forgetting to opt-out would expose new code to DoS without compile-time, test, or operational notification.
Rather than rely on opt-out flags in the config, this change creates listeners only when explicitly requested by callers. The existing getter methods on multiplexer were changed to create listeners when called. And multiplexer protocol detection now closes connections when a listener hasn't been created. This also allowed for the protocol detection routine to be simplified.
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.
Fixes#6889.