This commit includes a 7 character fix in lib/service/connect.go to call
connector.Close() instead of connector.Client.Close() when a new client
fails to ping the auth server.
connector.Close() correctly avoids closing the client if it is a shared
copy of the Instance client.
The call to connector.Client.Close() was causing intermittent problems
where reconnectToAuthService could get stuck repeatedly trying to use
the same client that was just closed.
This appears to be fixed now that the Instance client is not being
improperly closed by other components.
I discovered this issue because it manifested itself in flaky failures
of TestHSMMigrate, where logs indicated that the Instance client was
being repeatedly reused but the connection was never successful
```
{"caller":"service/connect.go:1057","component":"proc:18","level":"info","message":"Reusing Instance client for Proxy. additionalSystemRoles=[Proxy]","pid":"34558.18","timestamp":"2023-09-27T21:30:05Z"}
{"caller":"service/connect.go:166","component":"proc:18","level":"debug","message":"Connected client: Identity(Proxy, cert(c90e905c-76e7-4c68-803b-ba364167ec6f.testcluster issued by testcluster:173887050308815087166604899475019267945),trust root(testcluster:322819974523436048061473591931335284057),trust root(testcluster:173887050308815087166604899475019267945),trust root(testcluster:135083743987735629230336583041497316143))","pid":"34558.18","timestamp":"2023-09-27T21:30:05Z"}
{"caller":"service/connect.go:98","component":"proc:18","level":"debug","message":"Connected client Proxy failed to execute test call: rpc error: code = Canceled desc = grpc: the client connection is closing. Node or proxy credentials are out of sync.","pid":"34558.18","timestamp":"2023-09-27T21:30:05Z"}
time="2023-09-27T21:30:13Z" level=warning msg="connection problem: readfrom tcp 172.18.0.2:38114->172.18.0.2:41065: use of closed network connection *net.OpError" dest="172.18.0.2:41065" source="172.18.0.2:37496" trace.component=loadbalancer trace.fields="map[listen:8ce2fe8a89f0:0]"
time="2023-09-27T21:30:13Z" level=warning msg="Failed to forward connection: readfrom tcp 172.18.0.2:38114->172.18.0.2:41065: use of closed network connection." trace.component=loadbalancer trace.fields="map[listen:8ce2fe8a89f0:0]"
time="2023-09-27T21:30:17Z" level=warning msg="Failed to create inventory control stream: rpc error: code = Canceled desc = grpc: the client connection is closing."
{"caller":"service/connect.go:124","component":"proc:18","level":"debug","message":"Retrying connection to auth server after waiting 41.323026451s.","pid":"34558.18","timestamp":"2023-09-27T21:30:46Z"}
{"caller":"service/connect.go:189","component":"proc:18","level":"debug","message":"Connected state: rotating servers (mode: manual, started: Sep 27 2023 21:29:24 UTC, ending: Sep 29 2023 03:29:24 UTC).","pid":"34558.18","timestamp":"2023-09-27T21:30:46Z"}
{"caller":"service/connect.go:1057","component":"proc:18","level":"info","message":"Reusing Instance client for Proxy. additionalSystemRoles=[Proxy]","pid":"34558.18","timestamp":"2023-09-27T21:30:46Z"}
...repeating...
```
The HSM tests have become flaky in the past when reload/reconnect bugs like
this have been introduced, but they are long tests that are a bit tricky
to run locally and issues like this one can be difficult to diagnose.
To try to improve our chances of catching these issues in the future,
I've written a new test that starts up an Auth and Proxy process and
repeatedly reloads both of them, asserting that the reload is always
successful in a reasonable amount of time.
The new test is able to catch the bug every time I have run it locally,
usually in ~4 out of the 8 parallel invocations to runs.
I have not seen any failures with the fix applied.
The entire test completes in ~12 seconds on my local machine.
This change adds the `client_id` optio nto the Discovery Service for
Azure VMs, which sets the client ID of the managed identity for discovered
nodes to use when joining the cluster. This allows the discovered nodes
to be discovered while having multiple managed identities assigned.
* Check to make sure defaultAllowRules matches preset roles.
The tests for role defaults now ensure that `defaultAllowRules` will add
the rules that are defined in the `access`, `auditor`, and `editor` presets.
Additionally, several missing rules have been added into `defaultAllowRules`.
* Use presets for defaultAllowRules.
* Remove comments referring to defaultAllowRules in preset definitions.
* Kill agent after removing workspace on logout
Quick shutdown of an agent now induces a 3 second timeout if there are
active connections (#31869). I changed the logout procedure so that we
first remove the workspace (and thus close all tabs) and only then kill
the agent. This makes it so that if there were any open connections from
the app, we'll close them before killing the agent, which means the app
will close without that 3 second timeout.
* Make Start Agent button's handler ignore errors
* Avoid rendering labels if there's no node
This could happen if someone started and stopped the agent and then clicked
Start Agent with expired certs. agentNode would go from undefined to defined
back to undefined, but it seems that <Transition> doesn't unmount immediately.
* clusters.Cluster.Logout: Make error handling more explicit
!trace.IsNotFound(err) returns true both when the error is nil and when
it's not nil but the error is not NotFound. Meaning that code after the
conditional would run only when err is NotFound.
I added a bogus error to test something after the conditional and then
spent 30 minutes wondering what's going on.
* Display an empty ring if agent is set up but not running
This makes it consistent with the Connections list which also displays
an empty ring if there are no connections.
* Expand NavigationMenu stories, add story for agent starting
* Avoid calculating indicator status if user cannot use feature
* Copy progress icon from Cloud's ProgressBar
* Return 'not-configured' early if agent is not configured
* Add story for when agent is configured but not started, reorder stories
* Revert "Error when redundant prefixes are detected in events. (#32652)"
This reverts commit 84dbc45e1d.
* Error when redundant prefixes are detected in events.
When creating a new events watcher, redundant prefixes will be detected
and produce an error. This should prevent developer mistakes where watched
prefixes overlap, causing subsets of events not to be parsed. This has been
verified manually.
* Fix buffer_test.go
* Update lib/services/local/events.go
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
---------
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* re-add agentless node manual installation docs
* fix linter issues
* Add missing closing paren
Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
* use tabs in manual install guide
* removed more ScopedBlocks I missed
* add note about upgrading to v14
* add redirect and notes linking the two guides to one another
* addressed feedback
* Apply suggestions from code review
Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
* fix links
---------
Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
* Fix remote pool of signed certs when exec into leaf clusters
This PR fixes the list of acceptable CAs from the leaf cluster when
exec into a leaf cluster pod.
Fixes#32380
Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
* add unit test
---------
Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
* Revise Docker handling in OS compatibility script
This commit revises how Docker containers are interacted with in build-test-compat.sh. Optimized Docker image pulling process by pulling images in parallel to speed up the testing process. Makefile targets in Github workflow are also parallelized to speed up the build process.
* Simplify and parallel docker logic
* Add ExistingMFAResponse to DeleteMFADeviceSyncRequest
* Update generated protos
* Support challenge-based deletion on DeleteMFADeviceSync
* Refactor TestDeleteMFADeviceSync; use optional mode and drop spares
* Make spacing of Connect My Computer status more consistent
* Add server labels to Running story, add ErrorWithAlertAndLogs story
* Change copy depending on whether the agent is running
* Fix proxyVersion in story
Co-authored-by: Grzegorz Zdunek <gzdunek@users.noreply.github.com>
---------
Co-authored-by: Grzegorz Zdunek <gzdunek@users.noreply.github.com>
* An attempt to fix our failing builds
* Add merge_group condition to checkout step in workflows
This update adds a condition to the checkout step in various GitHub workflows to ensure it only runs when the event_name is "merge_group".
* Fix syntax
* Use v4 tag for checkout action instead of pinned commit
Co-authored-by: Reed Loden <reed@goteleport.com>
---------
Co-authored-by: Reed Loden <reed@goteleport.com>
* DiscoveryConfig: add service with rbac support
This PR adds the DiscoveryConfig service protected by RBAC rules.
A PR will follow that uses this service to expose the service in the
gRPC server.
* review pt1
* Error when redundant prefixes are detected in events.
When creaeting a new events watcher, redundant prefixes will be detected
and produce an error. This should prevent developer mistakes where watched
prefixes overlap, causing subsets of events not to be parsed. This has been
verified manually.
* Add in test for event watcher verification.
* Run GCI.
* Make yubikey unit test interactive and add to test plan.
* Move yubikey hardware signer method tests to interactive yubikey test.
* Remove hardware key interactive unit test from testplan
This change drains unused SSH channels and requests to prevent a
situation where an attacker could repeatedly open channels and
send data that won't be read, causing Teleport to eventually run
out of memory.
Standardize `examples/teleport-usage` to use the same base image
and other (general) build commands as `integrations/kube-agent-updater`
and `integrations/operator`.
The main change is moving from `debian:stable-slim` to `distroless/static-debian12`.
* Do not pass --insecure to tshd in dev mode by default
We ran into a bug related to certs in tshd that wasn't uncovered during
development because of the --insecure flag being passed by default.
* Add --debug flag
We need a way to pass --debug to tshd from a packaged app.
* Add test for auto disconnect (disconnect happens, query updates the timer).
* Fix: In database service, `clientConn` returned from `MonitorConn`
was never used, causing unjustified idle timeouts.
`tctl edit` was always performing a forceful update in the same way
that `tctl create -f` was. This prevents optimistic from being
enforced during the update step of the edit command and thus nullifies
some of the usefulness of the feature(preventing concurrent updates
to a resource).
Since not all resources support Update operations, some only support
Upsert, and optimistic locking will slowly be added one resource at
at time the new behavior was only implemented for user resources.
The UpdateHandlers will be updated in follow up PRs when the resource
has support for optimistic locking added.
* Allow agent compatibility state to be 'unknown'
* Show warning icon instead of read dot as the status indicator
* Add a comment
* Remove unnecessary check for agent compatibility
* Avoid creating portal if topBarContainerRef is not present
* Check compatibility when autostarting, not before autostart
* Flip autoStart to false on failed autostart
* Don't show Alert from CompatibilityError if current action has failed
* Run prettier
---------
Co-authored-by: Rafał Cieślak <rafal.cieslak@goteleport.com>
* Fix Access List Members cache and eventing.
Two things were happening that were shadowing the Access List members cache
and eventing.
1. In the cache collections, the wrong reader was being assigned to the
lookup map. The correct reader was being used elsewhere, however, so the
caching tests appear to have still been working.
2. The watcher in lib/services/local/events.go apparently collapses prefixes
if they overlap. Prefix `access_list_members` is encompassed by
`access_list`, so the access list members prefix was eliminated from the
watcher. As a result, access list member events were being processed by
the access list parser, which resulted in non-critical warnings.
Local testing and dogfooding has yielded that this has had no apparent impact,
at least in situations without cache propagation. However, I've got a feeling
that this could affect situations with multiple auth servers.
While I'm here, I've eliminated the pointer-to-pointer logic in the access
list unmarshaling, which was excised elsewhere and should be excised here as
well.
* Use ExactKey, fix accessListMemberParser as well.