* kube: emit audit events using process context
Using the request context can prevent audit events from getting emitted,
if client disconnected and request context got closed.
We shouldn't be losing audit events like that.
Also, log all response errors from exec handler.
* kube: cleanup forwarder code
Rename a few config fields to be more descriptive.
Avoid embedding unless necessary, to keep the package API clean.
* kube: cache only user certificates, not the entire session
The expensive part that we need to cache is the client certificate.
Making a new one requires a round-trip to the auth server, plus entropy
for crypto operations.
The rest of clusterSession contains request-specific state, and only
adds problems if cached.
For example: clusterSession stores a reference to a remote teleport
cluster (if needed); caching requires extra logic to invalidate the
session when that cluster disappears (or tunnels drop out). Same problem
happens with kubernetes_service tunnels.
Instead, the forwarder now picks a new target for each request from the
same user, providing a kind of "load-balancing".
* Init session uploader in kubernetes service
It's started in all other services that upload sessions (app/proxy/ssh),
but was missing here. Because of this, the session storage directory for
async uploads wasn't created on disk and caused interactive sessions to
fail.
* Update logrus package to fix data races
* Introduce a logger that uses the test context to log the messages so they are output if a test fails for improved trouble-shooting.
* Revert introduction of test logger - simply leave logger configuration at debug level outputting to stderr during tests.
* Run integration test for e as well
* Use make with a cap and append to only copy the relevant roles.
* Address review comments
* Update integration test suite to use test-local logger that would only output logs iff a specific test has failed - no logs from other test cases will be output.
* Revert changes to InitLoggerForTests API
* Create a new logger instance when applying defaults or merging with file service configuration
* Introduce a local logger interface to be able to test file configuration merge.
* Fix kube integration tests w.r.t log
* Move goroutine profile dump into a separate func to handle parameters consistently for all invocations
This commit fixes#4695.
Teleport in async recording mode sends all events to disk,
and uploads them to the server later.
It uploads some events synchronously to the audit log so
they show up in the global event log right away.
However if the auth server is slow, the fanout blocks the session.
This commit makes the fanout of some events to be fast,
but nonblocking and never fail so sessions will not hang
unless the disk writes hang.
It adds a backoff period and timeout after which some
events will be lost, but session will continue without locking.
A proxy running in pre-5.0 mode (e.g. with local kubeconfig) should
register an entry in `tsh kube clusters`.
After upgrading to 5.0, without migration to kubernetes_service, all the
new `tsh kube` commands will work as expected.
* Add labels to KubernetesCluster resources
Plumb from config to the registered object, keep dynamic labels updated.
* Check kubernetes RBAC
Checks are in some CRUD operations on the auth server and in the
kubernetes forwarder (both proxy or kubernetes_service).
The logic is essentially copy-paste of the TAA version.
Updated storage configuration to not only apply to DynamoDB in the
backend package, but also DynamoDB in the events package. This allows
configuring continuous backups and auto scaling for the events table.
This change has several parts: cluster registration, cache updates,
routing and a new tctl flag.
> cluster registration
Cluster registration means adding `KubernetesClusters` to `ServerSpec`
for servers with `KindKubeService`.
`kubernetes_service` instances will parse their kubeconfig or local
`kube_cluster_name` and add them to their `ServerSpec` sent to the auth
server. They are effectively declaring that "I can serve k8s requests
for k8s cluster X".
> cache updates
This is just cache plumbing for `kubernetes_service` presence, so that
other teleport processes can fetch all of kube services. It was missed
in the previous PR implementing CRUD for `kubernetes_service`.
> routing
Now the fun part - routing logic. This logic lives in
`/lib/kube/proxy/forwarder.go` and is shared by both `proxy_service`
(with kubernetes integration enabled) and `kubernetes_service`.
The target k8s cluster name is passed in the client cert, along with k8s
users/groups information.
`kubernetes_service` only serves requests for its direct k8s cluster
(from `Forwarder.creds`) and doesn't route requests to other teleport
instances.
`proxy_service` can serve requests:
- directly to a k8s cluster (the way it works pre-5.0)
- to a leaf teleport cluster (also same as pre-5.0, based on
`RouteToCluster` field in the client cert)
- to a `kubernetes_service` (directly or over a tunnel)
The last two modes require the proxy to generate an ephemeral client TLS
cert to do an outbound mTLS connection.
> tctl flag
A flag `--kube-cluster-name` for `tctl auth sign --format=kubernetes`
which allows generating client certs for non-default k8s cluster name
(as long as it's registered in a cluster).
I used this for testing, but it could be used for automation too.
Added support for an identity aware, RBAC enforcing, mutually
authenticated, web application proxy to Teleport.
* Updated services.Server to support an application servers.
* Updated services.WebSession to support application sessions.
* Added CRUD RPCs for "AppServers".
* Added CRUD RPCs for "AppSessions".
* Added RBAC support using labels for applications.
* Added JWT signer as a services.CertAuthority type.
* Added support for signing and verifying JWT tokens.
* Refactored dynamic label and heartbeat code into standalone packages.
* Added application support to web proxies and new "app_service" to
proxy mutually authenticated connections from proxy to an internal
application.
* Implement kubernetes_service registration and sratup
The new service now starts, registers (locally or via a join token) and
heartbeats its presence to the auth server.
This service can handle k8s requests (like a proxy) but not to remote
teleport clusters. Proxies will be responsible for routing those.
The client (tsh) will not yet go to this service, until proxy routing is
implemented. I manually tweaked server addres in kubeconfig to test it.
You can also run `tctl get kube_service` to list all registered
instances. The self-reported info is currently limited - only listening
address is set.
* Address review feedback
This is a shorthand for the larger kubernetes section:
```
proxy_service:
kube_listen_addr: "0.0.0.0:3026"
```
if equivalent to:
```
proxy_service:
kubernetes:
enabled: yes
listen_addr: "0.0.0.0:3026"
```
This shorthand is meant to be used with the new `kubernetes_service`:
https://github.com/gravitational/teleport/pull/4455
It reduces confusion when both `proxy_service` and `kubernetes_service`
are configured in the same process.
This commit fixes#4598
Config with multiple event backends was crashing on 4.4:
```yaml
storage:
audit_events_uri: ['dynamodb://streaming', 'stdout://', 'dynamodb://streaming2']
```
* Fix local etcd test failures when etcd is not running
* Add kubernetes_service to teleport.yaml
This plumbs config fields only, they have no effect yet.
Also, remove `cluster_name` from `proxy_config.kubernetes`. This field
will only exist under `kubernetes_service` per
https://github.com/gravitational/teleport/pull/4455
* Handle IPv6 in kubernetes_service and rename label fields
* Disable k8s cluster name defaulting in user TLS certs
Need to implement service registration first.
`require` is a sister package to `assert` that terminates the test on
failure. `assert` records the failure but lets the test proceed, which
is un-intuitive.
Also update all existing tests to match.
Cluster name from this field plug all clusters from kubeconfig are
stored on the auth server via heartbeats.
This info will later be used to route k8s requests back to proxies.
Updates https://github.com/gravitational/teleport/issues/3952
This commit introduces GRPC API for streaming sessions.
It adds structured events and sync streaming
that avoids storing events on disk.
You can find design in rfd/0002-streaming.md RFD.
* Split remote cluster watching from reversetunnel.AgentPool
Separating the responsibilities:
- AgentPool takes a proxy (or LB) endpoint and manages a pool of agents
for it (each agent is a tunnel to a unique proxy process behind the
endpoint)
- RemoteClusterTunnelManager polls the auth server for a list of trusted
clusters and manages a set of AgentPools, one for each trusted cluster
Previously, AgentPool did both of the above.
Also, bundling some cleanup in the area:
- better error when dialing through tunnel and directly both fail
- rename RemoteKubeProxy to LocalKubernetes to better reflect the
meaning
- remove some dead code and simplify config structs
* reversetunnel: factor out track.Key
ClusterName is the same for all Agents in an AgentPool. track.Tracker
needs to only track proxy addresses.
* Always collect metrics about top backend requests
Previously, it was only done in debug mode. This makes some tabs in
`tctl top` empty, when auth server is not in debug mode.
* backend: use an LRU cache for top requests in Reporter
This LRU cache tracks the most frequent recent backend keys. All keys in
this cache map to existing labels in the requests metric. Any evicted
keys are also deleted from the metric.
This will keep an upper limit on our memory usage while still always
reporting the most active keys.
Heartbeats are more frequent and result in more up-to-date /readyz
status. Concretely, it goes from ~10min status update to <1m.
Also, refactored the state tracking code to track the status of
individual teleport components (auth/proxy/node).
This allows users to manually switch to a different algorithm by:
- setting the config file field
- running "tctl auth rotate"
If config file field is not set, existing signing algorithm of the CA is
preserved.
Store the signing algorithm along the CA private key. When reading old
CAs that don't have it set, default to UNKNOWN proto enum which
corresponds to the old SHA1-based signing alg.
The only time you get a SHA2 signature is when creating a fresh cluster
and generating a new CA. This can be disabled in the config.
This allows users to override the SHA2 signing algorithms we default to
now for compatibility with the (very) old OpenSSH versions.
For host and user certs, use the CA signing algo for their own
handshakes. This allows us to propagate the signing algo from auth
server everywhere else.
List of fixed items:
```
integration/helpers.go:1279:2 gosimple S1000: should use for range instead of for { select {} }
integration/integration_test.go:144:5 gosimple S1009: should omit nil check; len() for nil slices is defined as zero
integration/integration_test.go:173:5 gosimple S1009: should omit nil check; len() for nil slices is defined as zero
integration/integration_test.go:296:28 gosimple S1019: should use make(chan error) instead
integration/integration_test.go:570:41 gosimple S1019: should use make(chan interface{}) instead
integration/integration_test.go:685:40 gosimple S1019: should use make(chan interface{}) instead
integration/integration_test.go:759:33 gosimple S1019: should use make(chan string) instead
lib/auth/init_test.go:62:2 gosimple S1021: should merge variable declaration with assignment on next line
lib/auth/tls_test.go:1658:22 gosimple S1024: should use time.Until instead of t.Sub(time.Now())
lib/backend/dynamo/dynamodbbk.go:420:5 gosimple S1004: should use !bytes.Equal(expected.Key, replaceWith.Key) instead
lib/backend/dynamo/dynamodbbk.go:656:12 gosimple S1039: unnecessary use of fmt.Sprintf
lib/backend/etcdbk/etcd.go:458:5 gosimple S1004: should use !bytes.Equal(expected.Key, replaceWith.Key) instead
lib/backend/firestore/firestorebk.go:407:5 gosimple S1004: should use !bytes.Equal(expected.Key, replaceWith.Key) instead
lib/backend/lite/lite.go:317:5 gosimple S1004: should use !bytes.Equal(expected.Key, replaceWith.Key) instead
lib/backend/lite/lite.go:336:6 gosimple S1004: should use !bytes.Equal(value, expected.Value) instead
lib/backend/memory/memory.go:365:5 gosimple S1004: should use !bytes.Equal(expected.Key, replaceWith.Key) instead
lib/backend/memory/memory.go:376:5 gosimple S1004: should use !bytes.Equal(existingItem.Value, expected.Value) instead
lib/backend/test/suite.go:327:10 gosimple S1024: should use time.Until instead of t.Sub(time.Now())
lib/client/api.go:1410:9 gosimple S1003: should use strings.ContainsRune(name, ':') instead
lib/client/api.go:2355:32 gosimple S1019: should use make([]ForwardedPort, len(spec)) instead
lib/client/keyagent_test.go:85:2 gosimple S1021: should merge variable declaration with assignment on next line
lib/client/player.go:54:33 gosimple S1019: should use make(chan int) instead
lib/config/configuration.go:1024:52 gosimple S1019: should use make(services.CommandLabels) instead
lib/config/configuration.go:1025:44 gosimple S1019: should use make(map[string]string) instead
lib/config/configuration.go:930:21 gosimple S1003: should use strings.Contains(clf.Roles, defaults.RoleNode) instead
lib/config/configuration.go:931:22 gosimple S1003: should use strings.Contains(clf.Roles, defaults.RoleAuthService) instead
lib/config/configuration.go:932:23 gosimple S1003: should use strings.Contains(clf.Roles, defaults.RoleProxy) instead
lib/service/supervisor.go:387:2 gosimple S1001: should use copy() instead of a loop
lib/tlsca/parsegen.go:140:9 gosimple S1034: assigning the result of this type assertion to a variable (switch generalKey := generalKey.(type)) could eliminate type assertions in switch cases
lib/utils/certs.go:140:9 gosimple S1034: assigning the result of this type assertion to a variable (switch generalKey := generalKey.(type)) could eliminate type assertions in switch cases
lib/utils/certs.go:167:40 gosimple S1010: should omit second index in slice, s[a:len(s)] is identical to s[a:]
lib/utils/certs.go:204:5 gosimple S1004: should use !bytes.Equal(certificateChain[0].SubjectKeyId, certificateChain[0].AuthorityKeyId) instead
lib/utils/parse/parse.go:116:45 gosimple S1003: should use strings.Contains(variable, "}}") instead
lib/utils/parse/parse.go:116:6 gosimple S1003: should use strings.Contains(variable, "{{") instead
lib/utils/socks/socks.go:192:10 gosimple S1025: should use String() instead of fmt.Sprintf
lib/utils/socks/socks.go:199:10 gosimple S1025: should use String() instead of fmt.Sprintf
lib/web/apiserver.go:1054:18 gosimple S1024: should use time.Until instead of t.Sub(time.Now())
lib/web/apiserver.go:1954:9 gosimple S1039: unnecessary use of fmt.Sprintf
tool/tsh/tsh.go:1193:14 gosimple S1024: should use time.Until instead of t.Sub(time.Now())
```
TeleportProcess can have multiple listeners per type during graceful
restart. Return an error from address getters to avoid flaky behavior.
These getters should only get called from tests.
This is primarily for tests to fetch the actual listening port of these
endpoints when config has port 0.
But it's also convenient for callers to avoid digging into the config
fields.
If kube.public_addr is not set and kube.listen_addr uses a non-standard
port, the client can't discover it. Advertise kube.listen_addr same as
we advertise it for SSH.
Also, override client.TeleportClient.KubeProxyAddr with info from
proxy's Ping response. Existing value in that field comes from
~/.tsh/profile and can contain the wrong value.
This only matters for nodes. The new stateStarting will be in effect
until the node successfully joins the cluster. This means that /readyz
for nodes will return '400 Bad Request' instead of '200 OK' until it
joins.
Updates #3700
Fixed findings:
```
lib/sshutils/server_test.go:163:2: SA4006: this value of `clt` is never used (staticcheck)
clt, err := ssh.Dial("tcp", srv.Addr(), &cc)
^
lib/sshutils/server_test.go:91:3: SA5001: should check returned error before deferring ch.Close() (staticcheck)
defer ch.Close()
^
lib/shell/shell_test.go:33:2: SA4006: this value of `shell` is never used (staticcheck)
shell, err = GetLoginShell("non-existent-user")
^
lib/cgroup/cgroup_test.go:111:2: SA9003: empty branch (staticcheck)
if err != nil {
^
lib/cgroup/cgroup_test.go:119:2: SA5001: should check returned error before deferring service.Close() (staticcheck)
defer service.Close()
^
lib/client/keystore_test.go:138:2: SA4006: this value of `keyCopy` is never used (staticcheck)
keyCopy, err = s.store.GetKey("host.a", "bob")
^
lib/client/api.go:1604:3: SA4004: the surrounding loop is unconditionally terminated (staticcheck)
return makeProxyClient(sshClient, m), nil
^
lib/backend/test/suite.go:156:2: SA4006: this value of `err` is never used (staticcheck)
result, err = s.B.GetRange(ctx, prefix("/prefix/c/c1"), backend.RangeEnd(prefix("/prefix/c/cz")), backend.NoLimit)
^
lib/utils/timeout_test.go:84:2: SA1019: t.Dial is deprecated: Use DialContext instead, which allows the transport to cancel dials as soon as they are no longer needed. If both are set, DialContext takes priority. (staticcheck)
t.Dial = func(network string, addr string) (net.Conn, error) {
^
lib/utils/websocketwriter.go:83:3: SA4006: this value of `err` is never used (staticcheck)
utf8, err = w.encoder.String(string(data))
^
lib/utils/loadbalancer_test.go:134:2: SA4006: this value of `out` is never used (staticcheck)
out, err = Roundtrip(frontend.String())
^
lib/utils/loadbalancer_test.go:209:2: SA4006: this value of `out` is never used (staticcheck)
out, err = RoundtripWithConn(conn)
^
lib/srv/forward/sshserver.go:582:3: SA4004: the surrounding loop is unconditionally terminated (staticcheck)
return
^
lib/service/service.go:347:4: SA4006: this value of `err` is never used (staticcheck)
i, err = auth.GenerateIdentity(process.localAuth, id, principals, dnsNames)
^
lib/service/signals.go:60:3: SA1016: syscall.SIGKILL cannot be trapped (did you mean syscall.SIGTERM?) (staticcheck)
syscall.SIGKILL, // fast shutdown
^
lib/config/configuration_test.go:184:2: SA4006: this value of `conf` is never used (staticcheck)
conf, err = ReadFromFile(s.configFileBadContent)
^
lib/config/configuration.go:129:2: SA5001: should check returned error before deferring reader.Close() (staticcheck)
defer reader.Close()
^
lib/kube/kubeconfig/kubeconfig_test.go:227:2: SA4006: this value of `err` is never used (staticcheck)
tlsCert, err := ca.GenerateCertificate(tlsca.CertificateRequest{
^
lib/srv/sess.go:720:3: SA4006: this value of `err` is never used (staticcheck)
result, err := s.term.Wait()
^
lib/multiplexer/multiplexer_test.go:169:11: SA1006: printf-style function with dynamic format string and no further arguments should use print-style function instead (staticcheck)
_, err = fmt.Fprintf(conn, proxyLine.String())
^
lib/multiplexer/multiplexer_test.go:221:11: SA1006: printf-style function with dynamic format string and no further arguments should use print-style function instead (staticcheck)
_, err = fmt.Fprintf(conn, proxyLine.String())
^
```
All changes should be noop, except for
`integration/integration_test.go`.
The integration test was ignoring `recordingMode` test case parameter
and always used `RecordAtNode`. When switching to `recordingMode`, test
cases with `RecordAtProxy` fail with a confusing error about missing
user agent. Filed https://github.com/gravitational/teleport/issues/3606
to track that separately and unblock enabling `structcheck` linter.
The node first tries using the token with auth server. If that fails, it
tries the same address as a proxy server.
If both fail, user only sees the error from the latter attempt. If using
CA pin, this error will be "x509: certificate signed by unknown
authority", which is confusing.
Log both errors, and mention that a fallback is happening. The output
looks like:
ERRO [AUTH] Failed to register through auth server: "my-hostname" [3e53a982-afd1-4d2f-8864-54c25fbe5865] can not join the cluster with role Node, the token is not valid; falling back to trying the proxy server auth/register.go:123
ERRO [PROC:1] Node failed to establish connection to cluster: failed to register through proxy server: x509: certificate signed by unknown authority. time/sleep.go:149
If node hasn't fully initialized before getting stopped (such as when
join token isn't valid), most pointer vars in `initSSH` will be nil.
Handle that cleanly.
* Add monorepo
* Add reset/passwd capability for local users (#3287)
* Add UserTokens to allow password resets
* Pass context down through ChangePasswordWithToken
* Rename UserToken to ResetPasswordToken
* Add auto formatting for proto files
* Add common Marshaller interfaces to reset password token
* Allow enterprise "tctl" reuse OSS user methods (#3344)
* Pass localAuthEnabled flag to UI (#3412)
* Added LocalAuthEnabled prop to WebConfigAuthSetting struct in webconfig.go
* Added LocalAuthEnabled state as part of webCfg in apiserver.go
* update e-refs
* Fix a regression bug after merge
* Update tctl CLI output msgs (#3442)
* Use local user client when resolving user roles
* Update webapps ref
* Add and retrieve fields from Cluster struct (#3476)
* Set Teleport versions for node, auth, proxy init heartbeat
* Add and retrieve fields NodeCount, PublicURL, AuthVersion from Clusters
* Remove debug logging to avoid log pollution when getting public_addr of proxy
* Create helper func GuessProxyHost to get the public_addr of a proxy host
* Refactor newResetPasswordToken to use GuessProxyHost and remove publicUrl func
* Remove webapps submodule
* Add webassets submodule
* Replace webapps sub-module reference with webassets
* Update webassets path in Makefile
* Update webassets
1b11b26 Simplify and clean up Makefile (#62) https://github.com/gravitational/webapps/commit/1b11b26
* Retrieve cluster details for user context (#3515)
* Let GuessProxyHost also return proxy's version
* Unit test GuessProxyHostAndVersion & GetClusterDetails
* Update webassets
4dfef4e Fix build pipeline (#66) https://github.com/gravitational/webapps/commit/4dfef4e
* Update e-ref
* Update webassets
0647568 Fix OSS redirects https://github.com/gravitational/webapps/commit/0647568
* update e-ref
* Update webassets
e0f4189 Address security audit warnings Updates "minimist" package which is used by 7y old "optimist". https://github.com/gravitational/webapps/commit/e0f4189
* Add new attr to Session struct (#3574)
* Add fields ServerHostname and ServerAddr
* Set these fields on newSession
* Ensure webassets submodule during build
* Update e-ref
* Ensure webassets before running unit-tests
* Update E-ref
Co-authored-by: Lisa Kim <lisa@gravitational.com>
Co-authored-by: Pierre Beaucamp <pierre@gravitational.com>
Co-authored-by: Jenkins <jenkins@gravitational.io>
Spring cleaning!
A very mechanical cleanup using several linters (unused, deadcode,
structcheck). Build and tests still pass so no behavior should be
affected.
This commit resolves#3227
In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.
Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.
The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.
This commit introduces the new cache configuration option, 'in-memory':
```yaml
teleport:
cache:
# default value sqlite,
# the only supported values are sqlite or in-memory
type: in-memory
```
This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.
The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.
The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.
If the user enabled enhanced session recording in file configuration but
the binary was built without BPF support (like macOS) then exit right
away with a message explaining that their operating system does not
support enhanced session recording.
* Make Teleport log its version upon service start #3145
This change implements a resolution to issue #3145. Version and Gitref string are output when components start information is logged.
https://github.com/gravitational/teleport/issues/3145
* fix merge artifact
Added package cgroup to orchestrate cgroups. Only support for cgroup2
was added to utilize because cgroup2 cgroups have unique IDs that can be
used correlated with BPF events.
Added bpf package that contains three BPF programs: execsnoop,
opensnoop, and tcpconnect. The bpf package starts and stops these
programs as well correlating their output with Teleport sessions
and emitting them to the audit log.
Added support for Teleport to re-exec itself before launching a shell.
This allows Teleport to start a child process, capture it's PID, place
the PID in a cgroup, and then continue to process. Once the process is
continued it can be tracked by it's cgroup ID.
Reduced the total number of connections to a host so Teleport does not
quickly exhaust all file descriptors. Exhausting all file descriptors
happens very quickly when disk events are emitted to the audit log which
are emitted at a very high rate.
Added tarballs for exec sessions. Updated session.start and session.end
events with additional metadata. Updated the format of session tarballs
to include enhanced events.
Added file configuration for enhanced session recording. Added code to
startup enhanced session recording and pass package to SSH nodes.
* Support resource-based bootstrapping for backend.
Outside of static configuration, most of the persistent state of an
auth server exists as a collection of resources, stored in its
backend. The resource API also forms the basis of Teleport's more
advanced dynamic configuration options.
This commit extends the usefulness of the resource API by adding
the ability to bootstrap backend state with a set of previously
exported resources. This allows the resource API to serve as a
rudimentary backup/migration tool.
Notes: This features is a work in progress, and very easy to misuse;
while it will prevent you from overwriting the state of an existing
auth server, it won't stop you from bootstrapping into a wildly
misconfigured state. In general, resource-based bootstrapping is
not a complete solution for backup or migration.
* update e-ref
Update utils.CertChecker to only check key and certificate algorithms
when in FIPS mode. Otherwise accept keys and certificates generated with
any algorithm.
This commit implements #2872.
Similarly to file://, the scheme `stdout://` could be used complimentary
to the existing external scheme to logs audit logs:
to stdout:
```yaml
audit_events_uri: ['dynamodb://events', 'stdout://',]
```
Just like `file://` scheme it is only possible to use 'stdout://'
scheme when external events and session uploader are defined,
so all audit upload and search features of teleport could work.
When attempting to guess the IP address of a remote host to add to the
host certificate, always remove the port.
Improve logging, so it's clear when a nodes host certificate changes due
to the principals list being updated.
Don't heartbeat address for nodes connected to clusters over a reverse
tunnel. Print warning to users if listen_addr or public_addr are set as
these are not used.
Return the tunnel address in the following preference order:
1. Reverse Tunnel Public Address.
2. SSH Proxy Public Address.
3. HTTP Proxy Public Address.
4. Tunnel Listen Address.
Added "--fips" flag to "teleport start" command which can start
Enterprise in FedRAMP/FIPS 140-2 mode.
In FIPS mode, Teleport configures the TLS and SSH servers with FIPS
compliant cryptographic algorithms. In FIPS mode, if non-compliant
algorithms are chosen, Teleport will fail to start. In addition,
Teleport checks if the binary was compiled against an approved
cryptographic module (BoringCrypto) and fails to start if it was not.
If a client, like tsh, tries to use non-FIPS encryption, like NaCl,
those requests are also rejected.
In case of IOT (whenever teleport nodes are
connecting to the proxy), there is no need
to create ReverseTunnel objects in the backend,
as there is always one reverse tunnel per node.
This commit removes the logic that created
reverse tunnel object in the backed in IOT cases
and refactors some other parts of the code.
Instantiate agent pool (and agent) with a reference to the reverse
tunnel server.
Pass list of principals to agents when initiating a transport dial
request.
The above two changes allow the agent to look up principals in local
site when attempting to connect to a node within a trusted cluster.
Whenever many IOT style nodes are connecting
back to the web proxy server, they all
call /find endpoint to discover the configuration.
This new endpoint is designed to be fast and not
hit the database.
In addition to that every proxy reverse tunnel
connection handler was fetching auth servers and
this commit adds caching for the auth servers
on the proxy side.
Updated services.ReverseTunnel to support type (proxy or node). For
proxy types, which represent trusted cluster connections, when a
services.ReverseTunnel is created, it's created on the remote side with
name /reverseTunnels/example.com. For node types, services.ReverseTunnel
is created on the main side as /reverseTunnels/{nodeUUID}.clusterName.
Updated services.TunnelConn to support type (proxy or node). For proxy
types, which represent trusted cluster connections, tunnel connections
are created on the main side under
/tunnelConnections/remote.example.com/{proxyUUID}-remote.example.com.
For nodes, tunnel connections are created on the main side under
/tunnelConnections/example.com/{proxyUUID}-example.com. This allows
searching for tunnel connections by cluster then allows easily creating
a set of proxies that are missing matching services.TunnelConn.
The reverse tunnel server has been updated to handle heartbeats from
proxies as well as nodes. Proxy heartbeat behavior has not changed.
Heartbeats from nodes now add remote connections to the matching local
site. In addition, the reverse tunnel server now proxies connection to
the Auth Server for requests that are already authenticated (a second
authentication to the Auth Server is required).
For registration, nodes try and connect to the Auth Server to fetch host
credentials. Upon failure, nodes now try and fallback to fetching host
credentials from the web proxy.
To establish a connection to an Auth Server, nodes first try and connect
directly, and if the connection fails, fallback to obtaining a
connection to the Auth Server through the reverse tunnel. If a
connection is established directly, node startup behavior has not
changed. If a node establishes a connection through the reverse tunnel,
it creates an AgentPool that attempts to dial back to the cluster and
establish a reverse tunnel.
When nodes heartbeat, they also heartbeat if they are connected directly
to the cluster or through a reverse tunnel. For nodes that are connected
through a reverse tunnel, the proxy subsystem now directs the reverse
tunnel server to establish a connection through the reverse tunnel
instead of directly.
When sending discovery requests, the domain field has been replaced with
tunnelID. The tunnelID field is either the cluster name (same as before)
for proxies, or {nodeUUID}.example.com for nodes.
Buffer fan out used simple prefix match
in a loop, what resulted in high CPU load
on many connected watchers.
This commit switches to RADIX trees for
prefix matching what reduces CPU load
substantially for 5K+ connected watchers.
This commit expands the usage of the caching layer
for auth server API:
* Introduces in-memory cache that is used to serve all
Auth server API requests. This is done to achieve scalability
on 10K+ node clusters, where each node fetches certificate authorities,
roles, users and join tokens. It is not possible to scale
DynamoDB backend or other backends on 10K reads per seconds
on a single shard or partition. The solution is to introduce
an in-memory cache of the backend state that is always used
for reads.
* In-memory cache has been expanded to support all resources
required by the auth server.
* Experimental `tctl top` command has been introduced to display
common single node metrics.
Replace SQLite Memory Backend with BTree
SQLite in memory backend was suffering from
high tail latencies under load (up to 8 seconds
in 99.9%-ile on load configurations).
This commit replaces the SQLite memory caching
backend with in-memory BTree backend that
brought down tail latencies to 2 seconds (99.9%-ile)
and brought overall performance improvement.
This commit hex encodes trusted cluster names
in target addresses for kubernetes SNI proxy.
For example, assuming public address of Teleport
Kubernetes proxy is main.example.com, and trusted
cluster is remote.example.com, resulting target
address added to kubeconfig will look like
k72656d6f74652e6578616d706c652e636f6d0a.main.example.com
And Teleport Proxy's DNS Name will include wildcard:
'*.main.example.com' in addition to 'main.example.com'
Note that no dots are in the SNI address thanks to hex encoding.
This will allow administrators to avoid manually updating
list of public_addr sections every time the trusted cluster and use
the wildcard DNS name.
The following addr:
remote.example.com.main.example.com would not have matched
*.main.example.com per DNS wildcard spec.
This commit switches Teleport proxy to use impersonation
API instead of the CSR API.
This allows Teleport to work on EKS clusters, GKE and all
other CNCF compabitble clusters.
This commit updates helm chart RBAC as well.
It introduces extra configuration flag to proxy_service
configuration parameter:
```yaml
proxy_service:
# kubeconfig_file is used for scenarios
# when Teleport Proxy is deployed outside
# of the kubernetes cluster
kubeconfig_file: /path/to/kube/config
```
It deprecates similar flag in auth_service:
```yaml
auth_service:
# DEPRECATED. THIS FLAG IS IGNORED
kubeconfig_file: /path/to/kube/config
```
Created *utils.TrackingConn that wraps the server side net.Conn and is
used to track how much data is transmitted and received over the
net.Conn. At the close of a connection (close of a *srv.ServerContext)
the total data transmitted and received is emitted to the Audit Log.
This commit allows additional configuration
for the `audit_sessions_uri` parameter:
`audit_sessions_uri: s3://example.com/path?region=us-east-1`
Additional query parameter `region` will override
default `audit` section `region` if set.
This commit introduces several key changes to
Teleport backend and API infrastructure
in order to achieve scalability improvements
on 10K+ node deployments.
Events and plain keyspace
--------------------------
New backend interface supports events,
pagination and range queries
and moves away from buckets to
plain keyspace, what better aligns
with DynamoDB and Etcd featuring similar
interfaces.
All backend implementations are
exposing Events API, allowing
multiple subscribers to consume the same
event stream and avoid polling database.
Replacing BoltDB, Dir with SQLite
-------------------------------
BoltDB backend does not support
having two processes access the database at the
same time. This prevented Teleport
using BoltDB backend to be live reloaded.
SQLite supports reads/writes by multiple
processes and makes Dir backend obsolete
as SQLite is more efficient on larger collections,
supports transactions and can detect data
corruption.
Teleport automatically migrates data from
Bolt and Dir backends into SQLite.
GRPC API and protobuf resources
-------------------------------
GRPC API has been introduced for
the auth server. The auth server now serves both GRPC
and JSON-HTTP API on the same TLS socket and uses
the same client certificate authentication.
All future API methods should use GRPC and HTTP-JSON
API is considered obsolete.
In addition to that some resources like
Server and CertificateAuthority are now
generated from protobuf service specifications in
a way that is fully backward compatible with
original JSON spec and schema, so the same resource
can be encoded and decoded from JSON, YAML
and protobuf.
All models should be refactored
into new proto specification over time.
Streaming presence service
--------------------------
In order to cut bandwidth, nodes
are sending full updates only when changes
to labels or spec have occured, otherwise
new light-weight GRPC keep alive updates are sent
over to the presence service, reducing
bandwidth usage on multi-node deployments.
In addition to that nodes are no longer polling
auth server for certificate authority rotation
updates, instead they subscribe to event updates
to detect updates as soon as they happen.
This is a new API, so the errors are inevitable,
that's why polling is still done, but
on a way slower rate.
This commit reduces traffic consumed
by the teleport cluster by polling for CA
status less frequently.
It also addresses a bug in cert regeneration
that checked for the wrong principals
This commit improves performance of teleport with
hundreds of connected trusted clusters.
TLS handshake protocol expects server to send a
list of trusted certificate authorities to the client
and client must present certificate signed by those.
With Teleport current implementation, every remote cluster
client is signed by local certificate and is not cross
signed.
Auth server now expects clients to announce the
remote cluster they are connecting from using SNI.
Auth server will send only certificate authorities
of the cluster announced via SNI.
Alternative idea is to cross sign the certificate
of the client of the remote cluster. We will explore
this idea in the next releases.
This commit also removes unnecessary reads
from the database to check the remote server status
that slows down user interface and other clients.
This is done at the expense of proxies showing
servers as offline in case if this individual
proxy does not have the connection, although
it's a small UI price to pay for not reading
the database, as proxy will eventually
get the connection thanks to the discovery
protocol.
Fixes#1986
When deployed outside of the kubernetes cluster,
teleport now reads all configuration from kubernetes
config file, supplied via parameter.
Auth server then passes information about
target api server back to the proxy.
Whenever critical services in teleport exit
with errors, system should shut down immediatelly
and exit with error. This was not the case
since 2.7 release.
When many nodes join the cluster or rotate certificates,
auth server was forced to generate may private/public
key pairs simultaneosly creating bottleneck
on the auth server side.
This commit pushes the private public key generation
logic back to clients releiving the pressure from
auth server.
This commit moves proxy kubernetes configuration
to a separate nested block to provide more fine
grained settings:
```yaml
auth:
kubernetes_ca_cert_path: /tmp/custom-ca
proxy:
enabled: yes
kubernetes:
enabled: yes
public_addr: [custom.example.com:port]
api_addr: kuberentes.example.com:443
listen_addr: localhost:3026
```
1. Kubernetes config section is explicitly enabled
and disabled. It is disabled by default.
2. Public address in kubernetes section
is propagated to tsh profile
The other part of the commit updates Ping
endpoint to send proxy configuration back to
the client, including kubernetes public address
and ssh listen address.
Clients updates profile accordingly to configuration
received from the proxy.
This commit fixes#1970.
Original process is started but has failed to join the cluster
and repeatedly connects to it. This process is not
ready yet, but can process HUP (reload events).
1. HUP event is sent to a parent process.
2. The parent process forks a child process and
awaits a message from the child on the signal pipe.
3. If the child process fails to connect to the cluster
as well, it does not emit Ready event and a message
is never sent to the parent.
4. Parent process fails to receive a message and
assumes that parent process has failed to start.
As a result of this, there are two processes
both trying to connect to the cluster.
This commit changes behavior by adding extra step -
if the child process fails to enter ready stat and it is aware that
it was forked by the parent process, it initates self-shutdown.
This commit fixes#1610.
New readyz endpoint is added to existing
/metrics and /healthz endpoints activated by
diag addr-flag:
`teleport start --diag-addr=127.0.0.1:1234`
Readyz endpoint will report 503 if node or
proxy failed to connect to the cluster and 200 OK
otherwise.
Additional prometheus gagues report connection
count for trusted and remote clusters:
```
remote_clusters{cluster="one"} 1
remote_clusters{cluster="two"} 1
trusted_clusters{cluster="one",state="connected"} 0
trusted_clusters{cluster="one",state="connecting"} 0
trusted_clusters{cluster="one",state="disconnected"} 0
trusted_clusters{cluster="one",state="discovered"} 1
trusted_clusters{cluster="one",state="discovering"} 0
```
This issue updates #1986.
This is intial, experimental implementation that will
be updated with tests and edge cases prior to production 2.7.0 release.
Teleport proxy adds support for Kubernetes API protocol.
Auth server uses Kubernetes API to receive certificates
issued by Kubernetes CA.
Proxy intercepts and forwards API requests to the Kubernetes
API server and captures live session traffic, making
recordings available in the audit log.
Tsh login now updates kubeconfig configuration to use
Teleport as a proxy server.
Flaky tests in teleport integration suite uncovered a problem.
It is possible that main cluster rotates certificate authority,
and will try to dial to the remote cluster with new credentials
before the remote cluster could fetch the new CA to trust.
To fix this, phase "update_clients" was split in two phases:
* Init and Update clients
Init phase does nothing on the main cluster except generating
new certificate authorities, that are trusted but not used in the
cluster.
This phase exists to give remote clusters opporunity
to update the list of trusted certificate authorities
of the main cluster, before main cluster reconnects with new clients
in "Update clients" phase.
* Cache services.ClusterConfig within srv.ServerContext for the duration
of a connection.
* Create a single websocket between the browser and the proxy for all
* terminal bytes and events.
This commit fixes#1741
* If bolt backend was used as a default,
new teleport continues using it as a default to prevent
regressions on start.
* Otherwise, dir backend is used as a default.
This commit fixes#1803, fixes#1889
* Adds support for public_addr for Proxy and Auth
* Parameter advertise_ip now supports host:port format
* Fixes incorrect output for tctl get proxies
* Fixes duplicate output of some error messages.
This commit implements #1860
During the the rotation procedure issuing TLS and SSH
certificate authorities are re-generated and all internal
components of the cluster re-register to get new
credentials.
The rotation procedure is based on a distributed
state machine algorithm - certificate authorities have
explicit rotation state and all parts of the cluster sync
local state machines by following transitions between phases.
Operator can launch CA rotation in auto or manual modes.
In manual mode operator moves cluster bewtween rotation states
and watches the states of the components to sync.
In auto mode state transitions are happening automatically
on a specified schedule.
The design documentation is embedded in the code:
lib/auth/rotate.go
This fixes the race with systemd reload.
P - parent, C - child
During live reload scenario,
the following happens:
P -> forks C
P -> blocks on pipe read
C -> writes to pipe
C -> writes pid file
P < - reads message from pipe
P <- shuts down
However, there is a race:
P -> forks C
P -> blocks on pipe read
C -> writes to pipe
P < - reads message from pipe
P <- shuts down
C -> writes pid file
In this case parent process exited
before child process writes new pid file
what makes systemd think that main process
is down and stop both processes.
This fix changes the sequence to:
P -> forks C
P -> blocks on pipe read
C -> writes pid file
C -> writes to pipe
P < - reads message from pipe
P <- shuts down
to make sure the race can't happen any more.
This commit allows teleport parent process to track
the status of the forked child process using os.Pipe.
Child process signals success to parent process by writing
to Pipe.
This allows HUP and USR2 to be more intelligent as they
can now detect the failure or success of the process.
This PR improves session recording:
* Nodes and proxies always buffer recorded sessions
to disk during the session what improves performance
and makes the recording more resilient to network failures.
* Async uploader running on proxy or node always uploads the
session tarball to the audit log server.
* Audit log server is the only component uploading
to the S3 or any other API.
fixes#1785, fixes#1776
This commit fixes several issues with output:
First teleport start now prints output
matching quickstart guide and sets default
console logging to ERROR.
SIGCHLD handler now only collects
processes PID forked during live restart
to avoid confusing other wait calls that
have no process status to collect any more.
Updates #1755
Design
------
This commit adds support for pluggable events and
sessions recordings and adds several plugins.
In case if external sessions recording storage
is used, nodes or proxies depending on configuration
store the session recordings locally and
then upload the recordings in the background.
Non-print session events are always sent to the
remote auth server as usual.
In case if remote events storage is used, auth
servers download recordings from it during playbacks.
DynamoDB event backend
----------------------
Transient DynamoDB backend is added for events
storage. Events are stored with default TTL of 1 year.
External lambda functions should be used
to forward events from DynamoDB.
Parameter audit_table_name in storage section
turns on dynamodb backend.
The table will be auto created.
S3 sessions backend
-------------------
If audit_sessions_uri is specified to s3://bucket-name
node or proxy depending on recording mode
will start uploading the recorded sessions
to the bucket.
If the bucket does not exist, teleport will
attempt to create a bucket with versioning and encryption
turned on by default.
Teleport will turn on bucket-side encryption for the tarballs
using aws:kms key.
File sessions backend
---------------------
If audit_sessions_uri is specified to file:///folder
teleport will start writing tarballs to this folder instead
of sending records to the file server.
This is helpful for plugin writers who can use fuse or NFS
mounted storage to handle the data.
Working dynamic configuration.
Fixes#1698.
* Added sync.Pool to take care of many gzip.Writer
allocating a lot of large objects on the heap.
* Reshuffled signal handling, SIGQUIT is now
graceful shutdown, just like in Nginx.
* Signal USR1 prints hepful diagnostic info to stderr.
* Removed gops endpoint and flags.
* Fixed logs in some places.
* Debug flag now adds extra pprof handlers to diagnostic
endpoint.
Fixes#1671
* Add notes about TOS agreements for AMI
* Use specific UID for Teleport instances
* Use encrypted EFS for session storage
* Default scale up auto scaling groups to amount of AZs
* Move dashboard to local file
* Fix dynamo locking bug
* Move PID writing fixing enterprise pid-file
* Add reload method for teleport units
This commit introduces signal handling.
Parent teleport process is now capable of forking
the child process and passing listeners file descriptors
to the child.
Parent process then can gracefully shutdown
by tracking the amount of current connections and
closing listeners once the amount goes to 0.
Here are the signals handled:
* USR2 signal will cause the parent to fork
a child process and pass listener file descriptors to it.
Child process will close unused file descriptors
and will bind to the used ones.
At this moment two processes - the parent
and the forked child process will be serving requests.
After looking at the traffic and the log files,
administrator can either shut down the parent process
or the child process if the child process is not functioning
as expected.
* TERM, INT signals will trigger graceful process shutdown.
Auth, node and proxy processes will wait until the amount
of active connections goes down to 0 and will exit after that.
* KILL, QUIT signals will cause immediate non-graceful
shutdown.
* HUP signal combines USR2 and TERM signals in a convenient
way: parent process will fork a child process and
self-initate graceful shutdown. This is a more convenient
than USR2/TERM sequence, but less agile and robust
as if the connection to the parent process drops, but
the new process exits with error, administrators
can lock themselves out of the environment.
Additionally, boltdb backend has to be phased out,
as it does not support read/writes by two concurrent
processes. This had required refactoring of the dir
backend to use file locking to allow inter-process
collaboration on read/write operations.
* Do not log EOF errors, avoid polluting logs
* Trim space from tokens when reading from file
* Do not use dir based caching
The caching problem deserves a separate explanation.
Directory backend is not concurrent friendly - it has a
fundamental design flaw - multiple gorotuines writing to the
same file corrupt cache data.
This requires either redesign of the backend or switching
to boltdb backend for caching.
Boltdb backend uses transactions and is safe for concurrent
access. This PR changes local cache to use boltdb instead
of the dir backend that is now used only in tests.
Add support for extra principals for proxy.
Proxy section already supports public_addr
property that is used during tctl users add
output.
Use the value from this property to update
host SSH certificate for proxy service.
proxy_service:
public_addr: example.com:3024
With the configuration above, proxy host
certificate will contain example.com principal
in the SSH principals list.
Support configuration for web and reverse tunnel
proxies to listen on the same port.
* Default config are not changed for backwards compatibility.
* If administrator configures web and reverse tunnel
addresses to be on the same port, multiplexing is turned on
* In trusted clusters configuration reverse_tunnel_addr
defaults to web_addr.
* Session events are delivered in continuous
batches in a guaranteed order with every event
and print event ordered from session start.
* Each auth server writes to a separate folder
on disk to make sure that no two processes write
to the same file at a time.
* When retrieving sessions, auth servers fetch
and merge results recorded by each auth server.
* Migrations and compatibility modes are in place
for older clients not aware of the new format,
but compatibility mode is not NFS friendly.
* On disk migrations are launched automatically
during auth server upgrades.
This commit introduced mutual TLS authentication
for auth server API server.
Auth server multiplexes HTTP over SSH - existing
protocol and HTTP over TLS - new protocol
on the same listening socket.
Nodes and users authenticate with 2.5.0 Teleport
using TLS mutual TLS except backwards-compatibility
cases.
* Allow external audit log plugins
* Add support for auth API server plugins
* Add license file path configuration parameter (not used in open-source)
* Extend audit log with user login events
If user running teleport is a member of adm group
create the directory and all subdirectories
accessible to admins.
Remove obsolete migrations required for pre 2.3 releases.
This is a fix for file leak in audit log server caused
by design issue:
Session file descriptors in audit log were opened on demand
when the session event or byte stream chunk was reported.
AuditLog server relied on SessionEnd event to close the
file descriptors associated with the session.
However, when SessionEnd event does not arrive (e.g.
there is a timeout or disconnect), the file descriptors
were not closed. This commit adds periodic clean up
of inactive sessions.
SessionEnd is now used as an optimization measure
to close the files, but is not used as the only
trigger to close files.
Now, inactive idle sessions, will close file descriptors
after periods of inactivity and will reopen the file
descriptors when the session activity resumes.
SessionLogger was not designed to open/close files
multiple times as it was reseting offsets
every time the session files were opened. This
change fixes this condition as well.
This was fixed running the `misspell` linter in fix mode using
`gometalinter`. The exact command I ran was :
```
gometalinter --vendor --disable-all -E misspell --linter='misspell:misspell -w {path}:^(?P<path>.*?\.go):(?P<line>\d+):(?P<col>\d+):\s*(?P<message>.*)$' ./...
```
Some typo were fixed by hand on top of it.
Instead of quietly changing behavior because `DEBUG` envar was set to
true, Teleport now explicitly requires scary --insecure flag to enable
this behavior.
BoltDB backend is now compatible with how all backends should
initialize.
Also all BoltDB-specific code/constants have been consolidated inside of
`backend.boltbk` package.
Originally Teleport had facilities to configure events/recordings via two
separate backends.
In reality those two objects (session events and session recordings)
need each other and currently there is only one implementaiton of it.
The old structures were unused. This commit is 100% dead code removeal.
- Added ability to read AWS config from `~/.aws` directory for testing
- Fixed TTL bug in DynamoDB back-end
- Made FS back-end return similar error types as Boltdb does
- Cleaned up buggy tests for DynamoDB
- Removed unnecessary locks everywhere in code
Functionality:
`teleport` binary now serves web assets from its own binary file.
Unless `DEBUG` environment variable is set to "1" or "true", in
this case it will look for ../web/dist (as located in github repo)
which can be used for development.
Design:
To avoid accumulating 3rd party dependencies with a ton of extra
features and licenses, this implementation uses minimalistic
implementation of http.FileSystem interface on top of the embedded ZIP
archive.
1. The assets are zipped into assets.zip during build process
2. assets.zip gets appended to the end of `teleport` binary
3. The resulting file is converted into a self-extracting ZIP
4. Teleport opens itself using the built-in zip unarchiver, and loads
the assets on demand.
Notes:
1. LOC is tiny (dozens)
2. RAM consumption is CONSTANT regardless of the ZIP size, about 500Kb
increase vs load-from-file, and most of it is linking zip archive
code from the standard library. Tested with a 20MB ZIP archive.
This backend can be enabled by optionally adding a new build flag.
See lib/backend/dynamo/README.md for details.
It should not affect default Teleport builds.
Instead of trying to achieve a full "offline" operation, this commit
honestly converts previous attempts to a "caching access point client"
behavior.
Closes#554
I know comments are very lacking right now. Once things are stable I will add
proper comments. Minimal manual testing of the U2F registration API was done
with a hardware U2F key. Some of the code may need to be cleaned up later to
remove excessively long variable names...
Currently we return an error rightaway if the username/password combo is wrong.
It's difficult to do U2F without revealing either whether a user exists or
whether the password is correct. Returning error immediately reveals whether
the user/password combo is valid, while waiting until we get a signed response
from the U2F device to announce whether the user/pass combo is valid can reveal
which users exist since we need to return a keyHandle in the U2F SignRequest
and generating fake keyHandles for nonexistent users can be difficult to get
right since there is no rigid format for keyHandle.
What works:
1. You have to start all 3: node, proxy and auth.
2. Login using 'tsh' (so it will create a cert)
3. Then you can shut 'auth' down.
4. Proxy and node will stay up and tsh will be able to login.
What doesn't work:
1. Auth updates are not visible to proxy/node (like new servers)
2. Not sure if "trusted clusters" will work.
At this stage I have an in-memory snapshot of a "cluster state" which
can be kept by nodes in-memory not requiring the auth connection to be
up 100% of the time.
Node and proxy are now both using this snapshot instead of a live
connection to the auth server.
Next steps:
- Make node and proxy continue to work after the auth is killed.
- Make the snapshot persistent.
- Make node & proxy use persistence and be able to restart with the auth
server down.
IMPORTANT:
Also found an interesting case where process identity is generated (on
first start). Right now there wasn't any kind of locking, and concurrent
identity initialization was possible. While it's not clear if this can
cause any real world issue, I have refactored it into a separate
lock-protected function.
Teleport configuration now has a new field: NoAudit (false by default,
which means audit is always on).
When this option is set, Teleport will not record events and will not
record sessions.
It's implemented by adding "DiscardLogger" which implements the same
interface as teh real logger, and it's plugged into the system instead.
NOTE: this option is not exposed in teleport in any way: no config file,
no switch, etc. I quickly needed it for Telecast.
* Downgraded many messages from `Debug` to `Info`
* Edited messages so they're not verbose and not too short
* Added "context" to some
* Added logical teleport component as [COMPONENT] at the beginning of
many, making logs **vastly** easier to read.
* Added one more logging level option when creating Teleport (only
Teleconsole uses it for now)
The output with 'info' severity now look extremely clean.
This is startup, for example:
```
INFO[0000] [AUTH] Auth service is starting on turing:32829 file=utils/cli.go:107
INFO[0000] [SSH:auth] listening socket: 127.0.0.1:32829 file=sshutils/server.go:119
INFO[0000] [SSH:auth] is listening on 127.0.0.1:32829 file=sshutils/server.go:144
INFO[0000] [Proxy] Successfully registered with the cluster file=utils/cli.go:107
INFO[0000] [Node] Successfully registered with the cluster file=utils/cli.go:107
INFO[0000] [AUTH] keyAuth: 127.0.0.1:56886->127.0.0.1:32829, user=turing file=auth/tun.go:370
WARN[0000] unable to load the auth server cache: open /tmp/cluster-teleconsole-client781495771/authservers.json: no such file or directory file=auth/tun.go:594
INFO[0000] [SSH:auth] new connection 127.0.0.1:56886 -> 127.0.0.1:32829 vesion: SSH-2.0-Go file=sshutils/server.go:205
INFO[0000] [AUTH] keyAuth: 127.0.0.1:56888->127.0.0.1:32829, user=turing.teleconsole-client file=auth/tun.go:370
INFO[0000] [AUTH] keyAuth: 127.0.0.1:56890->127.0.0.1:32829, user=turing.teleconsole-client file=auth/tun.go:370
INFO[0000] [Node] turing connected to the cluster 'teleconsole-client' file=service/service.go:158
INFO[0000] [AUTH] keyAuth: 127.0.0.1:56892->127.0.0.1:32829, user=turing file=auth/tun.go:370
INFO[0000] [SSH:auth] new connection 127.0.0.1:56890 -> 127.0.0.1:32829 vesion: SSH-2.0-Go file=sshutils/server.go:205
INFO[0000] [SSH:auth] new connection 127.0.0.1:56888 -> 127.0.0.1:32829 vesion: SSH-2.0-Go file=sshutils/server.go:205
INFO[0000] [Node] turing.teleconsole-client connected to the cluster 'teleconsole-client' file=service/service.go:158
INFO[0000] [Node] turing.teleconsole-client connected to the cluster 'teleconsole-client' file=service/service.go:158
INFO[0000] [SSH] received event(SSHIdentity) file=service/service.go:436
INFO[0000] [SSH] received event(ProxyIdentity) file=service/service.go:563
```
You can easily tell that auth, ssh node and proxy have successfully started.
We had this flag in the configuration forever, but apparently it was
being ignored.
It allows teleport proxy to start without HTTP UI enabled. This is
useful for proxies that strictly proxy and do nothing else.
I ran into this bug when I first time used this flag for Telecast, it
did not work, so I fixed it.
Teleport YAML config now has a new configuration variable for internal
use by Gravitational:
```yaml
teleport:
seed_config: true
```
If set to 'true', Teleport treats YAML configuration simply as a seed
configuration on first start.
If set to 'false' (default for OSS version), Teleport will throw away
its back-end config, treating YAML config as the only source of truth.
Specifically, for now, the following settings are thrown away if not
found in YAML:
- trusted authorities
- reverse tunnels
- Friendly error messages when parsing configuration and establishing
connection
- Bugs related to "first start" vs subsequent starts (reverse tunnells
added to YAML file won't be seen upon restart)
- Nicer logging
1. tctl auth export now dumps both user&host keys if --type key is missing
2. created fixtures for testing key imports: they're in
fixtures/trusted_clusters
3. configuration parser reads "trusted_clusters" files expecting the
output of tctl auth export
1. data_dir is now a global setting in teleport.yaml (instead of being
inside of "storage" sub-section)
2. changing data_dir in one place causes all of teleport to use it,
not just bolt backends.
3. moving auth server to listen on non-default ports properly adjusts
the global auth_servers setting
4. `tctl` now accepts -c flag just like Teleport, so you can pass
`teleprot.yaml` to it.
Fixes#432Fixes#431Fixes#430
TunClient always tries to dial the statically configured auth server
first, before trying "discovered" ones.
The rationale is that --auth flag must override whatever dynamic auth
servers have been discovered (because sometimes their IPs are wrong, if
advertise-ip was misconfigured)
Closes#416Fixes#416