* Added support for connecting API client through tunnel proxy and web proxy addresses (with identity file).
* Added concurrent dialing logic to dial several possible dialing combinations and seamlessly return the first client to connect.
After a recent local C compiler upgrade, I started getting these
warnings when building teleport:
```
\# github.com/mattn/go-sqlite3
sqlite3-binding.c: In function 'sqlite3SelectNew':
sqlite3-binding.c:123303:10: warning: function may return address of local variable [-Wreturn-local-addr]
123303 | return pNew;
| ^~~~
sqlite3-binding.c:123263:10: note: declared here
123263 | Select standin;
| ^~~~~~~
```
Upgrading to the latest version clears those.
Here's the full changelog: https://github.com/mattn/go-sqlite3/compare/v1.10.0...v1.14.6
This is just a refactoring without functional changes. Pull all the u2f
handling spread across multiple client and server packages into one
place.
Also clean up an obsolete vendored dependency, unrelated to this PR.
* Upgrade github.com/gravitataional/trace to v1.1.12
We were a few versions behind. In particular this versions lets us use
stdlib's `errors.Is/As` to inspect errors.
* Bump trace to 1.1.13
Co-authored-by: Andrew Lytvynov <andrew@goteleport.com>
This commit fixes#5177
Initial implementation uses dir backend as a cache and is OK
for small clusters, but will be a problem for many proxies.
This implementation uses Go autocert that is quite limited
compared to Caddy's certmagic or lego.
Autocert has no OCSP stapling and no locking for cache for example.
However, it is much simpler and has no dependencies.
It will be easier to extend to use Teleport backend as a cert cache.
```yaml
proxy_service:
public_addr: ['example.com']
# ACME - automatic certificate management environment.
#
# It provisions certificates for domains and
# valid subdomains in public_addr section.
#
# The sudomains are valid if there is a registered application.
# For example, app.example.com will get a cert if app is a regsitered
# application access app. The sudomain cookie.example.com is not.
#
# Teleport acme is using TLS-ALPN-01 challenge:
#
# https://letsencrypt.org/docs/challenge-types/#tls-alpn-01
#
acme:
# By default acme is disabled.
enabled: true
# Use a custom URI, for example staging is
#
# https://acme-staging-v02.api.letsencrypt.org/directory
#
# Default is letsencrypt.org production URL:
#
# https://acme-v02.api.letsencrypt.org/directory
uri: ''
# Set email to receive alerts and other correspondence
# from your certificate authority.
email: 'alice@example.com'
```
* Add logger attributes to be able to propagate logger from tests for identifying tests
* Add test case for Server's DeepCopy.
* Update test to using the testing package directly. Update dependency after upstream PR.
* Update logrus package to fix data races
* Introduce a logger that uses the test context to log the messages so they are output if a test fails for improved trouble-shooting.
* Revert introduction of test logger - simply leave logger configuration at debug level outputting to stderr during tests.
* Run integration test for e as well
* Use make with a cap and append to only copy the relevant roles.
* Address review comments
* Update integration test suite to use test-local logger that would only output logs iff a specific test has failed - no logs from other test cases will be output.
* Revert changes to InitLoggerForTests API
* Create a new logger instance when applying defaults or merging with file service configuration
* Introduce a local logger interface to be able to test file configuration merge.
* Fix kube integration tests w.r.t log
* Move goroutine profile dump into a separate func to handle parameters consistently for all invocations
This change has several parts: cluster registration, cache updates,
routing and a new tctl flag.
> cluster registration
Cluster registration means adding `KubernetesClusters` to `ServerSpec`
for servers with `KindKubeService`.
`kubernetes_service` instances will parse their kubeconfig or local
`kube_cluster_name` and add them to their `ServerSpec` sent to the auth
server. They are effectively declaring that "I can serve k8s requests
for k8s cluster X".
> cache updates
This is just cache plumbing for `kubernetes_service` presence, so that
other teleport processes can fetch all of kube services. It was missed
in the previous PR implementing CRUD for `kubernetes_service`.
> routing
Now the fun part - routing logic. This logic lives in
`/lib/kube/proxy/forwarder.go` and is shared by both `proxy_service`
(with kubernetes integration enabled) and `kubernetes_service`.
The target k8s cluster name is passed in the client cert, along with k8s
users/groups information.
`kubernetes_service` only serves requests for its direct k8s cluster
(from `Forwarder.creds`) and doesn't route requests to other teleport
instances.
`proxy_service` can serve requests:
- directly to a k8s cluster (the way it works pre-5.0)
- to a leaf teleport cluster (also same as pre-5.0, based on
`RouteToCluster` field in the client cert)
- to a `kubernetes_service` (directly or over a tunnel)
The last two modes require the proxy to generate an ephemeral client TLS
cert to do an outbound mTLS connection.
> tctl flag
A flag `--kube-cluster-name` for `tctl auth sign --format=kubernetes`
which allows generating client certs for non-default k8s cluster name
(as long as it's registered in a cluster).
I used this for testing, but it could be used for automation too.
`require` is a sister package to `assert` that terminates the test on
failure. `assert` records the failure but lets the test proceed, which
is un-intuitive.
Also update all existing tests to match.
This commit introduces GRPC API for streaming sessions.
It adds structured events and sync streaming
that avoids storing events on disk.
You can find design in rfd/0002-streaming.md RFD.
* Always collect metrics about top backend requests
Previously, it was only done in debug mode. This makes some tabs in
`tctl top` empty, when auth server is not in debug mode.
* backend: use an LRU cache for top requests in Reporter
This LRU cache tracks the most frequent recent backend keys. All keys in
this cache map to existing labels in the requests metric. Any evicted
keys are also deleted from the metric.
This will keep an upper limit on our memory usage while still always
reporting the most active keys.
This is a result of "go mod vendor".
You'll notice that some versions have changed. This is due to the
transient module dependencies that dep wasn't aware of.
For example:
- Gopkg.lock imported cloud.google.com/go v0.41.0 and
github.com/fsouza/fake-gcs-server v1.11.6
- github.com/fsouza/fake-gcs-server v1.11.6 has a go.mod file that
depends on cloud.google.com/go v0.43.3:
https://github.com/fsouza/fake-gcs-server/blob/v1.11.6/go.mod#L4
- therefore, "go mod vendor" bumped cloud.google.com/go to v.0.43.3
Same transient dependency version bumps got applied to some other
modules.
A few are also removed via "go mod tidy".
By importing the magic `k8s.io/client-go/plugin/pkg/client/auth`
package, we compile-in support for auth plugins: gcp, azure, oidc,
openstack.
Without this, users can't provide a kubeconfig using those authn
methods to a proxy.
Buffer fan out used simple prefix match
in a loop, what resulted in high CPU load
on many connected watchers.
This commit switches to RADIX trees for
prefix matching what reduces CPU load
substantially for 5K+ connected watchers.
This commit expands the usage of the caching layer
for auth server API:
* Introduces in-memory cache that is used to serve all
Auth server API requests. This is done to achieve scalability
on 10K+ node clusters, where each node fetches certificate authorities,
roles, users and join tokens. It is not possible to scale
DynamoDB backend or other backends on 10K reads per seconds
on a single shard or partition. The solution is to introduce
an in-memory cache of the backend state that is always used
for reads.
* In-memory cache has been expanded to support all resources
required by the auth server.
* Experimental `tctl top` command has been introduced to display
common single node metrics.
Replace SQLite Memory Backend with BTree
SQLite in memory backend was suffering from
high tail latencies under load (up to 8 seconds
in 99.9%-ile on load configurations).
This commit replaces the SQLite memory caching
backend with in-memory BTree backend that
brought down tail latencies to 2 seconds (99.9%-ile)
and brought overall performance improvement.
This commit introduces several key changes to
Teleport backend and API infrastructure
in order to achieve scalability improvements
on 10K+ node deployments.
Events and plain keyspace
--------------------------
New backend interface supports events,
pagination and range queries
and moves away from buckets to
plain keyspace, what better aligns
with DynamoDB and Etcd featuring similar
interfaces.
All backend implementations are
exposing Events API, allowing
multiple subscribers to consume the same
event stream and avoid polling database.
Replacing BoltDB, Dir with SQLite
-------------------------------
BoltDB backend does not support
having two processes access the database at the
same time. This prevented Teleport
using BoltDB backend to be live reloaded.
SQLite supports reads/writes by multiple
processes and makes Dir backend obsolete
as SQLite is more efficient on larger collections,
supports transactions and can detect data
corruption.
Teleport automatically migrates data from
Bolt and Dir backends into SQLite.
GRPC API and protobuf resources
-------------------------------
GRPC API has been introduced for
the auth server. The auth server now serves both GRPC
and JSON-HTTP API on the same TLS socket and uses
the same client certificate authentication.
All future API methods should use GRPC and HTTP-JSON
API is considered obsolete.
In addition to that some resources like
Server and CertificateAuthority are now
generated from protobuf service specifications in
a way that is fully backward compatible with
original JSON spec and schema, so the same resource
can be encoded and decoded from JSON, YAML
and protobuf.
All models should be refactored
into new proto specification over time.
Streaming presence service
--------------------------
In order to cut bandwidth, nodes
are sending full updates only when changes
to labels or spec have occured, otherwise
new light-weight GRPC keep alive updates are sent
over to the presence service, reducing
bandwidth usage on multi-node deployments.
In addition to that nodes are no longer polling
auth server for certificate authority rotation
updates, instead they subscribe to event updates
to detect updates as soon as they happen.
This is a new API, so the errors are inevitable,
that's why polling is still done, but
on a way slower rate.