minio/internal
Klaus Post b890bbfa63
Add local disk health checks (#14447)
The main goal of this PR is to solve the situation where disks stop 
responding to operations. This generally causes an FD build-up and 
eventually will crash the server.

This adds detection of hung disks, where calls on disk get stuck.

We add functionality to `xlStorageDiskIDCheck` where it keeps 
track of the number of concurrent requests on a given disk.

A total number of 100 operations are allowed. If this limit is reached 
we will block (but not reject) new requests, but we will monitor the 
state of the disk.

If no requests have been completed or updated within a 15-second 
window, we mark the disk as offline. Requests that are blocked will be 
unblocked and return an error as "faulty disk".

New requests will be rejected until the disk is marked OK again.

Once a disk has been marked faulty, a check will run every 5 seconds that 
will attempt to write and read back a file. As long as this fails the disk will 
remain faulty.

To prevent lots of long-running requests to mark the disk faulty we 
implement a callback feature that allows updating the status as parts 
of these operations are running.

We add a reader and writer wrapper that will update the status of each 
successful read/write operation. This should allow fine enough granularity 
that a slow, but still operational disk will not reach 15 seconds where 
50 operations have not progressed.

Note that errors themselves are not enough to mark a disk faulty. 
A nil (or io.EOF) error will mark a disk as "good".

* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.

* de-couple IsOnline() from disk health tracker

The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was

- disconnected from network we need to validate
  if the drive is "correct" and is the same drive
  which belongs to this server.

- drive was replaced we have to format it - we
  support hot swapping of the drives.

IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.

* return errFaultyDisk for DiskInfo() call

Co-authored-by: Harshavardhana <harsha@minio.io>

Possible future Improvements:

* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
2022-03-09 11:38:54 -08:00
..
arn run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
auth add gocritic/ruleguard checks back again, cleanup code. (#13665) 2021-11-16 09:28:29 -08:00
bpool run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
bucket Disallow delete replication for tag based rules (#14167) 2022-01-24 15:22:20 -08:00
color rename all remaining packages to internal/ (#12418) 2021-06-01 14:59:40 -07:00
config fix: regression from refactor in AMQP notification (#14455) 2022-03-02 21:35:48 -08:00
crypto run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
disk run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
dsync tests: Clean up dsync package (#14415) 2022-03-01 11:14:28 -08:00
etag rename all remaining packages to internal/ (#12418) 2021-06-01 14:59:40 -07:00
event Add authorization header to HEAD requests (#14510) 2022-03-09 10:48:56 -08:00
fips tls: add TLS 1.3 ciphers to the list of supported ciphers (#13158) 2021-09-07 09:57:32 -07:00
handlers run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
hash fix: enable go1.17 github ci/cd (#12997) 2021-08-18 18:35:22 -07:00
http Send deployment id and minio version in http header (#14378) 2022-02-23 13:36:01 -08:00
init Disable AVX512 on Darwin (#13550) 2021-11-01 08:03:16 -07:00
ioutil Add local disk health checks (#14447) 2022-03-09 11:38:54 -08:00
jwt run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
kms kes: remove unnecessary error conversion (#14459) 2022-03-03 09:42:37 -08:00
lock run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
logger improve logs, fix banner formatting (#14456) 2022-03-03 13:21:16 -08:00
lsync run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
mountinfo run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
pubsub rename all remaining packages to internal/ (#12418) 2021-06-01 14:59:40 -07:00
rest cleanup dsync tests and remove net/rpc references (#14118) 2022-01-18 12:44:38 -08:00
s3select select: add MISSING operator support (#14406) 2022-02-25 12:31:19 -08:00
smart run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
sync/errgroup rename all remaining packages to internal/ (#12418) 2021-06-01 14:59:40 -07:00