Don't assert mg_initialized due to device addition race

During device removal stress tests, we noticed that we were tripping 
the assertion that mg_initialized was true. After investigation, it was 
determined that the mg in question was the embedded log metaslab 
group for a newly added vdev; the normal mg had been initialized (by 
metaslab_sync_reassess, via vdev_sync_done). However, because the spa 
config alloc lock is not held as writer across both calls to 
metaslab_sync_reassess, it is possible for an allocation to happen 
between the two metaslab_groups being initialized. Because the metaslab 
code doesn't check the group in question, just the vdev's main mg, it 
is possible to get past the initial check in vdev_allocatable and 
later fail due to the assertion.

We simply remove the assertions. We could also consider locking the 
ALLOC lock around the reassess calls in vdev_sync_done, but that risks 
deadlocks. We could check the actual target mg in vdev_allocatable, 
but that risks racing with a passivation that comes in after that 
check but before the assertion. We still won't be able to actually 
allocate from the metaslab group if no metaslabs are ready, so this 
change shouldn't break anything.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15818
This commit is contained in:
Paul Dagnelie 2024-01-29 10:36:42 -08:00 committed by GitHub
parent 401c3563d4
commit 8161b73272
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -5105,7 +5105,6 @@ metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
int allocator, boolean_t try_hard)
{
uint64_t offset;
ASSERT(mg->mg_initialized);
offset = metaslab_group_alloc_normal(mg, zal, asize, txg, want_unique,
dva, d, allocator, try_hard);
@ -5256,8 +5255,6 @@ metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
goto next;
}
ASSERT(mg->mg_initialized);
/*
* Avoid writing single-copy data to an unhealthy,
* non-redundant vdev, unless we've already tried all