AK: Rehash HashTable in-place instead of shrinking

As seen on TV, HashTable can get "thrashed", i.e. it has a bunch of deleted buckets that count towards the load factor. This means that hash tables which are large enough for their contents need to be resized. This was fixed in 9d8da16 with a workaround that shrinks the HashTable back down in these cases, as after the resize and re-hash the load factor is very low again. However, that's not a good solution. If you insert and remove repeatedly around a size boundary, you might get frequent resizes, which involve frequent re-allocations. The new solution is an in-place rehashing algorithm that I came up with. (Do complain to me, I'm at fault.) Basically, it iterates the buckets and re-hashes the used buckets while marking the deleted slots empty. The issue arises with collisions in the re-hash. For this reason, there are two kinds of used buckets during the re-hashing: the normal "used" buckets, which are old and are treated as free space, and the "re-hashed" buckets, which are new and treated as used space, i.e. they trigger probing. Therefore, the procedure for relocating a bucket's contents is as follows: - Locate the "real" bucket of the contents with the hash. That bucket is the starting point for the target bucket, and the current (old) bucket is the bucket we want to move. - While we still need to move the bucket: - If we're the target, something strange happened last iteration or we just re-hashed to the same location. We're done. - If the target is empty or deleted, just move the bucket. We're done. - If the target is a re-hashed full bucket, we probe by double-hashing our hash as usual. Henceforth, we move our target for the next iteration. - If the target is an old full bucket, we swap the target and to-move buckets. Therefore, the bucket to move is a the correct location and the former target, which still needs to find a new place, is now in the bucket to move. So we can just continue with the loop; the target is re-obtained from the bucket to move. This happens for each and every bucket, though some buckets are "coincidentally" moved before their point of iteration is reached. Either way, this guarantees full in-place movement (even without stack storage) and therefore space complexity of O(1). Time complexity is amortized O(2n) asssuming a good hashing function. This leads to a performance improvement of ~30% on the benchmark introduced with the last commit. Co-authored-by: Hendiadyoin1 <leon.a@serenityos.org>
2024-07-22 10:36:24 +00:00 · 2022-03-07 23:56:54 +01:00 · 2022-03-07 23:56:54 +01:00 · 49d29c8298
parent e73e579446
commit 49d29c8298
1 changed files with 137 additions and 15 deletions
--- a/AK/HashTable.h
+++ b/AK/HashTable.h
@ -403,7 +403,7 @@ public:
        --m_size;
        ++m_deleted_count;

-        shrink_if_needed();
+        rehash_in_place_if_needed();
    }

    template<typename TUnaryPredicate>
@ -421,7 +421,7 @@ public:
            m_deleted_count += removed_count;
            m_size -= removed_count;
        }
-        shrink_if_needed();
+        rehash_in_place_if_needed();
        return removed_count;
    }

@ -454,6 +454,11 @@ private:

    ErrorOr<void> try_rehash(size_t new_capacity)
    {
+        if (new_capacity == m_capacity && new_capacity >= 4) {
+            rehash_in_place();
+            return {};
+        }
+
        new_capacity = max(new_capacity, static_cast<size_t>(4));
        new_capacity = kmalloc_good_size(new_capacity * sizeof(BucketType)) / sizeof(BucketType);

@ -491,6 +496,136 @@ private:
        MUST(try_rehash(new_capacity));
    }

+    void rehash_in_place()
+    {
+        // FIXME: This implementation takes two loops over the entire bucket array, but avoids re-allocation.
+        //        Please benchmark your new implementation before you replace this.
+        //        The reason is that because of collisions, we use the special "rehashed" bucket state to mark already-rehashed used buckets.
+        //        Because we of course want to write into old used buckets, but already rehashed data shall not be touched.
+
+        // FIXME: Find a way to reduce the cognitive complexity of this function.
+
+        for (size_t i = 0; i < m_capacity; ++i) {
+            auto& bucket = m_buckets[i];
+
+            // FIXME: Bail out when we have handled every filled bucket.
+
+            if (bucket.state == BucketState::Rehashed || bucket.state == BucketState::End || bucket.state == BucketState::Free)
+                continue;
+            if (bucket.state == BucketState::Deleted) {
+                bucket.state = BucketState::Free;
+                continue;
+            }
+
+            auto const new_hash = TraitsForT::hash(*bucket.slot());
+            if (new_hash % m_capacity == i) {
+                bucket.state = BucketState::Rehashed;
+                continue;
+            }
+
+            auto target_hash = new_hash;
+            auto const to_move_hash = i;
+            BucketType* target_bucket = &m_buckets[target_hash % m_capacity];
+            BucketType* bucket_to_move = &m_buckets[i];
+
+            // Try to move the bucket to move into its correct spot.
+            // During the procedure, we might re-hash or actually change the bucket to move.
+            while (!(bucket_to_move->state == BucketState::Free || bucket_to_move->state == BucketState::Deleted)) {
+
+                // If we're targeting ourselves, there's nothing to do.
+                if (to_move_hash == target_hash % m_capacity) {
+                    bucket_to_move->state = BucketState::Rehashed;
+                    break;
+                }
+
+                if (target_bucket->state == BucketState::Free || target_bucket->state == BucketState::Deleted) {
+                    // We can just overwrite the target bucket and bail out.
+                    new (target_bucket->slot()) T(move(*bucket_to_move->slot()));
+                    target_bucket->state = BucketState::Rehashed;
+                    bucket_to_move->state = BucketState::Free;
+
+                    if constexpr (IsOrdered) {
+                        swap(bucket_to_move->previous, target_bucket->previous);
+                        swap(bucket_to_move->next, target_bucket->next);
+
+                        if (target_bucket->previous)
+                            target_bucket->previous->next = target_bucket;
+                        else
+                            m_collection_data.head = target_bucket;
+                        if (target_bucket->next)
+                            target_bucket->next->previous = target_bucket;
+                        else
+                            m_collection_data.tail = target_bucket;
+                    }
+                } else if (target_bucket->state == BucketState::Rehashed) {
+                    // If the target bucket is already re-hashed, we do normal probing.
+                    target_hash = double_hash(target_hash);
+                    target_bucket = &m_buckets[target_hash % m_capacity];
+                } else {
+                    VERIFY(target_bucket->state != BucketState::End);
+                    // The target bucket is a used bucket that hasn't been re-hashed.
+                    // Swap the data into the target; now the target's data resides in the bucket to move again.
+                    // (That's of course what we want, how neat!)
+                    swap(*bucket_to_move->slot(), *target_bucket->slot());
+                    bucket_to_move->state = target_bucket->state;
+                    target_bucket->state = BucketState::Rehashed;
+
+                    if constexpr (IsOrdered) {
+                        // Update state for the target bucket, we'll do the bucket to move later.
+                        swap(bucket_to_move->previous, target_bucket->previous);
+                        swap(bucket_to_move->next, target_bucket->next);
+
+                        if (target_bucket->previous)
+                            target_bucket->previous->next = target_bucket;
+                        else
+                            m_collection_data.head = target_bucket;
+                        if (target_bucket->next)
+                            target_bucket->next->previous = target_bucket;
+                        else
+                            m_collection_data.tail = target_bucket;
+                    }
+
+                    target_hash = TraitsForT::hash(*bucket_to_move->slot());
+                    target_bucket = &m_buckets[target_hash % m_capacity];
+
+                    // The data is already in the correct location: Adjust the pointers
+                    if (target_hash % m_capacity == to_move_hash) {
+                        bucket_to_move->state = BucketState::Rehashed;
+                        if constexpr (IsOrdered) {
+                            // Update state for the bucket to move as it's not actually moved anymore.
+                            if (bucket_to_move->previous)
+                                bucket_to_move->previous->next = bucket_to_move;
+                            else
+                                m_collection_data.head = bucket_to_move;
+                            if (bucket_to_move->next)
+                                bucket_to_move->next->previous = bucket_to_move;
+                            else
+                                m_collection_data.tail = bucket_to_move;
+                        }
+                        break;
+                    }
+                }
+            }
+            // After this, the bucket_to_move either contains data that rehashes to itself, or it contains nothing as we were able to move the last thing.
+            if (bucket_to_move->state == BucketState::Deleted)
+                bucket_to_move->state = BucketState::Free;
+        }
+
+        for (size_t i = 0; i < m_capacity; ++i) {
+            if (m_buckets[i].state == BucketState::Rehashed)
+                m_buckets[i].state = BucketState::Used;
+        }
+
+        m_deleted_count = 0;
+    }
+
+    void rehash_in_place_if_needed()
+    {
+        // This signals a "thrashed" hash table with many deleted slots.
+        if (m_deleted_count >= m_size && should_grow())
+            rehash_in_place();
+    }
+
    template<typename TUnaryPredicate>
    [[nodiscard]] BucketType* lookup_with_hash(unsigned hash, TUnaryPredicate predicate) const
    {
@ -544,19 +679,6 @@ private:
    [[nodiscard]] size_t used_bucket_count() const { return m_size + m_deleted_count; }
    [[nodiscard]] bool should_grow() const { return ((used_bucket_count() + 1) * 100) >= (m_capacity * load_factor_in_percent); }

-    void shrink_if_needed()
-    {
-        // Shrink if less than 20% of buckets are used, but never going below 16.
-        // These limits are totally arbitrary and can probably be improved.
-        bool should_shrink = m_size * 5 < m_capacity && m_capacity > 16;
-        if (!should_shrink)
-            return;
-
-        // NOTE: We ignore memory allocation failure here, since we can continue
-        //       just fine with an oversized table.
-        (void)try_rehash(m_size * 2);
-    }
-
    void delete_bucket(auto& bucket)
    {
        bucket.slot()->~T();