linux/fs/btrfs
Filipe Manana 639bd575b7 btrfs: fix race leading to unnecessary transaction commit when logging inode
When logging an inode we may often have to fallback to a full transaction
commit, either because a new block group was allocated, there is some case
we can not deal with without a transaction commit or some error like an
ENOMEM happened. However after we fallback to a transaction commit, we
have a time window where we can make the next attempt to log any inode
commit the next transaction unnecessarily, adding additional overhead and
increasing latency.

A sequence of steps that leads to this issue is the following:

1) The current open transaction has a generation of 1000;

2) A new block group is allocated, and as a consequence we must make sure
   any attempts to commit a log fallback to a transaction commit, so
   btrfs_set_log_full_commit() is called from btrfs_make_block_group().
   This sets fs_info->last_trans_log_full_commit to 1000;

3) Task A is holding a handle on transaction 1000 and tries to log inode X.
   Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
   which returns true, since fs_info->last_trans_log_full_commit has a
   value of 1000. So we end up returning EAGAIN and propagating it up to
   btrfs_sync_file(), where we commit transaction 1000;

4) The transaction commit task (task A) sets the transaction state to
   unblocked (TRANS_STATE_UNBLOCKED);

5) Some other task, task B, starts a new transaction with a generation of
   1001;

6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;

7) Transaction 1000 has not fully committed yet, we are still writing all
   the extent buffers it created;

8) Some new task, task C, starts an fsync of inode Y, gets a handle for
   transaction 1001, and it gets to btrfs_log_inode_parent() which does
   the following check:

     if (fs_info->last_trans_log_full_commit > last_committed) {
         ret = 1;
         goto end_no_trans;
     }

   At that point last_trans_log_full_commit has a value of 1000 and
   last_committed (value of fs_info->last_trans_committed) has a value of
   999, since transaction 1000 has not yet committed - it is either still
   writing out dirty extent buffers, its super blocks or unpinning
   extents.

   As a consequence we return 1, which gets propagated up to
   btrfs_sync_file(), which will then call btrfs_commit_transaction()
   for transaction 1001.

   As a consequence we have an unnecessary second transaction commit, we
   previously committed transaction 1000 and now commit transaction 1001
   as well, resulting in more overhead and increased latency.

So fix this double transaction commit issue simply by removing that check,
because all we need to do is wait for the previous transaction to finish
its commit, which we already do later when starting the log transaction at
start_log_trans(), because there we acquire the tree_log_mutex lock, which
is held by a transaction commit and only released after the transaction
commits its super blocks.

Another issue that check has is that it reads last_trans_log_full_commit
without using READ_ONCE(), which is incorrect since that member of
struct btrfs_fs_info is always updated with WRITE_ONCE() through the
helper btrfs_set_log_full_commit().

This double transaction commit issue can actually be triggered quite often
in long runs of dbench, since besides the creation of new block groups
that force inode logging to fallback to a transaction commit, there are
cases where dbench asks to fsync a directory which had files in it that
were previously renamed or subdirectories that were removed, resulting in
the inode logging to fallback to a full transaction commit.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
..
tests btrfs: remove recalc_thresholds from free space ops 2020-12-09 19:16:06 +01:00
acl.c
async-thread.c
async-thread.h
backref.c btrfs: pass root owner to read_tree_block 2020-12-08 15:54:07 +01:00
backref.h
block-group.c btrfs: implement log-structured superblock for ZONED mode 2020-12-09 19:16:04 +01:00
block-group.h btrfs: load free space cache asynchronously 2020-12-08 15:54:03 +01:00
block-rsv.c btrfs: introduce mount option rescue=ignorebadroots 2020-12-08 15:53:41 +01:00
block-rsv.h
btrfs_inode.h btrfs: skip unnecessary searches for xattrs when logging an inode 2020-12-08 15:54:12 +01:00
check-integrity.c btrfs: drop casts of bio bi_sector 2020-12-09 19:16:05 +01:00
check-integrity.h
compression.c btrfs: drop casts of bio bi_sector 2020-12-09 19:16:05 +01:00
compression.h btrfs: compression: move declarations to header 2020-10-07 12:06:55 +02:00
ctree.c btrfs: simplify return values in setup_nodes_for_search 2020-12-08 15:54:13 +01:00
ctree.h btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
delalloc-space.c btrfs: add btrfs_reserve_data_bytes and use it 2020-10-07 12:06:52 +02:00
delalloc-space.h
delayed-inode.c btrfs: make btrfs_delayed_update_inode take btrfs_inode 2020-12-08 15:54:10 +01:00
delayed-inode.h btrfs: make btrfs_delayed_update_inode take btrfs_inode 2020-12-08 15:54:10 +01:00
delayed-ref.c
delayed-ref.h
dev-replace.c btrfs: check and enable ZONED mode 2020-12-09 19:16:03 +01:00
dev-replace.h
dir-item.c btrfs: locking: rip out path->leave_spinning 2020-12-08 15:54:02 +01:00
discard.c btrfs: don't miss async discards after scheduled work override 2020-12-08 15:54:05 +01:00
discard.h btrfs: cleanup btrfs_discard_update_discardable usage 2020-12-08 15:54:02 +01:00
disk-io.c btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
disk-io.h btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.c 2020-12-09 19:16:05 +01:00
export.c btrfs: locking: rip out path->leave_spinning 2020-12-08 15:54:02 +01:00
export.h
extent-io-tree.h btrfs: use fixed width int type for extent_state::state 2020-12-08 15:54:13 +01:00
extent-tree.c btrfs: set the lockdep class for extent buffers on creation 2020-12-08 15:54:07 +01:00
extent_io.c btrfs: drop casts of bio bi_sector 2020-12-09 19:16:05 +01:00
extent_io.h btrfs: use fixed width int type for extent_state::state 2020-12-08 15:54:13 +01:00
extent_map.c
extent_map.h
file-item.c btrfs: drop casts of bio bi_sector 2020-12-09 19:16:05 +01:00
file.c btrfs: disable fallocate in ZONED mode 2020-12-09 19:16:04 +01:00
free-space-cache.c btrfs: remove recalc_thresholds from free space ops 2020-12-09 19:16:06 +01:00
free-space-cache.h btrfs: remove recalc_thresholds from free space ops 2020-12-09 19:16:06 +01:00
free-space-tree.c btrfs: locking: rip out path->leave_spinning 2020-12-08 15:54:02 +01:00
free-space-tree.h
inode-item.c btrfs: locking: rip out path->leave_spinning 2020-12-08 15:54:02 +01:00
inode.c btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
ioctl.c btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
Kconfig btrfs: switch to iomap for direct IO 2020-10-07 12:06:57 +02:00
locking.c btrfs: remove the recurse parameter from __btrfs_tree_read_lock 2020-12-08 15:54:09 +01:00
locking.h btrfs: remove the recurse parameter from __btrfs_tree_read_lock 2020-12-08 15:54:09 +01:00
lzo.c
Makefile btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
misc.h
ordered-data.c btrfs: switch cached fs_info::csum_size from u16 to u32 2020-12-08 15:53:59 +01:00
ordered-data.h btrfs: remove unnecessary local variables for checksum size 2020-12-08 15:54:00 +01:00
orphan.c
print-tree.c btrfs: pass root owner to read_tree_block 2020-12-08 15:54:07 +01:00
print-tree.h btrfs: pretty print leaked root name 2020-10-07 12:12:20 +02:00
props.c
props.h
qgroup.c btrfs: pass root owner to read_tree_block 2020-12-08 15:54:07 +01:00
qgroup.h
raid56.c btrfs: drop casts of bio bi_sector 2020-12-09 19:16:05 +01:00
raid56.h
rcu-string.h
reada.c btrfs: pass the owner_root and level to alloc_extent_buffer 2020-12-08 15:54:07 +01:00
ref-verify.c btrfs: use btrfs_read_node_slot in walk_down_tree 2020-12-08 15:54:06 +01:00
ref-verify.h
reflink.c btrfs: make btrfs_cont_expand take btrfs_inode 2020-12-08 15:54:12 +01:00
reflink.h
relocation.c btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
root-tree.c btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations 2020-10-07 12:12:13 +02:00
scrub.c btrfs: implement log-structured superblock for ZONED mode 2020-12-09 19:16:04 +01:00
send.c btrfs: send: use helpers to access root_item::ctransid 2020-12-08 15:53:51 +01:00
send.h btrfs: send: avoid copying file data 2020-10-07 12:13:17 +02:00
space-info.c btrfs: kill the RCU protection for fs_info->space_info 2020-10-07 12:13:19 +02:00
space-info.h btrfs: add btrfs_reserve_data_bytes and use it 2020-10-07 12:06:52 +02:00
struct-funcs.c btrfs: use unaligned helpers for stack and header set/get helpers 2020-10-07 12:13:23 +02:00
super.c btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
sysfs.c btrfs: introduce ZONED feature flag 2020-12-08 15:54:16 +01:00
sysfs.h btrfs: split and refactor btrfs_sysfs_remove_devices_dir 2020-10-07 12:12:21 +02:00
transaction.c btrfs: remove inode number cache feature 2020-12-09 19:16:05 +01:00
transaction.h btrfs: return bool from btrfs_should_end_transaction 2020-12-08 15:54:16 +01:00
tree-checker.c btrfs: tree-checker: annotate all error branches as unlikely 2020-12-08 15:54:15 +01:00
tree-checker.h
tree-defrag.c btrfs: locking: remove all the blocking helpers 2020-12-08 15:54:01 +01:00
tree-log.c btrfs: fix race leading to unnecessary transaction commit when logging inode 2020-12-09 19:16:07 +01:00
tree-log.h btrfs: make fast fsyncs wait only for writeback 2020-10-07 12:06:56 +02:00
ulist.c
ulist.h
uuid-tree.c btrfs: remove unnecessary casts in printk 2020-12-08 15:53:52 +01:00
volumes.c btrfs: drop casts of bio bi_sector 2020-12-09 19:16:05 +01:00
volumes.h btrfs: get zone information of zoned block devices 2020-12-09 19:15:57 +01:00
xattr.c btrfs: skip unnecessary searches for xattrs when logging an inode 2020-12-08 15:54:12 +01:00
xattr.h
zlib.c
zoned.c btrfs: implement log-structured superblock for ZONED mode 2020-12-09 19:16:04 +01:00
zoned.h btrfs: implement log-structured superblock for ZONED mode 2020-12-09 19:16:04 +01:00
zstd.c