linux/fs
Filipe Manana 3a8b36f378 Btrfs: fix data loss in the fast fsync path
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:32 -08:00
..
9p assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
adfs
affs fs/affs/file.c: remove obsolete pagesize check 2014-12-13 12:42:52 -08:00
afs Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-12-11 14:27:06 -08:00
autofs4 assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
befs befs: remove dead code 2014-12-13 12:42:51 -08:00
bfs
btrfs Btrfs: fix data loss in the fast fsync path 2015-03-05 17:28:32 -08:00
cachefiles assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
ceph ceph: use %zu for len in ceph_fill_inline_data() 2015-01-08 20:36:56 +03:00
cifs cifs: make new inode cache when file type is different 2014-12-22 14:16:21 -06:00
coda coda_venus_readdir(): use file_inode() 2014-12-11 16:28:12 -05:00
configfs assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
cramfs
debugfs Driver core patches for 3.19-rc1 2014-12-14 16:10:09 -08:00
devpts
dlm Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-10 16:10:49 -08:00
ecryptfs Fixes for filename decryption and encrypted view plus a cleanup 2014-12-19 18:15:12 -08:00
efivarfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-10 16:10:49 -08:00
efs
exofs Boaz Harrosh - Fix broken email address 2014-10-19 20:22:32 +03:00
exportfs move d_rcu from overlapping d_child to overlapping d_alias 2014-11-03 15:20:29 -05:00
ext2 ext2: Convert to private i_dquot field 2014-11-10 10:06:10 +01:00
ext3 ext3: Convert to private i_dquot field 2014-11-10 10:06:10 +01:00
ext4 Revert a potential seek_data/hole regression which shows up when using 2015-01-06 14:05:40 -08:00
f2fs f2fs: avoid to ra unneeded blocks in recover flow 2014-12-08 14:19:09 -08:00
fat fat: fix data past EOF resulting from fsx testsuite 2014-12-13 12:42:51 -08:00
freevxfs
fscache fs/fscache/object-list.c: use __seq_open_private() 2014-10-13 17:52:21 +01:00
fuse fuse: add memory barrier to INIT 2015-01-06 10:45:35 +01:00
gfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-10 16:10:49 -08:00
hfs fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp 2014-12-10 17:41:16 -08:00
hfsplus hfsplus: fix longname handling 2014-12-18 19:08:10 -08:00
hostfs
hpfs
hppfs vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
hugetlbfs mm: convert i_mmap_mutex to rwsem 2014-12-13 12:42:45 -08:00
isofs isofs: Fix unchecked printing of ER records 2014-12-19 11:29:24 +01:00
jbd jbd: Deletion of an unnecessary check before the function call "iput" 2014-11-18 10:15:29 +01:00
jbd2 Lots of bugs fixes, including Zheng and Jan's extent status shrinker 2014-12-12 09:28:03 -08:00
jffs2 jffs2: Drop bogus if in comment 2014-11-28 18:23:44 -08:00
jfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-10 16:10:49 -08:00
kernfs kernfs: Fix kernfs_name_compare 2015-01-09 15:51:08 -08:00
lockd LOCKD: Fix a race when initialising nlmsvc_timeout 2015-01-05 19:40:53 -08:00
logfs
minix
ncpfs Merge branch 'akpm' (patchbomb from Andrew) 2014-12-10 18:34:42 -08:00
nfs NFSv4: Remove incorrect check in can_open_delegated() 2015-01-05 19:40:54 -08:00
nfs_common
nfsd nfsd: fix fi_delegees leak when fi_had_conflict returns true 2015-01-07 13:38:21 -05:00
nilfs2 nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races 2014-12-10 17:41:16 -08:00
nls
notify sched, fanotify: Deal with nested sleeps 2015-01-09 11:18:12 +01:00
ntfs assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
ocfs2 ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when link file 2015-01-08 15:10:51 -08:00
omfs FS/OMFS: block number sanity check during fill_super operation 2014-10-14 02:18:22 +02:00
openpromfs
overlayfs Merge branch 'iov_iter' into for-next 2014-12-08 20:39:29 -05:00
proc Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-19 13:26:08 -08:00
pstore Driver core patches for 3.19-rc1 2014-12-14 16:10:09 -08:00
qnx4
qnx6
quota vfs: Remove i_dquot field from inode 2014-11-10 10:06:18 +01:00
ramfs
reiserfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-12-16 15:46:01 -08:00
romfs
squashfs Squashfs: Add LZ4 compression configuration option 2014-11-27 18:48:44 +00:00
sysfs sysfs/kernfs: make read requests on pre-alloc files use the buffer. 2014-11-07 10:54:38 -08:00
sysv
ubifs UBIFS: fix a couple bugs in UBIFS xattr length calculation 2014-11-07 12:32:22 +02:00
udf udf: Reduce repeated dereferences 2014-12-21 22:42:37 +01:00
ufs fs/ufs/balloc.c: remove unused variable 2014-10-14 02:18:20 +02:00
xfs xfs: update for 3.19-rc1 2014-12-12 09:48:17 -08:00
aio.c aio: Skip timer for io_getevents if timeout=0 2014-12-13 17:50:20 -05:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
binfmt_elf.c Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus 2014-12-11 17:56:37 -08:00
binfmt_elf_fdpic.c handle suicide on late failure exits in execve() in search_binary_handler() 2014-10-09 02:39:00 -04:00
binfmt_em86.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
binfmt_flat.c
binfmt_misc.c unfuck binfmt_misc.c (broken by commit e6084d4) 2014-12-17 08:27:14 -05:00
binfmt_script.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
binfmt_som.c
block_dev.c fs: add freeze_super/thaw_super fs hooks 2014-11-17 10:35:17 +00:00
buffer.c fs: clarify rate limit suppressed buffer I/O errors 2014-10-21 13:55:11 -06:00
char_dev.c fs/char_dev.c: remove pointless assignment from __register_chrdev_region() 2014-12-10 17:41:04 -08:00
compat.c vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
compat_binfmt_elf.c
compat_ioctl.c
coredump.c coredump: add %i/%I in core_pattern to report the tid of the crashed thread 2014-10-14 02:18:21 +02:00
dcache.c Merge branch 'iov_iter' into for-next 2014-12-08 20:39:29 -05:00
dcookies.c
direct-io.c fuse: honour max_read and max_write in direct_io mode 2014-09-26 21:16:51 -04:00
drop_caches.c mm: vmscan: invoke slab shrinkers from shrink_zone() 2014-12-13 12:42:48 -08:00
eventfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
eventpoll.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
exec.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
fcntl.c vfs: renumber FMODE_NONOTIFY and add to uniqueness check 2015-01-08 15:10:52 -08:00
fhandle.c
file.c fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0) 2014-12-10 17:41:10 -08:00
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-10-13 11:28:42 +02:00
filesystems.c
fs-writeback.c writeback: fix a subtle race condition in I_DIRTY clearing 2014-11-04 10:42:23 -07:00
fs_pin.c
fs_struct.c
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
internal.h take the targets of /proc/*/ns/* symlinks to separate fs 2014-12-10 21:30:20 -05:00
ioctl.c Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linux 2014-12-16 15:25:31 -08:00
Kconfig overlay filesystem 2014-10-24 00:14:38 +02:00
Kconfig.binfmt binfmt_elf: allow arch code to examine PT_LOPROC ... PT_HIPROC headers 2014-11-24 07:45:02 +01:00
libfs.c move d_rcu from overlapping d_child to overlapping d_alias 2014-11-03 15:20:29 -05:00
locks.c locks: fix NULL-deref in generic_delete_lease 2015-01-13 07:00:55 -05:00
Makefile Merge branch 'nsfs' into for-next 2014-12-10 21:31:59 -05:00
mbcache.c
mount.h common object embedded into various struct ....ns 2014-12-04 14:31:00 -05:00
mpage.c vfs: guard end of device for mpage interface 2014-10-09 22:25:53 -04:00
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
namespace.c mnt: Fix a memory stomp in umount 2014-12-18 11:22:02 -08:00
no-block.c
nsfs.c take the targets of /proc/*/ns/* symlinks to separate fs 2014-12-10 21:30:20 -05:00
open.c Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linux 2014-12-16 15:25:31 -08:00
pipe.c
pnode.c mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers. 2014-12-02 10:46:50 -06:00
pnode.h
posix_acl.c
proc_namespace.c vfs: make mounts and mountstats honor root dir like mountinfo does 2014-12-17 08:27:15 -05:00
read_write.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2014-12-14 20:36:37 -08:00
readdir.c vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
select.c
seq_file.c fs, seq_file: fallback to vmalloc instead of oom kill processes 2014-12-13 12:42:49 -08:00
signalfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
splice.c vfs: export do_splice_direct() to modules 2014-10-24 00:14:35 +02:00
stack.c
stat.c
statfs.c
super.c vfs: Remove i_dquot field from inode 2014-11-10 10:06:18 +01:00
sync.c kill f_dentry uses 2014-11-19 13:01:25 -05:00
timerfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
utimes.c
xattr.c new helper: audit_file() 2014-11-19 13:01:26 -05:00