linux/fs
Filipe Manana 678886bdc6 Btrfs: fix fs corruption on transaction abort if device supports discard
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

  [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
   *** fsck.btrfs output ***
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   read block failed check_tree_block
   Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

   1) transaction N started, fs_info->pinned_extents pointed to
      fs_info->freed_extents[0];

   2) node/eb 94945280 is created;

   3) eb is persisted to disk;

   4) transaction N commit starts, fs_info->pinned_extents now points to
      fs_info->freed_extents[1], and transaction N completes;

   5) transaction N + 1 starts;

   6) eb is COWed, and btrfs_free_tree_block() called for this eb;

   7) eb range (94945280 to 94945280 + 16Kb) is added to
      fs_info->pinned_extents (fs_info->freed_extents[1]);

   8) Something goes wrong in transaction N + 1, like hitting ENOSPC
      for example, and the transaction is aborted, turning the fs into
      readonly mode. The stack trace I got for example:

      [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
      [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
      [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
      [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
      [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
      [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
      (...)
      [112065.264493] ---[ end trace dd7903a975a31a08 ]---
      [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
      [112065.264997] BTRFS info (device sdc): forced readonly

   9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
      fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
      turn calls btrfs_destroy_pinned_extent();

   10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
       marked as dirty in fs_info->freed_extents[], and for each one
       it calls discard, if the fs was mounted with "-o discard", and
       adds the range to the free space cache of the respective block
       group;

   11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
       sees the free space entries and performs a discard;

   12) After an umount and mount (or fsck), our eb's location on disk was full
       of zeroes, and it should have been untouched, because it was marked as
       dirty in the fs_info->pinned_extents tree, and therefore used by the
       trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b02).

Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:31 -08:00
..
9p 9p: switch to %p[dD] 2014-10-09 02:39:04 -04:00
adfs adfs: add __printf verification, fix format/argument mismatches 2014-08-08 15:57:24 -07:00
affs fs/affs: remove redundant sys_tz declarations 2014-10-14 02:18:22 +02:00
afs Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-13 16:23:15 +02:00
autofs4 autofs4: d_manage() should return -EISDIR when appropriate in rcu-walk mode. 2014-10-14 02:18:16 +02:00
befs fs/befs/btree.c: remove typedef befs_btree_node 2014-10-14 02:18:20 +02:00
bfs fs/bfs: use bfs prefix for dump_imap 2014-08-08 15:57:24 -07:00
btrfs Btrfs: fix fs corruption on transaction abort if device supports discard 2014-12-10 12:22:31 -08:00
cachefiles FS-Cache fixes 2014-10-14 08:40:15 +02:00
ceph ceph: fix flush tid comparision 2014-11-13 22:19:05 +03:00
cifs [CIFS] Remove obsolete comment 2014-10-17 17:17:12 -05:00
coda fs/coda: use linux/uaccess.h 2014-08-08 15:57:20 -07:00
configfs fs/configfs: use pr_fmt 2014-06-04 16:53:53 -07:00
cramfs fs/cramfs/inode.c: use linux/uaccess.h 2014-08-08 15:57:25 -07:00
debugfs fs: debugfs: remove trailing whitespace 2014-07-09 16:58:21 -07:00
devpts fs/devpts/inode.c: convert printk to pr_foo() 2014-06-06 16:08:14 -07:00
dlm dlm: fix missing endian conversion of rcom_status flags 2014-10-14 15:11:48 -05:00
ecryptfs fs: limit filesystem stacking depth 2014-10-24 00:14:39 +02:00
efivarfs fs/efivarfs/super.c: use static const for dentry_operations 2014-06-04 16:54:14 -07:00
efs fs/efs/namei.c: return is not a function 2014-08-08 15:57:18 -07:00
exofs Boaz Harrosh - Fix broken email address 2014-10-19 20:22:32 +03:00
exportfs fs/exportfs/expfs.c: kernel-doc warning fixes 2014-06-04 16:54:14 -07:00
ext2 percpu_counter: add @gfp to percpu_counter_init() 2014-09-08 09:51:29 +09:00
ext3 ext3: Don't check quota format when there are no quota files 2014-10-22 09:02:48 +02:00
ext4 ext4: make ext4_ext_convert_to_initialized() return proper number of blocks 2014-10-30 10:53:17 -04:00
f2fs f2fs: support volatile operations for transient data 2014-10-07 11:54:41 -07:00
fat fat: remove redundant sys_tz declaration 2014-10-14 02:18:20 +02:00
freevxfs
fscache fs/fscache/object-list.c: use __seq_open_private() 2014-10-13 17:52:21 +01:00
fuse vfs: Make d_invalidate return void 2014-10-09 02:38:57 -04:00
gfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-10-13 11:28:42 +02:00
hfs fs/hfs/hfs_fs.h: remove redundant sys_tz declaration 2014-10-14 02:18:20 +02:00
hfsplus Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
hostfs hostfs: support rename flags 2014-08-07 14:40:09 -04:00
hpfs fs/hpfs/dnode.c: fix suspect code indent 2014-08-08 15:57:22 -07:00
hppfs
hugetlbfs fs/hugetlbfs/inode.c: remove null test before kfree 2014-06-04 16:54:11 -07:00
isofs isofs: avoid unused function warning 2014-11-19 13:09:37 -05:00
jbd fs, jbd: use a more generic hash function 2014-10-22 10:02:04 +02:00
jbd2 jbd2: use a better hash function for the revoke table 2014-10-30 10:53:17 -04:00
jffs2 [jffs2] kill wbuf_queued/wbuf_dwork_lock 2014-10-09 02:39:01 -04:00
jfs Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-13 16:23:15 +02:00
kernfs vfs: Remove unnecessary calls of check_submounts_and_drop 2014-10-09 02:38:56 -04:00
lockd File locking related changes for v3.18 (pile #1) 2014-10-11 13:21:34 -04:00
logfs fs/logfs/readwrite.c: kernel-doc warning fixes 2014-08-06 18:01:12 -07:00
minix minix zmap block counts calculation fix 2014-08-08 15:57:20 -07:00
ncpfs fs/ncpfs/dir.c: remove redundant sys_tz declaration 2014-10-14 02:18:16 +02:00
nfs NFS: Don't try to reclaim delegation open state if recovery failed 2014-11-12 17:19:04 -05:00
nfs_common lockd: move lockd's grace period handling into its own module 2014-09-17 16:33:11 -04:00
nfsd nfsd4: fix crash on unknown operation number 2014-10-23 13:39:51 -04:00
nilfs2 nilfs2: improve the performance of fdatasync() 2014-10-14 02:18:20 +02:00
nls
notify fanotify: fix notification of groups with inode & mount marks 2014-11-13 16:17:06 -08:00
ntfs NTFS: Bump version to 2.1.31. 2014-10-16 12:53:35 +01:00
ocfs2 fix breakage in o2net_send_tcp_msg() 2014-11-05 15:21:18 -05:00
omfs FS/OMFS: block number sanity check during fill_super operation 2014-10-14 02:18:22 +02:00
openpromfs
overlayfs ovl: ovl_dir_fsync() cleanup 2014-11-20 16:40:02 +01:00
proc mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared 2014-10-14 02:18:28 +02:00
pstore pstore: Fix duplicate {console,ftrace}-efi entries 2014-10-15 13:51:33 -07:00
qnx4
qnx6 fs/qnx6: update debugging to current functions 2014-08-08 15:57:26 -07:00
quota quota: Properly return errors from dquot_writeback_dquots() 2014-10-22 09:08:03 +02:00
ramfs fs/ramfs/file-nommu.c: replace count*size kzalloc by kcalloc 2014-08-08 15:57:18 -07:00
reiserfs fs/reiserfs/journal.c: fix sparse context imbalance warning 2014-10-14 02:18:20 +02:00
romfs fs/romfs/super.c: add blank line after declarations 2014-08-08 15:57:25 -07:00
squashfs fs/squashfs/super.c: logging cleanup 2014-08-06 18:01:13 -07:00
sysfs kernfs: move the last knowledge of sysfs out from kernfs 2014-06-03 08:11:18 -07:00
sysv
ubifs UBIFS: Fix trivial typo in power_cut_emulated() 2014-09-30 09:29:44 +03:00
udf udf: Fix loading of special inodes 2014-10-09 13:06:14 +02:00
ufs fs/ufs/balloc.c: remove unused variable 2014-10-14 02:18:20 +02:00
xfs xfs: track bulkstat progress by agino 2014-11-07 08:33:52 +11:00
aio.c percpu_ref: add PERCPU_REF_INIT_* flags 2014-09-24 13:31:50 -04:00
anon_inodes.c
attr.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
bad_inode.c bad_inode: add ->rename2() 2014-08-07 14:40:09 -04:00
binfmt_aout.c handle suicide on late failure exits in execve() in search_binary_handler() 2014-10-09 02:39:00 -04:00
binfmt_elf.c handle suicide on late failure exits in execve() in search_binary_handler() 2014-10-09 02:39:00 -04:00
binfmt_elf_fdpic.c handle suicide on late failure exits in execve() in search_binary_handler() 2014-10-09 02:39:00 -04:00
binfmt_em86.c
binfmt_flat.c fs/binfmt_flat.c: make old_reloc() static 2014-06-04 16:54:21 -07:00
binfmt_misc.c binfmt_misc: work around gcc-4.9 warning 2014-10-14 02:18:16 +02:00
binfmt_script.c
binfmt_som.c
block_dev.c Return short read or 0 at end of a raw device, not EIO 2014-10-31 06:33:26 -04:00
buffer.c fs: clarify rate limit suppressed buffer I/O errors 2014-10-21 13:55:11 -06:00
char_dev.c
compat.c vfs: move getname() from callers to do_mount() 2014-10-09 02:39:16 -04:00
compat_binfmt_elf.c
compat_ioctl.c Bluetooth: Move HCI socket definitions into its own header file 2014-07-11 13:53:04 +03:00
coredump.c coredump: add %i/%I in core_pattern to report the tid of the crashed thread 2014-10-14 02:18:21 +02:00
dcache.c vfs: fix reference leak in d_prune_aliases() 2014-11-19 13:07:20 -05:00
dcookies.c
direct-io.c fuse: honour max_read and max_write in direct_io mode 2014-09-26 21:16:51 -04:00
drop_caches.c fs: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
eventfd.c
eventpoll.c eventpoll: fix uninitialized variable in epoll_ctl 2014-09-10 15:42:12 -07:00
exec.c handle suicide on late failure exits in execve() in search_binary_handler() 2014-10-09 02:39:00 -04:00
fcntl.c security: make security_file_set_fowner, f_setown and __f_setown void return 2014-09-09 16:01:36 -04:00
fhandle.c
file.c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-13 15:44:12 +02:00
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-10-13 11:28:42 +02:00
filesystems.c
fs-writeback.c sched: Remove proliferation of wait_on_bit() action functions 2014-07-16 15:10:39 +02:00
fs_pin.c make fs/{namespace,super}.c forget about acct.h 2014-08-07 14:40:09 -04:00
fs_struct.c
inode.c mm: allow drivers to prevent new writable mappings 2014-08-08 15:57:31 -07:00
internal.h vfs: export __inode_permission() to modules 2014-10-24 00:14:35 +02:00
ioctl.c
Kconfig overlay filesystem 2014-10-24 00:14:38 +02:00
Kconfig.binfmt
libfs.c locks: plumb a "priv" pointer into the setlease routines 2014-10-07 14:06:12 -04:00
locks.c locks: flock_make_lock should return a struct file_lock (or PTR_ERR) 2014-10-07 14:06:13 -04:00
Makefile ovl: rename filesystem type to "overlay" 2014-11-20 16:39:59 +01:00
mbcache.c fs/mbcache: replace __builtin_log2() with ilog2() 2014-06-25 22:08:29 -04:00
mount.h vfs: Add a function to lazily unmount all mounts from any dentry. 2014-10-09 02:38:55 -04:00
mpage.c vfs: guard end of device for mpage interface 2014-10-09 22:25:53 -04:00
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-11-02 10:28:43 -08:00
namespace.c vfs: introduce clone_private_mount() 2014-10-24 00:14:36 +02:00
no-block.c
open.c vfs: add i_op->dentry_open() 2014-10-24 00:14:35 +02:00
pipe.c
pnode.c get rid of propagate_umount() mistakenly treating slaves as busy. 2014-08-30 18:31:41 -04:00
pnode.h
posix_acl.c
proc_namespace.c namespaces: Use task_lock and not rcu to protect nsproxy 2014-07-29 18:08:50 -07:00
read_write.c cachefiles_write_page(): switch to __kernel_write() 2014-10-09 02:39:05 -04:00
readdir.c fanotify: create FAN_ACCESS event for readdir 2014-06-04 16:53:52 -07:00
select.c
seq_file.c fs/seq_file: fallback to vmalloc allocation 2014-07-03 09:21:54 -07:00
signalfd.c
splice.c vfs: export do_splice_direct() to modules 2014-10-24 00:14:35 +02:00
stack.c fs: fix comment for 'CONFIG_LBADF' 2014-08-26 09:35:56 +02:00
stat.c
statfs.c
super.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-10-13 11:28:42 +02:00
sync.c Export sync_filesystem() for modular ->remount_fs() use 2014-09-05 08:16:21 -07:00
timerfd.c timerfd: Remove an always true check 2014-08-27 11:17:48 +02:00
utimes.c
xattr.c vfs: Deduplicate code shared by xattr system calls operating on paths 2014-10-12 17:09:10 -04:00