Commit graph

1140 commits

Author SHA1 Message Date
Sergey Senozhatsky 65922cb5ce ext4: unused variables cleanup in fs/ext4/extents.c
ext4 extents cleanup:

  . remove unused `*ex' from check_eofblocks_fl
  . remove unused `*eh' from ext4_ext_map_blocks


Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-23 14:08:27 -04:00
Feng Tang 6de9843dab ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep()
The map_bh() call will have already set the buffer_head to mapped.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-23 14:05:03 -04:00
Jiaying Zhang 0562e0bad4 ext4: add more tracepoints and use dev_t in the trace buffer
- Add more ext4 tracepoints.
- Change ext4 tracepoints to use dev_t field with MAJOR/MINOR macros
so that we can save 4 bytes in the ring buffer on some platforms.
- Add sync_mode to ext4_da_writepages, ext4_da_write_pages, and
ext4_da_writepages_result tracepoints. Also remove for_reclaim
field from ext4_da_writepages since it is usually not very useful.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-21 21:38:05 -04:00
Eric Sandeen 4596fe0767 ext4: don't kfree uninitialized s_group_info members
We can call kfree on uninitialized members of the s_group_info array
on an the error path.  We can avoid this by kzalloc'ing the array.

This doesn't entirely solve the oops on mount if we fail down this
path; failed_mount4: frees the sbi, for one, which gets referenced
later in the failed mount paths - I haven't worked that out yet.

https://bugzilla.kernel.org/show_bug.cgi?id=30872

Reported-by: Eugene A. Shatokhin <dame_eugene@mail.ru>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-21 21:25:13 -04:00
Robin Dong 21149d611e ext4: add missing space in printk's in __ext4_grp_locked_error()
When we do performence-testing on ext4 filesystem, we observed a
warning like this:

EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd

instead, it should be

"group 2598, 25901 blocks in bitmap, 26057 in gd"

Reviewed-by: Coly Li <bosong.ly@taobao.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-21 20:39:22 -04:00
Tao Ma a56e69c28a ext4: add FITRIM to compat_ioctl.
FITRIM isn't added in compat_ioctl. So a 32 bit program can't be executed
in a 64 bit platform. Add it in the compat_ioctl.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-20 23:16:58 -04:00
Amir Goldstein d67d121834 ext4: handle errors in ext4_clear_blocks()
Checking return code from ext4_journal_get_write_access() is important
with snapshots, because this function invokes COW, so may return new
errors, such as ENOSPC.

ext4_clear_blocks() now returns < 0 for fatal errors, in which case,
ext4_free_data() is aborted.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-20 22:59:02 -04:00
Amir Goldstein 537a03103c ext4: unify the ext4_handle_release_buffer() api
There are two wrapper functions which do exactly the same thing:
ext4_journal_release_buffer(), and ext4_handle_release_buffer().  In
addition, ext4_xattr_block_set() calls jbd2_journal_release_buffer()
directly.

Unify all of the code to use ext4_handle_release_buffer(), and get rid
of ext4_journal_release_buffer().

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-20 22:57:02 -04:00
Amir Goldstein ef60789302 ext4: handle errors in ext4_rename
Checking return code from ext4_journal_get_write_access() is important
with snapshots, because this function invokes COW, so may return new
errors, such as ENOSPC.

We move the call to ext4_journal_get_write_access earlier in the
function, to simplify error handling in the case that this function
returns returns an error.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-20 21:18:44 -04:00
Theodore Ts'o 688f869ce3 ext4: Initialize fsync transaction ids in ext4_new_inode()
When allocating a new inode, we need to make sure i_sync_tid and
i_datasync_tid are initialized.  Otherwise, one or both of these two
values could be left initialized to zero, which could potentially
result in BUG_ON in jbd2_journal_commit_transaction.

(This could happen by having journal->commit_request getting set to
zero, which could wake up the kjournald process even though there is
no running transaction, which then causes a BUG_ON via the 
J_ASSERT(j_ruinning_transaction != NULL) statement.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-16 17:16:31 -04:00
Mingming Cao 198868f35d ext4: Use single thread to perform DIO unwritten convertion
While running ext4 testing on multiple core, we found there are per
cpu ext4-dio-unwritten threads processing conversion from unwritten
extents to written for IOs completed from async direct IO patch.  Per
filesystem is enough, we don't need per cpu threads to work on
conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2011-03-05 11:52:45 -05:00
Theodore Ts'o b616844310 ext4: optimize ext4_bio_write_page() when no extent conversion is needed
If no extent conversion is required, wake up any processes waiting for
the page's writeback to be complete and free the ext4_io_end structure
directly in ext4_end_bio() instead of dropping it on the linked list
(which requires taking a spinlock to queue and dequeue the io_end
structure), and waiting for the workqueue to do this work.

This removes an extra scheduling delay before process waiting for an
fsync() to complete gets woken up, and it also reduces the CPU
overhead for a random write workload.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-28 13:12:38 -05:00
Amir Goldstein d39195c33b ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features,
especially the SNAPSHOT feature.

This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext4 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-28 00:53:45 -05:00
Amir Goldstein 8e8eaabefe ext4: use the nblocks arg to ext4_truncate_restart_trans()
nblocks is passed into ext4_truncate_restart_trans() from
ext4_ext_truncate_extend_restart() with a value different from the default
blocks_for_truncate(), but is being ignored.

The two other calls to ext4_truncate_restart_trans() already pass the
default value, which is then being recalculated inside the function.

Fix the problem by using the passed argument.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
2011-02-27 23:32:12 -05:00
Manish Katiyar 32a9bb57d7 ext4: fix missing iput of root inode for some mount error paths
This assures that the root inode is not leaked, and that sb->s_root is
NULL, which will prevent generic_shutdown_super() from doing extra
work, including call sync_filesystem, which ultimately results in
ext4_sync_fs() getting called with an uninitialized struct super,
which is the cause of the crash noted in Kernel Bugzilla #26752.

https://bugzilla.kernel.org/show_bug.cgi?id=26752

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-27 20:42:06 -05:00
Yongqiang Yang 6d9c85eb70 ext4: make FIEMAP and delayed allocation play well together
Fix the FIEMAP ioctl so that it returns all of the page ranges which
are still subject to delayed allocation.  We were missing some cases
if the file was sparse.

Reported by Chris Mason <chris.mason@oracle.com>:
>We've had reports on btrfs that cp is giving us files full of zeros
>instead of actually copying them.  It was tracked down to a bug with
>the btrfs fiemap implementation where it was returning holes for
>delalloc ranges.
>
>Newer versions of cp are trusting fiemap to tell it where the holes
>are, which does seem like a pretty neat trick.
>
>I decided to give xfs and ext4 a shot with a few tests cases too, xfs
>passed with all the ones btrfs was getting wrong, and ext4 got the basic
>delalloc case right.
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1
>$ fiemap-test foo
>ext:   0 logical: [       0..     255] phys:        0..     255
>flags: 0x007 tot: 256
>
>Horray!  But once we throw a hole in, things go bad:
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
>$ fiemap-test foo
>< no output >
>
>We've got a delalloc extent after the hole and ext4 fiemap didn't find
>it.  If I run sync to kick the delalloc out:
>$sync
>$ fiemap-test foo
>ext:   0 logical: [     256..     511] phys:    34048..   34303
>flags: 0x001 tot: 256
>
>fiemap-test is sitting in my /usr/local/bin, and I have no idea how it
>got there.  It's full of pretty comments so I know it isn't mine, but
>you can grab it here:
>
>http://oss.oracle.com/~mason/fiemap-test.c
>
>xfsqa has a fiemap program too.

After Fix, test results are as follows:
ext:   0 logical: [     256..     511] phys:        0..     255
flags: 0x007 tot: 256
ext:   0 logical: [     256..     511] phys:    33280..   33535
flags: 0x001 tot: 256

$ mkfs.ext4 /dev/xxx
$ mount /dev/xxx /mnt
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
$ sync
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5
$ fiemap-test foo
ext:   0 logical: [     256..     511] phys:    33280..   33535
flags: 0x000 tot: 256
ext:   1 logical: [     768..    1023] phys:        0..     255
flags: 0x006 tot: 256
ext:   2 logical: [    1280..    1535] phys:        0..     255
flags: 0x007 tot: 256

Tested-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-27 17:25:47 -05:00
Theodore Ts'o 4dd89fc625 ext4: suppress verbose debugging information if malloc-debug is off
If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due
to disk being full, a verbose debugging message is printed, even if
the malloc-debug switch has not been enabled.  Suppress the debugging
message so that nothing is printed unless malloc-debug has been turned
on.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-27 17:23:47 -05:00
Theodore Ts'o a54aa76108 ext4: don't leave PageWriteback set after memory failure
In ext4_bio_write_page(), if the memory allocation for the struct
ext4_io_page fails, it returns with the page's PageWriteback flag set.
This will end up causing the page not to skip writeback in
WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or
umount) the writeback daemon will get stuck forever on the
wait_on_page_writeback() function in write_cache_pages_da().

Or, if journalling is enabled and the file gets deleted, it the
journal thread can get stuck in journal_finish_inode_data_buffers()
call to filemap_fdatawait().

Another place where things can get hung up is in
truncate_inode_pages(), called out of ext4_evict_inode().

Fix this by not setting PageWriteback until after we have successfully
allocated the struct ext4_io_page.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-27 16:43:24 -05:00
Theodore Ts'o 168fc0223c ext4: move setup of the mpd structure to write_cache_pages_da()
Move the initialization of all of the fields of the mpd structure to
write_cache_pages_da().  This simplifies the code considerably.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 14:09:20 -05:00
Theodore Ts'o 78aaced340 ext4: don't lock the next page in write_cache_pages if not needed
If we have accumulated a contiguous region of memory to be written
out, and the next page can added to this region, don't bother locking
(and then unlocking the page) before writing out the memory.  In the
unlikely event that the next page was being written back by some other
CPU, we can also skip waiting that page to finish writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 14:09:14 -05:00
Theodore Ts'o ee6ecbcc5d ext4: remove page_skipped hackery in ext4_da_writepages()
Because the ext4 page writeback codepath had been prematurely calling
clear_page_dirty_for_io(), if it turned out that a particular page
couldn't be written out during a particular pass of
write_cache_pages_da(), the page would have to get redirtied by
calling redirty_pages_for_writeback().  Not only was this wasted work,
but redirty_page_for_writeback() would increment wbc->pages_skipped to
signal to writeback_sb_inodes() that buffers were locked, and that it
should skip this inode until later.

Since this signal was incorrect in ext4's case --- which was caused by
ext4's historically incorrect use of write_cache_pages() ---
ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
confusing writeback_sb_inodes().

Now that we've fixed ext4 to call clear_page_dirty_for_io() right
before initiating the page I/O, we can nuke the page_skipped
save/restore hackery, and breathe a sigh of relief.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 14:08:11 -05:00
Theodore Ts'o 9749895644 ext4: clear the dirty bit for a page in writeback at the last minute
Move when we call clear_page_dirty_for_io() to just before we actually
write the page.  This simplifies the code somewhat, and avoids marking
pages as clean and then needing to remark them as dirty later.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 14:08:01 -05:00
Theodore Ts'o 4f01b02c8c ext4: simple cleanups to write_cache_pages_da()
Eliminate duplicate code, unneeded variables, etc., to make it easier
to understand the code.  No behavioral changes were made in this patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 14:07:37 -05:00
Theodore Ts'o 8eb9e5ce21 ext4: fold __mpage_da_writepage() into write_cache_pages_da()
Fold the __mpage_da_writepage() function into write_cache_pages_da().
This will give us opportunities to clean up and simplify the resulting
code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 14:07:31 -05:00
Theodore Ts'o 6fd7a46781 ext4: enable mblk_io_submit by default
Now that we've fixed the file corruption bug in commit d50bdd5aa5,
it's time to enable mblk_io_submit by default.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 13:53:09 -05:00
Curt Wohlgemuth c7f5938adc ext4: fix ext4_da_block_invalidatepages() to handle page range properly
If ext4_da_block_invalidatepages() is called because of a
failure from ext4_map_blocks() in mpage_da_map_and_submit(),
it's supposed to clean up -- including unlock -- all the
pages in the mpd structure.  But these values may not match
up, even on a system in which block size == page size:

   mpd->b_blocknr != mpd->first_page
   mpd->b_size != (mpd->next_page - mpd->first_page)

ext4_da_block_invalidatepages() has been using b_blocknr and
b_size; this patch changes it to use first_page and
next_page.

Tested:  I injected a small number (5%) of failures in
ext4_map_blocks() in the case that the flags contain
EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
kernel.  Without this patch, I got hung tasks every time.
With this patch, I see no hangs in many runs of fsstress.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 12:27:52 -05:00
Curt Wohlgemuth e0fd9b9076 ext4: mark multi-page IO complete on mapping failure
In mpage_da_map_and_submit(), if we have a delayed block
allocation failure from ext4_map_blocks(), we need to mark
the IO as complete, by setting

      mpd->io_done = 1;

Otherwise, we could end up submitting the pages in an outer
loop; since they are unlocked on mapping failure in
ext4_da_block_invalidatepages(), this will cause a bug check
in mpage_da_submit_io().

I tested this by injected failures into ext4_map_blocks().
Without this patch, a simple fsstress run will bug check;
with the patch, it works fine.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 12:25:52 -05:00
Coly Li 5a54b2f199 ext4: mballoc: don't replace the current preallocation group unnecessarily
In ext4_mb_check_group_pa(), the current preallocation space is
replaced with a new preallocation space when the two have the same
distance from the goal block.

This doesn't actually gain us anything, so change things so that the
function only switches to the new preallocation group if its distance
from the goal block is strictly smaller than the current preallocaiton
group's distance from the goal block.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-24 14:10:05 -05:00
Coly Li 58696f3ab2 ext4: clarify description of ac_g_ex in struct ext4_allocation_context
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 14:10:00 -05:00
Coly Li 7c78605929 mballoc: add comments to ext4_mb_mark_free_simple()
This patch adds comments to ext4_mb_mark_free_simple to make it more
understandable.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 13:24:25 -05:00
Coly Li 235772da3e ext4: remove unncessary call mb_find_buddy() in debugging code
In __mb_check_buddy(), look at the code below:
  591         fstart = -1;
  592         buddy = mb_find_buddy(e4b, 0, &max);
  593         for (i = 0; i < max; i++) {
  594                 if (!mb_test_bit(i, buddy)) {
  595                         MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
  596                         if (fstart == -1) {
  597                                 fragments++;
  598                                 fstart = i;
  599                         }
  600                         continue;
  601                 }
  602                 fstart = -1;
  603                 /* check used bits only */
  604                 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
  605                         buddy2 = mb_find_buddy(e4b, j, &max2);
  606                         k = i >> j;
  607                         MB_CHECK_ASSERT(k < max2);
  608                         MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
  609                 }
  610         }
  611         MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
  612         MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
  613
  614         grp = ext4_get_group_info(sb, e4b->bd_group);
  615         buddy = mb_find_buddy(e4b, 0, &max);

On line 592, buddy is fetched by mb_find_buddy() with order 0, between
line 593 to line 615, buddy is not changed, therefore there is
no need to fetch buddy again from mb_find_buddy() with order 0 again.

We can safely remove the second mb_find_buddy() on line 615.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 13:24:18 -05:00
Coly Li 84b775a354 ext4: code cleanup in mb_find_buddy()
Current code calculate max no matter whether order is zero, it's
unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
+ 3)" only when order == 0.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 12:51:59 -05:00
Eric Sandeen ea66333694 ext4: enable acls and user_xattr by default
There's no good reason to require the extra step of providing
a mount option for acl or user_xattr once the feature is configured
on; no other filesystem that I know of requires this.

Userspace patches have set these options in default mount options,
and this patch makes them default in the kernel.  At some point
we can start to deprecate the options, perhaps.

For now I've removed default mount option checks in show_options()
to be explicit about what's set, since it's changing the default,
but I'm open to alternatives if desired.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 17:51:51 -05:00
Lukas Czerner 5c2ed62fd4 ext4: Adjust minlen with discard_granularity in the FITRIM ioctl
Discard granularity tells us the minimum size of extent that can be
discarded by the device.  If the user supplies a minimum extent that
should be discarded (range.minlen) which is smaller than the discard
granularity, increase minlen to the discard granularity, since there's
no point submitting trim requests that the device will reject anyway.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 17:49:51 -05:00
Lukas Czerner 4143179218 ext4: check if device support discard in FITRIM ioctl
For a device that does not support discard, the FITRIM ioctl returns
-EOPNOTSUPP when blkdev_issue_discard() returns this error code, which
is how the user is informed that the device does not support discard.

If there are no suitable free extents to be trimmed, then FITRIM will
return success even though the device does not support discard, which
could confuse the user.  So check explicitly if the device supports
discard and return an error code at the beginning of the FITRIM ioctl
processing.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 12:42:32 -05:00
Lukas Czerner 0b75a84012 ext4: mark file-local functions and variables as static
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 12:22:49 -05:00
Alexander V. Lukyanov 5dbd571d87 ext4: allow inode_readahead_blks=0 (linux-2.6.37)
I cannot disable inode-read-ahead feature of ext4 (on 2.6.37):

# echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks 
bash: echo: write error: Invalid argument

On a server with lots of small files and random access this read-ahead makes
performance worse, and I'd like to disable it. I work around this problem
by using value of 1, but it still reads an extra block.

This patch fixes the problem by checking for zero explicitly.

Signed-off-by: Alexander V. Lukyanov <lav@netis.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:33:21 -05:00
Peter Huewe 7dc576158d ext4: Fix sparse warning: Using plain integer as NULL pointer
This patch fixes the warning "Using plain integer as NULL pointer",
generated by sparse, by replacing the offending 0s with NULL.

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:01:42 -05:00
Theodore Ts'o da488945f4 ext4: fix compile warnings with EXT4FS_DEBUG enabled
Compile 2.6.38-rc1 with turning EXT4FS_DEBUG on,
we get following compile warnings. This patch fixes them.

  CC      fs/ext4/hash.o
  CC      fs/ext4/resize.o
fs/ext4/resize.c: In function 'setup_new_group_blocks':
fs/ext4/resize.c:233:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
fs/ext4/resize.c:251:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
  CC      fs/ext4/extents.o
  CC      fs/ext4/ext4_jbd2.o
  CC      fs/ext4/migrate.o

Reported-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 20:39:58 -05:00
Eric Sandeen e9e3bcecf4 ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and 
dio_zero_block() will zero out the unwritten portions.  When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO.  I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write().  But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex.  So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment.  I've
tested a backport of this patch with qemu, and it does
avoid the corruption.  It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
 of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
 variables are protected with an "ext4_" prefix.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
Eric Sandeen 2892c15ddd ext4: make grpinfo slab cache names static
In 2.6.37 I was running into oopses with repeated module
loads & unloads.  I tracked this down to:

fb1813f4 ext4: use dedicated slab caches for group_info structures

(this was in addition to the features advert unload problem)

The kstrdup & subsequent kfree of the cache name was causing
a double free.  In slub, at least, if I read it right it allocates
& frees the name itself, slab seems to do something different...
so in slub I think we were leaking -our- cachep->name, and double
freeing the one allocated by slub.

After getting lost in slab/slub/slob a bit, I just looked at other
sized-caches that get allocated.  jbd2, biovec, sgpool all do it
more or less the way jbd2 does.  Below patch follows the jbd2
method of dynamically allocating a cache at mount time from
a list of static names.

(This might also possibly fix a race creating the caches with
parallel mounts running).

[Folded in a fix from Dan Carpenter which fixed an off-by-one error in
the original patch]

Cc: stable@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:12:18 -05:00
Curt Wohlgemuth d50bdd5aa5 ext4: Fix data corruption with multi-block writepages support
This fixes a corruption problem with the multi-block
writepages submittal change for ext4, from commit
bd2d0210cf ("ext4: use bio
layer instead of buffer layer in mpage_da_submit_io").

(Note that this corruption is not present in 2.6.37 on
ext4, because the corruption was detected after the
feature was merged in 2.6.37-rc1, and so it was turned
off by adding a non-default mount option,
mblk_io_submit.  With this commit, which hopefully
fixes the last of the bugs with this feature, we'll be
able to turn on this performance feature by default in
2.6.38, and remove the mblk_io_submit option.)

The ext4 code path to bundle multiple pages for
writeback in ext4_bio_write_page() had a bug: we should
be clearing buffer head dirty flags *before* we submit
the bio, not in the completion routine.

The patch below was tested on 2.6.37 under KVM with the
postgresql script which was submitted by Jon Nelson as
documented in commit 1449032be1.

Without the patch, I'd hit the corruption problem about
50-70% of the time.  With the patch, I executed the
script > 100 times with no corruption seen.

I also fixed a bug to make sure ext4_end_bio() doesn't
dereference the bio after the bio_put() call.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Reported-by: Matthias Bayer <jackdachef@gmail.com>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-02-07 12:46:14 -05:00
Theodore Ts'o dd68314ccf ext4: fix up ext4 error handling
Make sure we the correct cleanup happens if we die while trying to
load the ext4 file system.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:49 -05:00
Lukas Czerner 8f021222c1 ext4: unregister features interface on module unload
Ext4 features interface was not properly unregistered which led to
problems while unloading/reloading ext4 module. This commit fixes that by
adding proper kobject unregistration code into ext4_exit_fs() as well as
fail-path of ext4_init_fs()

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-02-03 14:33:33 -05:00
Eric Sandeen 8f1f745331 ext4: fix panic on module unload when stopping lazyinit thread
https://bugzilla.kernel.org/show_bug.cgi?id=27652

If the lazyinit thread is running, the teardown function
ext4_destroy_lazyinit_thread() has problems:

        ext4_clear_request_list();
        while (ext4_li_info->li_task) {
                wake_up(&ext4_li_info->li_wait_daemon);
                wait_event(ext4_li_info->li_wait_task,
                           ext4_li_info->li_task == NULL);
        }

Clearing the request list will cause the thread to exit and free
ext4_li_info, so then we're waiting on something which is getting
freed.

Fix this up by making the thread respond to kthread_stop, and exit,
without the need to wait for that exit in some other homegrown way.

Cc: stable@kernel.org
Reported-and-Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:15 -05:00
Linus Torvalds 4843456c5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix deadlock during path resolution
2011-01-21 07:33:37 -08:00
Christoph Hellwig 2fe17c1075 fallocate should be a file operation
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes.  On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions.   Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.

This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 02:25:31 -05:00
Christoph Hellwig 64c23e8687 make the feature checks in ->fallocate future proof
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem.  This makes the check future proof for any newly
added flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 02:25:30 -05:00
Linus Torvalds 275220f0fc Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
  block: ensure that completion error gets properly traced
  blktrace: add missing probe argument to block_bio_complete
  block cfq: don't use atomic_t for cfq_group
  block cfq: don't use atomic_t for cfq_queue
  block: trace event block fix unassigned field
  block: add internal hd part table references
  block: fix accounting bug on cross partition merges
  kref: add kref_test_and_get
  bio-integrity: mark kintegrityd_wq highpri and CPU intensive
  block: make kblockd_workqueue smarter
  Revert "sd: implement sd_check_events()"
  block: Clean up exit_io_context() source code.
  Fix compile warnings due to missing removal of a 'ret' variable
  fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
  block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
  cfq-iosched: don't check cfqg in choose_service_tree()
  fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
  cdrom: export cdrom_check_events()
  sd: implement sd_check_events()
  sr: implement sr_check_events()
  ...
2011-01-13 10:45:01 -08:00
Linus Torvalds b2034d474b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
  fs: add documentation on fallocate hole punching
  Gfs2: fail if we try to use hole punch
  Btrfs: fail if we try to use hole punch
  Ext4: fail if we try to use hole punch
  Ocfs2: handle hole punching via fallocate properly
  XFS: handle hole punching via fallocate properly
  fs: add hole punching to fallocate
  vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
  fix signedness mess in rw_verify_area() on 64bit architectures
  fs: fix kernel-doc for dcache::prepend_path
  fs: fix kernel-doc for dcache::d_validate
  sanitize ecryptfs ->mount()
  switch afs
  move internal-only parts of ncpfs headers to fs/ncpfs
  switch ncpfs
  switch 9p
  pass default dentry_operations to mount_pseudo()
  switch hostfs
  switch affs
  switch configfs
  ...
2011-01-13 10:27:28 -08:00