Commit graph

31813 commits

Author SHA1 Message Date
boxi liu c4932dbe63 ext4: improve free space calculation for inline_data
In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.

Signed-off-by: liulei <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-07-01 08:12:37 -04:00
Joe Perches e7c96e8e47 ext4: reduce object size when !CONFIG_PRINTK
Reduce the object size ~10% could be useful for embedded systems.

Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.

$ size fs/ext4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
 264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old

    $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
    # CONFIG_PRINTK is not set
    CONFIG_EXT4_FS=y
    CONFIG_EXT4_USE_FOR_EXT23=y
    CONFIG_EXT4_FS_POSIX_ACL=y
    # CONFIG_EXT4_FS_SECURITY is not set
    # CONFIG_EXT4_DEBUG is not set

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Zheng Liu d3922a777f ext4: improve extent cache shrink mechanism to avoid to burn CPU time
Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.  We can use the following steps to trigger
it.

  % cd /dev/shm
  % dd if=/dev/zero of=ext4-img bs=1M count=2k
  % mkfs.ext4 ext4-img
  % mount -t ext4 -o loop ext4-img /mnt
  % cd /mnt
  % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
  % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
  % perf record -a -g
  % perf report

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list.  When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list.  When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Alexey Khoroshilov 2c00ef3ee3 ext4: implement error handling of ext4_mb_new_preallocation()
If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.

An observed result was:

- allocation fail means ext4_mb_new_group_pa() does not update
  ext4_allocation_context;

- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
  ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
  of number of blocks requested (1);

- that activates update cycle in ext4_splice_branch():
    for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
      *(where->p + i) = cpu_to_le32(current_block++);

- it iterates 511 times and corrupts a chunk of memory including inode
  structure;

- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();

- system hangs with 'scheduling while atomic' BUG.

The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.

Found by Linux File System Verification project (linuxtesting.org).

[ Patch restructed by tytso to make the flow of control easier to follow. ]

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:36 -04:00
Maarten ter Huurne 6ca792edc1 ext4: fix corruption when online resizing a fs with 1K block size
Subtracting the number of the first data block places the superblock
backups one block too early, corrupting the file system. When the block
size is larger than 1K, the first data block is 0, so the subtraction
has no effect and no corruption occurs.

Signed-off-by: Maarten ter Huurne <maarten@treewalker.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org
2013-07-01 08:12:08 -04:00
Jon Ernst 03b40e3496 ext4: delete unused variables
This patch removed several unused variables.

Signed-off-by: Jon Ernst <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-17 08:56:26 -04:00
Jie Liu 72dac95d44 ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
FIEMAP_EXTENT_DELALLOC because the data location of an
delayed allocation extent is unknown.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2013-06-12 23:13:59 -04:00
Paul Gortmaker 75497d0607 jbd2: remove debug dependency on debug_fs and update Kconfig help text
Commit b6e96d0067 ("jbd2: use module parameters instead of debugfs
for jbd_debug") removed any need for a dependency on DEBUG_FS.  It
also moved the /sys variables out from underneath the typical debugfs
mount point.  Delete the dependency and update the /sys path to where
the debug settings are currently.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:07:51 -04:00
Paul Gortmaker 169f1a2a87 jbd2: use a single printk for jbd_debug()
Since the jbd_debug() is implemented with two separate printk()
calls, it can lead to corrupted and misleading debug output like
the following (see lines marked with "*"):

[  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
[  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
[  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
[* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
[* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
[* 290.339382] JBD2: starting commit of transaction 42104
[  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
[  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

i.e. the debug output from log_wait_commit and journal_commit_transaction
have become interleaved.  The output should have been:

(fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
(fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

It is expected that this is not easy to replicate -- I was only able
to cause it on preempt-rt kernels, and even then only under heavy
I/O load.

Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:04:04 -04:00
Paul Gortmaker cfc7bc896f jbd2: fix duplicate debug label for phase 2
Currently we see this output:

  $git grep phase fs/jbd2
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 1\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 3\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 4\n");
  [...]

There is clearly a duplicate label for phase 2, and they are
both active (i.e. not in #if ... #else block).  Rename them to
be "2a" and "2b" so the debug output is unambiguous.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:56:35 -04:00
Paul Gortmaker 0ef54180e0 jbd2: drop checkpoint mutex when waiting in __jbd2_log_wait_for_space()
While trying to debug an an issue under extreme I/O loading
on preempt-rt kernels, the following backtrace was observed
via SysRQ output:

rm              D ffff8802203afbc0  4600  4878   4748 0x00000000
 ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
 ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
 ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
Call Trace:
 [<ffffffff8172dc34>] schedule+0x24/0x70
 [<ffffffff81225b5d>] jbd2_log_wait_commit+0xbd/0x140
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff81223635>] jbd2_log_do_checkpoint+0xf5/0x520
 [<ffffffff81223b09>] __jbd2_log_wait_for_space+0xa9/0x1f0
 [<ffffffff8121dc40>] start_this_handle.isra.10+0x2e0/0x530
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff8121e0a3>] jbd2__journal_start+0xc3/0x110
 [<ffffffff811de7ce>] ? ext4_rmdir+0x6e/0x230
 [<ffffffff8121e0fe>] jbd2_journal_start+0xe/0x10
 [<ffffffff811f308b>] ext4_journal_start_sb+0x5b/0x160
 [<ffffffff811de7ce>] ext4_rmdir+0x6e/0x230
 [<ffffffff811435c5>] vfs_rmdir+0xd5/0x140
 [<ffffffff8114370f>] do_rmdir+0xdf/0x120
 [<ffffffff8105c6b4>] ? task_work_run+0x44/0x80
 [<ffffffff81002889>] ? do_notify_resume+0x89/0x100
 [<ffffffff817361ae>] ? int_signal+0x12/0x17
 [<ffffffff81145d85>] sys_unlinkat+0x25/0x40
 [<ffffffff81735f22>] system_call_fastpath+0x16/0x1b

What is interesting here, is that we call log_wait_commit, from
within wait_for_space, but we are still holding the checkpoint_mutex
as it surrounds mostly the whole of wait_for_space.  And then, as we
are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
bit is set, then we will also try to take the same checkpoint_mutex.

It seems that we need to drop the checkpoint_mutex while sitting in
jbd2_log_wait_commit, if we want to guarantee that progress can be made
by jbd2_journal_commit_transaction().  There does not seem to be
anything preempt-rt specific about this, other then perhaps increasing
the odds of it happening.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:47:35 -04:00
Paul Gortmaker 3ca841c106 jbd2: relocate assert after state lock in journal_commit_transaction()
The state lock is taken after we are doing an assert on the state
value, not before.  So we might in fact be doing an assert on a
transient value.  Ensure the state check is within the scope of
the state lock being taken.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:46:35 -04:00
Dmitry Monakhov 4418e14112 ext4: Fix fsync error handling after filesystem abort
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:38:04 -04:00
Dmitry Monakhov 06a407f13d ext4: fix data integrity for ext4_sync_fs
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.

Also fix data integrity for nojournal mode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:25:07 -04:00
Dmitry Monakhov 9ff8644624 jbd2: optimize jbd2_journal_force_commit
Current implementation of jbd2_journal_force_commit() is suboptimal because
result in empty and useless commits. But callers just want to force and wait
any unfinished commits. We already have jbd2_journal_force_commit_nested()
which does exactly what we want, except we are guaranteed that we do not hold
journal transaction open.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:24:07 -04:00
Theodore Ts'o 981250ca89 ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarily
Commit 18888cf088: "ext4: speed up truncate/unlink by not using
bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
the most important codepath for file systems using extents, but a
similar optimization also can be done for file systems using indirect
blocks, and for the two special cases in the ext4 extents code.

Cc: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:48:29 -04:00
Theodore Ts'o 2ed5724d5a ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator()
For a file systems with a very large number of block groups, if all of
the block group bitmaps are in memory and the file system is
relatively badly fragmented, it's possible ext4_mb_regular_allocator()
to take a long time trying to find a good match.  This is especially
true if the tuning parameter mb_max_to_scan has been sent to a very
large number.  So add a cond_resched() to avoid soft lockup warnings
and to provide better system responsiveness.

For ext4_free_blocks(), if we are deleting a large range of blocks,
and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
the loop to call sb_find_get_block() and to call ext4_forget() can
take over 10-15 milliseocnds or more.  So it's better to add a
cond_resched() here a well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:43:02 -04:00
Theodore Ts'o 20970ba65d ext4: use ext4_da_writepages() for all modes
Rename ext4_da_writepages() to ext4_writepages() and use it for all
modes.  We still need to iterate over all the pages in the case of
data=journalling, but in the case of nodelalloc/data=ordered (which is
what file systems mounted using ext3 backwards compatibility will use)
this will allow us to use a much more efficient I/O submission path.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 14:00:46 -04:00
Theodore Ts'o f4afb4f4e3 ext4: optimize test_root()
The test_root() function could potentially loop forever due to
overflow issues.  So rewrite test_root() to avoid this issue; as a
bonus, it is 38% faster when benchmarked via a test loop:

int main(int argc, char **argv)
{
	int  i;

	for (i = 0; i < 1 << 24; i++) {
		if (test_root(i, 7))
			printf("%d\n", i);
	}
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:40:37 -04:00
Theodore Ts'o 2f2e09eb15 ext4: add sanity check to ext4_get_group_info()
The group number passed to ext4_get_group_info() should be valid, but
let's add an assert to check this before we start creating a pointer
based on that group number and dereferencing it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:16:43 -04:00
Theodore Ts'o b302ef2d3c ext4: verify group number in verify_group_input() before using it
Check the group number for sanity earilier, before calling routines
such as ext4_bg_has_super() or ext4_group_overhead_blocks().

Reported-by: Jonathan Salwan <jonathan.salwan@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:14:31 -04:00
Theodore Ts'o a1d8d9a757 ext4: add check to io_submit_init_bio
The bio_alloc() function can return NULL if the memory allocation
fails.  So we need to check for this.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 10:18:22 -04:00
Jan Kara 5dc23bdd5f ext4: remove ext4_ioend_wait()
Now that we clear PageWriteback after extent conversion, there's no
need to wait for io_end processing in ext4_evict_inode().  Running
AIO/DIO keeps file reference until aio_complete() is called so
ext4_evict_inode() cannot be called.  For io_end structures resulting
from buffered IO waiting is happening because we wait for
PageWriteback in truncate_inode_pages().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:46:12 -04:00
Jan Kara c724585b62 ext4: don't wait for extent conversion in ext4_punch_hole()
We don't have to wait for extent conversion in ext4_punch_hole() as
buffered IO for the punched range has been flushed and waited upon
(thus all extent conversions for that range have completed).  Also we
wait for all DIO to finish using inode_dio_wait() so there cannot be
any extent conversions pending due to direct IO.

Also remove ext4_flush_unwritten_io() since it's unused now.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:44:36 -04:00
Jan Kara 38b8ff7db4 ext4: Remove wait for unwritten extents in ext4_ind_direct_IO()
We don't have to wait for unwritten extent conversion in
ext4_ind_direct_IO() as all writes that happened before DIO are
flushed by the generic code and extent conversion has happened before
we cleared PageWriteback bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:41:29 -04:00
Jan Kara 92e6222dfb ext4: remove i_mutex from ext4_file_sync()
After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
doesn't need i_mutex anymore. Forcing of transaction commits doesn't
need i_mutex as there's nothing inode specific in that code apart from
grabbing transaction ids from the inode. So remove the lock.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:40:39 -04:00
Jan Kara 37b10dd063 ext4: use generic_file_fsync() in ext4_file_fsync() in nojournal mode
Just use the generic function instead of duplicating it.  We only need
to reshuffle the read-only check a bit (which is there to prevent
writing to a filesystem which has been remounted read-only after error
I assume).

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:40:09 -04:00
Jan Kara a115f749c1 ext4: remove wait for unwritten extent conversion from ext4_truncate()
Since PageWriteback bit is now cleared after extents are converted
from unwritten to written ones, we have full exclusion of writeback
path from truncate (truncate_inode_pages() waits for PageWriteback
bits to get cleared on all invalidated pages).  Exclusion from DIO
path is achieved by inode_dio_wait() call in ext4_setattr().  So
there's no need to wait for extent convertion in ext4_truncate()
anymore.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:30:00 -04:00
Jan Kara e83403959f ext4: protect extent conversion after DIO with i_dio_count
Make sure extent conversion after DIO happens while i_dio_count is
still elevated so that inode_dio_wait() waits until extent conversion
is done.  This removes the need for explicit waiting for extent
conversion in some cases.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:27:38 -04:00
Jan Kara b0857d309f ext4: defer clearing of PageWriteback after extent conversion
Currently PageWriteback bit gets cleared from put_io_page() called
from ext4_end_bio().  This is somewhat inconvenient as extent tree is
not fully updated at that time (unwritten extents are not marked as
written) so we cannot read the data back yet.  This design was
dictated by lock ordering as we cannot start a transaction while
PageWriteback bit is set (we could easily deadlock with
ext4_da_writepages()).  But now that we use transaction reservation
for extent conversion, locking issues are solved and we can move
PageWriteback bit clearing after extent conversion is done.  As a
result we can remove wait for unwritten extent conversion from
ext4_sync_file() because it already implicitely happens through
wait_on_page_writeback().

We implement deferring of PageWriteback clearing by queueing completed
bios to appropriate io_end and processing all the pages when io_end is
going to be freed instead of at the moment ext4_io_end() is called.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:23:41 -04:00
Jan Kara 2e8fa54e3b ext4: split extent conversion lists to reserved & unreserved parts
Now that we have extent conversions with reserved transaction, we have
to prevent extent conversions without reserved transaction (from DIO
code) to block these (as that would effectively void any transaction
reservation we did).  So split lists, work items, and work queues to
reserved and unreserved parts.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:21:02 -04:00
Jan Kara 6b523df4fb ext4: use transaction reservation for extent conversion in ext4_end_io
Later we would like to clear PageWriteback bit only after extent
conversion from unwritten to written extents is performed.  However it
is not possible to start a transaction after PageWriteback is set
because that violates lock ordering (and is easy to deadlock).  So we
have to reserve a transaction before locking pages and sending them
for IO and later we use the transaction for extent conversion from
ext4_end_io().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:21:11 -04:00
Jan Kara 3613d22807 ext4: remove buffer_uninit handling
There isn't any need for setting BH_Uninit on buffers anymore.  It was
only used to signal we need to mark io_end as needing extent
conversion in add_bh_to_extent() but now we can mark the io_end
directly when mapping extent.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:19:34 -04:00
Jan Kara 4e7ea81db5 ext4: restructure writeback path
There are two issues with current writeback path in ext4.  For one we
don't necessarily map complete pages when blocksize < pagesize and
thus needn't do any writeback in one iteration.  We always map some
blocks though so we will eventually finish mapping the page.  Just if
writeback races with other operations on the file, forward progress is
not really guaranteed. The second problem is that current code
structure makes it hard to associate all the bios to some range of
pages with one io_end structure so that unwritten extents can be
converted after all the bios are finished.  This will be especially
difficult later when io_end will be associated with reserved
transaction handle.

We restructure the writeback path to a relatively simple loop which
first prepares extent of pages, then maps one or more extents so that
no page is partially mapped, and once page is fully mapped it is
submitted for IO. We keep all the mapping and IO submission
information in mpage_da_data structure to somewhat reduce stack usage.
Resulting code is somewhat shorter than the old one and hopefully also
easier to read.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:17:40 -04:00
Jan Kara fffb273997 ext4: better estimate credits needed for ext4_da_writepages()
We limit the number of blocks written in a single loop of
ext4_da_writepages() to 64 when inode uses indirect blocks.  That is
unnecessary as credit estimates for mapping logically continguous run
of blocks is rather low even for inode with indirect blocks.  So just
lift this limitation and properly calculate the number of necessary
credits.

This better credit estimate will also later allow us to always write
at least a single page in one iteration.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:01:11 -04:00
Jan Kara fa55a0ed03 ext4: improve writepage credit estimate for files with indirect blocks
ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
blocks mapped are logically contiguous. That is wrong since the argument
informs whether the blocks are physically contiguous. As the blocks
mapped are always logically contiguous and that's all
ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:56:55 -04:00
Jan Kara f2d50a65c9 ext4: deprecate max_writeback_mb_bump sysfs attribute
This attribute is now unused so deprecate it.  We still show the old
default value to keep some compatibility but we don't allow writing to
that attribute anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:51:16 -04:00
Jan Kara 39bba40b7a ext4: stop messing with nr_to_write in ext4_da_writepages()
Writeback code got better in how it submits IO and now the number of
pages requested to be written is usually higher than original 1024.
The number is now dynamically computed based on observed throughput
and is set to be about 0.5 s worth of writeback.  E.g. on ordinary
SATA drive this ends up somewhere around 10000 as my testing shows.
So remove the unnecessary smarts from ext4_da_writepages().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:50:24 -04:00
Jan Kara 5fe2fe895a ext4: provide wrappers for transaction reservation calls
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:37:50 -04:00
Jan Kara 8f7d89f368 jbd2: transaction reservation support
In some cases we cannot start a transaction because of locking
constraints and passing started transaction into those places is not
handy either because we could block transaction commit for too long.
Transaction reservation is designed to solve these issues.  It
reserves a handle with given number of credits in the journal and the
handle can be later attached to the running transaction without
blocking on commit or checkpointing.  Reserved handles do not block
transaction commit in any way, they only reduce maximum size of the
running transaction (because we have to always be prepared to
accomodate request for attaching reserved handle).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:35:11 -04:00
Jan Kara f29fad7210 jbd2: remove unused waitqueues
j_wait_logspace and j_wait_checkpoint are unused.  Remove them.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:24:11 -04:00
Jan Kara fe1e8db598 jbd2: fix race in t_outstanding_credits update in jbd2_journal_extend()
jbd2_journal_extend() first checked whether transaction can accept
extending handle with more credits and then added credits to
t_outstanding_credits.  This can race with start_this_handle() adding
another handle to a transaction and thus overbooking a transaction.
Make jbd2_journal_extend() use atomic_add_return() to close the race.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:22:15 -04:00
Jan Kara 76c3990456 jbd2: cleanup needed free block estimates when starting a transaction
__jbd2_log_space_left() and jbd_space_needed() were kind of odd.
jbd_space_needed() accounted also credits needed for currently
committing transaction while it didn't account for credits needed for
control blocks.  __jbd2_log_space_left() then accounted for control
blocks as a fraction of free space.  Since results of these two
functions are always only compared against each other, this works
correct but is somewhat strange.  Move the estimates so that
jbd_space_needed() returns number of blocks needed for a transaction
including control blocks and __jbd2_log_space_left() returns free
space in the journal (with the committing transaction already
subtracted).  Rename functions to jbd2_log_space_left() and
jbd2_space_needed() while we are changing them.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:12:57 -04:00
Jan Kara 2f387f849b jbd2: remove outdated comment
The comment about credit estimates isn't true anymore. We do what the
comment describes now.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:10:11 -04:00
Jan Kara b34090e5e2 jbd2: refine waiting for shadow buffers
Currently when we add a buffer to a transaction, we wait until the
buffer is removed from BJ_Shadow list (so that we prevent any changes
to the buffer that is just written to the journal).  This can take
unnecessarily long as a lot happens between the time the buffer is
submitted to the journal and the time when we remove the buffer from
BJ_Shadow list.  (e.g.  We wait for all data buffers in the
transaction, we issue a cache flush, etc.)  Also this creates a
dependency of do_get_write_access() on transaction commit (namely
waiting for data IO to complete) which we want to avoid when
implementing transaction reservation.

So we modify commit code to set new BH_Shadow flag when temporary
shadowing buffer is created and we clear that flag once IO on that
buffer is complete.  This allows do_get_write_access() to wait only
for BH_Shadow bit and thus removes the dependency on data IO
completion.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:08:56 -04:00
Jan Kara e5a120aeb5 jbd2: remove journal_head from descriptor buffers
Similarly as for metadata buffers, also log descriptor buffers don't
really need the journal head. So strip it and remove BJ_LogCtl list.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:06:01 -04:00
Jan Kara f5113effc2 jbd2: don't create journal_head for temporary journal buffers
When writing metadata to the journal, we create temporary buffer heads
for that task.  We also attach journal heads to these buffer heads but
the only purpose of the journal heads is to keep buffers linked in
transaction's BJ_IO list.  We remove the need for journal heads by
reusing buffer_head's b_assoc_buffers list for that purpose.  Also
since BJ_IO list is just a temporary list for transaction commit, we
use a private list in jbd2_journal_commit_transaction() for that thus
removing BJ_IO list from transaction completely.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:01:45 -04:00
Jan Kara 97a851ed71 ext4: use io_end for multiple bios
Change writeback path to create just one io_end structure for the
extent to which we submit IO and share it among bios writing that
extent. This prevents needless splitting and joining of unwritten
extents when they cannot be submitted as a single bio.

Bugs in ENOMEM handling found by Linux File System Verification project
(linuxtesting.org) and fixed by Alexey Khoroshilov
<khoroshilov@ispras.ru>.

CC: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 11:58:58 -04:00
Jan Kara 8af8eecc13 ext4: fix overflow when counting used blocks on 32-bit architectures
The arithmetics adding delalloc blocks to the number of used blocks in
ext4_getattr() can easily overflow on 32-bit archs as we first multiply
number of blocks by blocksize and then divide back by 512. Make the
arithmetics more clever and also use proper type (unsigned long long
instead of unsigned long).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:39:56 -04:00
Jan Kara a60697f411 ext4: fix data offset overflow in ext4_xattr_fiemap() on 32-bit archs
On 32-bit architectures with 32-bit sector_t computation of data offset
in ext4_xattr_fiemap() can overflow resulting in reporting bogus data
location. Fix the problem by typing block number to proper type before
shifting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:38:56 -04:00