Commit graph

20526 commits

Author SHA1 Message Date
Andrew Morton 6c5a6cb998 ext4: fix uninitialized variable in ext4_register_li_request
fs/ext4/super.c: In function 'ext4_register_li_request':
fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function

It looks buggy to me, too.

Cc: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:17 -05:00
Theodore Ts'o 8aefcd557d ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing.  This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:29:43 -05:00
Theodore Ts'o 353eb83c14 ext4: drop i_state_flags on architectures with 64-bit longs
We can store the dynamic inode state flags in the high bits of
EXT4_I(inode)->i_flags, and eliminate i_state_flags.  This saves 8
bytes from the size of ext4_inode_info structure, which when
multiplied by the number of the number of in the inode cache, can save
a lot of memory.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:18:25 -05:00
Theodore Ts'o 8a2005d3f8 ext4: reorder ext4_inode_info structure elements to remove unneeded padding
By reordering the elements in the ext4_inode_info structure, we can
reduce the padding needed on an x86_64 system by 16 bytes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:42 -05:00
Theodore Ts'o b05e6ae58a ext4: drop ec_type from the ext4_ext_cache structure
We can encode the ec_type information by using ee_len == 0 to denote
EXT4_EXT_CACHE_NO, ee_start == 0 to denote EXT4_EXT_CACHE_GAP, and if
neither is true, then the cache type must be EXT4_EXT_CACHE_EXTENT.
This allows us to reduce the size of ext4_ext_inode by another 8
bytes.  (ec_type is 4 bytes, plus another 4 bytes of padding)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:26 -05:00
Theodore Ts'o 01f49d0b9d ext4: use ext4_lblk_t instead of sector_t for logical blocks
This fixes a number of places where we used sector_t instead of
ext4_lblk_t for logical blocks, which for ext4 are still 32-bit data
types.  No point wasting space in the ext4_inode_info structure, and
requiring 64-bit arithmetic on 32-bit systems, when it isn't
necessary.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:03 -05:00
Theodore Ts'o f232109773 ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:36 -05:00
Kazuya Mio ad4fb9cafe ext4: fix 32bit overflow in ext4_ext_find_goal()
ext4_ext_find_goal() returns an ideal physical block number that the block
allocator tries to allocate first. However, if a required file offset is
smaller than the existing extent's one, ext4_ext_find_goal() returns
a wrong block number because it may overflow at
"block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.

ext4_ext_find_goal() will also return a wrong block number in case
a file offset of the existing extent is too big. In this case,
the ideal physical block number is fixed in ext4_mb_initialize_context(),
so it's no problem.

reproduce:
# dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0     128    67456             128 eof
/mnt/mp1/file: 2 extents found
# rm -rf /mnt/mp1/tmp
# echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc

result (linux-2.6.37-rc2 + ext4 patch queue):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    33280             128 
   1     128    67456    33407    128 eof
/mnt/mp1/file: 2 extents found

result(apply this patch):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    66560             128 
   1     128    67456    66687    128 eof
/mnt/mp1/file: 2 extents found

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:28 -05:00
Namhyung Kim dabd991f9d ext4: add more error checks to ext4_mkdir()
Check return value of ext4_journal_get_write_access,
ext4_journal_dirty_metadata and ext4_mark_inode_dirty. Move brelse()
under 'out_stop' to release bh properly in case of journal error.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:11:16 -05:00
Eric Paris f1dffc4c54 ext4: ext4_ext_migrate should use NULL not 0
ext4_ext_migrate() calls ext4_new_inode() and passes 0 instead of a pointer
to a struct qstr.  This patch uses NULL, to make it obvious to the caller
that this was a pointer.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:11:00 -05:00
Theodore Ts'o f7c21177af ext4: Use ext4_error_file() to print the pathname to the corrupted inode
Where the file pointer is available, use ext4_error_file() instead of
ext4_error_inode().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:55 -05:00
Dan Carpenter f9a62d090c ext4: use IS_ERR() to check for errors in ext4_error_file
d_path() returns an ERR_PTR and it doesn't return NULL.  This is in
ext4_error_file() and no one actually calls ext4_error_file().

Signed-off-by: Dan Carpenter <error27@gmail.com>
2011-01-10 12:10:50 -05:00
Dan Carpenter 13195184a8 ext4: test the correct variable in ext4_init_pageio()
This is a copy and paste error.  The intent was to check
"io_page_cachep".  We tested "io_page_cachep" earlier.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:44 -05:00
Wang Sheng-Hui 1f605b3027 ext2: remove dead code in ext2_xattr_get
Reviewed-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Wang Sheng-Hui <crosslonelyover@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:37 -05:00
Wang Sheng-Hui 6e9510b0e0 ext2,ext3,ext4: clarify comment for extN_xattr_set_handle
Signed-off-by: Wang Sheng-Hui <crosslonelyover@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:30 -05:00
Theodore Ts'o eaeef86718 ext4: clean up ext4_xattr_list()'s error code checking and return strategy
Any time you see code that tries to add error codes together, you
should want to claw your eyes out...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:07 -05:00
Lukas Czerner 9325963667 ext4: remove warning message from ext4_issue_discard helper
ext4_issue_discard is supposed to be helper for calling discard, however
in case that underlying device does not support discard it prints out
the warning message and clears the DISCARD t_mount_opt flag. Since it
can be (and is) used by others, it should not do anything and let the
caller to handle the error case.

This commit removes warning message and flag setting from
ext4_issue_discard and use it just in place where it is really needed
(release_blocks_on_commit). FITRIM ioctl should not set any flags nor it
should print out warning messages, so get rid of the warning as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-01-10 12:09:59 -05:00
Lukas Czerner 4f531501e4 ext4: fix possible overflow in ext4_trim_fs()
When determining last group through ext4_get_group_no_and_offset() the
result may be wrong in cases when range->start and range-len are too
big, because it may overflow when summing up those two numbers.

Fix that by checking range->len and limit its value to
ext4_blocks_count(). This commit was tested by myself with expected
result.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-01-10 12:04:55 -05:00
Theodore Ts'o b72143ab3e ext4: Add error checking to kmem_cache_alloc() call in ext4_free_blocks()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-20 07:26:59 -05:00
Joe Perches 0ff2ea7d84 ext4: Use printf extension %pV
Using %pV reduces the number of printk calls and eliminates any
possible message interleaving from other printk calls.

In function __ext4_grp_locked_error also added KERN_CONT to some
printks.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:43:19 -05:00
Joe Perches 94de56ab20 ext4: Use vzalloc in ext4_fill_flex_info()
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:21:02 -05:00
Eric Sandeen af0b44a197 ext4: zero out nanosecond timestamps for small inodes
When nanosecond timestamp resolution isn't supported on an ext4
partition (inode size = 128), stat() appears to be returning
uninitialized garbage in the nanosecond component of timestamps.

EXT4_INODE_GET_XTIME should zero out tv_nsec when EXT4_FITS_IN_INODE
evaluates to false.

Reported-by: Jordan Russell <jr-list-2010@quo.to>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:10:31 -05:00
Theodore Ts'o cad3f00763 ext4: optimize ext4_check_dir_entry() with unlikely() annotations
This function gets called a lot for large directories, and the answer
is almost always "no, no, there's no problem".  This means using
unlikely() is a good thing.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:07:02 -05:00
Jesper Juhl b17b35ec13 ext4: use kmem_cache_zalloc() in ext4_init_io_end()
Use advantage of kmem_cache_zalloc() to remove a memset() call in
ext4_init_io_end() and save a few bytes.

Before:
 [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text    data     bss     dec     hex filename
    3016       0     624    3640     e38 fs/ext4/page-io.o
After:
 [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text    data     bss     dec     hex filename
    3000       0     624    3624     e28 fs/ext4/page-io.o

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 21:41:55 -05:00
Tobias Klauser 6ca7b13dea ext4: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 21:38:46 -05:00
Theodore Ts'o b7271b0a39 jbd2: simplify return path of journal_init_common
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:39:38 -05:00
Theodore Ts'o 9a4f6271b6 jbd2: move debug message into debug #ifdef
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:36:33 -05:00
Theodore Ts'o ae00b267f3 jbd2: remove unnecessary goto statement
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:34:20 -05:00
Theodore Ts'o a1dd533184 jbd2: use offset_in_page() instead of manual calculation
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:13:40 -05:00
Theodore Ts'o cfef2c6a55 jbd2: Fix a debug message in do_get_write_access()
'buffer_head' should be 'journal_head'

This is a port of a patch which Namhyung Kim <namhyung@gmail.com> made
to fs/jbd to jbd2.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:07:34 -05:00
Theodore Ts'o 670be5a78a jbd2: Use pr_notice_ratelimited() in journal_alloc_journal_head()
We had an open-coded version of printk_ratelimited(); use the provided
abstraction to make the code cleaner and easier to understand.

Based on a similar patch for fs/jbd from Namhyung Kim <namhyung@gmail.com>

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-17 10:44:16 -05:00
Theodore Ts'o a8901d3487 ext4: Use pr_warning_ratelimited() instead of printk_ratelimit()
printk_ratelimit() is deprecated since it is a global instead of a
per-printk ratelimit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-17 10:40:47 -05:00
Theodore Ts'o 225db7d35c ext4: Fix up comments in inode.c
This fixes up some broken argument descriptions that Namhyung Kim had
originally submitted for ext3.  This fixes the comments that were
still applicable in ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-16 16:38:26 -05:00
Theodore Ts'o a2595b8aa6 ext4: Add second mount options field since the s_mount_opt is full up
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:30:48 -05:00
Theodore Ts'o 673c610033 ext4: Move struct ext4_mount_options from ext4.h to super.c
Move the ext4_mount_options structure definition from ext4.h, since it
is only used in super.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:28:48 -05:00
Theodore Ts'o fd8c37eccd ext4: Simplify the usage of clear_opt() and set_opt() macros
Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt.  This makes it easier for us
to support a second mount option field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:26:48 -05:00
Linus Torvalds a4851d8f7d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix typo which broke '..' detection in ext4_find_entry()
  ext4: Turn off multiple page-io submission by default
2010-12-15 12:41:17 -08:00
Tavis Ormandy 462e635e5b install_special_mapping skips security_file_mmap check.
The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

  $ uname -m
  x86_64
  $ cat /proc/sys/vm/mmap_min_addr
  65536
  $ cat install_special_mapping.s
  section .bss
      resb BSS_SIZE
  section .text
      global _start
      _start:
          mov     eax, __NR_pause
          int     0x80
  $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
  $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
  $ ./install_special_mapping &
  [1] 14303
  $ cat /proc/14303/maps
  0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
  00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
  00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.

Signed-off-by: Tavis Ormandy <taviso@google.com>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Robert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-15 12:30:36 -08:00
Aaro Koskinen 6d5c3aa84b ext4: fix typo which broke '..' detection in ext4_find_entry()
There should be a check for the NUL character instead of '0'.

Fortunately the only thing that cares about this is NFS serving, which
is why we didn't notice this in the merge window testing.

Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 21:45:31 -05:00
Theodore Ts'o 1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Linus Torvalds 5111711d3e Merge branch 'for-2.6.37' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix possible BUG_ON firing in set_change_info
  sunrpc: prevent use-after-free on clearing XPT_BUSY
2010-12-14 11:09:05 -08:00
Linus Torvalds e13cf63f2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: prevent RAID level downgrades when space is low
  Btrfs: account for missing devices in RAID allocation profiles
  Btrfs: EIO when we fail to read tree roots
  Btrfs: fix compiler warnings
  Btrfs: Make async snapshot ioctl more generic
  Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
  Btrfs: Fix a crash when mounting a subvolume
  Btrfs: fix sync subvol/snapshot creation
  Btrfs: Fix page leak in compressed writeback path
  Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
  Btrfs: fixup return code for btrfs_del_orphan_item
  Btrfs: do not do fast caching if we are allocating blocks for tree_root
  Btrfs: deal with space cache errors better
  Btrfs: fix use after free in O_DIRECT
2010-12-14 11:08:13 -08:00
Linus Torvalds 073f21ae13 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: verify ioctl retries
  fuse: fix ioctl when server is 32bit
2010-12-14 11:07:39 -08:00
Linus Torvalds 497b5b13c9 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: log timestamp changes to the source inode in rename
2010-12-14 11:06:17 -08:00
Linus Torvalds e97b71ded9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix ioctl magic
  ceph: Behave better when handling file lock replies.
  ceph: pass lock information by struct file_lock instead of as individual params.
  ceph: Handle file locks in replies from the MDS.
  ceph: avoid possible null deref in readdir after dir llseek
2010-12-14 11:02:15 -08:00
Linus Torvalds 38971ce2fa Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix panic after nfs_umount()
  nfs: remove extraneous and problematic calls to nfs_clear_request
  nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
  NFS: Fix fcntl F_GETLK not reporting some conflicts
  nfs: Discard ACL cache on mode update
  NFS: Readdir cleanups
  NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
  NFS: Fix a memory leak in nfs_readdir
  Call the filesystem back whenever a page is removed from the page cache
  NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
2010-12-14 08:51:12 -08:00
Linus Torvalds caa4a59574 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: remove bogus remapping of error in cifs_filldir()
  cifs: allow calling cifs_build_path_to_root on incomplete cifs_sb
  cifs: fix check of error return from is_path_accessable
  cifs: remove Local_System_Name
  cifs: fix use of CONFIG_CIFS_ACL
  cifs: add attribute cache timeout (actimeo) tunable
2010-12-14 08:49:15 -08:00
Chris Mason 83a50de97f Btrfs: prevent RAID level downgrades when space is low
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.

This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.

But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk.  This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.

The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.

The fix here is to make sure that we provide duplication when the
caller has asked for it.  It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:07:01 -05:00
Chris Mason cd02dca564 Btrfs: account for missing devices in RAID allocation profiles
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.

This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.

The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.

The fix here is to count the missing devices as we build up the list
of devices in the system.  This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:06:52 -05:00
Chris Mason 68433b73b1 Btrfs: EIO when we fail to read tree roots
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain.  This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs.  The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 14:47:58 -05:00