Tales of ext3: lost in space

This week I explained a few colleagues how to expand logical volumes and ext3 filesystems. Nothing special, you would say.

This is pretty straightforward in Linux, do a lvresize and resize2fs and that’s about it. I was surprised when one of my colleagues came back to me with the question why the LV size was greater than the filesystem size reported by df. I was quite sure he did the lvresize but forgot to do the resize2fs, but he was sure he didn’t. I was convinced he was doing something wrong and decided to have a look my self:

# lvdisplay /dev/mapper/VolGroup01-dbexport
..cut..
  LV Size                200.00 GB
..cut..
# df -h /DB/export/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-dbexport
                      197G   65G  133G  33% /DB/export

Ok there is a difference of 3GB. Weird. Let’s see if we really can’t grow our filesystem anymore:

# resize2fs /dev/mapper/VolGroup01-dbexport
resize2fs 1.39 (29-May-2006)
The filesystem is already 52428800 blocks long.  Nothing to do!

Hmmm…wtf?

Ok, I had no idea where that 3GB was left. Where did my 3GB go? If it was used by the root-reservation, it wouldn’t be substracted from the filesystem size, and besides that I am root, and the reservation was also set to 0. Perhaps LV is displayed in powers of 10 and df in power of 2? Nahh that doesn’t make sense, and the difference would be bigger that just those 3GB.

Lets see what the actual filesystem size is using dumpe2fs:

# dumpe2fs /dev/mapper/VolGroup01-dbexport
dumpe2fs 1.39 (29-May-2006)
..cut..
Inode count:              26214400
Block count:              52428800
Reserved block count:     0

..cut..

Block size:               4096
Fragment size:            4096

Now let’s do the math: 52428800 * 4096 / 1024^3 = 200 GB.

So actually our filesystem is 200GB, but why does “df” display a size of 197GB?

Let’s see what df actually does:

# strace -fF df -h /DB/export
..cut..
statfs("/DB/export", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=51606190, f_.....
..cut..

It seems df executes statfs(), which seems to return f_blocks=51606190; but I would expect this to be 52428800…

To rule out statfs() really returns f_blocks=51606190 I wrote a small program:

#include <stdio.h>
#include <sys/vfs.h>

int main(int argc, char *argv[])
{
        struct statfs buf;

        statfs(argv[1], &buf);

        printf("f_type is %li\n", buf.f_type);
        printf("f_bsize is %li\n", buf.f_bsize);
        printf("f_blocks is %li\n", buf.f_blocks);
        printf("f_bfree is %li\n", buf.f_bfree);
        printf("f_bavail is %li\n", buf.f_bavail);
        printf("f_files is %li\n", buf.f_files);
        printf("f_ffree is %li\n", buf.f_ffree);
        printf("f_namelen is %li\n", buf.f_namelen);

        return 0;
}

and run it:

# ./statfs /DB/export/
f_type is 61267
f_bsize is 4096
f_blocks is 51606190
f_bfree is 34249226
f_bavail is 34249226
f_files is 26214400
f_ffree is 26213719
f_namelen is 255

Ok, so statfs indeed reports f_blocks=51606190. What does the manual say about this?

struct statfs {
   long    f_type;     /* type of filesystem (see below) */
   long    f_bsize;    /* optimal transfer block size */
   long    f_blocks;   /* total data blocks in file system */
   long    f_bfree;    /* free blocks in fs */
   long    f_bavail;   /* free blocks avail to non-superuser */
   long    f_files;    /* total file nodes in file system */
   long    f_ffree;    /* free file nodes in fs */
   fsid_t  f_fsid;     /* file system id */
   long    f_namelen;  /* maximum length of filenames */
};

f_blocks; /* total data blocks in file system */

Total data blocks? What does this mean? Could this mean all blocks – metadata blocks or something?

Let’s dig the Linux source and see how f_blocks is filled up:

In fs/statfs.c:

..cut..
static int do_statfs_native(struct kstatfs *st, struct statfs __user *p)
{
         struct statfs buf;
..cut..

                 buf.f_type = st->f_type;
                 buf.f_bsize = st->f_bsize;
                 buf.f_blocks = st->f_blocks;
                 buf.f_bfree = st->f_bfree;
                 buf.f_bavail = st->f_bavail;
..cut..

Ok, now kstatfs get’s in this case filled up by ext3_statfs:

static int ext3_statfs (struct dentry * dentry, struct kstatfs * buf)
{
  struct super_block *sb = dentry->d_sb;
  struct ext3_sb_info *sbi = EXT3_SB(sb);
  struct ext3_super_block *es = sbi->s_es;
  u64 fsid;

  if (test_opt(sb, MINIX_DF)) {
          sbi->s_overhead_last = 0;
  } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) {
          unsigned long ngroups = sbi->s_groups_count, i;
          ext3_fsblk_t overhead = 0;
          smp_rmb();

          /*
           * Compute the overhead (FS structures).  This is constant
           * for a given filesystem unless the number of block groups
           * changes so we cache the previous value until it does.
           */

          /*
           * All of the blocks before first_data_block are
           * overhead
           */
          overhead = le32_to_cpu(es->s_first_data_block);

          /*
           * Add the overhead attributed to the superblock and
           * block group descriptors.  If the sparse superblocks
           * feature is turned on, then not all groups have this.
           */
          for (i = 0; i < ngroups; i++) {
                  overhead += ext3_bg_has_super(sb, i) +
                          ext3_bg_num_gdb(sb, i);
                  cond_resched();
          }

          /*
           * Every block group has an inode bitmap, a block
           * bitmap, and an inode table.
           */
          overhead += ngroups * (2 + sbi->s_itb_per_group);
          sbi->s_overhead_last = overhead;
          smp_wmb();
          sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count);
  }

  buf->f_type = EXT3_SUPER_MAGIC;
  buf->f_bsize = sb->s_blocksize;
  buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
  buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
  buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
  if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
          buf->f_bavail = 0;
  buf->f_files = le32_to_cpu(es->s_inodes_count);
  buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
  buf->f_namelen = EXT3_NAME_LEN;
  fsid = le64_to_cpup((void *)es->s_uuid) ^
         le64_to_cpup((void *)es->s_uuid + sizeof(u64));
  buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
  buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;
  return 0;
}

Ok so:

buf->f_blocks = le32_to_cpu(es->s_blocks_count) – sbi->s_overhead_last;

F_blocks is s_block_count – s_overhead_last;

If you look at the above code you can see how s_overhead_last is calculated:

     /*
      * All of the blocks before first_data_block are
      * overhead
      */
     overhead = le32_to_cpu(es->s_first_data_block);

    ..cut..

     /*
      * Every block group has an inode bitmap, a block
      * bitmap, and an inode table.
      */
      overhead += ngroups * (2 + sbi->s_itb_per_group);
      sbi->s_overhead_last = overhead;

So this explains why df reports a different size than I expected. But what actually got my attention was this part in the above code:

  if (test_opt(sb, MINIX_DF)) {
          sbi->s_overhead_last = 0;

Wtf is MINIX_DF???

And WHY would the behaviour be different with the mount option MINIX_DF?
Let’s see what the manual says about this:

# man mount

..cut..
       bsddf / minixdf
              Set  the behaviour for the statfs system call. The minixdf behaviour is to return in the f_blocks field the total number of blocks of the file system, while the bsddf behaviour (which is the default)
              is to subtract the overhead blocks used by the ext2 file system and not available for file storage. Thus
..cut..

This breaks my klomp…. let’s try this out:

# df -h /DB/export
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-dbexport
                      197G   70G  128G  36% /DB/export


# mount -o remount,minixdf /DB/export

# df -h /DB/export
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-dbexport
                      200G   73G  128G  37% /DB/export
This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.