Tales of ext3: lost in space

This week I explained a few colleagues how to expand logical volumes and ext3 filesystems. Nothing special, you would say.

This is pretty straightforward in Linux, do a lvresize and resize2fs and that’s about it. I was surprised when one of my colleagues came back to me with the question why the LV size was greater than the filesystem size reported by df. I was quite sure he did the lvresize but forgot to do the resize2fs, but he was sure he didn’t. I was convinced he was doing something wrong and decided to have a look my self:

# lvdisplay /dev/mapper/VolGroup01-dbexport
..cut..
  LV Size                200.00 GB
..cut..
# df -h /DB/export/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-dbexport
                      197G   65G  133G  33% /DB/export

Ok there is a difference of 3GB. Weird. Let’s see if we really can’t grow our filesystem anymore:

# resize2fs /dev/mapper/VolGroup01-dbexport
resize2fs 1.39 (29-May-2006)
The filesystem is already 52428800 blocks long.  Nothing to do!

Hmmm…wtf?

Ok, I had no idea where that 3GB was left. Where did my 3GB go? If it was used by the root-reservation, it wouldn’t be substracted from the filesystem size, and besides that I am root, and the reservation was also set to 0. Perhaps LV is displayed in powers of 10 and df in power of 2? Nahh that doesn’t make sense, and the difference would be bigger that just those 3GB.

Lets see what the actual filesystem size is using dumpe2fs:

# dumpe2fs /dev/mapper/VolGroup01-dbexport
dumpe2fs 1.39 (29-May-2006)
..cut..
Inode count:              26214400
Block count:              52428800
Reserved block count:     0

..cut..

Block size:               4096
Fragment size:            4096

Now let’s do the math: 52428800 * 4096 / 1024^3 = 200 GB.

So actually our filesystem is 200GB, but why does “df” display a size of 197GB?

Let’s see what df actually does:

# strace -fF df -h /DB/export
..cut..
statfs("/DB/export", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=51606190, f_.....
..cut..

It seems df executes statfs(), which seems to return f_blocks=51606190; but I would expect this to be 52428800…

To rule out statfs() really returns f_blocks=51606190 I wrote a small program:

#include <stdio.h>
#include <sys/vfs.h>

int main(int argc, char *argv[])
{
        struct statfs buf;

        statfs(argv[1], &buf);

        printf("f_type is %li\n", buf.f_type);
        printf("f_bsize is %li\n", buf.f_bsize);
        printf("f_blocks is %li\n", buf.f_blocks);
        printf("f_bfree is %li\n", buf.f_bfree);
        printf("f_bavail is %li\n", buf.f_bavail);
        printf("f_files is %li\n", buf.f_files);
        printf("f_ffree is %li\n", buf.f_ffree);
        printf("f_namelen is %li\n", buf.f_namelen);

        return 0;
}

and run it:

# ./statfs /DB/export/
f_type is 61267
f_bsize is 4096
f_blocks is 51606190
f_bfree is 34249226
f_bavail is 34249226
f_files is 26214400
f_ffree is 26213719
f_namelen is 255

Ok, so statfs indeed reports f_blocks=51606190. What does the manual say about this?

struct statfs {
   long    f_type;     /* type of filesystem (see below) */
   long    f_bsize;    /* optimal transfer block size */
   long    f_blocks;   /* total data blocks in file system */
   long    f_bfree;    /* free blocks in fs */
   long    f_bavail;   /* free blocks avail to non-superuser */
   long    f_files;    /* total file nodes in file system */
   long    f_ffree;    /* free file nodes in fs */
   fsid_t  f_fsid;     /* file system id */
   long    f_namelen;  /* maximum length of filenames */
};

f_blocks; /* total data blocks in file system */

Total data blocks? What does this mean? Could this mean all blocks – metadata blocks or something?

Let’s dig the Linux source and see how f_blocks is filled up:

In fs/statfs.c:

..cut..
static int do_statfs_native(struct kstatfs *st, struct statfs __user *p)
{
         struct statfs buf;
..cut..

                 buf.f_type = st->f_type;
                 buf.f_bsize = st->f_bsize;
                 buf.f_blocks = st->f_blocks;
                 buf.f_bfree = st->f_bfree;
                 buf.f_bavail = st->f_bavail;
..cut..

Ok, now kstatfs get’s in this case filled up by ext3_statfs:

static int ext3_statfs (struct dentry * dentry, struct kstatfs * buf)
{
  struct super_block *sb = dentry->d_sb;
  struct ext3_sb_info *sbi = EXT3_SB(sb);
  struct ext3_super_block *es = sbi->s_es;
  u64 fsid;

  if (test_opt(sb, MINIX_DF)) {
          sbi->s_overhead_last = 0;
  } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) {
          unsigned long ngroups = sbi->s_groups_count, i;
          ext3_fsblk_t overhead = 0;
          smp_rmb();

          /*
           * Compute the overhead (FS structures).  This is constant
           * for a given filesystem unless the number of block groups
           * changes so we cache the previous value until it does.
           */

          /*
           * All of the blocks before first_data_block are
           * overhead
           */
          overhead = le32_to_cpu(es->s_first_data_block);

          /*
           * Add the overhead attributed to the superblock and
           * block group descriptors.  If the sparse superblocks
           * feature is turned on, then not all groups have this.
           */
          for (i = 0; i < ngroups; i++) {
                  overhead += ext3_bg_has_super(sb, i) +
                          ext3_bg_num_gdb(sb, i);
                  cond_resched();
          }

          /*
           * Every block group has an inode bitmap, a block
           * bitmap, and an inode table.
           */
          overhead += ngroups * (2 + sbi->s_itb_per_group);
          sbi->s_overhead_last = overhead;
          smp_wmb();
          sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count);
  }

  buf->f_type = EXT3_SUPER_MAGIC;
  buf->f_bsize = sb->s_blocksize;
  buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
  buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
  buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
  if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
          buf->f_bavail = 0;
  buf->f_files = le32_to_cpu(es->s_inodes_count);
  buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
  buf->f_namelen = EXT3_NAME_LEN;
  fsid = le64_to_cpup((void *)es->s_uuid) ^
         le64_to_cpup((void *)es->s_uuid + sizeof(u64));
  buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
  buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;
  return 0;
}

Ok so:

buf->f_blocks = le32_to_cpu(es->s_blocks_count) – sbi->s_overhead_last;

F_blocks is s_block_count – s_overhead_last;

If you look at the above code you can see how s_overhead_last is calculated:

     /*
      * All of the blocks before first_data_block are
      * overhead
      */
     overhead = le32_to_cpu(es->s_first_data_block);

    ..cut..

     /*
      * Every block group has an inode bitmap, a block
      * bitmap, and an inode table.
      */
      overhead += ngroups * (2 + sbi->s_itb_per_group);
      sbi->s_overhead_last = overhead;

So this explains why df reports a different size than I expected. But what actually got my attention was this part in the above code:

  if (test_opt(sb, MINIX_DF)) {
          sbi->s_overhead_last = 0;

Wtf is MINIX_DF???

And WHY would the behaviour be different with the mount option MINIX_DF?
Let’s see what the manual says about this:

# man mount

..cut..
       bsddf / minixdf
              Set  the behaviour for the statfs system call. The minixdf behaviour is to return in the f_blocks field the total number of blocks of the file system, while the bsddf behaviour (which is the default)
              is to subtract the overhead blocks used by the ext2 file system and not available for file storage. Thus
..cut..

This breaks my klomp…. let’s try this out:

# df -h /DB/export
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-dbexport
                      197G   70G  128G  36% /DB/export


# mount -o remount,minixdf /DB/export

# df -h /DB/export
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-dbexport
                      200G   73G  128G  37% /DB/export
Posted in Uncategorized | Comments Off on Tales of ext3: lost in space

AIX getcwd() weirdness

When I came into the office my colleague immediately confronted me with a weird issue. It seems all of our applications stopped working due to an issue with a NFS server. Almost all commands and utilities hang and spawned NFS timeouts after a while.

The most simple explanation would be a dependency on the NFS share. But I knew this couldn’t be the case since the NFS is only used to transfer some data, but not by any apllication. At least, it shouldn’t.

It was quite simple to reproduce this issue on a test machine. I mounted a test NFS share and firewalled it afterwards to simulate a failed NFS server. Example commands that fail are: “which”, “pwd”, “enq”, etc.

Time to analyze what’s going on:

$ truss -f pwd

... cut ...

156966: getdirent64(3, 0x20002758, 4096) = 3072
156966: statx("./../../.", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../..", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../home", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../fci", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../lib", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../libmgr", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../opt", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../DBDEV", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../tmp", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../usr", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../var", 0x2FF210B8, 176, 021) = 0
156966: statx("./../../dev", 0x2FF210B8, 176, 021) = 0

... cut ...

statx("./../../DDR", 0x2FF210B8, 176, 021) = 0

 

Now I know there is a lot of crap in / that really shouldn’t be there… but this is how the company has setup their systems in the past.

So what’s going on here? All command’s like pwd, which, enq, etc. execute a statx() on almost all entries in the root filesystem. Since /DDR is a dead NFS mount, the application will wait for statx() on /DDR indefinitely.

But why would these commands cause a statx() on almost all entries in the root filesystem? It doesn’t make sense, introduces a performance penalty and a dead NFS share in the root filesystem even prevents applications from working properly.

Some further investigation, with the dead NFS mount being unmounted, showed actually that the statx() stops when the first entry of the current path is reached. To illustrate this, below is the output of “truss -t statx pwd” executed both from /usr and /var.

$ cd /var
$ truss -t statx pwd
statx(0xF01E7D30, 0x2FF20EB0, 76, 0) = 0
statx(0xF01E7D34, 0x2FF21018, 176, 020) = 0
statx("./", 0x2FF21018, 176, 020) = 0
statx("./../", 0x2FF20EB0, 76, 0) = 0
statx("./../.", 0x2FF210C8, 176, 021) = 0
statx("./../..", 0x2FF210C8, 176, 021) = 0
statx("./../home", 0x2FF210C8, 176, 021) = 0
statx("./../fci", 0x2FF210C8, 176, 021) = 0
statx("./../lib", 0x2FF210C8, 176, 021) = 0
statx("./../libmgr", 0x2FF210C8, 176, 021) = 0
statx("./../opt", 0x2FF210C8, 176, 021) = 0
statx("./../DBDEV", 0x2FF210C8, 176, 021) = 0
statx("./../tmp", 0x2FF210C8, 176, 021) = 0
statx("./../usr", 0x2FF210C8, 176, 021) = 0
statx("./../var", 0x2FF210C8, 176, 021) = 0
/var

 

$ cd /usr
$ truss -t statx pwd
statx(0xF01E7D30, 0x2FF20EB0, 76, 0) = 0
statx(0xF01E7D34, 0x2FF21018, 176, 020) = 0
statx("./", 0x2FF21018, 176, 020) = 0
statx("./../", 0x2FF20EB0, 76, 0) = 0
statx("./../.", 0x2FF210C8, 176, 021) = 0
statx("./../..", 0x2FF210C8, 176, 021) = 0
statx("./../home", 0x2FF210C8, 176, 021) = 0
statx("./../fci", 0x2FF210C8, 176, 021) = 0
statx("./../lib", 0x2FF210C8, 176, 021) = 0
statx("./../libmgr", 0x2FF210C8, 176, 021) = 0
statx("./../opt", 0x2FF210C8, 176, 021) = 0
statx("./../DBDEV", 0x2FF210C8, 176, 021) = 0
statx("./../tmp", 0x2FF210C8, 176, 021) = 0
statx("./../usr", 0x2FF210C8, 176, 021) = 0
/usr

The environment I used was stripped and people in #aix on freenode were able to reproduce this behaviour on AIX 5.1, 5.2 and 5.3 systems.

Since this looked like a bug I opened a support case with IBM. The next day IBM replied with APAR “IZ79889” . This APAR states:

The performance of getwd is degraded because 
it performs a statx on all directories up to /
in order to find the pathname. If crossing a 
mount point, it does a statx on all directories 
in order to find the inode number that matches 
the mount point.

So it seems getwd/getcwd performs a statx() on all directories in order to find the pathname and the inode number that matches the mountpoint.

It’s needless to say that this is very, very bad 😉

IBM fixed this in later releases.

 

 

 

 

Posted in Uncategorized | Comments Off on AIX getcwd() weirdness

Fun with DTrace in destructive mode

Most people use DTrace to see what their system is actually doing. What functions or systemcalls are being executed, what is my process doing, where is most time spent, etc. The possibilities are endless and DTrace is such an awesome tool that I just can’t stop talking about it.

Not everyone might know that DTrace is not only able to trace your system and applications, but it can also change memory on the fly. I knew this from the beginning but actually never took the time to dig into this, until now.

While reading DTrace’s documentation about copyout and copyoutstr it became immediately clear to me that this means even more fun! Extending the, already endless, possibilities.

Imagine the following example:

    open64("/etc/vfstab", O_RDONLY);

The open64 systemcall takes 2 arguments: arg0 and arg1, where arg0 is a const char pointing to a file and arg1 is the flag that specifies the access mode, in this example readonly.

We can easially dtrace all processes that are calling open64 and print arg0 and arg1:

    syscall::open64:entry
    {
        printf("%s: %s %i", execname, copyinstr(arg0), arg1);
    }

which will show something like:

    dtrace: description 'syscall::open64:entry ' matched 1 probe
    CPU ID    FUNCTION:NAME
      1 16560 open64:entry cron: /system/contract/process/template 2
      1 16560 open64:entry cron: /var/adm/lastlog 258
      1 16560 open64:entry cron: /dev/null 0
      1 16560 open64:entry cron: /tmp/croutUSLV2gaKT 1281
      3 16560 open64:entry cron: /system/contract/process/latest 0
      3 16560 open64:entry cron: /system/contract/all/117346/ctl 1
      0 16560 open64:entry cron: /system/contract/process/latest 0
      0 16560 open64:entry cron: /system/contract/all/117347/ctl 1
      2 16560 open64:entry cat: /etc/vfstab 0
      0 16560 open64:entry cron: /system/contract/process/template 2
      0 16560 open64:entry cron: /system/contract/process/latest 0

There is a cat process showing up with /etc/vfstab as arg0, and 0 (O_RDONLY) as arg1. Wouldn’t it be awesome if we could change arg0 into something different, like /etc/hosts, on the fly?

It’s quite simple to do this with DTrace. We only need to run DTrace in destructive mode (be carefull !!) and to use copyout to rewrite arg0:

    #pragma D option destructive

    syscall::open64:entry
    /execname == "cat" && copyinstr(arg0) == "/etc/vfstab"/
    {
        this->f="/etc/hosts";
        copyout(this->f, arg0, sizeof(this->f) + 1);
    }

It’s important to keep in mind that the size of /etc/hosts is smaller than /etc/vfstab. Writing data that doesn’t fit in the application buffer is dangerous!

Posted in Uncategorized | Leave a comment