Wednesday, September 28, 2011

Performance tuning Linux NFS for VMware ESXi

Recently, I setup RHEL 5.4 as an NFS server connected over a 1GBit LAN to an ESXi 4.1 host. With a datastore of 2.8TB, Hardware SAS controller, the average I/O speed "out of the box" did not exceed 15MB/s ! Below are some changes which I did to ramp up average I/O speeds to 54MB/s with a maximum read speed of 95MB/s

ESXi Setup
HP c7000 Blade Chassis
ESXi 4.1
HP BL460c
2x146GB SAS RAID 1
22GB PC3 RAM
Gb2E network switch

The NFS server is connected to the same physical switch as the vmkernel
MTU is set to standard 1500 bytes

NFS Server Hardware
HP DL180
4GB RAM
1x Intel Quad Core processor
2x 1 Gigabit NICs

MSA Storage Configuration
HP MSA 70 storage with 25x146GB SAS 10K connected to P800 SAS controller (3Gb/s) with 512MB cache

2 x 146GB SAS disks in RAID 1 used for Operating System and Swap
22 x 146GB SAS disks in RAID 5 with a stripe size of 64KB used to serve NFS data.
1x 146GB SAS as hot spare

# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
128G 11G 111G 9% /
/dev/cciss/c0d0p1 99M 15M 79M 16% /boot
tmpfs 2.0G 0 2.0G 0% /dev/shm
/dev/cciss/c0d1 2.8T 2.5T 158G 95% /mnt/nfs

Using the nfsstat command, we saw 80% reads and 20% writes on /mnt/nfs, RAID 5 was chosen as a tradeoff between capacity and performance. Had the writes been higher than the reads, RAID 10 would have been more appropriate.

Operating System
32 bit Redhat Enterprise Linux 5.4

Preliminary checks.
RAID Card and NIC must not be on shared interrupts.
To check for shared interrupts:

cat /proc/interrupts

Stop all unnecessary services as shown in the example below.

chkconfig --list
chkconfig --level 3 xfs off
chkconfig --level 3 sendmail off
chkconfig --level 3 gpm off

Configuring the HP P800 SAS Controller Cache Settings

Download the hpacucli utility from HP website.
Search for HP Array Configuration Utility CLI for Linux

Obtain the slot number of the P800 controller and set cache ratio to 75% read and 25% write.

hpacucli ctrl all show config detail
hpacucli ctrl slot=2 modify cacheratio=75/25


Configuring the Linux I/O Scheduler
The RAID is hardware based, so there is no need for the kernel to perform any kind of buffering. Just by changing this setting alone from the default [cfq] gave marked I/O improvements.

echo "noop" > /sys/block/cciss\!c0d0/queue/scheduler
echo "noop" > /sys/block/cciss\!c0d1/queue/scheduler

Configuring Disk Read-Ahead
Set read-ahead cache to 512KB

/sbin/blockdev --setra 1024 /dev/cciss/c0d0
/sbin/blockdev --setra 1024 /dev/cciss/c0d1


Configuring the ext3 filesystem
For faster writes to RAID 5, use the ext3 data=journal option and prevent updates to file access times which in itself results in additional data written to the disk.

/dev/cciss/c0d1 /mnt/nfs ext3 defaults,noatime,data=journal

Configuring NFS Exports
Configure the nfs threads to write without delay. Let the RAID controller handle the job.
Although using "async" might produce better results, for data integrity I prefer using the "sync" option.

/etc/exports
/mnt/nfs *(rw,insecure,all_squash,sync,no_wdelay)

Configure TCP/IP
/etc/sysctl.conf

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65535 16777216
net.core.netdev_max_backlog = 30000

Configure Network Interface Card
/sbin/ifconfig eth0 txqueuelen 10000


Configuring the number of NFS Server Threads
/etc/sysconfig/nfs
RPCNFSCOUNTD=128


Can be configured in realtime by
echo 128 > /proc/fs/nfsd/threads
ps ax | grep nfs

Reducing CPU I/O Wait Times
May not be a bad thing, due to the fact that CPU's have become much faster than disk systems. We will tell Linux to flush as quickly as possible so that writes are kept as small as possible.

In the Linux kernel dirty pages writeback frequency can be controlled by two parameters: vm.dirty_ratio and vm.dirty_background_ratio. Both are expressed as a percentage of memory, that is the free memory + reclaimable memory. The first parameter controls when a process will itself start writing out dirty data, the second controls when the kernel thread [pdflush] must be woken up to start writing global dirty data.

dirty_background_ratio is always less than dirty_ratio. If dirty_background_ratio >= dirty_ratio the kernel automatically set it to dirty_ratio / 2

Kernels also have dirty_background_bytes and dirty_bytes, which can be used to define a limit in bytes, instead of percentage. In this scenario I have used vm.dirty_ratio and vm.dirty_background_ratio

iostat -x /dev/cciss/c0d1
vmstat 1 10

/etc/sysclt.conf
vm.dirty_background_ratio = 1
vm.dirty_expire_centisecs = 1000
vm.dirty_ratio = 10
vm.dirty_writeback_centisecs = 100
vm.vfs_cache_pressure = 40

Verifying NFS Performance
cat /proc/net/rpc/nfsd

rc 0 58708094 377082036
fh 0 0 0 0 0
io 4191757345 576031420
th 128 931051 15634.270 7895.289 27271.733 3240.960 229.407 127.945 68.418 85.028 55.789 1935.526
ra 32 376818531 0 0 0 0 0 0 0 0 0 809
net 435790183 0 435790174 3325
rpc 435788524 0 0 0 0
proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc3 22 0 142652 256 106273 8674 0 376819530 58706883 312 118 0 0 303 124 76 0 0 1241 2892 0 0 0
proc4ops 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In the above, we are interested with the line that starts with "th"
128 = number of nfs server threads
931051 = number of times that all 128 threads had work to do. In this case, the system was running for a period of 3 weeks.

The remaining numbers show which percentage of threads were active and for how many seconds.
eg.
15634.270 (up to 10% of threads were active)
27271.733 (up to 30% of the threads were active. )

The last 3 numbers which indicate usage of 80%, 90%, 100% must be low. If they are high, means your system needs additional NFS threads to cope with the load.


Testing a Windows 2003 R2 Guest VM
SQLiostress from Microsoft. This utility is used to intensely thrash disks prior to setting up SQL Server within a production environment. The tool also reports on any I/O errors found during the tests.

Create a file with a size greater than the hardware controller cache. In this case we will use 1GB

Checking Network Usage

iptraf
ifconfig eth0
ethtool eth0
netstat -s