Tech Info 154: 10 Gb Ethernet tuning in VMware ESX/Mac/Linux environments

HELIOS Tech Info #154

Fri, 10 Jan 2014

10 Gb Ethernet tuning in VMware ESX/Mac/Linux environments

Summary

This Tech Info gives tuning advice for 10 Gb Ethernet environments, to enable optimum performance. By default, 10 Gb Ethernet already works great for most usage cases. However, when the fastest end-to-end file server performance is needed, additional tuning with proper setup and optimized parameters can help boost performance, to be much faster, and almost reach the 10 Gb limit on the server. Today's clients and servers usually use 1 Gb Ethernet which allows data transfers of about 100 MByte/sec. Every client can easily utilize 1 Gb Ethernet, therefore 10 Gb Ethernet is needed to bump performance. At minimum, servers should be deployed with 10 Gb Ethernet to the switch to serve the majority of 1 Gb clients. For workstations, 10 Gb Ethernet makes sense when file server transfer performances of more than 50 MB/sec are needed, e.g. 350 MB/sec for uncompressed HD video editing directly from a HELIOS server volume.

The following table shows maximum throughput in different Ethernet technologies:

Network MB/s     PC clients handling this
10 Mbit Ethernet 1 MB     Every PC in the last 20 years
100 Mbit Ethernet   10 MB     Every PC in the last 15 years
1 Gbit Ethernet 100 MB     500 Mhz PCs in the last 10 years
10 Gbit Ethernet 1000 MB     Multicore PCs since 2010

 
This table clearly shows that today's PCs can easily utilize Ethernet networks with hundreds of megabytes per second. For medium and larger networks 10 Gb Ethernet is required to keep up with the client performance.

This Tech Info is suitable for 10 Gb Ethernet server to 10 Gb client usage cases, e.g.:

  • Full HD video editing directly from the server (uncompressed video)
  • Using very large files, e.g. VM images directly from the server
  • Copying very large files between the server and workstation
  • Running backups between workstation and server
  • High-performance computing which requires accessing a large amount of shared files
  • Server to server file synchronization or backups

We expect that this Tech Info provides good advice for many 10 Gb Ethernet environments. We have focused on testing VMware ESX/Mac/Linux environments. The various settings can in general also be applied to other setups.

Please note: Only for 10 Gb Ethernet environments should this tuning be applied, because 1 Gb works fine out of the box. Wrong buffer or packet sizes can bring down performance or introduce incompatibilities.

We look forward to seeing more 10 Gb Ethernet networks.

Table of Contents

  1. Equipment Used for Testing »
  2. Jumbo Frames »
  3. AFP Server DSI Block Size »
  4. Server and Client Network TCP Buffer & Tuning »
  5. References »

Equipment Used for Testing


  • Client
    • Mac Pro, 2x 2.8 GHz quad-core Intel Xeon CPUs
    • 2 GB memory
    • Small Tree 10 GbE card (Intel 8259x), then current driver v. 2.20
    • 10 GbE network card is available in PCI slot 2 and known to the OS as "en3"
  • Server
    • IBM X3650 with VMware ESXi 5
    • QLogic 10 GbE
    • IBM built-in PCI Raid-5 storage
  • VM server setup
    • Virtualized HELIOS Virtual Server Appliance (VSA) with 4 CPUs
    • 2 GB memory
    • 10 GbE network card "vmxnet3" is available in slot 0 and known to the OS as "eth0"
  • Test utility
    • HELIOS LanTest network performance and reliability testing utility

Jumbo Frames

  • Benefit/drawback
    Jumbo frames are Ethernet frames with more than 1500 bytes of payload.

    In this Tech Info we will use the phrase "jumbo frame" for Ethernet frames of 9000 bytes payload.

    The benefit of using jumbo frames is usually less CPU usage, less processing overhead, and the potential of higher network throughput.

    There are no drawbacks to using jumbo frames, but careful implementation is required.

    In order to use jumbo frames, not only the server and client, but also  all the network entities in between, must support jumbo frames of the  same size. This includes routers and switches, as well as Virtual Machine logical switches.
  • Configuration of Mac, Linux, VMWare
    Before reconfiguring any of the client or server network interfaces,  you should make sure that basic TCP connectivity is working.  See the "Tests" section (below) for more details.
  • Mac configuration
    Setting network card specific parameters can be done via the "System Preferences":

         System Preferences > Network > <select card | e.g.: PCI Ethernet Slot 2> > Advanced > Hardware

         There you can change several parameter settings. The values should be similar to the ones below:

      Configure   : Manually
      Speed       : 10 GbaseSR
      Duplex      : full-duplex,flow-control
      MTU         : Custom
      enter value : 9000


    Confirm with OK/Apply

    Switch back once to verify that the settings were applied.

    In case you see again the default settings, you need to configure these settings using "ifconfig".

    Another option is to set network card specific options in a Terminal session with "ifconfig".

    This must be done as the superuser:

      # sudo ifconfig en3 mtu 9000

    Please note: "ifconfig" will set the "mtu" only temporarily and after a boot this setting will be gone.

    You can set this permanently by adding the above "ifconfig" sequence to one of the boot scripts.

    For example, you can add the line "ifconfig en3 mtu 9000" at the end of "/etc/rc.common".

    You can use an editor like "vi" or "pico" to edit the file.
  • Linux configuration
    On Linux you can set network card specific options in a "Terminal" session with "ifconfig". This must be done as the super user:

      # ifconfig eth0 mtu 9000

    Please note: "ifconfig" will set the "mtu" only temporarily and after a boot this setting will be gone.

    You can set this permanently by adding the above "ifconfig" sequence to one of the boot scripts.

    On the VSA you can also add the "mtu" change to the "iface" block of the "eth0" interface in the network configuration file "/etc/network/interfaces".

    The block would look similar to this one:

        iface eth0 inet static
        address 192.168.2.1
        netmask 255.255.255.0
        mtu 9000

    You can use an editor like "vi" or "pico" to edit the file.
  • VMware configuration
    By default, an Intel E1000 card driver is installed for a Linux VM.

    We did also test the VMware “vmxnet3” driver and found it to perform slightly better.

    In order to get this driver, you have to install and configure the VMware tools in the Linux VM first.
  • Configure the "vmxnet3" driver in vSphere
    In the "vSphere" client, choose the ESXi 5 host on which the virtual machine is running. Select the virtual machine, and in tab "Summary" choose "Edit Settings".

    In the "Hardware" tab choose the button "Add", select "Ethernet Adapter" and continue with "Next>". As "Adapter Type" choose  "VMXNET 3" and select your "Network Connection" you want to connect this adapter to.

    Then continue with Next >, verify the displayed settings and Finish.
  • Setting MTU to 9000
    In the "vSphere" client, choose the ESXi 5 host on which the virtual machine is running. Then choose the tab "Configuration" and from the "Hardware" list box select the "Networking" item.

    Identify the "vSwitch" connected to the 10GbE network and open its "Properties" configuration. Then edit the "vSwitch" configuration "Advanced Properties"/"MTU" and change its value to 9000.
  • Basic TCP connectivity test
    From both client and server, verify that the other machine is reachable via "ping":

      client # ping 192.168.2.1
      PING 192.168.2.1 (192.168.2.1): 56 data bytes
      64 bytes from 192.168.2.1: icmp_seq=0 ttl=64 time=0.354 ms
      64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.162 ms
      ..

      server # ping 192.168.2.2
      PING 192.168.2.2 (192.168.2.2): 56 data bytes
      64 bytes from 192.168.2.2: icmp_seq=0 ttl=64 time=0.358 ms
      64 bytes from 192.168.2.2: icmp_seq=1 ttl=64 time=0.232 ms
      ..
  • Jumbo packet TCP connectivity test
    When basic TCP connectivity is working, and the MTU has been adjusted to 9000 on server, client and all TCP network entities in between, verify also that packets with a payload of 9000 bytes can be exchanged.

    For this, you call ping with a "size" value, and maybe a count like this:

      client # ping -c 2 -s 9000 192.168.2.1
      PING 192.168.2.1 (192.168.2.1): 9000 data bytes
      9008 bytes from 192.168.2.1: icmp_seq=0 ttl=64 time=0.340 ms
      9008 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.292 ms

      --- 192.168.2.1 ping statistics ---
      2 packets transmitted, 2 packets received, 0.0% packet loss
      round-trip min/avg/max/stddev = 0.292/0.316/0.340/0.024 ms

      server # ping -c 2 -s 9000 192.168.2.2
      ping -c 2 -s 9000 192.168.2.2
      PING 192.168.2.2 (192.168.2.2) 9000(9028) bytes of data.
      9008 bytes from 192.168.2.2: icmp_req=1 ttl=64 time=0.240 ms
      9008 bytes from 192.168.2.2: icmp_req=2 ttl=64 time=0.244 ms

      --- 192.168.2.2 ping statistics ---
      2 packets transmitted, 2 received, 0% packet loss, time 999ms
      rtt min/avg/max/mdev = 0.240/0.242/0.244/0.002 ms

AFP Server DSI Block Size

  • What is the DSI block size?
    The so-called DSI block size, or "request quantum", is exchanged by AFP client and server when a new AFP session is established.

    It specifies the maximum amount of data to be processed with one single DSI command.

    By default, EtherShare offers a value of 128 KB. If necessary, this value can be increased.
  • Why increase its value?
    This would make sense when there are applications that read or write larger amounts of data with one call.

    If server and clients are sufficiently fast, and the TCP connection is also fast and reliable, increasing the "dsiblocksize" can result in a much higher throughput.
  • Adjust DSI block size preference for EtherShare
    This is done via the HELIOS "prefvalue" command.

    For example, this command sequence sets the "dsiblocksize" preference to 1 MB (1024*1024):

      # prefvalue -k Programs/afpsrv/dsiblocksize -t int 1048576

    Thereafter a restart of the HELIOS "afpsrv" is required:

      # srvutil stop  -f afpsrv
      # srvutil start -f afpsrv

Server and Client Network TCP Buffer & Tuning

  • Client
    On Mac OS X, Kernel parameters of interest are:

      Name                  | Default | Max     | Tuned
      ===================================================

      kern.ipc.maxsockbuf   | 4194304 | 4194304 | 4194304
      net.inet.tcp.recvspace|  131072 | 3727360 | 1000000
      net.inet.tcp.sendspace|  131072 | 3727360 | 3000000

    We conducted our tests with the values from the "Tuned" column.

    Current values you can display with "sysctl", e.g.:

      # sysctl net.inet.tcp.recvspace
      net.inet.tcp.recvspace: 131072

    You can also list multiple parameters with one single call to "sysctl":

      # sysctl kern.ipc.maxsockbuf net.inet.tcp.recvspace net.inet.tcp.sendspace
      kern.ipc.maxsockbuf: 4194304
      net.inet.tcp.recvspace: 131072
      net.inet.tcp.sendspace: 131072

    You can set new values with option "-w", e.g.:

      # sysctl -w net.inet.tcp.recvspace=1000000
      net.inet.tcp.recvspace: 1000000

    Please note: These changes are only temporary and will fall back to the defaults after the next boot.

    In order to make these changes permanent, you either have to set them after each boot process with "sysctl -w", or specify them in the  configuration file "/etc/sysctl.conf".

    By default, this file does not exist on Mac OS X 10.8/10.9.

    You can create it with an editor like "pico" or "vi".

    Then enter the parameter=value tuples like this:

      kern.ipc.maxsockbuf=4194304
      net.inet.tcp.recvspace=1000000
      net.inet.tcp.sendspace=3000000

    Please note: You must NOT enter the "sysctl -w" command before the parameter=value entry.
  • Server
    On Debian Linux, Kernel parameters of interest are:

      Name              | Default | Tuned
      ======================================
      net.core.wmem_max |  131071 | 12582912
      net.core.rmem_max |  131072 | 12582912

                          Default
      =============================================
      net.ipv4.tcp_rmem |  4096    87380    4194304
      net.ipv4.tcp_wmem |  4096    16384    4194304

                          Tuned
      ==============================================
      net.ipv4.tcp_rmem |  4096    87380    12582912
      net.ipv4.tcp_wmem |  4096    16384    12582912

    Current values can be displayed with "sysctl", e.g.:

      # sysctl net.core.wmem_max
      net.core.wmem_max: 131071

    You can also list multiple parameters with one single call to "sysctl":

      # sysctl net.core.wmem_max net.core.rmem_max net.ipv4.tcp_rmem net.ipv4.tcp_wmem
      net.core.wmem_max = 131071
      net.core.rmem_max = 131071
      net.ipv4.tcp_rmem = 4096    87380    4194304
      net.ipv4.tcp_wmem = 4096    16384    4194304

    You can set new values with the option "-w", e.g.:

      # sysctl -w net.core.wmem_max=12582912
      net.core.wmem_max: 12582912

      # sysctl -w 'net.ipv4.tcp_rmem=4096 87380 12582912'
      net.ipv4.tcp_rmem = 4096 87380 12582912

    Don't forget the ticks around the 'parameter=value' pair.

    These changes are only temporary and will fall back to the defaults after the next boot.

    In order to make these changes permanent, you either have to set them after each boot process with "sysctl", or specify them in the configuration file "/etc/sysctl.conf".

    By default this file does exist on Debian Linux 6.0.6.

    You can edit it with an editor like "pico" or "vi".

    Then enter the parameter=value tuples like this:

    net.core.wmem_max = 12582912
    net.core.rmem_max = 12582912
    net.ipv4.tcp_rmem = 4096    87380    12582912
    net.ipv4.tcp_wmem = 4096    16384    12582912


    Please note: You must NOT enter the "sysctl -w" command before the parameter=value entry.
  • Server network card settings
    Depending on used network card and driver, it is possible to adjust certain card characteristics.

    Here the hardware RX/TX ring buffers are of interest. The defaults are:

      # ethtool -g eth0
      Ring parameters for eth0:
      Pre-set maximums:
      RX:         4096
      RX Mini:    0
      RX Jumbo:   0
      TX:         4096
      Current hardware settings:
      RX:         288
      RX Mini:    0
      RX Jumbo:   0
      TX:         512

    These we increased to 1024:

      # ethtool -G eth0 rx 1024
      # ethtool -G eth0 tx 1024

    Please note: These changes are only temporary and will fall back to the defaults after the next boot.

    In order to make these changes permanent, you either have to set them after each boot process with "ethtool", or add them to one of the boot scripts.

    As we tested on the VSA, we set up a HELIOS startup script which is called during every start/stop of HELIOS.

    The HELIOS startup scripts reside in "/usr/local/helios/etc/startstop", and we added a script "01eth0bufs", that runs "ethtool" and sets the number of card buffers each time HELIOS is started.

      ========================================
      #!/bin/sh
      ## set eth0 10gb rx and tx buffers
      # /sbin/ethtool -G eth0 rx 1024
      # /sbin/ethtool -G eth0 tx 1024

      case "$1" in
      pre-start)
          /sbin/ethtool -G eth0 rx 1024  > /dev/null 2>&1
          /sbin/ethtool -G eth0 tx 1024  > /dev/null 2>&1
          ;;

      post-start)
          ;;

      pre-stop)
          ;;

      post-stop)
      #        /sbin/ethtool -G eth0 rx 348  > /dev/null 2>&1
      #        /sbin/ethtool -G eth0 tx 548  > /dev/null 2>&1
          ;;

      *)
          echo "Usage: $0 { pre-start | post-start | pre-stop | post-stop }"
          exit 1
          ;;
      esac

      exit 0
      ========================================

    After you have created this script, make sure to make it executable, e.g. "chmod a+x 01eth0bufs".
  • PCI bus performance on Mac OS X
    If you don't get near the expected throughput, verify that the PCI card is using the maximal "Link" settings. HELIOS LanTest can be used to test throughput.

    During testing we experienced that a cold boot is required in order for the PCI Ethernet Card to get the maximal "Link Width" of 8.

    A reboot of the same OS X version, or between different installed OS X versions may result in "Link Width" values as low as 1.

    With that setting you won't be able to achieve 10 GbE throughput.

    Check at:
    About This Mac > More Info > System Report > Hardware > PCI-Cards > ethernet

    that the values are similar to these:
         Link Width: x8
         Link Speed: 2.5 GT/s

    The absolute values may vary depending on the used Ethernet card.

    Please note: The server and client network TCP buffer & tuning has been done by testing in multiple cycles to find out which buffer configurations offer the best and most reliable read/write performance. This may vary using different operating systems and different 10 Gb Ethernet NICs and drivers. Wrong network tuning can also result in slower performance or even a faulty or failing network.

References