[Beowulf] Performance tuning for Jumbo Frames

Patrick Geoffray patrick at myri.com
Sat Dec 12 08:40:49 PST 2009


Rahul,

Rahul Nabar wrote:
> I have seen a considerable performance boost for my codes by using
> Jumbo Frames. But are there any systematic tools or strategies to
> select the optimum MTU size?

There is no optimal MTU size. This is the maximum payload you can fit in 
one packet, so there is no drawback to a bigger MTU. Actually, there is 
one in terms of wormhole switching, but switch contention is an issue 
happily ignored by most HPC users.

> external world required of the interfaces) Have you guys found
> performance to be MTU sensitive?

A large MTU means fewer packets for the same amount of data transfered.
In all stack processing, there is a per-packet overhead (decoding 
header, integrity, sequence number, etc) and a per-byte overhead (copy). 
A large MTU reduces the total per-packet overhead because there are less 
packets to process.

Most 10GE NIC have no problems reaching line rate at 1500 Bytes (the 
standard Ethernet MTU), the problem is the host OS stack (mainly TCP) 
where the per-packet overhead is important. One trick that all 10GE NICs 
worth their salt are doing these days is to fake a large MTU at the OS 
level, while keeping the wire MTU at 1500 Bytes (for compatibility). 
This is called TSO (Transmit Send Offload) and LRO (Large Receive 
Offload). The OS stack is using a virtual MTU of 64K and the NIC does 
segmentation/reassembly in hardware, sort of.

> Also, are there any switch side parameters that can affect the
> performance of HPC codes? Specifically I was trying to run VASP which
> is known to be latency sensitive.

A large MTU has little to no impact on latency.

> I have a 10 Gig E network with a
> RDMA offload card and am getting average latencies (ping pong) using
> rping of around 14 microsecs in the MPI tests.

It is most likely due to the switch. Try back-to-back to measure without 
it. I don't know what hardware you are using, but you can get close to 
10us latency over TCP with a standard 10GE NIC and interrupt coalescing 
disabled. With a NIC supporting OS-bypass (RDMA only make sense for 
bandwidth), you should get at least half that, ideally below 3us.

Patrick



More information about the Beowulf mailing list