/********************************************************* END OF STYLE RULES *********************************************************/

Monday, August 14, 2006

Performance: Blocks vs. NAS

One of the top reasons SAN administrators cite for not using NAS is performance. Much has been written about how to make ethernet work as fast as SCSI/FC for storage. Here is my summary.


Why Block Adapters Go Fast
Block adapters (SCSI & FC HBAs) can go fast for two key reasons. One, the SCSI protocol separates commands from data. An application server and storage server first perform a small exchange allowing both ends to prepare for the actual I/O operation. They agree on which direction data will move (Read or Write), the length of the data, and the HBAs on each end identify the location of the data buffers, make sure they are locked-down in physical memory (can't be paged out), and build Scatter/Gather lists so the DMA engines can follow these discontiguous physical pages. Finally, they exchange a unique 'SCSI Tag'. This will be used during the data transfer part of the I/O to tie the data back to the correct SCSI command. Or, more specifically it will tell the DMA engines which Scatter/Gather list to follow during the data transfer.


This is the second reason block HBAs perform so well is they can do Remote DMA (RDMA) using the SCSI tag during the data transfer phase. These HBAs are designed specifically to run the SCSI/FC protocols so in real time they can extract the SCSI Tag and Data Pointer from a data packet, plug that into its DMA engine, which has a table of Scatter/Gather lists for all I/Os in process, and in real-time DMA data to/from the correct buffer. This gives it zero-copy capability on both the transmit AND receive sides. (A lot of NICs which claim zero-copy can really only do it on the transmit side. Zero copy on the receive side is harder).


These two features, the pre-exchange to setup DMA, and the ability to directly DMA data mean that SCSI and FC can viewed as a form of an RDMA protocol.


Matching This Performance on IP
A problem with traditional ethernet NICs is they're designed to handle a wide variety of upper-layer-protocols (ULPs) so they don't know how to interpret the data transfer context for any particular ULP. A second problem is that most ULPs don't setup the transfer before moving data anyway.


As Jeff Mogul describes in his paper: "TCP offload is a dumb idea whose time has come", the real benefit of TCP Offload Engines (TOEs) isn't offloading TCP, it's that it can offload the ULP and implement RDMA. He has a better term: RDMA-enabled NIC (RNIC). So, like a SCSI or FC HBA, if you limit these RNICs to specific ULPs that separate command setup from data transfer, and build the RNIC so it can extract key fields from the packets to understand data transfer and tie that context into it's DMA engine, then you can match FC performance.


iSCSI is one way to do this. It has the same SCSI Tag (aka DMA context handle) so you can build an iSCSI RNIC that will perform as well as FC - provided you have equivalent physical speeds.


So the next question is, can NAS be accelerated in the same way? As I mentioned in my last post, V4 moves all of NFS into one port so the NAS RNIC can trigger off that port number for special processing. Then we need to know whether the NAS ULP does any setup to give the RNICs a chance to setup DMA context and if there is a way to detect that context during data transfer. My questions for investigation:

- NFS originally used RPC. Is this true for V4? How does RPC move large amounts of data? Does NFS move data as well as parameters through RPC?

- Does NFS V4 change anything to facilitate RDMA?