/********************************************************* END OF STYLE RULES *********************************************************/

Friday, September 29, 2006

Part III: And more storage arrays

Continuing my research on all the storage array vendors out there.....

Not a new vendor but I haven't looked at them in a while. 3Par makes a fairly traditional block storage rack with 3U and 4U disk trays, a 4U dual controller modular, dual power supplies, etc. Available with FC and iSCSI SAN interfaces. They provide the usual set of RAID levels and data services including snapshot and remote copy and a 'single pain' management GUI that includes tools for monitoring and managing storage resources, access patterns, and for migrating data between RAID levels.

One feature they have developed is the ability to Underprovision. You can create several volumes that can present LUNs that are larger than the amount of available disk space. You have to leave some amount of disk space in a free pool. Then, as one or more of the underprovisioned LUNs fill up, the controller will automatically take space from the free pool as necessary. Another nice (although I'm not sure new) feature is the ability to migrate data between RAID levels online as access patterns and the desired SLA change.

In summary, nothing bleeding-edge here. They've been around a while so I would hope they have most of the bugs worked out so for someone just needing reliable block storage with some scaleability, ability for snapshot/backup and remote mirror, this might be a good choice.

Not new either. Another fairly standard RAID array in the middle of the pack. Features include FC or iSCSI host interface (4 ports/controller), FC or SATA disks, snapshot, sync and async remote copy, standard RAID levels and also claim to have RAID 6. Available in a 3U, 15-drive dual-controller model. Also sell a 1U RAID head that uses stand-alone JBODs on the back-end.

More mid-range RAID. 2U and 3U RAID trays. FC host, SAS and SATA disks. Snapshot, remote copy, etc.

Block storage provider who's value prop is a set of block data services. Claim to provide the 'Only SAN with Automated Tiered Storage'. I'm not sure I believe that but their RAID subsystem will track properties of data and automatically migrate to different tiers of storage. This would have to be at the block level so they must be doing this at the granularity of some number of blocks.

They also have an underprovisioning (called Thin-Provisioning) feature to let users create LUNs that are larger than the available storage and pull from a free pool as necessary. Also, claim to have CDP. They call it Continuous Snapshots.

Going for low cost. Do a low-end array with ethernet interface running both iSCSI target and NAS. My guess is they use Linux internally on a commodity motherboard and OEM a low-end RAID controller. Targeting small business.

Really going for the low-cost leader in an iSCSI storage array. Have a product called a 'Storage Concentrator' that serves as iSCSI Target. Looks suspiciously like a 1U Dell rack server with the Dell logo replaced with one that says Stonefly. My guess is it runs Linux with an iSCSI target driver such as Wasabi. Also available in a 3U array with a single integrated iSCSI controller. Has battery-backed cache but no controller failover.

Thursday, September 28, 2006

Part II: More new storage arrays

The second of my series of posts on what new array vendors are doing.

Isilon Systems
Isilon is one of the vendors doing Clustered Storage in the form of 2U NAS bricks that interconnect with each other through Gig Ethernet or Infiniband and include a Shared Filesystem that lets them share and balance files across the shared disk space. They focus on very similar servers and workloads as Panasas and BlueArc - scaleable rack servers requiring high-bandwidth access to large files although their material talks about the market differently. It describes the growth of Unstructured Data - large multimedia files requiring high-bandwidth, shared access - essentially the same as HPTC.

Here is the picture from their White Paper:

Their 'secret sauce' is their Distributed File System (DFS) which uses a Distributed Lock Manager (DLM) over the dedicated interconnect which can be either IB of Ether. The DFS integrates volume management and RAID including Reed-Solomon ECC so it can tolerate multiple disk or node failures within a volume/stripe. The DFS handles distributed metadata and file locking so multiple nodes can share access to a file. Includes the ability to rebalance data across nodes to maintain load balancing and something they call 'smartconnect' which a source tells me looks a lot like basic IP path failover and load-balancing. Also provides one integrated view of the data for the management UI. The host interface includes NFS, CIFS. No mention of iSCSI at this point.

One issue I didn't see addressed is how they handle failover when a box fails. Using their EDD/RAID algorithms, the data is still there, but clients will have to re-mount the volume through new IP address. I suspect this isn't handled today. Something like a pNFS MDS is required for that.

Another approach to address similar problems and market as Panasas and BlueArc. Uses commodity HW, like Panasas, but uses existing standard NFS/CIFS for the SAN interface like BlueArc. I saw a reference to them on the web using embedded Linux but someone told my they use a version of BSD. Either way, they should be able to quickly pick up NFS/pNFS enhancements as they evolve while they focus their engineering on enhancing the DFS and management UI. The biggest threat is that open-source Linux DFS's will eventually catch up and standards like pNFS eventually eliminate the need to embedded a DFS in the storage. For now though, looks like a promising approach (provided they've worked out all the failure conditions).

Lefthand is also doing 'Clustered Storage' using rack-mount storage trays built from commodity hardware, SAS and SATA disks, and ethernet interconnect. Like Isilon, they have a distributed file system allowing them to share data, stripe & RAID protect across multiple bricks, and scale to many bricks. Lefthand also distributes their metadata processing so there is no single-point-of-failure.

Lefthand offers a NAS interface although I think most of their installed-base is iSCSI and that's where they focus. One unique feature is they've developed a multipath driver that keeps a map of all the alternate controllers that could take over serving a LUN when one fails. Good idea. I've seen some PR recently about certifying with Windows for iSCSI boot. I don't know how far they've gone with implementing automated services such as DHCP and DNS as I described in my Ideal SAN, but that would support their value prop as a leading provider of iSCSI-based SAN Solutions.

Crosswalk was founded by Jack McDonald who started McData and it looks like they are leveraging their switching heritage. Technically, not an array provider, but they use similar technology and are focusing on similar problems - aggregating storage together into a 'grid' with a global namespace and high availability, scaleability, etc.

Crosswalk is clearly focusing on the HPTC market and I applaud their marketing for being clear about their focus on this segment. Nearly every marketing book I've read and class I've taken makes it clear that segmenting your marketing and defining your unique advantage in that segment is a fundamental requirement for success. Despite this, in my twenty years in storage I've met very few product managers willing to do this ("we might lose a sales opportunity somewhere else...."). But, I digress.

Crosswalk also uses a distributed file system that aggregates data into one namespace, allows shared access to information, implements a DLM, etc. Crosswalk differs in that they take the approach of doing this in an intermediate layer between the storage and the IP Network. Their product is a small grid of high-performance switches serving NFS/CIFS on the front and using legacy FC storage on the back-end. This is their differentiation: "integration of disparate storage resources". Presumably, they leverage their experience at McData implementing high-performance data channels in the box so they can move lots of data with a relatively few nodes in their 'virtualization grid'. Host interface is standard NFS/CIFS.

Given their focus on HPTC where Linux prevails, I would hope they use Linux in their grid and can pick up NFS/pNFS enhancement as they are adopted on HPTC grids. Also, given that 30% of the top-100 supercomputers now use Lustre from ClustreFS, and given their location just down the road from ClusterFS in Boulder, I would assume they are talking. This would make a good platform for running the Lustre OSD target.

Sells 3U and 4U iSCSI arrays. Found limited information on the website about the internal architecture but appears to be block (no NFS) with the usual set of data services. Also talks briefly about some unique data services to let bricks share data and metadata for scaling and availability but it doesn't sound like the same level of sharing as Isilon, Crosswalk, or Lefthand. This looks more like a straightforward, iSCSI block RAID tray. Nothing wrong with that. Over the next several years, as the RAID tray becomes what the disk drive was to the enterprise ten years ago, they are one of the contenders to be one of the survivors, provided they can keep driving down HW cost, manufacturing in high volume, keep reliability high, and keep up with interconnect and drive technology.

Tuesday, September 26, 2006

New Storage Arrays: Part 1

This post is another collection of notes. In this case, notes from reading the websites from several fairly new (to me at least) storage subsystem vendors. I don't have any inside information or access to NDA material on these companies. All my notes and conclusions are the result of reading material on their websites.


Panasas builds a storage array and associated installable file system that closely aligns with my vision of an ideal SAN so, needless to say, I like them. Their focus is HPTC, specifically, today's supercomputers built from many compute nodes running Linux on commodity processors. For these supercomputers, it's critical that multiple compute nodes can efficiently share access to data. To facilitate this, Panasas uses object storage along with a pNFS MetaData Server (MDS). Benefits include:

    Centralize and offload block space management. Compute nodes don't have to spend a lot of effort comparing free/used block lists between each other. A compute node can simply request creation of a storage object.

    Improved Data Sharing. Compute nodes can open objects for exclusive or shared access and do more caching on the client. This is similar to NFS V4. The MDS helps by providing a call-back mechanism for nodes waiting for access.

    Improved Performance Service Levels. Objects include associated properties, and with the data grouped into objects, the storage can be smarter about how to layout the data for maximum performance. This is important for HPTC which may stream large objects.

    Better Security Objects include ACLs and authentication properties for improved security in these multi-node environments.

Panasas uses pNFS concepts but goes beyond pNFS, I think. Compute nodes include the client layout manager so they can stripe data across OSD devices for increased performance (reference my pNFS Notes). They use the MDS server for opens/closes, finding data, and requesting call-backs when waiting for shared data. They get the scaleable bandwidth that results from moving the MDS out of the datapath. More importantly, the MDS provides the central point to keep track of new storage and for new servers to go to find the storage it needs. Supports scaleability of both storage and servers.

Object Properties Panasas objects use the object-oriented concept of public and private properties. Public properties are visible to the Object Storage Device and specify the Object ID, size, and presumably other properties to tell the OSD the SLA it needs. Private properties are not visible to the OSD and are used by the client AND the MDS. They include ACLs, client (layout manager) RAID associations, etc.

iSCSI Panasas runs their OSD via iSCSI over IP/Ethernet. I assume they use RDMA NICs in their OSD array and it's up to the client whether or not to use one. For control communications with the MDS, they use standard RPC.

File System I don't think their filesystem is Lustre. I think they wrote their own client that plugs into the vnode interface on the Linux client. I don't know if their OSDs work with Lustre or not. I would think they would not pass up that revenue opportunity. I think that 30% of the Top-100 supercomputers use Lustre.

Standards I like that Panasas is pursuing and using standards. They understand that this is necessary to grow their business. They claim their OSD protocol is T10 compliant and they are driving the pNFS standard.

Storage Hardware Interesting design that uses 'blades'. From the front, looks like a Drive CRU, but a much deeper card with (2) SATA HDDs. Fits into a 4U rack mount tray. Includes adapters for IB and Myranet, as well as native ethernet/iSCSI interface. Don't know what the price is but, appears to be built from commodity components so ought to be reasonably inexpensive. I didn't see anything about the FW but I'm certain it must be Linux-based.

Summary Again, I like it - a lot. They are aligned with the trend to enable high-performance, scaleable storage on commodity storage AND server hardware (ethernet interconnect, x86/x64 servers running Linux, simple storage using SATA disks). Developing FileSystem and MDS server software to enable this scaling that actually works. Driving it as an open standard including driving pNFS as a standard that is transport-agnostic. By using the open-source process they can take advantage of contributions from the development community. Finally, makes sense to start out in HPTC get established and mature the technology but I see a lot of potential in commercial/enterprise datacenters.


BlueArc is an interesting contrast to Panasas. Both are trying to address the same problem - Scaleable, intelligent, IP network and object-based storage that can support lots of scaleable application servers but, they approach the problem in completely different ways. Panasas, founded by a computer science PhD (Garth Gibson) uses software to combine the power the lots of commodity hardware. BlueArc on the other hand, founded by a EE with a background developing multi-processor servers is addressing the problem with custom high-performance hardware.

The BlueArc product is an NFS/CIFS server that can also serve up blocks via iSCSI. Their goal is scaleability but their premise is that new SW standards such as pNFS and NFS V4++ are too new so they work within the constraints of current, pervasive versions of NFS/CIFS. Their scaleability and ease-of-use comes from from very high performance hardware that can support so many clients that only a few are needed.

Hardware Overview. Uses the four basic components of any RAID or NAS controller: Host Interface, Storage Interface, Non-real-time executive/error handling processor, and Real-time data movement and buffer memory control. Each of these is implemented as independent modules that plug into a common chassis and backplane.

    Chassis/Backplane Chassis with a high-performance backplane. Website explains that it uses "contention-free pipelines" for many concurrent sessions and low-latency interprocessor communications between I/O and processing modules. Claims this is a key to enabling one rack of storage to scale to support many app servers.

    Network Interface Module Custom plug-in hardware module providing the interface to the ethernet-based storage network. Website says includes HW capability to scale to 64k sessions

    File System Modules Plug-in processing modules for running NAS/CIFS/iSCSI. Two types: 'A' modules does higher-level supervisory processing but little data movement. 'B' module actually moves file system data and controls buffer memory.

    Storage Interface Module Back-end FC, SCSI interface and processing. Also does multipathing. Website says it contains much more memory than a typical HBA so it can support more concurrent I/Os

Software The software mainly consists of the embedded FW in the server for NAS/CIFS and filesystem processing. Works with standard CIFS/NFS/iSCSI so no special client software required. The white paper refers to the 'Object Storage' architecture but no OSD interface is supported at this time. Includes volume management (striping, mirroring) for the back-end HW RAID trays.

Summary Again, the advantage is high performance and scaleability due to custom hardware. It uses existing network standards so it can be rolled into a datacenter today and it's ready to go. No special drivers or SW required on the app servers which is nice. Also, since you only need one, or a few of these you don't have the problem of managing lots of them. Similar to the benefits of using a large mainframe vs. rack servers. Also, implemented as a card-cage that lets you start small and grow - sort of like a big Sun E10k SPARC server where you can add CPU and I/O modules.

Keys to success will include three things. One, the ability to keep up with new advances in hardware. Two the ability to keep it simple to manage. Third, and my biggest concern, the ability to mature the custom, closed firmware and remain competitive with data services. This is custom hardware requiring custom firmware. BlueArc needs to continue staffing enough development resources to keep up. This concerns me because I've been at too many companies who tried this approach and just couldn't keep up with commodity HW and open software.

Pillar Data

Pillar builds an integrated rack of storage that includes RAID trays (almost certainly OEM'd), a FC block SAN head and a NAS head which can both be used at the same time sharing the same disk trays, and a management controller. Each is implemented as 19" rack mount modules. There's no bleeding-edge technology here. It's basic block and NAS storage with the common, basic data services such as snapshot, replication, and a little bit of CDP. That appears to be by design and supports their tag-line: 'a sensible alternative'. The executive team is experienced storage executives that know that most datacenter admins are highly risk-averse and their data management processes are probably built around just these few basic data-services so this strategy makes sense as a way to break into the datacenter market.

The unique value here is that both NAS and block are integrated under one simple management interface, you can move (oops, I mean provision) storage between both, and the same data services can be applied to both block and NAS. Most of the new invention here is in the management controller which bundles configuration with wizards, capacity planning, policies for applying data services, and tiered storage management. It allows a user to define three tiers of storage, assign data to those tiers, and presumably the system can track access patterns for, at least the NAS files, and migrate between tiers of storage.

Looking Forward

This looks like a company trying to be the next EMC. It is managed by several mature, experienced executives including several ex-STK VPs. They are building on mature technology and trying to build the trust of enterprise datacenter administrators. The value prop is integration of mature, commonly used technologies - something attractive to many admins who use NAS storage with one management UI from one vendor, block storage from another, and SAN management from yet another.

What's really interesting is when you combine this with their Oracle relationship. They are funded by Larry Ellison. As I described in my post on Disruption and Innovation in Storage, I firmly believe that for enterprise storage, the pendulum has swung back to giving the competitive advantage to companies that can innovate up and down an integrated stack by inventing new interfaces at each layer of the stack. We will never solve today's data management problems with a stack consisting of an application sitting on top of the old POSIX file API, a filesystem that breaks data into meaningless 512-byte blocks for a block volume manager, in-turn talking to a block storage subsystem. So, Oracle is doing the integration of the layers starting from the top by bypassing the filesystem, integrating it's own volume manager and talking directly to RDMA interfaces. Now we have Pillar integrating things from the bottom up. By getting Oracle and Pillar together to invent a new interface, they could create something similar to my vision of an ideal SAN.

In this vision of the future, Oracle provides a bundle of software that can be loaded on bare, commodity hardware platforms. It includes every layer from the DB app, through volume management down to RDMA NIC driver and basic OS services which come from bundling Linux. The commodity x64 blades could include RDMA-capable NICs for high-performance SAN interconnect. Then, using NFS V4++, Oracle and Pillar agree on extended properties for the data objects to tell the Pillar storage subsystem what Service Levels and Compliance steps to apply to the data objects as they are stored, replicated, etc. Over time, to implement new data services or add compliance to new data management laws, Oracle and Pillar can quickly add new data properties to the interfaces up and down the stack. They don't have to wait for SNIA or ANSI to update a standard and they don't have to wait for other players to implement their side of the interface. Microsoft can do this with VDS and their database. With Pillar, Oracle can do it as well.

Tuesday, September 19, 2006

My Ideal SAN, Part II, Data Services

In part one, I talked about the SAN interconnect and network and array services that facilitated the use of lots of scaleable rack servers. Now I want to talk about how to achieve scaleability on the storage side, and about data services that help solve the combined problem of centralizing information, keeping it always available, putting it on the right class of storage, and keeping it secure and compliant with information laws.

Consolidating and Managing the Information with NFS
Early SANs were about consolidating storage HARDWARE, not the information. The storage was partitioned up, zoned in the SAN, and presented exclusively to large servers giving them the impression they were still talking to direct-attached storage. This allowed the server to continue to run data services in the host stack because it virtually owned it's storage. My ideal datacenter uses lots of scaleable rack servers with applications that grow and migrate around. Trying to run the data services spread across all these little app servers/data clients is nearly impossible. The INFORMATION, not just the storage hardware has to be centralized and shared and most of the data services have to run where the storage lives - on the data servers. This means block storage is out. Block storage servers which receive disaggregated blocks of storage with no properties of the data are hopelessly limited in their ability to meaningfully manage and share the information. So, my storage needs to be object-based and, since I'm building this datacenter from scratch, I'm going to use NFS V4++. (If I needed to run this on legacy FC infrastructure, I would use the OSD protocol but, more on that later). With enhanced NFS, the storage servers keep the information in meaningful groupings with properties that let it store the information properly.

Performance and Availability
For high performance and availability I want NFS V4 plus some enhancements. One enhancement is the RPC via RDMA standard being developed by Netapp and Apple. The onboard NIC in the rack servers should be capable of performing RDMA for the RPC ULP as well as iSCSI. For availability, the host stack must support basic IP multipath as well as NFS volume and iSCSI LUN failover. The latter should use the industry-standard symmetric standard, or ANSI T10 ALUA for asymmetric LUN failover. For NFS volume/path failover, the V4 fs_locations is helpful because it allows a storage server to redirect the client to another controller that has access to the same, or a mirrored copy of the data. This helps but, to achieve full availability and scaleability, we need pNFS with it's ability to completely decouple information from any particular piece of HW.

I few weeks ago a posted a few notes on pNFS. pNFS applies the proven concept of using centralized Name and Location services in networks - the same concept that has allowed the internet to grow to millions of nodes. The pNFS Name/Location server can run on the same inexpensive clustered pair of rack servers as the storage DHCP service. With pNFS, instead of mounting a file from a piece of NFS server hardware, clients do a lookup by name, and the pNFS Name/Location server returns a pointer to the storage device(s) where the data currently resides. Now files can move between a variety of small, low-cost, scaleable storage arrays giving high availability. Frequently-accessed data can reside on multiple arrays and an app server can access the nearest copy. For performance, app servers can stripe files across multiple arrays. Finally, with NFS V4 locking semantics, multiple app servers can share common information - something FC/Block SANs have never been able to do effectively.

The Storage Server Hardware
Just like I described using small, scaleable, rack-mount application servers, pNFS now allows doing the same with the storage. My storage arrays would be scaleable - probably 2U/12-drive or 3U/16 drive bricks, some with high-performance SAS disks, and others with low-performance SATA. Some with high-performance mirrorsets, others with lower-performance RAID 5. The interface is ethernet that is RDMA-capable for both iSCSI and NFS/RPC. As I described in part one, they can be configured with iSCSI LUNs and assigned meaningful names and the array registers those with the central name service on the SAN. They can also be configured with NFS volumes that register with pNFS. This gives the ultimate in scaleability, flexibility, lower cost, high availability, and automated configuration. Now, we can talk about how to seriously help manage the data.

Managing the Data
Managing the Data means being able to do four things all at the same time. One, keep the data centralized and shared. Two, keeping it always accessible, in the face of any failures or disasters. Three, putting the right data on the right class of storage, and Four, complying with applicable laws and regulations for securing, retaining, auditing, etc. It's when you put all four of these together that it gets tough with today's SANs.

I already talked about how pNFS with NFS V4++ solves the first two - keeping the data centralized, shared, and 100% accessible. With pNFS, arrays can share files among multiple data clients. Both arrays and data clients can locally and remotely replicate data via IP and the pNFS server allows data clients to find the remote copies in the event of a failure. Similarly, on the application server side, if a server fails, an application can migrate to another server and quickly find the data it needs.

Now I want to talk about how the object nature of NFS allows solving the second two problems. Again, because the data remains in meaningful groupings (files) and has properties along with the ability to add properties over time, the storage servers can now put it on the right class of storage, and apply the right compliance steps. NFS today has some basic properties that let the storage server put the data on the right class of storage. Revision dates and read-only properties allow the storage to put mostly-read data on cheaper RAID 5 volumes. With revision dates, the storage can migrate older data to lower cost SATA/RAID-5 volumes and even eventually down to tape archive. With the names of files, the storage can perform single-instancing. These properties are a start but I would like to see the industry standardize more properties to define the Storage Service Levels data objects require.

Finally, compliance with data laws is where the object nature of NFS can help the most. The problem with these laws is they apply to the Information, not to particular copies of the data. Availability and Consolidation requirements mean the information has to be replicated, archived and shared on the storage network. With NFS, information can be named, and the name service can keep track of where every copy resides. The properties associated with the data can include an ACL and audit trail of who accessed each copy. The storage can retain multiple revisions, or can include an 'archive' property so the storage makes it read-only. The properties can include retention requirements then, once the retention period expires, the storage can delete all copies. These are just a few of the possibilities.

How to Get There?
Some of this development is happening. Enhancements to NFS V4 are being defined and implemented in at least Linux and Solaris. pNFS is being defined and prototyped through open source, with strong participation by Panasas. RDMA for NFS is at least getting defined as a standard. Now we need NICs from either the storage HBA, or ethernet NIC vendors. Some gaps where I don't see enough progress are one, defining more centralized configuration, naming and lookup services for pNFS storage networks. Panasas and the open development community seem to be focusing on HPTC right now. Probably not a bad place to start. That market needs the parallel access to storage they get from pNFS and object-based storage. But, it leaves an opportunity to define the services to automate large SANs for other markets. The other gap is standardizing properties for data objects, specifically for defining Storage Service Levels and Compliance with data laws. These need to be standardized. (I need to check what the SNIA OSD group is doing here).

Notes on Transitioning from Legacy FC SANs
One of the nice features of pNFS's separation of control and data flow is that it doesn't care what transport is used to move the data. They typical datacenter with it's large investment in Fibre Channel will have to leverage that infrastructure. There is no reason the architecture I describe can't use FC in parallel with ethernet with the T10 OSD protocol provided OS drivers are available that connect the OSD driver to the vnode layer. The same data objects with the same properties attached can be transmitted through the OSD protocol over FC. THIS is the value of the T10 OSD spec. It allows an Object-based data management architecture like I described above to leverage the huge legacy FC infrastructure.

Monday, September 11, 2006

My Ideal SAN, Part I, Boot Support

This is the first of what may be several posts where I describe my idea of an ideal SAN using a combination of products and technology available today, technology still being defined in the standards bodies, and some of my own ideas. My ideal SAN will use reasonably priced components, use protocols that automate and centralize configuration and management tasks, will scale to thousands of server and storage nodes, and provides storage service levels that solve real data management problems.

The Interconnect
My SAN will use Ethernet. In part because of cost but mostly because it comes with a true network protocol stack. Also, because I can get scaleable rack-mount servers that come with ethernet on the motherboard so I don't need add-on HBAs. The normal progression for an interconnect, as happened with ethernet and SCSI, is that it starts out as an add-on adapter card costing a couple hundred dollars. Then, as it becomes ubiquitous, it moves to a $20 (or less) chip on the motherboard. Fibre Channel never followed this progression because it's too expensive and complex to use as the interconnect for internal disks, and it never reached wide-enough adoption to justify adding the socket to motherboards. I want rack-mount servers that come ready to go right out of the box and with two dual-ported NICs so I have two ports for the LAN, and two for the SAN. Sun's x64 rack servers and probably others meet this requirement.

To further simplify configuration and management, these rack servers won't have internal disks. They will load pre-configured boot images from LUNs on centralized arrays on the SAN via iSCSI. In spite of my raves about object storage, I don't see any reason to go beyond the block ULP for the boot LUN. The SAN NIC in these servers will be RDMA-capable under iSCSI and include an iSCSI boot BIOS that can locate and load the OS from the correct boot LUN. It finds the boot LUN using the same IP-based protocols that let you take your notebook into a coffee shop, get connected to the internet, and type in a human-readable name like 'google.com' and connect to a remote google server. These are, of course, DHCP and DNS.

I will have a pair of clustered rack-mount servers on the SAN running DHCP and other services. The Internet Engineering Task Force (IETF), the body that defines standards for internet protocols has extended DHCP to add the Boot Device as one of the host configuration options. So replacing or adding a new rack server involves entering its human-readable server name, e.g. websrv23, into its eprom and making sure your central DHCP server has been configured with the boot device/LUN for each of your websrvxx app servers. When the new server powers-on, it broadcasts its name to the DHCP server which replies with websrv23's IP address, boot LUN, and other IP configuration parameters. It can then use a local nameserver to find the boot device by name and then load its operating system. The architect for one very large datacenter who is hoping to move to an IP SAN called these Personality-less Servers.

Array Data Services for Boot Volumes
I want a few data services in my arrays to help manage boot and boot images. First, my ethernet-based arrays will also use DHCP to get their IP address and will register their human-readable array name with the DHCP server. In addition to automating network config, this enables the DHCP application to provide an overview of all the devices on the SAN, and to present the devices as one namespace using meaningful, human-readable names.

One data service I will use to help manage boot volumes is fast volume replication so I can quickly replicate a boot volume, add patches/updates that I want to test out, and present that as a new LUN. I'll have app servers for testing out these new boot images and through DHCP I will route these to boot from the updated boot volumes. Once these are tested, then I want to be able to quickly replicated these back to my production boot volumes.

The other array data service I would like is my own invention that allows me to minimize the number of boot volumes I have to maintain. Ninety-some percent of every boot volume is the same and is read-only. Only a small number of files including page, swap, and log files get written to. I would like a variation of snapshot technology that allows me to create one volume in the array and present that as multiple LUNs. Most reads get satisfied out of the one volume. Writes to the LUN however, get redirected to a small space allocated for each LUN and the array keeps track of which blocks have been written to and any reads to an updated block are read from the per-LUN update space. With this feature I can manage one consistent boot image for each type of server on the SAN.

It's a Real Network
This is why I like iSCSI (for now). You get a real network stack with protocols that let you scale to hundreds or thousands of devices and you can get servers where the SAN interconnect is already built-in. Nothing I've described here (except my common boot volume) is radically new. Ethernet, DHCP, DNS, and even the iSCSI ULP are all mature technologies. Only a few specific new standards and products are needed to actually build this part of my ideal SAN:

    iSCSI BIOS Standard iSCSI Adapters with embedded BIOS are available from vendors such as Emulex and Qlogic but they don't use DHCP to find the boot volume and they're not on the motherboard. We need an agreement for the motherboard-resident BIOS for standard NICs. Intel and Microsoft are the big players here.

    SAN DHCP Server Application We need a DHCP server with the IETF extension for configuring boot volumes. It would be nice if the GUI was customized for SANs with menus for configuring and managing boot volumes and features for displaying the servers and storage on the SAN using the single, human-readable namespace. This app should run on standard Unix APIs so it runs on any Unix.

    The Arrays Finally, we need the arrays that support user-assigned names and use those with DHCP configuration. Maybe iSCSI arrays do this already - I haven't looked. Then, some features to help manage boot volumes would be nice.

If anyone who manages a real SAN is reading this, send me a comment.

Saturday, September 09, 2006

Innovation at the Array Level

The block interface is restricting innovation for Arrays even more than disk drives and we are seeing the rapid commoditization that results from this lack of ability to add meaningful value-add features. A couple years ago array marketers used to talk about segmenting the array market into horizontal tiers based on price, capacity, availability, data services, etc. and they used to have three or more tiers. Today, due to commoditization, this has collapsed into only two tiers, as described to me directly by more than one storage administrator. The top tier still exists with EMC DMX and HDS 9000 and other arrays for the highly paranoid willing to pay these high prices. Below that however, is a single tier of commodity arrays. The sales discussion is pretty simple. "Is it a 2U or 3U box?" "How many disks?" "What capacity?" Then, the next questions are "How cheap is it today and how much cheaper will it be next quarter?"

Some would say this is fine and the natural progression in the storage industry. Arrays become what disk drives where fifteen years ago (the thing that stores the bits as cheaply as possible), and higher level data services move to virtualization engines or back to the host stack.* As an engineer seeking to innovate at the system level however, I can't accept this.

As with disk drives, it's a distributed computing problem and there are improvements that can only be done inside the RAID controller. These include improvements in performance, providing an SLA and improving utilization based on the information being stored, securing the information, and complying with information laws. All this requires knowledge of the information that is stripped away by the block protocol.

Arrays try to do this today the only way they can - through static configuration at the LUN level through management interfaces. One problem with this approach is the granularity is too large (LUNs). Another is it's too static and is difficult to manage, especially when combined with the need to manage switch zoning, host device nodes, etc. Finally they are trying to manage the information by applying properties to the hardware storing the information vs. on the information itself. Take for example, the need to keep a log of any user who accessed a particular legal record. One, you can't use LUNs to store each record (so you can't manage it at the LUN level) and two, information doesn't sit on on one piece of hardware anymore. It gets mirrored locally, maybe remotely, and probably backed-up as well. If you're doing HSM, it might even completely migrate off the original piece of storage hardware. Now remember that the law is to track who accessed the INFORMATION, not a particular copy on one array.

If the storage devices are allowed to keep the record grouped (an object) and app servers and storage agree on a protocol for authentication and a format for the access log, then this becomes a solvable problem. Other ways the array can help manage the data is by storing data on the right class of storage such as RAID 5, mirror, remotely mirrored, or a single non-redundant disk drive. To optimize use of storage, these should be applied at the file, or database record level because required storage service levels change at that granularity.

* Note to array vendors. If this is the direction arrays are going the way to win is clear: follow the example of successful disk companies like Seagate. Build high-volume, high-yield manufacturing capability, get your firmware and interop process fully baked, relentlessly drive down HW costs, and standardize form factors and functionality to enable second-source suppliers.

Friday, September 01, 2006

Innovation at the Disk Drive Component

Subtitled: My Free Advice to the Disk Drive Vendors

The disk drive industry has been severely restricted by the limitations of the block interface. By restricting the functionality a drive can expose to that of a 1980's disk drive, they have been limited to primarily innovating along only one dimension of performance - Capacity. Of course, they have made amazing increases in performance but, much as been written about the growing imbalance between capacity and the ability to access that data in reasonable time as well as support a consistent performance SLA. I've also seen several articles lately about problems with sensitive data left on old disk drives. These point to the need for drive vendors to innovate in more dimensions of performance or, saying it differently, add value in other ways than just increasing capacity and lowering cost.

It's a Distributed Computing Problem
If you talk to engineers who work at layers above the disk drive (RAID controllers, volume managers, file systems), you'll get answers like "the job of a disk driver is just to hold lots of data cheaply, we'll take care of the rest". The problem is, they can never solve problems like security, optimizing performance and providing a consistent SLA as well as if they enlist the help of the considerable processing power embedded in the disk drive itself.

Back in the 60's and early 70s, most of the low-level functions of a disk drive were controlled by the host CPU. Engineers could have said: "Hey, our CPUs are getting so much faster, it's no problem continuing to control all these low-level functions". Instead, as employees of vertically-integrated companies like IBM and DEC, they were able to take a systemic view of the problem. They realized the advances in silicon technology could be better used to embedded a controller in the drive where it could be more efficient at controlling the actuator and spindle motor. So, they actually completely changed the interface to the disk drive - a radical and foreign concept to so many computer engineers today. Now, three decades later we are dealing with a whole new set of data storage problems and, the processing power embedded in the disk drive has grown along with increases in silicon technology. Now, as in the 1970s, the right answer is to distribute some of this processing to the disk processor where it has the knowledge, and is in the right location to handle it.

The first thing to realize is these hard drives already have significant processing power built into their controller and in many cases, have wasted silicon real-estate that be used to add more intelligence. This processing power used for thing like fabricating lies about the disk geometry for OS's and drivers that think drive layout is like it was twenty years ago and want a align data on tracks, remapping around bad sections of media, read-ahead and write-back caching, re-ordering I/Os etc. The problem with these last three, is they are being done without any knowledge of the data, severely limiting it's ability to help overall system performance. We need to enable these processor to combine their knowledge of how the drive mechanics really work, with some knowledge of the properties of the data it is storing.

The first problem to address is the growing disparity between the amount of data stored under a spindle relative to the time it takes mechanical components to access it. For example, if an I/O spans from the end of one track to the beginning of the next, It still takes on the order of a millisecond just to re-align the actuator to the beginning of the track on the next platter. Or, if a track has a media defect, it can take many milliseconds to find the data that has been relocated to a good sector. Drives could save many tens of milliseconds if they just knew how data was grouped together. They could keep related data on the same track and avoid spanning defects. This is, of course, one of the key benefits of moving to an object interface.

The next problem to address is how to support a performance Service Level Agreement (SLA). Tell the drive that an object needs frequent, or fast access so it can locate it where seek times are shortest. Tell the drive that an object contains audio or video to make sure it can stream the data on reads without gaps. Allow the OS and drive to track access patterns so the drive can adjust the SLA and associated access characteristics as the workload changes. This has to be done where the knowledge of the drive characteristics is known.

How to Change the Interface
Of course, at the point I'm not telling the drive vendors anything they don't already know. Seagate, in particular, drove creation of the T10 OSD interface and has been a big advocate of the object interface for drives. The problem is, after almost ten years, they have had limited success. As Christensen pointed out, changing a major interface in a horizontally integrated industry is really hard. No one wants to develop a product to a new interface until there is already an established market. This means, not only are there products that plug into the other side of the interface, but they must be fully mature and 'baked' with an established market. So, the industry sits deadlocked on this chicken-and-egg problem. I think there is hope though and here is my advice on how to create a path out of this deadlock.

1. Up-level the discussion and speak to the right audience
The consumer of the features enabled by OSD drives are File System, RAID, and Database application developers. The T10 spec defines the transport mechanism but, that discussion is highly uninteresting to this audience. They need to know specific value they get by storing objects and they need to understand that it's value they can ONLY get by offloading to the embedded disk processor. In addition, it needs to be expressed in their language - as an object API. This is about storing objects and it maps into the Object-Oriented view of development. It's an interface for persisting objects. These objects have some public properties that can be set by the application to define required performance, security and other attributes to be applied when persisting the data. It's basic OO Design 101.

2. Standardize this higher-level API
Seagate has already gets the need for standards and has done it for the transport protocol. I hope some standardization of the higher-level API is happening in the SNIA OSD Workgroup. For any serious developer to adopt an API built on HW features, the HW must be available from multiple sources and different vendors must provide consistent behavior for some core set of functions. Of course, this lets direct competitors in on the game, but it up-levels the game to a whole new level of value.

3. Leverage open source and community development
I continue to see open source leading the way at innovating across the outdated interfaces. HW vendors who are locked into the limitations of these outdated interfaces have the most to gain by enabling value-add in their layer through open-source software but, they seem to have a blind spot here. Leverage this opportunity! It's not about traditional market analyses of current revenue opportunities. It's about showing the world whole new levels of value that your HW can offer and about getting that to early adopters so those features gain maturity.

Many of the pieces are already there. IBM and Intel have OSD drivers for Linux on Sourceforge. One is coming from Sun for Solaris. File systems are there from ClustreFS and Panasas. Emulex has demo'd a FC driver for Linux. Most of the pieces are there Object Persistence API and disk firmware. Also, the beauty of community development is that you don't have to staff armies of SW developers to do it. A small group focused on evangelizing, and creating, leading, and prototyping open development projects is enough. The developers are out there, the customer problems are there, and the start-ups and VC money are out there looking to create these solutions. Finally, although open-source leads the way on Linux and Open Solaris, if the value prop is compelling enough, developers will find a way to do it by bypassing the block stack in Windows which will, in turn, force Microsoft to support this interface so they can insert Windows back into the value chain.

4. Make the technology available to developers as cheaply as possible
The open development community is not going to leverage new HW features if they can't get the HW. Sounds fairly obvious but the FC industry in particular is missing the boat on this. Lustre and PanFS have been implemented on IP. IBM and Intel's OSD drivers on Sourceforge are for iSCSI. The irony is that Lustre and PanFS, which focus on HPTC where they could most benefit from FC performance, have been forced to move to IP, promoting the misconception that FC has some basic limitations that prevent its use in HPTC compute grids.

Any developer should be able to buy a drive and download OSD FW for it. Ideally, this should include not only a set of expensive FC drives, but also a $99 SATA drive available at Frye's. Hopefully the FW development processes at the drive vendors have evolved to the point where it is modular enough that a small team should be able to take the FW source code for a new drive, plug-in the OSD front-end, and release it on a download site for developers.

5. Participate as part of the developer community
Create an open bug database and monitor and address those issues. As early developers use this FW, they need a way to report problems, track resolution, and generally get the feeling the disk vendors are committed to supporting this new API. In addition, consider opening the source for the OSD interface part of the disk FW. The 'secret sauce' for handling properties can still be kept closed. This will accomplish several things. One, it will drive the de-facto standard (one of the primary reasons for open-sourcing anything). Two, it will enable drive vendors to leverage bug fixes and enhancements from the open-source community. Three, it will help build trust from the database/file system/RAID vendors that this interface really is mature and can be trusted and that they retain some control over the ability to find and fix problems. Fourth, it will help enable second-source vendors to implement consistent basic functionality.

This will take time but the ability to innovate along more dimensions than just capacity and the resulting value-add that customers are willing to pay for is worth the long-term investment. The key requirements to adopting this new interface are to communicate the value of this new functionality to the developers who will use it in terms they understand; make the functionality readily available to them and provide as much of the solution as possible; and build their trust by enabling second source suppliers and using early adopters such as the HPTC and open developer community. Finally, if any drive vendor wants help creating a specific plan, send me a note through the contact link on this blog page and we can talk about a consulting arrangement.