/********************************************************* END OF STYLE RULES *********************************************************/

Friday, November 03, 2006

Seeing What's Next, Fourth and Final Part

The third group that Clayton says to watch for signs of change in an industry are the Non-consumers. In this case, people who have storage and information management problems to solve but are not buying current products because they either can't afford them, or can't adapt them to their unique needs. As I mentioned in Part I, the most interesting group of non-consumers are in High Performance Technical Computing (HPTC). They have both problems. One, their research budgets don't allow them to buy expensive supercomputers and RAID arrays, and two, today's arrays don't meet their needs anyway. To build 'cheap' supercomputers, they cluster together lots of x64 compute nodes running Linux. Then, because they spread their HPTC compute task across so many compute nodes, they need storage that enables sharing the data between all those processors. So, even if they could afford today's block or NAS arrays, they don't support this level of information sharing anyway.

SANs have enabled IT managers to consolidate their storage HARDWARE, but not necessarily their information. Block arrays don't have the capability to effectively share the same information between multiple application servers. Even if it did, most filesystems still assume they use captive DAS disks anyway. So, SAN admins have to partition up their storage through LUN masking and switch zoning so each file system instance thinks it's talking to captive, direct attach storage.

This won't work for highly parallel HPTC tasks. They need a filesystem with the capability to share files between multiple compute servers. Filesystems such as Sun's QFS do this with block storage but it's inefficient. Compute nodes have to spend a lot of time communicating with each other to compare which blocks are used and free, and which nodes have which blocks opened for read or write access. It's much more efficient to let the storage server do this which means, of course, it has to know how data is grouped into files and it has to manage ownership properties with each file. In other words, it needs an object storage interface. NFS is close but won't have the right ownership properties until V4++. So leading HPTC filesystems such as Lustre and Panasas' PanFS invented their own proprietary object storage protocol that runs over IP. Lustre users assemble their own object storage device using Linux on a commodity server. Panasas builds an object storage array providing one of the first commercial products for this market.

One of the reasons this disruption is so interesting is that I've talked to several enterprise datacenter managers who want similar capability. They want to move enterprise applications from large servers to racks of scaleable x64 blades. They want the ability for multiple instances of an application running on separate compute nodes to share information (not just array hardware). They also want quick re-provisioning so an application can migrate to a new compute node without manually reconfiguring LUN masks and switch zoning as required in block SANs today. And, of course, they will need SAN security and access logging with this to comply with information laws. HPTC is evolving this technology, mostly through open source. They are evolving object protocols that allow information sharing AND provide the underlying object protocol on which security and compliance features can be added as sustaining enhancements.

Like the old guy in The Graduate who's advice to young Dustin Hoffman is I have just one word for you: Plastics, I have just one word for the future of storage: NAS. NAS has been a classic low-end disruption growing among overshot customers. Netapp has grown with customers who don't need the level of performance and availability of a FC SAN, and want the low cost and ease-of-use of an IP network-based filer appliance. In parallel, the sustaining enhancements to give NAS the same performance and availability as FC SANs are in process. These include NFS over RDMA and NFS V4++ enhancements to enable multipathing and client data services. IP-based NAS WILL come up and displace Fibre Channel. No question.

Not only is NAS a low-end disruption, it is also the best candidate to become the next radical up-market sustaining innovation for undershot customers. These customers need storage servers that get data in meaningful groups with properties to let them manage compliance with information laws, manage content, put it on the right tier of storage, etc. Once NAS has the required level of performance and availability, then it just takes a series of straightforward sustaining innovations to give it those capabilities.

Finally, object-based storage on IP is growing as a new-market disruption in HPTC. Since the FC HBA vendors appear to have no interest in growing up to high-performance computing (while they lose the low-end to IP) so these users have moved to IP. Also, they have avoided NFS because they need sharing properties not available in NFS today and they don't need the full functionality of the embedded file system in a NAS server. I predict though, that once those properties, along with RDMA are available in NAS, that someone will build a NAS server optimized for Lustre. Then, the economics of commodity NAS arrays will make those attractive to HPTC.

Down at the spinning rust, data will always be stored in groups optimized for the hardware such as 512-byte blocks. For availability, it will also make sense to group disks together with RAID hardware optimized for those physical blocks but these block RAID arrays will become commodity devices. The questions for these vendors is to either be the winner in that commodity business, or to embed a filesystem, and meaningful data management, and compete as a NAS storage server. As for which players will pursue which strategy and who will win, I don't know. Clayton says one of the natural laws is that companies try to up-level. That says one or more existing disk drive suppliers may try to leverage their existing manufacturing capability and OEM relationships and do RAID trays. It also says existing block RAID vendors will move up to doing NAS servers. EMC has indicated it may go this way by acquiring Rainfinity. In parallel, existing NAS vendors such as Netapp will continue the sustaining innovations to give their products the availability, performance, and data integrity required to compete with Tier 1 block arrays. That will be one of the races to watch to see who gets to enterprise-class NAS first.

Wednesday, November 01, 2006

Seeing What's Next Part III, Overshot Customers

In 'Seeing What's Next' Clayton describes signals indicating that a group of customers have been overshot and some of the industry changes that might result. The key signal is customers who lose interest in new features and are unwilling to pay extra for them, and who just want the product to be cheaper and easier to use. This sure sounds like what's happening with midrange and volume disk arrays. Customers used to pay a lot to get RAID arrays from a trusted supplier like EMC who could make it reliable, and they might even pay extra for features like snapshot and replication. Not anymore. Today most customers assume RAID technology has matured to the point where it 'just works' and features like snapshot are like power windows - just assumed to be there.

Another signal of overshoot is the emergence of new standards that disaggregate parts of the value chain. This has happened at both the datapath and management interfaces. Five years ago when you bought an array, you had to get that vendor's driver stack which included their multipathing driver, and in the case of EMC included HBAs with custom firmware. Today, there is a standard for multipathing built into Solaris, available for Linux and in development for Windows, and you can use standard Qlogic or Emulex HBAs from your favorite reseller. At the management interface, SNIA has now standardized this so you can buy an array from your favorite RAID vendor and plug it into a separate third-party SAN management tool. The result, a RAID array has become an interchangeable, plug-compatible component like a disk drive.

The next indicator of overshoot is the emergence of a new business model that emphasizes convenience and low price. Sure enough, just this morning I saw an announcement regarding Dell's arrays that they OEM from EMC. Five years ago you had to create a relationship with EMC to use their products, use their support, use their management software, and install their HBAs and stack on your servers. Today you go to a click-through order form on a website, a UPS truck delivers them, and you plug just the array component into your datacenter.

One type of change Clayton predicts when there are overshot customers is what he calls displacing innovations. This is an innovation that introduces a new product that plugs into points of modularity (aka standard interfaces). We are seeing this as well with the emergence of virtualization devices that sit between the application servers and storage arrays. Pirus (acquired by Sun), StoreAge (just acquired by LSI), Crosswalk, and Incipient are a few of the startups in the space. Cisco and Brocade are adding this capability, and pNFS is an emerging open standard to move this higher-level virtualization out of the array. The result is RAID trays effectively become what disk drives were ten years ago - the commodity hardware that holds the data behind a higher-level virtualization device.

I think that's the key clue to what's next. RAID arrays become what disk drives were ten years ago. Twenty years ago I developed disk firmware in Digital's disk engineering group. In the early nineties, once the disk interface standardized, it wasn't worth competing with Quantum and Seagate and we moved up the value chain to do RAID instead. Now the next cycle is happening. HP, Sun, and IBM, move up another level and OEM the whole RAID array. EMC has already stated they plan to become a software company so it won't surprise me if and when they start OEM'ing the basic RAID arrays.

This creates OEM opportunities for the RAID vendors who want to complete in this commodity market. Because this looks so much like what disk drives went through fifteen years I think the role models are companies like Seagate the won in that disruptive cycle. They very precisely standardized form factor and functionality so they could be second-sources for each other at the big OEMs, got very good at high-yield, high volume manufacturing, reduced their product cycle times, and relentlessly drove down cost with each cycle. Who wins this battle? I don't know. Maybe one of the existing RAID suppliers like LSI or Dothill wins. On the other hand, Clayton's books explain that one of the natural laws of business is that companies try to move to the next level of value so maybe one of today's disk drive companies becomes the next big supplier of RAID trays. Like Clayton says, don't use his book to pick stocks. Use it to see higher level trends in the industry. Maybe this is one.

Tuesday, October 31, 2006

Seeing What's Next Part II, Undershot Customers

The undershot market includes those enterprise datacenters that are 1) trying to keep the company's mission critical data on centralized storage servers and serve that data to many application servers on a large storage network; 2) keep that information 100% available even in the face of major disasters; 3)manage an increasing amount of information while their IT budget gets cut every year, and 4) comply with all the laws and regulations for storing, securing and tracking the various types of information stored in digital form. I've talked to several datacenter managers who simply don't know how to do all of this with the products available today. At best, they can put together various point solutions (often advertised as 'Compliance Solutions' but really just tools for solving a piece of the problem). These solutions required integrating a variety of components and lots of administration to get everything working together. These undershot customers would happily pay more for improved products that provide a relatively simple, integrated solution to these four requirements.

As Clayton describes, incumbent players (EMC, IBM, etc.) are strongly motivated to add new features that they can charge these undershot customers for, especially in a situation like today where traditional block arrays are rapidly becoming commodities. The problem is that sometimes, what he terms a 'radical sustaining innovation' is required. This is when a major re-architecture that changes the whole system from end to end is required to meet new customer needs. An example is when AT&T changed it's whole network from analog to digital in the 1970's. They just couldn't move forward and add significant new value without making that end-to-end change.

That's where storage is today. I've repeated that message in this blog but will say it again because Clayton makes this such an important point for understanding innovation. The block interface which is the primary interface for almost all enterprise class storage is over thirty years old now. The basic assumptions the block interface was based on don't apply anymore at these undershot customers. Block-based filesystems assumed storage was a handful of disks owned by the application server. It assumed the storage devices had no intelligence so it disaggregated information into meaningless blocks. A block storage array has no inherent way to know anything about the information it is given so it's extremely limited in it's ability manage any information lifecycle, or comply with information laws, or anything else involving the information. The problem is that now that information has moved out of the application server, away from the filesystem or the database application and now that it lives in the networked storage server, the information MUST be managed there. This is why an object-based interface such as NFS, CIFS, or OSD which allows the storage server to understand the information and its properties is essential. So, the question is, who can create this 'radical sustaining innovation' with it's changes up and down the stack from databases and filesystems down to the storage servers?

If we were back in the early 1980s this would be easy. One of the vertically integrated companies such as IBM or DEC would get their architects from each layer of the stack together, create a new set of proprietary interfaces, and develop a new end-to-end solution. Then, if they did their job right, the undershot customers would be happy to buy into this new proprietary solution because it does a better job of solving the four problems above. The problem today is that many of these companies either don't exist anymore, or they've lost the ability for this radical innovation after twenty years in our layered, standards-based industry.

Who could pull off this radical sustaining innovation spanning both the storage and the application server? Clayton recommends looking at various companies strengths, past records, and their resources, priorities and values to attempt to identify who the winners might be. Here's my assessment of the possible candidates.

IBM is probably the farthest along this transition with it's StorageTank architecture. StorageTank is a new architecture for both filesystems and storage arrays that do just what I described above - use an object protocol to tell the storage array about the information so it can be managed and tracked. What I don't know is how successful it has been or how committed IBM is to this architecture. Twenty years ago it was in IBM's DNA to invest in a major integrated architecture like this. Whether that's still the case, I don't know. Another question is how well the array stands up as block array in it's own right? Datacenters are heterogeneous. It's fine to implement enhanced, unique functions when working with an IBM app server but the array must do basic block to support the non-IBM servers as well. Will customers buy StorageTank arrays for this? I don't know.

Microsoft is a strong candidate to be one of the winners here. They have a track record of doing interdependent, proprietary architectures. They did it with their web-services architecture. They are moving into the storage server with their Windows Storage Server (WSS) - a customized version of the Windows OS designed to be used as the embedded OS in a storage server. Although I don't know specifics, I would bet they are already planning CIFS enhancements that will only work between WSS and Windows clients. Another strength of Microsoft is they understand that, in the end, this is about providing information management services to the applications and gaining support of the developers to use those services. As storage shifts from standard block 'bit buckets' to true information management devices this application capture becomes more important. This is not part of the DNA of most storage companies. Finally, as we all know, they are willing to take the long-term perspective and keep working at complex new technology like this until they get it right. On the flip side, these undershot customers tend to be the high-end mission critical datacenter managers who may not trust Microsoft to store and manage their data.

EMC is acting like they're moving in this direction, albeit in a stealth way. They have a growing collection of software in the host stack including Powerpath (with Legato additions) and VMWare. They also have a track record of creating and using unique interfaces so it would not be out of character to start tying their host software to their arrays with proprietary interfaces. What they don't have though is a filesystem or, to my knowledge, an object API used by any databases. They also don't have much experience at the filesystem level. They could build a strong value proposition for unstructured data if they acquired the Veritas Foundation Suite from Symantec. An agreement with Oracle on an object API, similar to what Netapp did with the DAFS API would enable a strategy for structured data.

Oracle is a trusted supplier of complex software for managing information and they have been increasing their push down the stack. They have always bypassed the filesystem and now with Oracle Disk Manager, they are doing their own volume management. Recent announcements relative to Linux indicate they might start bundling Linux so they don't need a third party OS. With the move to scaleable rack servers and the growth of networked storage, they must be running into the problem that it's hard to fully manage the information when it resides down on the storage server. This explains Larry's investment in Pillar. My prediction is that once Pillar gains some installed-base in datacenters as a basic block/NAS array, then we will see Oracle-specific enhancements to NAS that only work with the Pillar arrays.

Linux and open development
Clayton groups innovation into two types. One type is based on solutions built from modular components designed to standard interfaces. For example a Unix application developed to the standard Unix API running on an OS that uses standard SCSI block storage. Here innovation happens within components of the system and customers get to pick the best component to build their solution. The second type is systems built from proprietary, interdependent components such as an application running on IBM's MVS OS which in-turn uses a proprietary interface to IBM storage. Because standard interfaces take time to form, the proprietary systems have the advantage when it comes to addressing the latest customer problems. When new problems can't be solved by the old standard interfaces, it's the proprietary system vendors who will lead the way in developing the best new solutions.

What Clayton doesn't factor in however, is the situation in the computing industry today where we have open source and open community development. A board member of the Open Source Foundation once explained to me that the main reason to open source something is not to get lots of free labor contributing to your product. The main reason is to establish it as a standard. For example, Apache became the de-facto standard for how to do web serving. This is happening today with NFS enhancements such as NFS over RDMA, pNFS, and V4++. They first get implemented as open source and others look to those implementations as the example. Because Linux is used as both an application server AND as an embedded OS in storage servers, both sides of new proprietary interfaces get developed in the open community and can quickly become the de-facto standard. This is what I love most about Linux and open source. Not only is it leading so much innovation within software components, but when the interface becomes the bottleneck to innovation, the community invents a new interface and implements both sides making that the new standard.

Friday, October 27, 2006

'Seeing What's Next' in the Storage Industry, Part I

I'm reading Clayton Christensen's latest book titled Seeing What's Next. This is the third book in a series after The Innovator's Dilemma and The Innovator's Solution. This book leverages the theories from those two but, as he says in the preface, "Seeing What's Next shows how to use these theories to conduct an "outside-in" analysis of how innovation will change in industry." Perfect. Let's do our homework by applying this to the storage industry - specifically array subsystems and associated data services and storage networking products.

Clayton says that to identify significant change in an industry, look at three customer groups: Non-consumers, Undershot customers, and Overshot customers. I see all three of these in the storage industry.

Non-consumers and Storage
Clayton defines these as potential customers who are not buying the products today because they either can't afford it, or for some reason don't have the ability to use the existing products or to apply them to the problem they are trying to solve. Instead they either go without, hire someone else to do it, or cobble together their own 'less-than-adequate' solution. These customers can be important because it's a place where new technology that appears sub-optimal to existing customers can mature to the point where it becomes attractive to the mainstream. Clayton calls this a 'New-market disruptive innovation'

I see two such groups for storage subsystems and data services. One group has always been there - the small office/home office market. Most of these customers are still not buying $10k NAS or RAID servers, installing them at home, backing them up, remote mirroring, etc. Instead they put data on a single HDD and maybe manually backup to a second drive (cobble together a less-than-adequate solution), or maybe use an SSP such as Google or Apple's iDisk (hire someone else to do it). A few players are pushing into this space such as Stonefly, and Zetera with their Storage-over-IP technology but I'm personally not seeing anything that justifies a significant 'new market disruption'.

The more interesting group that meets the definition of 'non-consumers' is the new group of High-Performance Technical Computing (HPTC) users. These users have big high-performance computing jobs to do but their research budgets don't allow them to buy multimillion dollar mainframes and storage subsystems. They have figured out how to build their own parallel computers using lots of commodity x64 rack servers harnessed together with customized Linux operating systems. Part of customizing Linux to run these highly parallelized compute jobs is to use the Lustre File System. Lustre uses its own object storage protocol on top of commodity IP interconnects so the object-based storage can effectively share storage between many compute nodes. Then, on the storage side, they cobble together their own storage servers using Linux with added Lustre target-side components on commodity hardware.

Much of this solution would be considered 'not-good-enough' by many mainstream enterprise storage customers - ethernet is considered too slow, availability and data integrity are not sufficient, and it requires an 'assemble-it-yourself' Linux storage array. As Clayton's books describe however, this is exaclty how many disruptive innovations start. In addition, the sustaining enhancements to make this acceptable to mainstream datacenters are in process in the open-source community and associated standards bodies. These include RPC over RDMA and 10G ethernet that will exceed current FC performance, improved availability and reliability through enhancements in NFS and pNFS, as well as sustaining Lustre enhancements driven by ClustreFS Inc. I've talked to several managers of large datacenters who are interested in migrating key applications such as Oracle and web services from large mainframes to scaleable rack servers and who are watching and waiting for the storage and SAN technology to support that migration. So, this one clearly looks like a new market disruption in process.

Overshot Customers
These are the customers for whom existing products more than meet their goals and who are not willing to pay more for new features. The customers that were in the soon-to-be-extinct segment called 'mid-range storage' meet this definition. They just want reliable RAID storage and MAYBE a few basic data services such as snapshot, etc. I've talked to several ex-midrange customers who know there are low-end arrays that meet their reliability and availability goals, and they just want to know how cheaply they can get them.

The other giveaway that overshot customers exist is the rapid growth of companies supplying what used to be considered not-good-enough technology. This exists with the growth of Netapp and NAS storage. NAS has been considered sub-optimal for datacenter storage for several reasons. No RDMA, limited path failover, data can't be seamlessly migrated between, or striped across NAS servers. Netapp's growth proves that more customers are finding these limitations acceptable. In parallel, the NFS and pNFS enhancements in process will solve these reliability/availability restrictions. So, I'm betting that this is a low-end disruption that will just keep growing.

Undershot Customers
Undershot customers are those who have trouble getting the job done with the products available today and would be willing to pay more for added features that help. These are the customers companies love. They can invent new features and charge more for them. Storage has lots of undershot customers driven by the need to comply with information laws, while at the same time consolidating storage on the storage network, while at the same time keeping it 100% available. The storage industry is putting a lot of energy into claims they can add features to help these undershot customers. The problem though, as described in The Innovator's Solution (chapter 5), and in my post on Disruption and Innovation in Data Storage, is that sometimes a component designed to a modular interface can't effectively solve new problems due to the restrictions of the now outdated standard interface. The inventors of the block interface standard never planned for today's data management problems or the type of highly intelligent, networked storage that we have today. In a case like this, the competitive advantage shifts to an integrated company that can invent a new proprietary interface and provide new functionality on both sides of the interface (in this case both the storage server AND the data client/app server). This has been a common theme of mine in this blog. Designing a storage system to meet these new requirements requires a new interface between app server and storage or, stating it in Clayton's terms, the competitive advantage has swung to companies with an interdependent architecture that integrates multiple layers of the stack across a proprietary interface.

I'm going to leave with that thought for today and continue with it in Part II. In the meantime, think about EMC and the host software companies it's been acquiring, or Microsoft with it's move to the storage side with VDS, or Oracle and its investment in an array company (Pillar).

Wednesday, October 25, 2006

Confirmation that File Virtualization is hot

SearchStorage released an article today titled File virtualization tops hot technology index. As described here: "File virtualization beat archiving, data classification, encryption and ILM to the top spot," said Robert Stevenson, managing director of TheInfoPro's storage sector. He attributes the interest in file virtualization to network attached storage (NAS) growth in the data center, with average capacity deployments in the last month at 220 terabytes (TB) and the long project timelines needed for block virtualization. "Storage professionals have been focusing their near-term energies and budgets on improving file content management," Stevenson said.

You read it here first
This aligns with my ideal SAN as described in my post on 19-Sep. In my SAN, information is stored in meaningful groupings (files) with associated properties that let the storage servers apply useful data services based on the information being stored. In addition, I want many of the virtualization features available for today's block SANs including mirroring, striping, and remote replication across multiple storage devices. It seems I'm not alone and that Fortune 1000 storage admins are interested in the same thing.

Note to Technology Venture Investors
The article goes on to state that there are no standards around this technology yet, so every file virtualization and namespace technology has a different way of talking to storage services.. That is true but the underlying standards are evolving in the form of pNFS and NFS V4++. As I commented in my pNFS post, helping to evolve pNFS, and building the data services and APIs above that top my list of promising start-up opportunities. If I were leading such a startup, I would execute a strategy of building this software in open source, on Linux, and would partner with storage vendors such as NetApp, Panasas, and maybe even EMC, and define the APIs in cooperation with as many ISVs as I could, including Oracle and Microsoft. Such a startup has a good chance of then being acquired by one of these companies.

Tuesday, October 10, 2006

Notes on Various Storage-related ISVs

I'm reading the material on the websites for various storage software startups - trying to get past the grand claims of how their widget will solve all your storage and data management problems and dig out the hidden clues to what their products can, and can't do. Here they are, in no particular order:

Actually not a bad website. They provide an algorithm for compressing structured data records. That, in itself is nothing new but what's unique is while compressed, they retain the ability to search based on keywords (using SQL) to help compliance with data laws. So, for example, you could keep all your OLTP transaction records for the last year on nearline VTLs and the finance or legal department could query them at any time.

The amount of compression varies. According to their website, the algorithm de-duplicates fields in records, keeping only one copy, replacing other copies with references to the one single copy of that field. The presumption is that most databases store many copies of the same data in different records. I don't know how to verify that but they claim you can achieve up to 10-1 compression.

Seems like useful technology that meets a real need. What I don't have a good feel for is how to operationalize this. Do you create a snapshot every day, then run it through this application as you copy to the archive? Do you run the daily incremental backup through it and can the application put that together with previous incremental backups? I'm curious. I'm also curious whether most databases really duplicate that much data and, if they do, how long will it be before they add a built-in feature to compress the database to create a similar searchable nearline archive.

Continuity Software
Software that analyzes your SAN topology and identifies data protection and availability risks such as configuration errors, inconsistent LUN mapping, unprotected data volumes, etc. Includes a knowledge-base that provides suggestions for best-practices around disaster recovery and data recovery.

What's not clear is how the applications gets the data to analyze and how it gets updates when changes are made. The software is 'agent-less' but they claim it has automation to detect the configuration on its own. They also sell a 'service offering' (translated - you pay for the labor for them to come into your datacenter and do the work). They collect the configuration and enter it into the tool which in turn shows you the risk.

Scalant produces a layer of system software that turns a compute grid into a 'super cluster' providing HA and load balancing.. This is something Sun claimed to be doing a few years ago (anyone remember N1?). It includes three components. Like traditional clusters, it includes a layer on each compute node that monitors the health of that node and provides a heartbeat so others know it's alive. The second component, unlike traditional clusters, is a monitor (which itself runs on an HA-clustered pair of servers) that monitors the overall health of each compute node and receives heartbeats. It also stores the 'context' of the various applications running on the grid. It detects the failure of an application on a compute node and restarts it on another one. The third component is the config and monitoring GUI.

What's interesting to me about this is the implications on the storage network. FC is not a good choice for this type of compute grid. One, it's too expensive and not available onboard these types of scaleable compute nodes. Mostly, it doesn't have good functionality for the sharing and dynamic reconfiguration you really want to support automatic migration of applications around a large compute grid. You really want a SAN like I described in My Ideal SAN.

First, you want sub-pools of compute nodes running the same OS configs and you want easy scaleability. So no internal boot disks to manage. You want IP-based boot with a central DHCP server to route the blade to the right boot LUN. You would like data services in the array so all these sub-pools can boot from the same volume. Then, you would like the Application Contexts managed by the Scalant cluster monitor to include references, by name, of the data volumes that application needs so when it instructs a compute node to startup an app, it knows how to find, and mount the data volumes it needs. Finally, you would like some form of object-based storage that can share data between multiple nodes to support parallel processing as well as HA failover clusters.

OK, this is the first company I've researched today who doesn't have the smiling person on the homepage. I like them already.

Coppereye is addressing the same problem as Clearpace above. The need to quickly search large transaction histories based on content/keywords. Unlike Clearpace, Coppereye indexes data in place and builds a set of tables that fit in a relatively small amount of additional storage. They claim their algorithms and structure of the tables allow for flexible, and high-speed searches. Although they never explicitly mention structured vs. unstructured data, their site usually talks in the context of searching transactions so I think the focus is structured data. I didn't see a mention of SQL but they do have a graphical UI. Here's their description:

CopperEye Searchâ„¢ is a specialized search solution that allows business users to quickly find and retrieve specific transactions that may be buried within months or years of saved transaction history. Unlike enterprise search solutions, CopperEye Search is specifically targeted at retrieving records such as credit card transactions, stock trades, or phone call records that would otherwise require a database.

Datacore is not new and is not a startup although they are a private company. They provide a software suite that lets you turn a standard x86 platform into an in-band virtualization device. A key feature is the ability to under-provision volumes and keep a spare pool that is dynamically added to volumes as necessary. Other features include intelligent caching, data mirroring, snapshots, virtual LUNs, LUN masking, etc. It runs on top of Windows on the native virtualization device so it can use the FC or iSCSI drivers in Windows, including running target mode on top of either.

This looks like a nice product. It's been shipping since 2000 and is up to rev 5 so it ought to be pretty robust and stable. It runs on commodity hardware and can use JBODs for the back-end storage. Provided the SW license is reasonable, this can be a nice way to get enterprise-class data management on some very low-cost hardware.

Avail has developed a SW product that works with the Windows File system to synchronously replicate files among any number of Windows systems. They call it the Wide Area File System (WAFS). It replicates only the changed bytes to minimize data traffic and works through the HTTP protocol so it can pass through any firewall that enables HTTP traffic so it truly works over WANs. It can replicate from a desktop to a server or between desktops. Users always open a local copy of the file, but the local agent gets notified if a change has been made to one of the remote copies and it makes sure any reads return the most recent copy of the data. It does this by implementing a lightweight protocol so that at soon as a file or directory (folder) is updated, all mirrors get quickly notified, although actual data movement may happen in the background.

Provided this is robust, it's kind of cool technology. It allows both peer-to-peer file sharing as well as backup/replication.

That's all for today. I'll follow up with a Part II in a few days.

Monday, October 09, 2006

Array Chart Rev 2 and Responses to Comments

I added another array vendor to my Array chart - Agami. They are another startup doing scaleable NAS with file-aware services that runs on low-cost commodity hardware. So, they go in the upper left somewhere near Isilon. I update the chart in my previous post.

Responses to Several Comments
Thanks for the great comments over the last several weeks. Good comments on NFS V4 that I need to do more research on. Here are responses to some of the others:

Good corrections and additions to my description of Isilon. I updated my notes in Part II.

EMC and Innovation
Good point that just because EMC acquired a bunch of companies it doesn't justify being an 'innovator'. I guess I'm giving credit to those new EMC employees who did some new and unique things before getting acquired by EMC. In particular, I like what VMWare, Rainfinity, Invista, and some of the other Data Services start-ups created.

iSCSI IP overhead and ATA over Ethernet
One comment questioned why anyone would take the overhead of running IP for iSCSI traffic verses just using ATA over ethernet and who is using IP to route iSCSI traffic. I'm not enough of an IP expert to really answer that but I do have two comments. One, I've talked to several datacenter managers who are looking at moving apps from big-iron to dual and quad-x64 rack servers. They're finding they have plenty of spare processing power and, for now at least, wouldn't even notice if they had a more efficient protocol stack. Second, a big reason for using iSCSI is to get the automated network utility protocols such as DHCP and DNS for their ethernet SAN. Can ATA over ethernet work in a such a network?

Had a comment requesting more insight into what's going on in Sunlabs. I'm not with Sun anymore (why I've moved to blogspot for weblogging) but, Sun really has become very open and transparent and you can put together a lot by reading their blogs and by looking at open Solaris. Jonathan's blog is always a good way to learn where his head is at. In his post The Rise of the General Purpose System he talks about custom storage hardware getting replaced with commodity HW running specialized, but open-source-based software and specifically mentions their Thumper project that packs 48 drives into a 4U enclosure with a standard x64 motherboard. Another interesting one is Jeremy Werner's Weblog where he talks about Honeycomb. This is software, based on Solaris that stores data reliably across a large number of commodity storage platforms, (such as Thumpers) and provides a Content Addressable Storage (CAS) API. So, imagine building your own Google inside your datacenter for your companies knowledge base.

Other visible datapoints: Lots of blogs and visibility (including open source) around ZFS and it's interesting file-level data services. OpenSolaris.com includes some interesting storage-related projects including iSCSI with both initiator and target-side functionality, and an Object SCSI Disk (OSD) driver. Sun continues to lead enhancements to NFS to give it the availability and performance to finally displace block/FC in the enterprise datacenter. Many highly mission-critical datacenters continue to run SAM-FS to automatically archive and store huge filesystems across both disk arrays and tape libraries.

Put all this together and what you might expect are NAS storage products with an iSCSI option, based on Solaris, running ZFS, with standard AMD64 and Sparc motherboards. They will come in scaleable, rack-mount form factors. You might have the option to run Honeycomb to build large content-searchable storage farms or the option to use SAM to archive data to tape libraries.

Keep the comments coming!

Monday, October 02, 2006

Array Vendor Chart

I've been trying to figure out how to map all these vendors onto a single chart. They can be rated on many criteria so it's hard to reduce them to a two-dimensional chart but, I'm going to try. I believe storage of the future will be based on technology that reliably stores and manages information on scaleable commodity HW much like the trend with rack servers today. So I created a two-dimensional chart that maps new innovation (at managing and storing information) vs. cost. New innovation is on the y-axis and higher is better. Cost is on the x-axis and lower is better. This chart will be an ongoing work-in-progress but, my first revision is shown below.

I've tried to show where the customer groups fall on these scales. For example, mission-critical enterprises are willing to pay for expensive products and service so falls far to the right. They have tough data management problems so need some innovation but are too risk averse to go for very new technology so they fall midway along the y-axis. SMB is to the left of that. They typically want basic block storage with common features but want to save money so don't buy the most expensive equipment. HPTC, in the upper left is leading the way in innovating technology for solving tough computing problems but typically have limited budgets so they are using the community development process to drive innovation through open-source software.

Placing the Vendors
Now, where to put the array companies? EMC is clearly the cost leader (loser?) so they go far to the right. In terms of innovation, I give them the benefit of the combined invention of all the companies they've acquired, but they basically stay within their block storage framework and stay away from bleeding-edge technology so they go in the middle. I put IBM higher on the innovation scale because I think they're on the right track with StorageTank. In the lower left are the simple integrators like StoneFly and Celeros. They basically integrate commodity hardware and software components. Not much innovation, but they provide it at a very low cost. Then, in the upper left is the innovation that I like. Software-based invention that leverages commodity arrays and motherboards. Panasas is doing this. I listed ClustreFS on here. They aren't an array vendor but they do the file system and add-ons to Linux that 30% of the Top-100 supercomputers use to create their storage grid. I haven't quite figured out where to put Netapp. They are driving a lot of important innovation but I need to check their prices to see where they fit on the price scale.

So here it is, rev 2 of my array vendor chart. with all it's errors and probably several missing storage vendors.

Friday, September 29, 2006

Part III: And more storage arrays

Continuing my research on all the storage array vendors out there.....

Not a new vendor but I haven't looked at them in a while. 3Par makes a fairly traditional block storage rack with 3U and 4U disk trays, a 4U dual controller modular, dual power supplies, etc. Available with FC and iSCSI SAN interfaces. They provide the usual set of RAID levels and data services including snapshot and remote copy and a 'single pain' management GUI that includes tools for monitoring and managing storage resources, access patterns, and for migrating data between RAID levels.

One feature they have developed is the ability to Underprovision. You can create several volumes that can present LUNs that are larger than the amount of available disk space. You have to leave some amount of disk space in a free pool. Then, as one or more of the underprovisioned LUNs fill up, the controller will automatically take space from the free pool as necessary. Another nice (although I'm not sure new) feature is the ability to migrate data between RAID levels online as access patterns and the desired SLA change.

In summary, nothing bleeding-edge here. They've been around a while so I would hope they have most of the bugs worked out so for someone just needing reliable block storage with some scaleability, ability for snapshot/backup and remote mirror, this might be a good choice.

Not new either. Another fairly standard RAID array in the middle of the pack. Features include FC or iSCSI host interface (4 ports/controller), FC or SATA disks, snapshot, sync and async remote copy, standard RAID levels and also claim to have RAID 6. Available in a 3U, 15-drive dual-controller model. Also sell a 1U RAID head that uses stand-alone JBODs on the back-end.

More mid-range RAID. 2U and 3U RAID trays. FC host, SAS and SATA disks. Snapshot, remote copy, etc.

Block storage provider who's value prop is a set of block data services. Claim to provide the 'Only SAN with Automated Tiered Storage'. I'm not sure I believe that but their RAID subsystem will track properties of data and automatically migrate to different tiers of storage. This would have to be at the block level so they must be doing this at the granularity of some number of blocks.

They also have an underprovisioning (called Thin-Provisioning) feature to let users create LUNs that are larger than the available storage and pull from a free pool as necessary. Also, claim to have CDP. They call it Continuous Snapshots.

Going for low cost. Do a low-end array with ethernet interface running both iSCSI target and NAS. My guess is they use Linux internally on a commodity motherboard and OEM a low-end RAID controller. Targeting small business.

Really going for the low-cost leader in an iSCSI storage array. Have a product called a 'Storage Concentrator' that serves as iSCSI Target. Looks suspiciously like a 1U Dell rack server with the Dell logo replaced with one that says Stonefly. My guess is it runs Linux with an iSCSI target driver such as Wasabi. Also available in a 3U array with a single integrated iSCSI controller. Has battery-backed cache but no controller failover.

Thursday, September 28, 2006

Part II: More new storage arrays

The second of my series of posts on what new array vendors are doing.

Isilon Systems
Isilon is one of the vendors doing Clustered Storage in the form of 2U NAS bricks that interconnect with each other through Gig Ethernet or Infiniband and include a Shared Filesystem that lets them share and balance files across the shared disk space. They focus on very similar servers and workloads as Panasas and BlueArc - scaleable rack servers requiring high-bandwidth access to large files although their material talks about the market differently. It describes the growth of Unstructured Data - large multimedia files requiring high-bandwidth, shared access - essentially the same as HPTC.

Here is the picture from their White Paper:

Their 'secret sauce' is their Distributed File System (DFS) which uses a Distributed Lock Manager (DLM) over the dedicated interconnect which can be either IB of Ether. The DFS integrates volume management and RAID including Reed-Solomon ECC so it can tolerate multiple disk or node failures within a volume/stripe. The DFS handles distributed metadata and file locking so multiple nodes can share access to a file. Includes the ability to rebalance data across nodes to maintain load balancing and something they call 'smartconnect' which a source tells me looks a lot like basic IP path failover and load-balancing. Also provides one integrated view of the data for the management UI. The host interface includes NFS, CIFS. No mention of iSCSI at this point.

One issue I didn't see addressed is how they handle failover when a box fails. Using their EDD/RAID algorithms, the data is still there, but clients will have to re-mount the volume through new IP address. I suspect this isn't handled today. Something like a pNFS MDS is required for that.

Another approach to address similar problems and market as Panasas and BlueArc. Uses commodity HW, like Panasas, but uses existing standard NFS/CIFS for the SAN interface like BlueArc. I saw a reference to them on the web using embedded Linux but someone told my they use a version of BSD. Either way, they should be able to quickly pick up NFS/pNFS enhancements as they evolve while they focus their engineering on enhancing the DFS and management UI. The biggest threat is that open-source Linux DFS's will eventually catch up and standards like pNFS eventually eliminate the need to embedded a DFS in the storage. For now though, looks like a promising approach (provided they've worked out all the failure conditions).

Lefthand is also doing 'Clustered Storage' using rack-mount storage trays built from commodity hardware, SAS and SATA disks, and ethernet interconnect. Like Isilon, they have a distributed file system allowing them to share data, stripe & RAID protect across multiple bricks, and scale to many bricks. Lefthand also distributes their metadata processing so there is no single-point-of-failure.

Lefthand offers a NAS interface although I think most of their installed-base is iSCSI and that's where they focus. One unique feature is they've developed a multipath driver that keeps a map of all the alternate controllers that could take over serving a LUN when one fails. Good idea. I've seen some PR recently about certifying with Windows for iSCSI boot. I don't know how far they've gone with implementing automated services such as DHCP and DNS as I described in my Ideal SAN, but that would support their value prop as a leading provider of iSCSI-based SAN Solutions.

Crosswalk was founded by Jack McDonald who started McData and it looks like they are leveraging their switching heritage. Technically, not an array provider, but they use similar technology and are focusing on similar problems - aggregating storage together into a 'grid' with a global namespace and high availability, scaleability, etc.

Crosswalk is clearly focusing on the HPTC market and I applaud their marketing for being clear about their focus on this segment. Nearly every marketing book I've read and class I've taken makes it clear that segmenting your marketing and defining your unique advantage in that segment is a fundamental requirement for success. Despite this, in my twenty years in storage I've met very few product managers willing to do this ("we might lose a sales opportunity somewhere else...."). But, I digress.

Crosswalk also uses a distributed file system that aggregates data into one namespace, allows shared access to information, implements a DLM, etc. Crosswalk differs in that they take the approach of doing this in an intermediate layer between the storage and the IP Network. Their product is a small grid of high-performance switches serving NFS/CIFS on the front and using legacy FC storage on the back-end. This is their differentiation: "integration of disparate storage resources". Presumably, they leverage their experience at McData implementing high-performance data channels in the box so they can move lots of data with a relatively few nodes in their 'virtualization grid'. Host interface is standard NFS/CIFS.

Given their focus on HPTC where Linux prevails, I would hope they use Linux in their grid and can pick up NFS/pNFS enhancement as they are adopted on HPTC grids. Also, given that 30% of the top-100 supercomputers now use Lustre from ClustreFS, and given their location just down the road from ClusterFS in Boulder, I would assume they are talking. This would make a good platform for running the Lustre OSD target.

Sells 3U and 4U iSCSI arrays. Found limited information on the website about the internal architecture but appears to be block (no NFS) with the usual set of data services. Also talks briefly about some unique data services to let bricks share data and metadata for scaling and availability but it doesn't sound like the same level of sharing as Isilon, Crosswalk, or Lefthand. This looks more like a straightforward, iSCSI block RAID tray. Nothing wrong with that. Over the next several years, as the RAID tray becomes what the disk drive was to the enterprise ten years ago, they are one of the contenders to be one of the survivors, provided they can keep driving down HW cost, manufacturing in high volume, keep reliability high, and keep up with interconnect and drive technology.

Tuesday, September 26, 2006

New Storage Arrays: Part 1

This post is another collection of notes. In this case, notes from reading the websites from several fairly new (to me at least) storage subsystem vendors. I don't have any inside information or access to NDA material on these companies. All my notes and conclusions are the result of reading material on their websites.


Panasas builds a storage array and associated installable file system that closely aligns with my vision of an ideal SAN so, needless to say, I like them. Their focus is HPTC, specifically, today's supercomputers built from many compute nodes running Linux on commodity processors. For these supercomputers, it's critical that multiple compute nodes can efficiently share access to data. To facilitate this, Panasas uses object storage along with a pNFS MetaData Server (MDS). Benefits include:

    Centralize and offload block space management. Compute nodes don't have to spend a lot of effort comparing free/used block lists between each other. A compute node can simply request creation of a storage object.

    Improved Data Sharing. Compute nodes can open objects for exclusive or shared access and do more caching on the client. This is similar to NFS V4. The MDS helps by providing a call-back mechanism for nodes waiting for access.

    Improved Performance Service Levels. Objects include associated properties, and with the data grouped into objects, the storage can be smarter about how to layout the data for maximum performance. This is important for HPTC which may stream large objects.

    Better Security Objects include ACLs and authentication properties for improved security in these multi-node environments.

Panasas uses pNFS concepts but goes beyond pNFS, I think. Compute nodes include the client layout manager so they can stripe data across OSD devices for increased performance (reference my pNFS Notes). They use the MDS server for opens/closes, finding data, and requesting call-backs when waiting for shared data. They get the scaleable bandwidth that results from moving the MDS out of the datapath. More importantly, the MDS provides the central point to keep track of new storage and for new servers to go to find the storage it needs. Supports scaleability of both storage and servers.

Object Properties Panasas objects use the object-oriented concept of public and private properties. Public properties are visible to the Object Storage Device and specify the Object ID, size, and presumably other properties to tell the OSD the SLA it needs. Private properties are not visible to the OSD and are used by the client AND the MDS. They include ACLs, client (layout manager) RAID associations, etc.

iSCSI Panasas runs their OSD via iSCSI over IP/Ethernet. I assume they use RDMA NICs in their OSD array and it's up to the client whether or not to use one. For control communications with the MDS, they use standard RPC.

File System I don't think their filesystem is Lustre. I think they wrote their own client that plugs into the vnode interface on the Linux client. I don't know if their OSDs work with Lustre or not. I would think they would not pass up that revenue opportunity. I think that 30% of the Top-100 supercomputers use Lustre.

Standards I like that Panasas is pursuing and using standards. They understand that this is necessary to grow their business. They claim their OSD protocol is T10 compliant and they are driving the pNFS standard.

Storage Hardware Interesting design that uses 'blades'. From the front, looks like a Drive CRU, but a much deeper card with (2) SATA HDDs. Fits into a 4U rack mount tray. Includes adapters for IB and Myranet, as well as native ethernet/iSCSI interface. Don't know what the price is but, appears to be built from commodity components so ought to be reasonably inexpensive. I didn't see anything about the FW but I'm certain it must be Linux-based.

Summary Again, I like it - a lot. They are aligned with the trend to enable high-performance, scaleable storage on commodity storage AND server hardware (ethernet interconnect, x86/x64 servers running Linux, simple storage using SATA disks). Developing FileSystem and MDS server software to enable this scaling that actually works. Driving it as an open standard including driving pNFS as a standard that is transport-agnostic. By using the open-source process they can take advantage of contributions from the development community. Finally, makes sense to start out in HPTC get established and mature the technology but I see a lot of potential in commercial/enterprise datacenters.


BlueArc is an interesting contrast to Panasas. Both are trying to address the same problem - Scaleable, intelligent, IP network and object-based storage that can support lots of scaleable application servers but, they approach the problem in completely different ways. Panasas, founded by a computer science PhD (Garth Gibson) uses software to combine the power the lots of commodity hardware. BlueArc on the other hand, founded by a EE with a background developing multi-processor servers is addressing the problem with custom high-performance hardware.

The BlueArc product is an NFS/CIFS server that can also serve up blocks via iSCSI. Their goal is scaleability but their premise is that new SW standards such as pNFS and NFS V4++ are too new so they work within the constraints of current, pervasive versions of NFS/CIFS. Their scaleability and ease-of-use comes from from very high performance hardware that can support so many clients that only a few are needed.

Hardware Overview. Uses the four basic components of any RAID or NAS controller: Host Interface, Storage Interface, Non-real-time executive/error handling processor, and Real-time data movement and buffer memory control. Each of these is implemented as independent modules that plug into a common chassis and backplane.

    Chassis/Backplane Chassis with a high-performance backplane. Website explains that it uses "contention-free pipelines" for many concurrent sessions and low-latency interprocessor communications between I/O and processing modules. Claims this is a key to enabling one rack of storage to scale to support many app servers.

    Network Interface Module Custom plug-in hardware module providing the interface to the ethernet-based storage network. Website says includes HW capability to scale to 64k sessions

    File System Modules Plug-in processing modules for running NAS/CIFS/iSCSI. Two types: 'A' modules does higher-level supervisory processing but little data movement. 'B' module actually moves file system data and controls buffer memory.

    Storage Interface Module Back-end FC, SCSI interface and processing. Also does multipathing. Website says it contains much more memory than a typical HBA so it can support more concurrent I/Os

Software The software mainly consists of the embedded FW in the server for NAS/CIFS and filesystem processing. Works with standard CIFS/NFS/iSCSI so no special client software required. The white paper refers to the 'Object Storage' architecture but no OSD interface is supported at this time. Includes volume management (striping, mirroring) for the back-end HW RAID trays.

Summary Again, the advantage is high performance and scaleability due to custom hardware. It uses existing network standards so it can be rolled into a datacenter today and it's ready to go. No special drivers or SW required on the app servers which is nice. Also, since you only need one, or a few of these you don't have the problem of managing lots of them. Similar to the benefits of using a large mainframe vs. rack servers. Also, implemented as a card-cage that lets you start small and grow - sort of like a big Sun E10k SPARC server where you can add CPU and I/O modules.

Keys to success will include three things. One, the ability to keep up with new advances in hardware. Two the ability to keep it simple to manage. Third, and my biggest concern, the ability to mature the custom, closed firmware and remain competitive with data services. This is custom hardware requiring custom firmware. BlueArc needs to continue staffing enough development resources to keep up. This concerns me because I've been at too many companies who tried this approach and just couldn't keep up with commodity HW and open software.

Pillar Data

Pillar builds an integrated rack of storage that includes RAID trays (almost certainly OEM'd), a FC block SAN head and a NAS head which can both be used at the same time sharing the same disk trays, and a management controller. Each is implemented as 19" rack mount modules. There's no bleeding-edge technology here. It's basic block and NAS storage with the common, basic data services such as snapshot, replication, and a little bit of CDP. That appears to be by design and supports their tag-line: 'a sensible alternative'. The executive team is experienced storage executives that know that most datacenter admins are highly risk-averse and their data management processes are probably built around just these few basic data-services so this strategy makes sense as a way to break into the datacenter market.

The unique value here is that both NAS and block are integrated under one simple management interface, you can move (oops, I mean provision) storage between both, and the same data services can be applied to both block and NAS. Most of the new invention here is in the management controller which bundles configuration with wizards, capacity planning, policies for applying data services, and tiered storage management. It allows a user to define three tiers of storage, assign data to those tiers, and presumably the system can track access patterns for, at least the NAS files, and migrate between tiers of storage.

Looking Forward

This looks like a company trying to be the next EMC. It is managed by several mature, experienced executives including several ex-STK VPs. They are building on mature technology and trying to build the trust of enterprise datacenter administrators. The value prop is integration of mature, commonly used technologies - something attractive to many admins who use NAS storage with one management UI from one vendor, block storage from another, and SAN management from yet another.

What's really interesting is when you combine this with their Oracle relationship. They are funded by Larry Ellison. As I described in my post on Disruption and Innovation in Storage, I firmly believe that for enterprise storage, the pendulum has swung back to giving the competitive advantage to companies that can innovate up and down an integrated stack by inventing new interfaces at each layer of the stack. We will never solve today's data management problems with a stack consisting of an application sitting on top of the old POSIX file API, a filesystem that breaks data into meaningless 512-byte blocks for a block volume manager, in-turn talking to a block storage subsystem. So, Oracle is doing the integration of the layers starting from the top by bypassing the filesystem, integrating it's own volume manager and talking directly to RDMA interfaces. Now we have Pillar integrating things from the bottom up. By getting Oracle and Pillar together to invent a new interface, they could create something similar to my vision of an ideal SAN.

In this vision of the future, Oracle provides a bundle of software that can be loaded on bare, commodity hardware platforms. It includes every layer from the DB app, through volume management down to RDMA NIC driver and basic OS services which come from bundling Linux. The commodity x64 blades could include RDMA-capable NICs for high-performance SAN interconnect. Then, using NFS V4++, Oracle and Pillar agree on extended properties for the data objects to tell the Pillar storage subsystem what Service Levels and Compliance steps to apply to the data objects as they are stored, replicated, etc. Over time, to implement new data services or add compliance to new data management laws, Oracle and Pillar can quickly add new data properties to the interfaces up and down the stack. They don't have to wait for SNIA or ANSI to update a standard and they don't have to wait for other players to implement their side of the interface. Microsoft can do this with VDS and their database. With Pillar, Oracle can do it as well.

Tuesday, September 19, 2006

My Ideal SAN, Part II, Data Services

In part one, I talked about the SAN interconnect and network and array services that facilitated the use of lots of scaleable rack servers. Now I want to talk about how to achieve scaleability on the storage side, and about data services that help solve the combined problem of centralizing information, keeping it always available, putting it on the right class of storage, and keeping it secure and compliant with information laws.

Consolidating and Managing the Information with NFS
Early SANs were about consolidating storage HARDWARE, not the information. The storage was partitioned up, zoned in the SAN, and presented exclusively to large servers giving them the impression they were still talking to direct-attached storage. This allowed the server to continue to run data services in the host stack because it virtually owned it's storage. My ideal datacenter uses lots of scaleable rack servers with applications that grow and migrate around. Trying to run the data services spread across all these little app servers/data clients is nearly impossible. The INFORMATION, not just the storage hardware has to be centralized and shared and most of the data services have to run where the storage lives - on the data servers. This means block storage is out. Block storage servers which receive disaggregated blocks of storage with no properties of the data are hopelessly limited in their ability to meaningfully manage and share the information. So, my storage needs to be object-based and, since I'm building this datacenter from scratch, I'm going to use NFS V4++. (If I needed to run this on legacy FC infrastructure, I would use the OSD protocol but, more on that later). With enhanced NFS, the storage servers keep the information in meaningful groupings with properties that let it store the information properly.

Performance and Availability
For high performance and availability I want NFS V4 plus some enhancements. One enhancement is the RPC via RDMA standard being developed by Netapp and Apple. The onboard NIC in the rack servers should be capable of performing RDMA for the RPC ULP as well as iSCSI. For availability, the host stack must support basic IP multipath as well as NFS volume and iSCSI LUN failover. The latter should use the industry-standard symmetric standard, or ANSI T10 ALUA for asymmetric LUN failover. For NFS volume/path failover, the V4 fs_locations is helpful because it allows a storage server to redirect the client to another controller that has access to the same, or a mirrored copy of the data. This helps but, to achieve full availability and scaleability, we need pNFS with it's ability to completely decouple information from any particular piece of HW.

I few weeks ago a posted a few notes on pNFS. pNFS applies the proven concept of using centralized Name and Location services in networks - the same concept that has allowed the internet to grow to millions of nodes. The pNFS Name/Location server can run on the same inexpensive clustered pair of rack servers as the storage DHCP service. With pNFS, instead of mounting a file from a piece of NFS server hardware, clients do a lookup by name, and the pNFS Name/Location server returns a pointer to the storage device(s) where the data currently resides. Now files can move between a variety of small, low-cost, scaleable storage arrays giving high availability. Frequently-accessed data can reside on multiple arrays and an app server can access the nearest copy. For performance, app servers can stripe files across multiple arrays. Finally, with NFS V4 locking semantics, multiple app servers can share common information - something FC/Block SANs have never been able to do effectively.

The Storage Server Hardware
Just like I described using small, scaleable, rack-mount application servers, pNFS now allows doing the same with the storage. My storage arrays would be scaleable - probably 2U/12-drive or 3U/16 drive bricks, some with high-performance SAS disks, and others with low-performance SATA. Some with high-performance mirrorsets, others with lower-performance RAID 5. The interface is ethernet that is RDMA-capable for both iSCSI and NFS/RPC. As I described in part one, they can be configured with iSCSI LUNs and assigned meaningful names and the array registers those with the central name service on the SAN. They can also be configured with NFS volumes that register with pNFS. This gives the ultimate in scaleability, flexibility, lower cost, high availability, and automated configuration. Now, we can talk about how to seriously help manage the data.

Managing the Data
Managing the Data means being able to do four things all at the same time. One, keep the data centralized and shared. Two, keeping it always accessible, in the face of any failures or disasters. Three, putting the right data on the right class of storage, and Four, complying with applicable laws and regulations for securing, retaining, auditing, etc. It's when you put all four of these together that it gets tough with today's SANs.

I already talked about how pNFS with NFS V4++ solves the first two - keeping the data centralized, shared, and 100% accessible. With pNFS, arrays can share files among multiple data clients. Both arrays and data clients can locally and remotely replicate data via IP and the pNFS server allows data clients to find the remote copies in the event of a failure. Similarly, on the application server side, if a server fails, an application can migrate to another server and quickly find the data it needs.

Now I want to talk about how the object nature of NFS allows solving the second two problems. Again, because the data remains in meaningful groupings (files) and has properties along with the ability to add properties over time, the storage servers can now put it on the right class of storage, and apply the right compliance steps. NFS today has some basic properties that let the storage server put the data on the right class of storage. Revision dates and read-only properties allow the storage to put mostly-read data on cheaper RAID 5 volumes. With revision dates, the storage can migrate older data to lower cost SATA/RAID-5 volumes and even eventually down to tape archive. With the names of files, the storage can perform single-instancing. These properties are a start but I would like to see the industry standardize more properties to define the Storage Service Levels data objects require.

Finally, compliance with data laws is where the object nature of NFS can help the most. The problem with these laws is they apply to the Information, not to particular copies of the data. Availability and Consolidation requirements mean the information has to be replicated, archived and shared on the storage network. With NFS, information can be named, and the name service can keep track of where every copy resides. The properties associated with the data can include an ACL and audit trail of who accessed each copy. The storage can retain multiple revisions, or can include an 'archive' property so the storage makes it read-only. The properties can include retention requirements then, once the retention period expires, the storage can delete all copies. These are just a few of the possibilities.

How to Get There?
Some of this development is happening. Enhancements to NFS V4 are being defined and implemented in at least Linux and Solaris. pNFS is being defined and prototyped through open source, with strong participation by Panasas. RDMA for NFS is at least getting defined as a standard. Now we need NICs from either the storage HBA, or ethernet NIC vendors. Some gaps where I don't see enough progress are one, defining more centralized configuration, naming and lookup services for pNFS storage networks. Panasas and the open development community seem to be focusing on HPTC right now. Probably not a bad place to start. That market needs the parallel access to storage they get from pNFS and object-based storage. But, it leaves an opportunity to define the services to automate large SANs for other markets. The other gap is standardizing properties for data objects, specifically for defining Storage Service Levels and Compliance with data laws. These need to be standardized. (I need to check what the SNIA OSD group is doing here).

Notes on Transitioning from Legacy FC SANs
One of the nice features of pNFS's separation of control and data flow is that it doesn't care what transport is used to move the data. They typical datacenter with it's large investment in Fibre Channel will have to leverage that infrastructure. There is no reason the architecture I describe can't use FC in parallel with ethernet with the T10 OSD protocol provided OS drivers are available that connect the OSD driver to the vnode layer. The same data objects with the same properties attached can be transmitted through the OSD protocol over FC. THIS is the value of the T10 OSD spec. It allows an Object-based data management architecture like I described above to leverage the huge legacy FC infrastructure.

Monday, September 11, 2006

My Ideal SAN, Part I, Boot Support

This is the first of what may be several posts where I describe my idea of an ideal SAN using a combination of products and technology available today, technology still being defined in the standards bodies, and some of my own ideas. My ideal SAN will use reasonably priced components, use protocols that automate and centralize configuration and management tasks, will scale to thousands of server and storage nodes, and provides storage service levels that solve real data management problems.

The Interconnect
My SAN will use Ethernet. In part because of cost but mostly because it comes with a true network protocol stack. Also, because I can get scaleable rack-mount servers that come with ethernet on the motherboard so I don't need add-on HBAs. The normal progression for an interconnect, as happened with ethernet and SCSI, is that it starts out as an add-on adapter card costing a couple hundred dollars. Then, as it becomes ubiquitous, it moves to a $20 (or less) chip on the motherboard. Fibre Channel never followed this progression because it's too expensive and complex to use as the interconnect for internal disks, and it never reached wide-enough adoption to justify adding the socket to motherboards. I want rack-mount servers that come ready to go right out of the box and with two dual-ported NICs so I have two ports for the LAN, and two for the SAN. Sun's x64 rack servers and probably others meet this requirement.

To further simplify configuration and management, these rack servers won't have internal disks. They will load pre-configured boot images from LUNs on centralized arrays on the SAN via iSCSI. In spite of my raves about object storage, I don't see any reason to go beyond the block ULP for the boot LUN. The SAN NIC in these servers will be RDMA-capable under iSCSI and include an iSCSI boot BIOS that can locate and load the OS from the correct boot LUN. It finds the boot LUN using the same IP-based protocols that let you take your notebook into a coffee shop, get connected to the internet, and type in a human-readable name like 'google.com' and connect to a remote google server. These are, of course, DHCP and DNS.

I will have a pair of clustered rack-mount servers on the SAN running DHCP and other services. The Internet Engineering Task Force (IETF), the body that defines standards for internet protocols has extended DHCP to add the Boot Device as one of the host configuration options. So replacing or adding a new rack server involves entering its human-readable server name, e.g. websrv23, into its eprom and making sure your central DHCP server has been configured with the boot device/LUN for each of your websrvxx app servers. When the new server powers-on, it broadcasts its name to the DHCP server which replies with websrv23's IP address, boot LUN, and other IP configuration parameters. It can then use a local nameserver to find the boot device by name and then load its operating system. The architect for one very large datacenter who is hoping to move to an IP SAN called these Personality-less Servers.

Array Data Services for Boot Volumes
I want a few data services in my arrays to help manage boot and boot images. First, my ethernet-based arrays will also use DHCP to get their IP address and will register their human-readable array name with the DHCP server. In addition to automating network config, this enables the DHCP application to provide an overview of all the devices on the SAN, and to present the devices as one namespace using meaningful, human-readable names.

One data service I will use to help manage boot volumes is fast volume replication so I can quickly replicate a boot volume, add patches/updates that I want to test out, and present that as a new LUN. I'll have app servers for testing out these new boot images and through DHCP I will route these to boot from the updated boot volumes. Once these are tested, then I want to be able to quickly replicated these back to my production boot volumes.

The other array data service I would like is my own invention that allows me to minimize the number of boot volumes I have to maintain. Ninety-some percent of every boot volume is the same and is read-only. Only a small number of files including page, swap, and log files get written to. I would like a variation of snapshot technology that allows me to create one volume in the array and present that as multiple LUNs. Most reads get satisfied out of the one volume. Writes to the LUN however, get redirected to a small space allocated for each LUN and the array keeps track of which blocks have been written to and any reads to an updated block are read from the per-LUN update space. With this feature I can manage one consistent boot image for each type of server on the SAN.

It's a Real Network
This is why I like iSCSI (for now). You get a real network stack with protocols that let you scale to hundreds or thousands of devices and you can get servers where the SAN interconnect is already built-in. Nothing I've described here (except my common boot volume) is radically new. Ethernet, DHCP, DNS, and even the iSCSI ULP are all mature technologies. Only a few specific new standards and products are needed to actually build this part of my ideal SAN:

    iSCSI BIOS Standard iSCSI Adapters with embedded BIOS are available from vendors such as Emulex and Qlogic but they don't use DHCP to find the boot volume and they're not on the motherboard. We need an agreement for the motherboard-resident BIOS for standard NICs. Intel and Microsoft are the big players here.

    SAN DHCP Server Application We need a DHCP server with the IETF extension for configuring boot volumes. It would be nice if the GUI was customized for SANs with menus for configuring and managing boot volumes and features for displaying the servers and storage on the SAN using the single, human-readable namespace. This app should run on standard Unix APIs so it runs on any Unix.

    The Arrays Finally, we need the arrays that support user-assigned names and use those with DHCP configuration. Maybe iSCSI arrays do this already - I haven't looked. Then, some features to help manage boot volumes would be nice.

If anyone who manages a real SAN is reading this, send me a comment.

Saturday, September 09, 2006

Innovation at the Array Level

The block interface is restricting innovation for Arrays even more than disk drives and we are seeing the rapid commoditization that results from this lack of ability to add meaningful value-add features. A couple years ago array marketers used to talk about segmenting the array market into horizontal tiers based on price, capacity, availability, data services, etc. and they used to have three or more tiers. Today, due to commoditization, this has collapsed into only two tiers, as described to me directly by more than one storage administrator. The top tier still exists with EMC DMX and HDS 9000 and other arrays for the highly paranoid willing to pay these high prices. Below that however, is a single tier of commodity arrays. The sales discussion is pretty simple. "Is it a 2U or 3U box?" "How many disks?" "What capacity?" Then, the next questions are "How cheap is it today and how much cheaper will it be next quarter?"

Some would say this is fine and the natural progression in the storage industry. Arrays become what disk drives where fifteen years ago (the thing that stores the bits as cheaply as possible), and higher level data services move to virtualization engines or back to the host stack.* As an engineer seeking to innovate at the system level however, I can't accept this.

As with disk drives, it's a distributed computing problem and there are improvements that can only be done inside the RAID controller. These include improvements in performance, providing an SLA and improving utilization based on the information being stored, securing the information, and complying with information laws. All this requires knowledge of the information that is stripped away by the block protocol.

Arrays try to do this today the only way they can - through static configuration at the LUN level through management interfaces. One problem with this approach is the granularity is too large (LUNs). Another is it's too static and is difficult to manage, especially when combined with the need to manage switch zoning, host device nodes, etc. Finally they are trying to manage the information by applying properties to the hardware storing the information vs. on the information itself. Take for example, the need to keep a log of any user who accessed a particular legal record. One, you can't use LUNs to store each record (so you can't manage it at the LUN level) and two, information doesn't sit on on one piece of hardware anymore. It gets mirrored locally, maybe remotely, and probably backed-up as well. If you're doing HSM, it might even completely migrate off the original piece of storage hardware. Now remember that the law is to track who accessed the INFORMATION, not a particular copy on one array.

If the storage devices are allowed to keep the record grouped (an object) and app servers and storage agree on a protocol for authentication and a format for the access log, then this becomes a solvable problem. Other ways the array can help manage the data is by storing data on the right class of storage such as RAID 5, mirror, remotely mirrored, or a single non-redundant disk drive. To optimize use of storage, these should be applied at the file, or database record level because required storage service levels change at that granularity.

* Note to array vendors. If this is the direction arrays are going the way to win is clear: follow the example of successful disk companies like Seagate. Build high-volume, high-yield manufacturing capability, get your firmware and interop process fully baked, relentlessly drive down HW costs, and standardize form factors and functionality to enable second-source suppliers.

Friday, September 01, 2006

Innovation at the Disk Drive Component

Subtitled: My Free Advice to the Disk Drive Vendors

The disk drive industry has been severely restricted by the limitations of the block interface. By restricting the functionality a drive can expose to that of a 1980's disk drive, they have been limited to primarily innovating along only one dimension of performance - Capacity. Of course, they have made amazing increases in performance but, much as been written about the growing imbalance between capacity and the ability to access that data in reasonable time as well as support a consistent performance SLA. I've also seen several articles lately about problems with sensitive data left on old disk drives. These point to the need for drive vendors to innovate in more dimensions of performance or, saying it differently, add value in other ways than just increasing capacity and lowering cost.

It's a Distributed Computing Problem
If you talk to engineers who work at layers above the disk drive (RAID controllers, volume managers, file systems), you'll get answers like "the job of a disk driver is just to hold lots of data cheaply, we'll take care of the rest". The problem is, they can never solve problems like security, optimizing performance and providing a consistent SLA as well as if they enlist the help of the considerable processing power embedded in the disk drive itself.

Back in the 60's and early 70s, most of the low-level functions of a disk drive were controlled by the host CPU. Engineers could have said: "Hey, our CPUs are getting so much faster, it's no problem continuing to control all these low-level functions". Instead, as employees of vertically-integrated companies like IBM and DEC, they were able to take a systemic view of the problem. They realized the advances in silicon technology could be better used to embedded a controller in the drive where it could be more efficient at controlling the actuator and spindle motor. So, they actually completely changed the interface to the disk drive - a radical and foreign concept to so many computer engineers today. Now, three decades later we are dealing with a whole new set of data storage problems and, the processing power embedded in the disk drive has grown along with increases in silicon technology. Now, as in the 1970s, the right answer is to distribute some of this processing to the disk processor where it has the knowledge, and is in the right location to handle it.

The first thing to realize is these hard drives already have significant processing power built into their controller and in many cases, have wasted silicon real-estate that be used to add more intelligence. This processing power used for thing like fabricating lies about the disk geometry for OS's and drivers that think drive layout is like it was twenty years ago and want a align data on tracks, remapping around bad sections of media, read-ahead and write-back caching, re-ordering I/Os etc. The problem with these last three, is they are being done without any knowledge of the data, severely limiting it's ability to help overall system performance. We need to enable these processor to combine their knowledge of how the drive mechanics really work, with some knowledge of the properties of the data it is storing.

The first problem to address is the growing disparity between the amount of data stored under a spindle relative to the time it takes mechanical components to access it. For example, if an I/O spans from the end of one track to the beginning of the next, It still takes on the order of a millisecond just to re-align the actuator to the beginning of the track on the next platter. Or, if a track has a media defect, it can take many milliseconds to find the data that has been relocated to a good sector. Drives could save many tens of milliseconds if they just knew how data was grouped together. They could keep related data on the same track and avoid spanning defects. This is, of course, one of the key benefits of moving to an object interface.

The next problem to address is how to support a performance Service Level Agreement (SLA). Tell the drive that an object needs frequent, or fast access so it can locate it where seek times are shortest. Tell the drive that an object contains audio or video to make sure it can stream the data on reads without gaps. Allow the OS and drive to track access patterns so the drive can adjust the SLA and associated access characteristics as the workload changes. This has to be done where the knowledge of the drive characteristics is known.

How to Change the Interface
Of course, at the point I'm not telling the drive vendors anything they don't already know. Seagate, in particular, drove creation of the T10 OSD interface and has been a big advocate of the object interface for drives. The problem is, after almost ten years, they have had limited success. As Christensen pointed out, changing a major interface in a horizontally integrated industry is really hard. No one wants to develop a product to a new interface until there is already an established market. This means, not only are there products that plug into the other side of the interface, but they must be fully mature and 'baked' with an established market. So, the industry sits deadlocked on this chicken-and-egg problem. I think there is hope though and here is my advice on how to create a path out of this deadlock.

1. Up-level the discussion and speak to the right audience
The consumer of the features enabled by OSD drives are File System, RAID, and Database application developers. The T10 spec defines the transport mechanism but, that discussion is highly uninteresting to this audience. They need to know specific value they get by storing objects and they need to understand that it's value they can ONLY get by offloading to the embedded disk processor. In addition, it needs to be expressed in their language - as an object API. This is about storing objects and it maps into the Object-Oriented view of development. It's an interface for persisting objects. These objects have some public properties that can be set by the application to define required performance, security and other attributes to be applied when persisting the data. It's basic OO Design 101.

2. Standardize this higher-level API
Seagate has already gets the need for standards and has done it for the transport protocol. I hope some standardization of the higher-level API is happening in the SNIA OSD Workgroup. For any serious developer to adopt an API built on HW features, the HW must be available from multiple sources and different vendors must provide consistent behavior for some core set of functions. Of course, this lets direct competitors in on the game, but it up-levels the game to a whole new level of value.

3. Leverage open source and community development
I continue to see open source leading the way at innovating across the outdated interfaces. HW vendors who are locked into the limitations of these outdated interfaces have the most to gain by enabling value-add in their layer through open-source software but, they seem to have a blind spot here. Leverage this opportunity! It's not about traditional market analyses of current revenue opportunities. It's about showing the world whole new levels of value that your HW can offer and about getting that to early adopters so those features gain maturity.

Many of the pieces are already there. IBM and Intel have OSD drivers for Linux on Sourceforge. One is coming from Sun for Solaris. File systems are there from ClustreFS and Panasas. Emulex has demo'd a FC driver for Linux. Most of the pieces are there Object Persistence API and disk firmware. Also, the beauty of community development is that you don't have to staff armies of SW developers to do it. A small group focused on evangelizing, and creating, leading, and prototyping open development projects is enough. The developers are out there, the customer problems are there, and the start-ups and VC money are out there looking to create these solutions. Finally, although open-source leads the way on Linux and Open Solaris, if the value prop is compelling enough, developers will find a way to do it by bypassing the block stack in Windows which will, in turn, force Microsoft to support this interface so they can insert Windows back into the value chain.

4. Make the technology available to developers as cheaply as possible
The open development community is not going to leverage new HW features if they can't get the HW. Sounds fairly obvious but the FC industry in particular is missing the boat on this. Lustre and PanFS have been implemented on IP. IBM and Intel's OSD drivers on Sourceforge are for iSCSI. The irony is that Lustre and PanFS, which focus on HPTC where they could most benefit from FC performance, have been forced to move to IP, promoting the misconception that FC has some basic limitations that prevent its use in HPTC compute grids.

Any developer should be able to buy a drive and download OSD FW for it. Ideally, this should include not only a set of expensive FC drives, but also a $99 SATA drive available at Frye's. Hopefully the FW development processes at the drive vendors have evolved to the point where it is modular enough that a small team should be able to take the FW source code for a new drive, plug-in the OSD front-end, and release it on a download site for developers.

5. Participate as part of the developer community
Create an open bug database and monitor and address those issues. As early developers use this FW, they need a way to report problems, track resolution, and generally get the feeling the disk vendors are committed to supporting this new API. In addition, consider opening the source for the OSD interface part of the disk FW. The 'secret sauce' for handling properties can still be kept closed. This will accomplish several things. One, it will drive the de-facto standard (one of the primary reasons for open-sourcing anything). Two, it will enable drive vendors to leverage bug fixes and enhancements from the open-source community. Three, it will help build trust from the database/file system/RAID vendors that this interface really is mature and can be trusted and that they retain some control over the ability to find and fix problems. Fourth, it will help enable second-source vendors to implement consistent basic functionality.

This will take time but the ability to innovate along more dimensions than just capacity and the resulting value-add that customers are willing to pay for is worth the long-term investment. The key requirements to adopting this new interface are to communicate the value of this new functionality to the developers who will use it in terms they understand; make the functionality readily available to them and provide as much of the solution as possible; and build their trust by enabling second source suppliers and using early adopters such as the HPTC and open developer community. Finally, if any drive vendor wants help creating a specific plan, send me a note through the contact link on this blog page and we can talk about a consulting arrangement.