Storage Thoughts: October 2006

Tuesday, October 31, 2006

Seeing What's Next Part II, Undershot Customers

The undershot market includes those enterprise datacenters that are 1) trying to keep the company's mission critical data on centralized storage servers and serve that data to many application servers on a large storage network; 2) keep that information 100% available even in the face of major disasters; 3)manage an increasing amount of information while their IT budget gets cut every year, and 4) comply with all the laws and regulations for storing, securing and tracking the various types of information stored in digital form. I've talked to several datacenter managers who simply don't know how to do all of this with the products available today. At best, they can put together various point solutions (often advertised as 'Compliance Solutions' but really just tools for solving a piece of the problem). These solutions required integrating a variety of components and lots of administration to get everything working together. These undershot customers would happily pay more for improved products that provide a relatively simple, integrated solution to these four requirements.

As Clayton describes, incumbent players (EMC, IBM, etc.) are strongly motivated to add new features that they can charge these undershot customers for, especially in a situation like today where traditional block arrays are rapidly becoming commodities. The problem is that sometimes, what he terms a 'radical sustaining innovation' is required. This is when a major re-architecture that changes the whole system from end to end is required to meet new customer needs. An example is when AT&T changed it's whole network from analog to digital in the 1970's. They just couldn't move forward and add significant new value without making that end-to-end change.

That's where storage is today. I've repeated that message in this blog but will say it again because Clayton makes this such an important point for understanding innovation. The block interface which is the primary interface for almost all enterprise class storage is over thirty years old now. The basic assumptions the block interface was based on don't apply anymore at these undershot customers. Block-based filesystems assumed storage was a handful of disks owned by the application server. It assumed the storage devices had no intelligence so it disaggregated information into meaningless blocks. A block storage array has no inherent way to know anything about the information it is given so it's extremely limited in it's ability manage any information lifecycle, or comply with information laws, or anything else involving the information. The problem is that now that information has moved out of the application server, away from the filesystem or the database application and now that it lives in the networked storage server, the information MUST be managed there. This is why an object-based interface such as NFS, CIFS, or OSD which allows the storage server to understand the information and its properties is essential. So, the question is, who can create this 'radical sustaining innovation' with it's changes up and down the stack from databases and filesystems down to the storage servers?

If we were back in the early 1980s this would be easy. One of the vertically integrated companies such as IBM or DEC would get their architects from each layer of the stack together, create a new set of proprietary interfaces, and develop a new end-to-end solution. Then, if they did their job right, the undershot customers would be happy to buy into this new proprietary solution because it does a better job of solving the four problems above. The problem today is that many of these companies either don't exist anymore, or they've lost the ability for this radical innovation after twenty years in our layered, standards-based industry.

Who could pull off this radical sustaining innovation spanning both the storage and the application server? Clayton recommends looking at various companies strengths, past records, and their resources, priorities and values to attempt to identify who the winners might be. Here's my assessment of the possible candidates.

IBM
IBM is probably the farthest along this transition with it's StorageTank architecture. StorageTank is a new architecture for both filesystems and storage arrays that do just what I described above - use an object protocol to tell the storage array about the information so it can be managed and tracked. What I don't know is how successful it has been or how committed IBM is to this architecture. Twenty years ago it was in IBM's DNA to invest in a major integrated architecture like this. Whether that's still the case, I don't know. Another question is how well the array stands up as block array in it's own right? Datacenters are heterogeneous. It's fine to implement enhanced, unique functions when working with an IBM app server but the array must do basic block to support the non-IBM servers as well. Will customers buy StorageTank arrays for this? I don't know.

Microsoft
Microsoft is a strong candidate to be one of the winners here. They have a track record of doing interdependent, proprietary architectures. They did it with their web-services architecture. They are moving into the storage server with their Windows Storage Server (WSS) - a customized version of the Windows OS designed to be used as the embedded OS in a storage server. Although I don't know specifics, I would bet they are already planning CIFS enhancements that will only work between WSS and Windows clients. Another strength of Microsoft is they understand that, in the end, this is about providing information management services to the applications and gaining support of the developers to use those services. As storage shifts from standard block 'bit buckets' to true information management devices this application capture becomes more important. This is not part of the DNA of most storage companies. Finally, as we all know, they are willing to take the long-term perspective and keep working at complex new technology like this until they get it right. On the flip side, these undershot customers tend to be the high-end mission critical datacenter managers who may not trust Microsoft to store and manage their data.

EMC
EMC is acting like they're moving in this direction, albeit in a stealth way. They have a growing collection of software in the host stack including Powerpath (with Legato additions) and VMWare. They also have a track record of creating and using unique interfaces so it would not be out of character to start tying their host software to their arrays with proprietary interfaces. What they don't have though is a filesystem or, to my knowledge, an object API used by any databases. They also don't have much experience at the filesystem level. They could build a strong value proposition for unstructured data if they acquired the Veritas Foundation Suite from Symantec. An agreement with Oracle on an object API, similar to what Netapp did with the DAFS API would enable a strategy for structured data.

Oracle
Oracle is a trusted supplier of complex software for managing information and they have been increasing their push down the stack. They have always bypassed the filesystem and now with Oracle Disk Manager, they are doing their own volume management. Recent announcements relative to Linux indicate they might start bundling Linux so they don't need a third party OS. With the move to scaleable rack servers and the growth of networked storage, they must be running into the problem that it's hard to fully manage the information when it resides down on the storage server. This explains Larry's investment in Pillar. My prediction is that once Pillar gains some installed-base in datacenters as a basic block/NAS array, then we will see Oracle-specific enhancements to NAS that only work with the Pillar arrays.

Linux and open development
Clayton groups innovation into two types. One type is based on solutions built from modular components designed to standard interfaces. For example a Unix application developed to the standard Unix API running on an OS that uses standard SCSI block storage. Here innovation happens within components of the system and customers get to pick the best component to build their solution. The second type is systems built from proprietary, interdependent components such as an application running on IBM's MVS OS which in-turn uses a proprietary interface to IBM storage. Because standard interfaces take time to form, the proprietary systems have the advantage when it comes to addressing the latest customer problems. When new problems can't be solved by the old standard interfaces, it's the proprietary system vendors who will lead the way in developing the best new solutions.

What Clayton doesn't factor in however, is the situation in the computing industry today where we have open source and open community development. A board member of the Open Source Foundation once explained to me that the main reason to open source something is not to get lots of free labor contributing to your product. The main reason is to establish it as a standard. For example, Apache became the de-facto standard for how to do web serving. This is happening today with NFS enhancements such as NFS over RDMA, pNFS, and V4++. They first get implemented as open source and others look to those implementations as the example. Because Linux is used as both an application server AND as an embedded OS in storage servers, both sides of new proprietary interfaces get developed in the open community and can quickly become the de-facto standard. This is what I love most about Linux and open source. Not only is it leading so much innovation within software components, but when the interface becomes the bottleneck to innovation, the community invents a new interface and implements both sides making that the new standard.

Friday, October 27, 2006

'Seeing What's Next' in the Storage Industry, Part I

I'm reading Clayton Christensen's latest book titled Seeing What's Next. This is the third book in a series after The Innovator's Dilemma and The Innovator's Solution. This book leverages the theories from those two but, as he says in the preface, "Seeing What's Next shows how to use these theories to conduct an "outside-in" analysis of how innovation will change in industry." Perfect. Let's do our homework by applying this to the storage industry - specifically array subsystems and associated data services and storage networking products.

Clayton says that to identify significant change in an industry, look at three customer groups: Non-consumers, Undershot customers, and Overshot customers. I see all three of these in the storage industry.

Non-consumers and Storage
Clayton defines these as potential customers who are not buying the products today because they either can't afford it, or for some reason don't have the ability to use the existing products or to apply them to the problem they are trying to solve. Instead they either go without, hire someone else to do it, or cobble together their own 'less-than-adequate' solution. These customers can be important because it's a place where new technology that appears sub-optimal to existing customers can mature to the point where it becomes attractive to the mainstream. Clayton calls this a 'New-market disruptive innovation'

I see two such groups for storage subsystems and data services. One group has always been there - the small office/home office market. Most of these customers are still not buying $10k NAS or RAID servers, installing them at home, backing them up, remote mirroring, etc. Instead they put data on a single HDD and maybe manually backup to a second drive (cobble together a less-than-adequate solution), or maybe use an SSP such as Google or Apple's iDisk (hire someone else to do it). A few players are pushing into this space such as Stonefly, and Zetera with their Storage-over-IP technology but I'm personally not seeing anything that justifies a significant 'new market disruption'.

The more interesting group that meets the definition of 'non-consumers' is the new group of High-Performance Technical Computing (HPTC) users. These users have big high-performance computing jobs to do but their research budgets don't allow them to buy multimillion dollar mainframes and storage subsystems. They have figured out how to build their own parallel computers using lots of commodity x64 rack servers harnessed together with customized Linux operating systems. Part of customizing Linux to run these highly parallelized compute jobs is to use the Lustre File System. Lustre uses its own object storage protocol on top of commodity IP interconnects so the object-based storage can effectively share storage between many compute nodes. Then, on the storage side, they cobble together their own storage servers using Linux with added Lustre target-side components on commodity hardware.

Much of this solution would be considered 'not-good-enough' by many mainstream enterprise storage customers - ethernet is considered too slow, availability and data integrity are not sufficient, and it requires an 'assemble-it-yourself' Linux storage array. As Clayton's books describe however, this is exaclty how many disruptive innovations start. In addition, the sustaining enhancements to make this acceptable to mainstream datacenters are in process in the open-source community and associated standards bodies. These include RPC over RDMA and 10G ethernet that will exceed current FC performance, improved availability and reliability through enhancements in NFS and pNFS, as well as sustaining Lustre enhancements driven by ClustreFS Inc. I've talked to several managers of large datacenters who are interested in migrating key applications such as Oracle and web services from large mainframes to scaleable rack servers and who are watching and waiting for the storage and SAN technology to support that migration. So, this one clearly looks like a new market disruption in process.

Overshot Customers
These are the customers for whom existing products more than meet their goals and who are not willing to pay more for new features. The customers that were in the soon-to-be-extinct segment called 'mid-range storage' meet this definition. They just want reliable RAID storage and MAYBE a few basic data services such as snapshot, etc. I've talked to several ex-midrange customers who know there are low-end arrays that meet their reliability and availability goals, and they just want to know how cheaply they can get them.

The other giveaway that overshot customers exist is the rapid growth of companies supplying what used to be considered not-good-enough technology. This exists with the growth of Netapp and NAS storage. NAS has been considered sub-optimal for datacenter storage for several reasons. No RDMA, limited path failover, data can't be seamlessly migrated between, or striped across NAS servers. Netapp's growth proves that more customers are finding these limitations acceptable. In parallel, the NFS and pNFS enhancements in process will solve these reliability/availability restrictions. So, I'm betting that this is a low-end disruption that will just keep growing.

Undershot Customers
Undershot customers are those who have trouble getting the job done with the products available today and would be willing to pay more for added features that help. These are the customers companies love. They can invent new features and charge more for them. Storage has lots of undershot customers driven by the need to comply with information laws, while at the same time consolidating storage on the storage network, while at the same time keeping it 100% available. The storage industry is putting a lot of energy into claims they can add features to help these undershot customers. The problem though, as described in The Innovator's Solution (chapter 5), and in my post on Disruption and Innovation in Data Storage, is that sometimes a component designed to a modular interface can't effectively solve new problems due to the restrictions of the now outdated standard interface. The inventors of the block interface standard never planned for today's data management problems or the type of highly intelligent, networked storage that we have today. In a case like this, the competitive advantage shifts to an integrated company that can invent a new proprietary interface and provide new functionality on both sides of the interface (in this case both the storage server AND the data client/app server). This has been a common theme of mine in this blog. Designing a storage system to meet these new requirements requires a new interface between app server and storage or, stating it in Clayton's terms, the competitive advantage has swung to companies with an interdependent architecture that integrates multiple layers of the stack across a proprietary interface.

I'm going to leave with that thought for today and continue with it in Part II. In the meantime, think about EMC and the host software companies it's been acquiring, or Microsoft with it's move to the storage side with VDS, or Oracle and its investment in an array company (Pillar).

Wednesday, October 25, 2006

Confirmation that File Virtualization is hot

SearchStorage released an article today titled File virtualization tops hot technology index. As described here: "File virtualization beat archiving, data classification, encryption and ILM to the top spot," said Robert Stevenson, managing director of TheInfoPro's storage sector. He attributes the interest in file virtualization to network attached storage (NAS) growth in the data center, with average capacity deployments in the last month at 220 terabytes (TB) and the long project timelines needed for block virtualization. "Storage professionals have been focusing their near-term energies and budgets on improving file content management," Stevenson said.

You read it here first
This aligns with my ideal SAN as described in my post on 19-Sep. In my SAN, information is stored in meaningful groupings (files) with associated properties that let the storage servers apply useful data services based on the information being stored. In addition, I want many of the virtualization features available for today's block SANs including mirroring, striping, and remote replication across multiple storage devices. It seems I'm not alone and that Fortune 1000 storage admins are interested in the same thing.

Note to Technology Venture Investors
The article goes on to state that there are no standards around this technology yet, so every file virtualization and namespace technology has a different way of talking to storage services.. That is true but the underlying standards are evolving in the form of pNFS and NFS V4++. As I commented in my pNFS post, helping to evolve pNFS, and building the data services and APIs above that top my list of promising start-up opportunities. If I were leading such a startup, I would execute a strategy of building this software in open source, on Linux, and would partner with storage vendors such as NetApp, Panasas, and maybe even EMC, and define the APIs in cooperation with as many ISVs as I could, including Oracle and Microsoft. Such a startup has a good chance of then being acquired by one of these companies.

Tuesday, October 10, 2006

Notes on Various Storage-related ISVs

I'm reading the material on the websites for various storage software startups - trying to get past the grand claims of how their widget will solve all your storage and data management problems and dig out the hidden clues to what their products can, and can't do. Here they are, in no particular order:

Clearpace
Actually not a bad website. They provide an algorithm for compressing structured data records. That, in itself is nothing new but what's unique is while compressed, they retain the ability to search based on keywords (using SQL) to help compliance with data laws. So, for example, you could keep all your OLTP transaction records for the last year on nearline VTLs and the finance or legal department could query them at any time.

The amount of compression varies. According to their website, the algorithm de-duplicates fields in records, keeping only one copy, replacing other copies with references to the one single copy of that field. The presumption is that most databases store many copies of the same data in different records. I don't know how to verify that but they claim you can achieve up to 10-1 compression.

Seems like useful technology that meets a real need. What I don't have a good feel for is how to operationalize this. Do you create a snapshot every day, then run it through this application as you copy to the archive? Do you run the daily incremental backup through it and can the application put that together with previous incremental backups? I'm curious. I'm also curious whether most databases really duplicate that much data and, if they do, how long will it be before they add a built-in feature to compress the database to create a similar searchable nearline archive.

Continuity Software
Software that analyzes your SAN topology and identifies data protection and availability risks such as configuration errors, inconsistent LUN mapping, unprotected data volumes, etc. Includes a knowledge-base that provides suggestions for best-practices around disaster recovery and data recovery.

What's not clear is how the applications gets the data to analyze and how it gets updates when changes are made. The software is 'agent-less' but they claim it has automation to detect the configuration on its own. They also sell a 'service offering' (translated - you pay for the labor for them to come into your datacenter and do the work). They collect the configuration and enter it into the tool which in turn shows you the risk.

Scalant
Scalant produces a layer of system software that turns a compute grid into a 'super cluster' providing HA and load balancing.. This is something Sun claimed to be doing a few years ago (anyone remember N1?). It includes three components. Like traditional clusters, it includes a layer on each compute node that monitors the health of that node and provides a heartbeat so others know it's alive. The second component, unlike traditional clusters, is a monitor (which itself runs on an HA-clustered pair of servers) that monitors the overall health of each compute node and receives heartbeats. It also stores the 'context' of the various applications running on the grid. It detects the failure of an application on a compute node and restarts it on another one. The third component is the config and monitoring GUI.

What's interesting to me about this is the implications on the storage network. FC is not a good choice for this type of compute grid. One, it's too expensive and not available onboard these types of scaleable compute nodes. Mostly, it doesn't have good functionality for the sharing and dynamic reconfiguration you really want to support automatic migration of applications around a large compute grid. You really want a SAN like I described in My Ideal SAN.

First, you want sub-pools of compute nodes running the same OS configs and you want easy scaleability. So no internal boot disks to manage. You want IP-based boot with a central DHCP server to route the blade to the right boot LUN. You would like data services in the array so all these sub-pools can boot from the same volume. Then, you would like the Application Contexts managed by the Scalant cluster monitor to include references, by name, of the data volumes that application needs so when it instructs a compute node to startup an app, it knows how to find, and mount the data volumes it needs. Finally, you would like some form of object-based storage that can share data between multiple nodes to support parallel processing as well as HA failover clusters.

CopperEye
OK, this is the first company I've researched today who doesn't have the smiling person on the homepage. I like them already.

Coppereye is addressing the same problem as Clearpace above. The need to quickly search large transaction histories based on content/keywords. Unlike Clearpace, Coppereye indexes data in place and builds a set of tables that fit in a relatively small amount of additional storage. They claim their algorithms and structure of the tables allow for flexible, and high-speed searches. Although they never explicitly mention structured vs. unstructured data, their site usually talks in the context of searching transactions so I think the focus is structured data. I didn't see a mention of SQL but they do have a graphical UI. Here's their description:

CopperEye Search™ is a specialized search solution that allows business users to quickly find and retrieve specific transactions that may be buried within months or years of saved transaction history. Unlike enterprise search solutions, CopperEye Search is specifically targeted at retrieving records such as credit card transactions, stock trades, or phone call records that would otherwise require a database.

Datacore
Datacore is not new and is not a startup although they are a private company. They provide a software suite that lets you turn a standard x86 platform into an in-band virtualization device. A key feature is the ability to under-provision volumes and keep a spare pool that is dynamically added to volumes as necessary. Other features include intelligent caching, data mirroring, snapshots, virtual LUNs, LUN masking, etc. It runs on top of Windows on the native virtualization device so it can use the FC or iSCSI drivers in Windows, including running target mode on top of either.

This looks like a nice product. It's been shipping since 2000 and is up to rev 5 so it ought to be pretty robust and stable. It runs on commodity hardware and can use JBODs for the back-end storage. Provided the SW license is reasonable, this can be a nice way to get enterprise-class data management on some very low-cost hardware.

Avail
Avail has developed a SW product that works with the Windows File system to synchronously replicate files among any number of Windows systems. They call it the Wide Area File System (WAFS). It replicates only the changed bytes to minimize data traffic and works through the HTTP protocol so it can pass through any firewall that enables HTTP traffic so it truly works over WANs. It can replicate from a desktop to a server or between desktops. Users always open a local copy of the file, but the local agent gets notified if a change has been made to one of the remote copies and it makes sure any reads return the most recent copy of the data. It does this by implementing a lightweight protocol so that at soon as a file or directory (folder) is updated, all mirrors get quickly notified, although actual data movement may happen in the background.

Provided this is robust, it's kind of cool technology. It allows both peer-to-peer file sharing as well as backup/replication.

That's all for today. I'll follow up with a Part II in a few days.

Monday, October 09, 2006

Array Chart Rev 2 and Responses to Comments

I added another array vendor to my Array chart - Agami. They are another startup doing scaleable NAS with file-aware services that runs on low-cost commodity hardware. So, they go in the upper left somewhere near Isilon. I update the chart in my previous post.

Responses to Several Comments
Thanks for the great comments over the last several weeks. Good comments on NFS V4 that I need to do more research on. Here are responses to some of the others:

Isilon
Good corrections and additions to my description of Isilon. I updated my notes in Part II.

EMC and Innovation
Good point that just because EMC acquired a bunch of companies it doesn't justify being an 'innovator'. I guess I'm giving credit to those new EMC employees who did some new and unique things before getting acquired by EMC. In particular, I like what VMWare, Rainfinity, Invista, and some of the other Data Services start-ups created.

iSCSI IP overhead and ATA over Ethernet
One comment questioned why anyone would take the overhead of running IP for iSCSI traffic verses just using ATA over ethernet and who is using IP to route iSCSI traffic. I'm not enough of an IP expert to really answer that but I do have two comments. One, I've talked to several datacenter managers who are looking at moving apps from big-iron to dual and quad-x64 rack servers. They're finding they have plenty of spare processing power and, for now at least, wouldn't even notice if they had a more efficient protocol stack. Second, a big reason for using iSCSI is to get the automated network utility protocols such as DHCP and DNS for their ethernet SAN. Can ATA over ethernet work in a such a network?

Sunlabs
Had a comment requesting more insight into what's going on in Sunlabs. I'm not with Sun anymore (why I've moved to blogspot for weblogging) but, Sun really has become very open and transparent and you can put together a lot by reading their blogs and by looking at open Solaris. Jonathan's blog is always a good way to learn where his head is at. In his post The Rise of the General Purpose System he talks about custom storage hardware getting replaced with commodity HW running specialized, but open-source-based software and specifically mentions their Thumper project that packs 48 drives into a 4U enclosure with a standard x64 motherboard. Another interesting one is Jeremy Werner's Weblog where he talks about Honeycomb. This is software, based on Solaris that stores data reliably across a large number of commodity storage platforms, (such as Thumpers) and provides a Content Addressable Storage (CAS) API. So, imagine building your own Google inside your datacenter for your companies knowledge base.

Other visible datapoints: Lots of blogs and visibility (including open source) around ZFS and it's interesting file-level data services. OpenSolaris.com includes some interesting storage-related projects including iSCSI with both initiator and target-side functionality, and an Object SCSI Disk (OSD) driver. Sun continues to lead enhancements to NFS to give it the availability and performance to finally displace block/FC in the enterprise datacenter. Many highly mission-critical datacenters continue to run SAM-FS to automatically archive and store huge filesystems across both disk arrays and tape libraries.

Put all this together and what you might expect are NAS storage products with an iSCSI option, based on Solaris, running ZFS, with standard AMD64 and Sparc motherboards. They will come in scaleable, rack-mount form factors. You might have the option to run Honeycomb to build large content-searchable storage farms or the option to use SAM to archive data to tape libraries.

Keep the comments coming!

Monday, October 02, 2006

Array Vendor Chart

I've been trying to figure out how to map all these vendors onto a single chart. They can be rated on many criteria so it's hard to reduce them to a two-dimensional chart but, I'm going to try. I believe storage of the future will be based on technology that reliably stores and manages information on scaleable commodity HW much like the trend with rack servers today. So I created a two-dimensional chart that maps new innovation (at managing and storing information) vs. cost. New innovation is on the y-axis and higher is better. Cost is on the x-axis and lower is better. This chart will be an ongoing work-in-progress but, my first revision is shown below.

I've tried to show where the customer groups fall on these scales. For example, mission-critical enterprises are willing to pay for expensive products and service so falls far to the right. They have tough data management problems so need some innovation but are too risk averse to go for very new technology so they fall midway along the y-axis. SMB is to the left of that. They typically want basic block storage with common features but want to save money so don't buy the most expensive equipment. HPTC, in the upper left is leading the way in innovating technology for solving tough computing problems but typically have limited budgets so they are using the community development process to drive innovation through open-source software.

Placing the Vendors
Now, where to put the array companies? EMC is clearly the cost leader (loser?) so they go far to the right. In terms of innovation, I give them the benefit of the combined invention of all the companies they've acquired, but they basically stay within their block storage framework and stay away from bleeding-edge technology so they go in the middle. I put IBM higher on the innovation scale because I think they're on the right track with StorageTank. In the lower left are the simple integrators like StoneFly and Celeros. They basically integrate commodity hardware and software components. Not much innovation, but they provide it at a very low cost. Then, in the upper left is the innovation that I like. Software-based invention that leverages commodity arrays and motherboards. Panasas is doing this. I listed ClustreFS on here. They aren't an array vendor but they do the file system and add-ons to Linux that 30% of the Top-100 supercomputers use to create their storage grid. I haven't quite figured out where to put Netapp. They are driving a lot of important innovation but I need to check their prices to see where they fit on the price scale.

So here it is, rev 2 of my array vendor chart. with all it's errors and probably several missing storage vendors.

Storage Thoughts