/********************************************************* END OF STYLE RULES *********************************************************/

Wednesday, August 23, 2006

Notes on Continuous Data Protection (CDP)

I'm using my blog as notebook again today. Today's topic is CDP.

The benefit of CDP is you can restore data to any arbitrary point in time - unlike Snapshot/PIT, where you might snapshot say once every 24 hours. With CDP, if you corrupt your database, you can restore it to it's state just before the event, as opposed to Snap where you my have to go back to it's state 24 hours ago. So, CDP is for protecting against an application or operator error that corrupts your database. It kind of reminds me of the VMS workstation I had years ago. By default, VMS saved three revisions of each file (and you could make it more). Usually, it just wasted space but occasionally I screwed up a file and was really glad I had the older rev.

The basic technology is similar to Copy-on-Write Snapshot. It can run in a Volume Manager, in the Array, or an intermediate Switch/Virtualization Engine. With COW you allocate some amount of storage that is smaller than the original volume. On a write to the original volume/LUN, the old data as it existed at the time the snap was taken, is copied to this extra space and the new data is written to the original volume/LUN as usual, and a log of writes gets updated. Reads to the original volume/LUN return the new data, as normal. The snap gets exposed as a new, Read-Only Vol/LUN. On a read, the log is checked. If reading blocks that have updated, the old blocks are returned from the extra storage space. If reading blocks that have NOT been updated, they are read from the original Vol/LUN. So, the amount of extra space to allocate is a function of a) how long you want to keep the snap, and b) how much data your apps write to this volume. The downsides of COW are a) you need more storage space, and 2)the performance impact of copying blocks on write commands.

CDP is an enhancement. Instead of just copying the old data to the extra space and logging which blocks have been written, CDP time stamps the updated blocks and keeps every copy of blocks that get written multiple times. Then, the administrator can take a snapshot at any point in time and on reads, the CDP logic will find the copy of the block as it existed at that point in time. The downside is a volume that gets lots of writes will require much larger extra space. The performance impact is about the same as COW - a data copy and log update for each write.

Who is doing CDP - The big players

EMC. EMC has two CDP products. RecoverPoint , which is really just a feature of their host-based Replication Manager. I'm not sure if this means it runs in a host VM, or it's just a host-based management front end for these data services running in the array. Need to look into that. Second EMC bought startup Kashya which makes CDP to run on their Invista suite that runs on intelligent switches/virtualization platforms (called Connectrix, based on McData). According to the Kashya web site, their CDP also runs on the Cisco MDS switch, and IBM SAN Volume Controller. According to their website, CDP is a Module in a suite of data services (replication, etc.)

Veritas Released CDP for Windows with 'Backup Exec 10d'. The interesting twist on this is it includes an interface for end-users to find old copies of files themselves. Claim it has a 'Google-like' interface. Can't find any info on CDP for any other platforms.

Netapp I saw a Byte&Switch article claiming Netapp got CDP with the aquisition of Alacritus in April 2005. Looking at their website though it looks like what Alacritus got them was VTL firmware to create their nearline/SATA backup product and then they partner with Symatec to bundle Backup Exec for Windows and partner with IBM to bundle the Tivoli CDP product.

IBM Has a Tivoli product that runs on client workstations/PCs/Laptops. Does CDP, replication, and backup at the file level. Stores old copies of files over the network to a NAS product like the Netapp. Also, as stated above, the Kashya FW runs on their SAN Controller virtualization product. IBM probably has something for their servers as well.

Microsoft MS provides CDP through a product called Data Protection Manager. Overview description here. It runs on the CIFS file server but claims that saved changes at the BYTE level. To minimize performance impact it saves the changes locally on the file server but then asynchronously replicates those to a disk-based DPM Server (probably running VDS). An interesting feature is client PCs can use the VSS API to request snapshots to retrieve their own files. Similar concept to Veritas Backup Exec where end-users have their own interface.

HP HP released a product last April called Continuous Information Capture which runs under Oracle on Solaris or MS Exchange or SQL server. It includes a SW layer on the Oracle app server which sends changes to a 'Recovery Appliance' and the old data to the secondary storage device.
This is the Mendocino product rebranded.

Sun A search on sun.com/storage returned no CDP products but the search did return an 'ILM Vision' white paper where Sun states: "We envision an advanced state of information lifecycle management that involves both pervasiveness and importance. Our vision includes ...." and they go on to list CDP. Good luck Sun.

The Startups

Found a long list of startups doing CDP including Kashya (aquired by EMC), Mendocino, Revivio, StorActive, Topio, XOSoft, Zetta Systems, Mimosa.

Mendocino As above, provided the CIC product to HP. Website talks a lot about the need to be application aware. Working with Sybase, and presumably Oracle on integrated database management tools that use underlying CDP APIs from Mendocino. Seems to be their unique angle on CDP.

Revivio Sells a CDP Appliance. Also pushing the application integration strategy. Talks about their SW Suite for Application Integration and modules for integrating with Oracle, Sybase, SQL, Exchange, Lotus Notes, etc.

Storactive, now called Atempo Quick read of their website looks like they are doing CDP at file level for PCs. Also provides a direct client UI so end users can recover old versions of files on their own.

Topio Provides block-level SW suite for Remote Replication (asynch via IP), Snapshot, and claims replication is time-stamped. Not sure if they really do CDP though.

XOSoft Another application-aware CDP and availability SW suite. Supports Oracle, Exchange, SQL.

Mimosa CDP for MS Exchange. Website says they are a 'strong' partner with Microsoft. Interesting - their banner uses the Sun motorcycle jumper photo with the 'S' on the side - but no mention of Sun partnership.

All the big companies have CDP. Looks like CDP has become just another tool in the data management toolbox along with Remote replication, snapshot, etc. As such, seems to be more of a 'feature' than a 'product'. Seems to be two basic approaches: 1) file CDP focused on desktops/notebooks with direct end-user interface to recover old files; and 2)server CDP designed to run under databases with some amount of database integration to make it usable from database tools.

Tuesday, August 22, 2006

Disruption and Innovation in Data Storage

The Disruption in Demand
I've been in storage for almost twenty years and I can't remember a situation where customers are looking for solutions to so many new data management and storage problems as today. It's driven by the convergence of the needs to consolidate storage resources to increase utilization, provide continuous access to data in the face of any disaster, keep data secure and comply with laws and regulations, and to manage the huge amount of data based on its content and service level requirements. While this disruption in demand is creating a market willing to pay a premium for a solution, the storage industry really doesn't know how to provide a solution so growth has been fairly anemic and products are rapidly commoditizing. Yes, there has been some growth in storage startups, VC funding and new products claiming Compliance Solutions, etc. The problem is that most of these are point products that might provide some small benefit but don't really provide the solution that customers are willing to pay a premium for.

There's a disconnect here of the type described by Christensen and Raynor in chapter 5 of "The Innovator's Solution". In this chapter titled "Getting the Scope of the Business Right" they talk about when industry standard architectures, that are integrated from components that comply to standard interfaces are better vs. when an integrated architecture using new, enhanced interfaces and components has the competitive advantage. The advantage moves back and forth based on the level of technology relative to the problems customers are willing to pay to solve.

In the 1970s computer engineers were still working out how to build a computer system that provided reasonable performance, meaning the OS ran well on the processor, which could move data to/from disks, etc. Engineers needed the flexibility to redesign and enhance these interfaces quickly to make them better (they couldn't wait for a standards body). Then, customers were willing to pay for these enhancements, even if it meant paying the premium to get them from a vertically-integrated company like IBM or DEC, because the increased ability to solve their IT problems with their 1970s datacenter was worth it.

Then, we all know what happened in the 1980s. I was at DEC at the time. VAX/VMS machines were great but we were increasingly adding Cadillac features that most customers didn't need but that we forced them to pay a premium for anyway. In other words, we overshot the market. At the same time, what Christensen calls the Dominant Architecture evolved. Applications used standard APIs such as Unix, which ran on Intel or Motorola instruction sets and talked to storage using block SCSI protocols. This shifted the advantage to solutions built from best-of-breed components at each layer of the stack and created the horizontally-integrated industry that we have today. Over and over I go into datacenters and see this heterogeneous layering with a best-of-breed application such as Oracle, running on Unix, on say Dell platforms, with Veritas VM, Emulex HBAs, to EMC storage. This is a big change from twenty years ago.

What enabled this to happen was that twenty years ago several interfaces froze at the technology level of the early 1980s. The industry agreed that the Unix file system API more or less defined the functionality a file system could provide to applications. They could open, close, and read and write files and the file system could handle a few properties such read-only and revision dates. There was no point in telling the file system anything more about the data because they didn't know what to do about it. Similarly, the industry agreed that below the filesystem the interface was 512 byte blocks with even the few properties the file system new about it stripped out because a 1985 disk drive barely had enough intelligence to control an actuator. Forget about helping to manage the information. These engineers never imagined today's volume managers or RAID storage servers that have more compute power and lines of code than the average server of the day.

Sure, you can innovate within your layer. Most notably, the disk drive has become a RAID controllers with a variety of data services and reliability and availability features but it is still limited by the fact that at the host interface, it has to make sure it looks and acts like a brainless old SCSI disk drive. More than anything, these interfaces into the dominant system architecture define the overall customer value that a particular layer can provide. As Christensen and Raynor describe, this is what ultimately forces the advantage back to a vertically integrated solution and this is caused by a new disruption in demand. For enterprise storage, that disruption is here in the form of the 'Perfect Storm' caused by the convergence of requirements for storage consolidation, continuous availability, compliance with information laws, and the need to manage vast amounts of data based on its information content.

As I talk to senior engineers and architects across the storage industry I see them struggling with these 1980s interfaces. A common answer is to attempt to bypass and disable every other layer of the stack - effectively building a proprietary stack with enhanced, but proprietary interfaces between the layers. I hear this a lot from Database engineers: "We bypass the filesystem and go straight to the SCSI passthrough interface with our own volume manager and turn off caching in the disk drive and we asked Seagate for a mode page where we can turn off seek optimization because we need to control that. Oh, and don't use a RAID controller because that just gets in the way...".

The right answer of course is to let the experts in each layer develop the best products but extend the interfaces to give each layer the right information to help manage the data. Doing all the data management in the database application is not the right answer. Neither is doing it all in the volume manager or the virtualization engine or the RAID controller or the disk drive. It's a distributed computing problem and the optimal solution will allow each layer to add more value in the stack. This is why I'm such a huge fan of Object Storage - both OSD and enhanced NFS. They are enhanced, and extensible interfaces to let each layer, and the total system solve the data management demand disruption in ways the 1980s dominant architecture cannot.

This sounds like goodness for everyone. Engineers unleash a new wave of innovation, storage vendors get to charge a premium for them, customer are happy to pay because they finally get a solution, and investors make a return for the first time in years. As Christensen describes though, this transition from a modular, horizontally integrated industry to offering a new, unique, top to bottom solution is hard. Read the book.

There are signs of hope though. Open source, particularly Linux is providing a way to break these limiting interfaces. One, because the whole OS is accessible, any interface can be improved. Two, Linux is increasingly used as the embedded OS for storage devices so new storage interfaces can be implemented on both the server and storage side. Lustre is a good example of this. Third, the community development process provides a way for vendors in different layers of the stack to work together to develop new interfaces as is happening with NFS RDMA and pNFS. Also, HPTC, which really hasn't been a driver in new technology for a while is innovating through the use of new object storage interfaces with products like Clustre and Panasas storage. Finally some vendors are trying, including EMC with Centera and Sun with Honeycomb. These products manage data based on content by extending all the way up the stack through proprietary interface to a new application-level API. I hope we can agree on an enhanced API and see some real applications.

To summarize this long post, the storage industry is in the midst of a Demand Disruption for solutions to data management problems. The industry has solidified around a set of suppliers for modular components that plug into interfaces that were standardized in the 1980s based on the technology of that time. These modular component are hitting the limits of their ability to add value through these interfaces causing their products to commoditize. The next wave of innovation, and ability to charge premiums will come through changes to these interfaces either by systems companies that supply proprietary stacks (or portions thereof), or component suppliers who work together to enhance the interfaces.

In future posts, I will continue to look at new storage products, companies and technology through this lense defined by Christensen and Raynor.

Monday, August 21, 2006

So where are the storage HBA vendors?

If I were Qlgc or Emlx, I would be helping create the standards for NAS over RDMA and helping to implement it in at least the open source operating systems (Linux and Solaris). Both vendors currently advertise iSCSI HBAs on their website but, the problem is, datacenter managers don't want to build iSCSI SANs. They want to build IP SANs that can run all the interesting storage ULPs including NFS and CIFS. So, a key requirement for buying expensive storage NICs is they must be optimized (translated: RDMA-enabled) for all these ULPs. If these guys don't build this product, the NIC vendors will, and Q and Emlx risk falling into the same situation where McBrocade is relative to Cisco.

I/O device vendors could be among the biggest beneficiaries of the open-source movement if they only realized it. For most of their life, their value-add has been limited to the small set of features the OS geeks enable in their DDI (device driver interface). Now, for the first time these IHVs have the chance to participate in architecting and creating the operating system so it enables new features and functions in the hardware. NFS over RDMA is a perfect example of this but, from what I can tell, the Linux and Solaris communities are doing all their development using IB because those are the only RDMA-capable adapters they have.

Oh, and I think NAS via RDMA will be so compelling that MS will be forced to implement it in Windows so don't assume that helping create the technology in open source only gets you the Linux and Solaris business.

By the way, I didn't mention LSI because they have a great future with SAS. The more I look at SAS the more I like it for SMB and WG SANs. It has the right level of connectivity with the simplicity and low cost of SCSI. One key enabler is SAS is already being designed onto motherboards - something that has really hurt FC adoption for small SANs.

Friday, August 18, 2006

iSCSI is good - for now

iSCSI is a good transitional technology. The common perception is that it offers a low-cost way to build a SAN on commodity ethernet components and will be successful in the SMB market. This is true but I actually see the biggest benefits in large datacenters that have hundreds and even thousands of nodes on their storage networks.

The problem with Fibre Channel is that it was never finished. The industry built a physical and a data link layer and then never went any higher in the stack (think back to the seven-layer ISO protocol stack). So, with FC you can build small SANs but, once you go beyond a few dozen nodes, and want to do things like reprovisioning and application migration, it becomes a nightmare. Administrators have to work across three or more different namespaces that have to be administered at separate UIs for app servers, switches, and storage servers. Users and applications care about file system and data and server names. Then that has to be mapped to mount points (e.g. /dev/...) and LUNs. Finally, what the poor SAN administrator has to work with are physical node WWIDs that have to be recorded and copied across the management UIs for hosts, switches and servers.

One of the benefits of moving to an IP SAN is you can leverage all the automated and centralized services such as DHCP, LDAP, SLP, etc. These aren't just aids to help manage the complexity like SAN management SW (which doesn't scale beyond a hundred nodes or so), these truly automate the complexity through central services. With iSCSI you can plug-in a new compute blade and it can query a DHCP server to not only get an IP address, but also get the location of it's boot LUN. Then, it can query an LDAP server to get the list of LUNs to mount. No scanning of LUNs, or zoning to restrict the set of visible LUNs. The DHCP and LDAP servers automated the process, are managed centrally, and are configured based on human-readable names.

The increased scaleability of IP SANs is especially beneficial for data centers that are moving to scaleable rack servers. In addition, many of these blade/rack servers have enough IP ports built-in that they are ready to connect right out of the box. The Sun/AMD blades come with four ethernet ports. There is no need to buy and install a separate SAN adapter. For server vendors, the percent of users who attached to FC has never justified putting FC HBAs on the motherboard they way they do with SCSI(SAS) and ethernet.

A benefit of using iSCSI on this IP SAN is it requires the minimum servers changes from what is used on today's SANs. SAN admins can still run their favorite local filesystem, volume manager, and multipath driver. Admins can get comfortable with the changes in the transport before making these changes in the server stack. BUT, eventually the benefits of new versions of NFS, and the ability to perform critical data management in the centralized storage servers become too compelling. This is why iSCSI is an important technology, a big improvement over Fibre Channel, but still another step in technology evolution and anyone moving to iSCSI should do so as part of a long-term plan towards NAS.

NFS and RPC via RDMA

In my post a couple days ago I left off wondering if NFS could do direct data placement (RDMA) the way FC and iSCSI do. The answer is that IETF specs are in place and projects are underway for both Linux and Open Solaris.

IEFT has a standard written by Tom Talpey of Netapp and Brent Callaghan of Apple for how to modify RPC to support RDMA with Scatter/Gather lists and handles for connecting to those DMA contexts similar to the SCSI protocol. IETF also has a standard (same authors) for modifying NFS to use this new RPC over RDMA transport. See the specs here:
    - RPC with RDMA Specification;

    - NFS Direct Data Placement Spec;

Projects to implement these for both Linux and Open Solaris are underway. Solaris is being done through a joint project between Sun and Ohio State:
    - Solaris NFS RDMA Project;

The Linux project is on SourceForge:
    - Linux NFS/RDMA;

Path Failover
As part of Direct Data Placement, NFS adds a Session Context. This is used when clients and servers reconnect to RDMA data for an I/O operation. An interesting side effect of this is it enables path failover similar to Fibre Channel since the full context of an I/O needs to failover.

Together, this takes a major step towards enabling NFS as an efficient, reliable transport that can replace FC in the datacenter while adding the benefits of an object protocol to allow the centralized storage server to play a much bigger role in solving today's data management problems.

Tuesday, August 15, 2006


Parallel NFS (pNFS) extends NFS V4 to separate communication of control (file opens and lookups) and user data. With pNFS, the NFS server, which previously performed all control operations AND moved the data, now just returns 'pointers' telling a client where the data is actually located. This means the pNFS server can be implemented on a fairly inexpensive (although highly available) server such as a clustered pair of x64 rack servers.

Parallel NFS also breaks the restriction that a filesystem must reside on a single server - or as Hildebrand says it: the 'single server' design which binds one network endpoint to all files in a file system. The pNFS server can return more than just a simple file pointer to a client. It can return a descriptor (a LAYOUT) describing locations on multiple storage servers containing portions, or copies of the file. Clients in pNFS implement a layer below the Vnode Ops and above the I/O driver called a 'Layout Driver' that interprets the LAYOUT and routes I/Os to the right regions of the right storage devices to do file I/O. The original goal was for high performance parallel access (hence the name parallel NFS) however, as you might have noticed, this Layout Driver sounds a lot like a Volume Manager in a block stack. Just like a Volume Manager, the Layout Driver could implement Mirrors, RAID 5, Remote Replication, Snapshot, Continuous Data Protection (CDP), or anything else a VM does.

The third big difference with pNFS is that as a result of separating control and data, these can now run over different transports. The relatively low-bandwidth, short bursts of control information between clients and pNFS servers can run over standard, low-cost etnernet NICS. The data traffic, more suited to RDMA-capable adapter can run over Fibre Channel, SCSI, or ethernet via a ULP that an adapter can implement to perform RDMA such as iSCSI (see yesterday's post). This also means that the client can use the multipathing driver in the FC or iSCSI stack.

This separation of control and data also means the protocol between the client and the STORAGE server can be either Blocks, Object, or something more like traditional NFS (Files via RPC). If it's blocks, you still need much of the traditional filesystem running on the client and the Layout managed by the pNFS server has to include managing free/used blocks. Objects offer an improvement by moving basic block management to the storage server as well as allowing association of properties with the data objects so the storage server can do a better job storing the data. NFS-like files offer similar benefits with V4 Named Properties but I still have my question from yesterday's post about whether we can build an RDMA NIC to RDMA this data.

At this point, I want to recap the key features of pNFS:

    Separation of Control from Data Traffic and separation of the NFS server from the actual Storage Servers;

    Eliminates the restriction that a file system must reside on one server and allows files to be replicated or spread across multiple Storage Servers under the control of a layer in the client stack similar to today's block volume managers;

    Allows data transfer between clients and Storage Servers over a variety of transports and protocols including FC SANs with their high-performance RDMA adapters and highly available multi-path drivers. Also can include OSD with it's ability to centralize block management and associate useful properties to data at an object granularity.

The Layout Manager is one of my favorite parts of pNFS. Hildebrand et al describe the Linux framework for pluggable Layout Drivers being developed jointly between IBM, Netapp, and U Michigan. Here's the Link. This is beautiful. All kinds of features needed to manage data can be implemented here including local and remote sync and async mirrors with seamless failover, COW snapshot, and CDP. This is a start-up opportunity - creating the next Veritas VM but for NAS based on pNFS.

The other part I really like is the freedom to choose your transport protocol. You can use OSD and associate properties so the Storage Server can store the data on the right tier, keep it secure, comply with data management laws and regulations - and do it at the right granularity based on the data. Then you can run it on an efficient transport such as iSCSI via an RDMA NIC, or on your exiting FC SAN. Or, you can use NFS V4++ over RDMA. It's your choice.

Monday, August 14, 2006

Performance: Blocks vs. NAS

One of the top reasons SAN administrators cite for not using NAS is performance. Much has been written about how to make ethernet work as fast as SCSI/FC for storage. Here is my summary.

Why Block Adapters Go Fast
Block adapters (SCSI & FC HBAs) can go fast for two key reasons. One, the SCSI protocol separates commands from data. An application server and storage server first perform a small exchange allowing both ends to prepare for the actual I/O operation. They agree on which direction data will move (Read or Write), the length of the data, and the HBAs on each end identify the location of the data buffers, make sure they are locked-down in physical memory (can't be paged out), and build Scatter/Gather lists so the DMA engines can follow these discontiguous physical pages. Finally, they exchange a unique 'SCSI Tag'. This will be used during the data transfer part of the I/O to tie the data back to the correct SCSI command. Or, more specifically it will tell the DMA engines which Scatter/Gather list to follow during the data transfer.

This is the second reason block HBAs perform so well is they can do Remote DMA (RDMA) using the SCSI tag during the data transfer phase. These HBAs are designed specifically to run the SCSI/FC protocols so in real time they can extract the SCSI Tag and Data Pointer from a data packet, plug that into its DMA engine, which has a table of Scatter/Gather lists for all I/Os in process, and in real-time DMA data to/from the correct buffer. This gives it zero-copy capability on both the transmit AND receive sides. (A lot of NICs which claim zero-copy can really only do it on the transmit side. Zero copy on the receive side is harder).

These two features, the pre-exchange to setup DMA, and the ability to directly DMA data mean that SCSI and FC can viewed as a form of an RDMA protocol.

Matching This Performance on IP
A problem with traditional ethernet NICs is they're designed to handle a wide variety of upper-layer-protocols (ULPs) so they don't know how to interpret the data transfer context for any particular ULP. A second problem is that most ULPs don't setup the transfer before moving data anyway.

As Jeff Mogul describes in his paper: "TCP offload is a dumb idea whose time has come", the real benefit of TCP Offload Engines (TOEs) isn't offloading TCP, it's that it can offload the ULP and implement RDMA. He has a better term: RDMA-enabled NIC (RNIC). So, like a SCSI or FC HBA, if you limit these RNICs to specific ULPs that separate command setup from data transfer, and build the RNIC so it can extract key fields from the packets to understand data transfer and tie that context into it's DMA engine, then you can match FC performance.

iSCSI is one way to do this. It has the same SCSI Tag (aka DMA context handle) so you can build an iSCSI RNIC that will perform as well as FC - provided you have equivalent physical speeds.

So the next question is, can NAS be accelerated in the same way? As I mentioned in my last post, V4 moves all of NFS into one port so the NAS RNIC can trigger off that port number for special processing. Then we need to know whether the NAS ULP does any setup to give the RNICs a chance to setup DMA context and if there is a way to detect that context during data transfer. My questions for investigation:

- NFS originally used RPC. Is this true for V4? How does RPC move large amounts of data? Does NFS move data as well as parameters through RPC?

- Does NFS V4 change anything to facilitate RDMA?

Friday, August 11, 2006

Notes on NFS V4

Notes from reading The NFS V4 Protocol, by Spencer Shepler, David Robinson, Robert Thurlow and others.

NFS V4 Goals:
- Improve hetero support (especially with Windows);
- Higher performance;
- Better security;
- Improved data sharing;

Key changes:
- Leased-based file locking (Stateful NFS servers);
- Standard data representation (XDR) (Endian-neutral);
- Elimination of separate daemons & utilities (mount, network lock manager);
- Compound operations - aggregates multiple RPC calls w/single server response;
- Aggregation of File Systems on each server into common namespace (server creates pseudo root to put them together);

Benefits of Statefull Open/Close:
- Matches Windows CIFS semantics;
- Allow exclusive file creates;
- Allows higher performance aggressive caching for clients with exclusive opens;

Benefits of elimination of separate protocols:
All NFS operations combined on one port. Makes it easier to enable NFS through a firewall. Also (need to confirm this) - may make it easier to optimize for use through a TOE.

State and File Locking
Uses client IDs to identify clients. ID is globally unique and changes through reboots so the storage server knows to release locks from prior session. Server uses State IDs for each file to keep track of client locks.

Concept that client (with non-exclusive open on a file) can cache changes. Delegated on a lease basis. Client should periodically query the server to see if another client has changed the file and cache the changes. Then, of course, the client flushes all changes.

Share Reservation
New in V4. NFS term for an exclusive lock. There must be a lease & lease renewal mechanism but I don't know what it is.

New 'Recommended Attributes'
Includes ACLs, Archive bit, Modification time, create time, access time, Owner, Group, and some other things. The archive and access info will be nice for enabling HSM and archiving.

Named Attributes
A way for a server and client to agree on additional attributes on a PER FILE BASIS (yes!). The are name/value pairs. Could be used to instruct servers to handle files in unique ways. BIG ENABLER FOR INTELLIGENT STORAGE!

More Secure RPC V4 adds support for something called Generic Security Services (RPCSEC_GSS). This provides better AUTHENTICATION for RPC calls as well as the option to add ENCRYPTION and integrity checksums to RPC calls.

ACLs V4 adds ACLs, not in V2 or 3. Uses the NT ACL model. Values for a User or Group can be ALLOW, DENY, AUDIT, ALARM. Means server can keep an audit trail and can ALARM if certain users try to access data.

Migration/Replication Support (this is cool) V4 adds a new Attribute called fs_locations (must be a named attribute, I guess since it's not a mandatory or recommended attribute). Anyway there is an error code telling the client to query this attribute. It will tell the client the NEW location of this data, if it's been migrated. It can also identify ALTERNATE locations so the server can mirror the data.

Thursday, August 10, 2006

Who is using Object Storage and why?

IBM Storage Tank
I think the commercial version of storage tank out today does NOT use OSD but, the developers have stated that: "Though not implemented (in the current release), Storage Tank was developed with the intent to use object-based storage devices, and the current research prototype supports object-based storage devices."[1] IBM has an open source implementation of an OSD driver for Linux so we know they are making progress. Their project is on SourceForge. Also, IBM demo'd OSD at Storage Network World in April 2005

By the way, Intel also has a Linux OSD initiator project on SourceForge

Their stated reasons for wanting OSD are: "Storage devices that have knowledge of the objects (essentially files) being stored can add significant capability to a storage environment. For instance, the device can store specific keys with an object so that host systems that want to read or write the associated file are required to present the proper key before being allowed to access to the object. Such an approach eliminates SAN security problems and makes it feasible to extend Storage Tank beyond the machine room to become a compus-wide file system. Additionally, object storage devices can conceivably optimize data placement (having intimate knowledge of drive geometry) or participate in intelligent caching or remote copy operations that are beyond the ability of current storage devices"

I'm going to go into more on the disk data placement benefit in a later entry. There's a lot more benefit than just that!

Lustre is a high-performance parallel file system for Linux designed for high-performance computing. Lustre is open sourced but primarily created and maintained by Cluster File Systems Inc. According to their website, ten out of the top thirty supercomputers, including the number one ranked IBM BlueGene computer run Lustre.

Lustre currently uses a proprietary object protocol on top of IP, although they have stated they WANT to move to a standard. Today, of course, there are no readily available T10 OSD storage devices.

Lustre uses object storage for two main reasons. One, Lustre enables creation of highly parallel and scaleable computers using many commodity compute platforms. Centralizing block management in the storage device avoids a whole lot traffic on the interconnect to synchronize free/used block management. Two, the object storage device can keep track of which compute nodes have which objects (files) open for read and write access avoiding a whole lot more traffic on the interconnect to share files.

Panasas builds an object array and associated file system layer called PanFS. I think they also work with Lustre but I need to verify this. Panasas also uses a proprietary object interface and have also stated that they want to move to an open standard. Panasas is also driving pNFS as an open standard for a transport-agnostic metadata server.

Panasas uses Object Storage for for many reasons. Similar to Lustre they focus on high performance supercomputing and want the benefit of centralized block management and file sharing. In addition, Panasas uses the grouping of data to intelligently layout the data on the media and do smarter caching. This is valuable for HPTC workloads where they may create large data objects where high-speed streaming access is important. Finally, Panasas allows secure, authenticated access to objects.[2]

Sun has demonstrated QFS running on T10 OSD storage and has stated their desire to move to OSD in a presentation at the U. of Minnesota DTC here. Nice slides Harriet - some of these look familiar :-). This presentation cites several benefits. As a high-performance parallel file system, centralizing block management and sharing are two key benefits.

1. Jai Menon, David Pease, Robert Rees, Linda Duyanovich, and Bruce Hillsberg, IBM Storage Tank - A heterogeneous scalable SAN file system", IBM Systems Journal 42, No. 2, 250-267. Also available online: Storage Tank Paper

2. Panasas, "Object Storage Architecture", White Paper

Wednesday, August 09, 2006

Types of Object Storage

The ANSI SCSI T10 OSD spec is not the only way to build Object Storage. My definition of Object Storage is any storage that:
    1) Groups the data into meaningful groupings (not based on the underlying physical storage) and;

    2) Associated properties with these objects.

Based on this definition, NFS servers were the first object storage. NFS servers store data in meaningful groupings, (files) and associate useful properties such as revision history, read-only vs writeable, and open state (open for read, write, etc.) Although fairly simple, this allows valuable functionality in the storage server. The storage server can use the associated properties to perform ILM functions. The revision history and file grouping enables intelligent backup or HSM. Traditional backup application can run on the storage server to do incremental backups, full backup of all files (vs. every block in the volume). Archiving/HSM file systems such as SAM-FS for Solaris can archive infrequently-accessed data to tape or lower tiers of storage.

Of course, NFS has had limited adoption due to limitations that are being fixed in NFS V4 and beyond and through the use of TOE NICs. More on that later...

Re-post: More on Blocks

Originally posted Aug 31, 2005
A few weeks ago I was blogging about how block protocols like SCSI were designed around the on-disk sector format and limited intelligence of 1980's disk drives. Clearly, if we were starting from a clean sheet of paper today to design storage for modern datacenters, this is NOT the protocol we would create.

The real problem though isn't just that the physical sector size doesn't apply to today's disk arrays. The problem today has more to do with the separation of the storage and the application/compute server. Storage in today's data-centers sits in storage servers, typically in the form of disk arrays or tape libraries which are available as services on a network, the SAN. These storage services are used by a number of clients - the application/compute servers. As with any server, you would like some guarantee of the level of service it provides. This includes things like availability of the data, response time, security, failure and disaster tolerance, and a variety of other service levels needed to insure compliance with laws for data retention and to avoid over-provisioning.

The block protocol was not designed with the notion of service levels. When a data client writes a collection of data, there is no way to specify to the storage server what storage service level is required for that particular data. Furthermore, all data gets broken into 512-byte blocks so there isn't even a way to identify how to group blocks that require a common service level. The workaround today is to use a management interface to apply service levels at the LUN level which is at too high a level and leads to over-provisioning. This gets really complicated when you factor in Information Lifecycle Management (ILM) where data migrates and gets replicated to different classes of storage. This leads to highly complex management software and administrative processes that must tie together management APIs from a variety of storage servers, operating systems, and database and backup applications.

If we were starting from a clean sheet of paper today to design a storage interconnect we would do a couple of things. One, we would use the concept of a variable sized data Object that allows the data client to group related data at a much finer granularity then the LUN. This could be an individual file, or a database record, or any unit of data that requires a consistent storage service level. Second, each data object would include metadata - the information about the object that identifies what service levels, access rights, etc. are required for this piece of data. This metadata stays with the data object as it migrates through its lifecycle and gets accessed by multiple data clients.

Of course there are some things about today's block protocols we would retain such as the separation of command and data. This allows block storage devices and HBAs to quickly communicate the necessary command information to set up DMA engines and memory buffers to subsequently move data very efficiently.

Key players in the storage industry have created just such a protocol in the ANSI standards group that governs the SCSI protocol. The new protocol is called Object SCSI Disk (OSD). OSD is based on variable-sized data object which includes metadata and can run on all the same physical interconnects as SCSI including parallel SCSI, Fibre Channel, and ethernet. With the OSD protocol, we now have huge potential to enable data clients to specify service levels in the metadata of each data object and to design storage servers to support those service level agreements.

I could go on for many pages about potential service levels that can be specified for data objects. They cover performance, insuring the right availability, security, including access rights and access logs, compliance with data retention laws, and any storage SLAs a storage administrator may have. I'll talk more about these in future blogs.

Re-post from Storage Networking blog in August '05

Why blocks?
We've been doing a lot of thinking lately about the blocks in block storage. At some level blocks make sense. It makes sense to break the disk media into fixed-size sectors. Disks have done this for years and up until the early 1990s, disk drives had very little intelligence and could only store and retrieve data that was pre-formatted into their native sector size. The industry standardized on 512-byte sectors and file systems and I/O stacks were all designed to operate on these fixed blocks.

Now fast-forward to today. Disk drives have powerful embedded processors in integrated circuits that have wasted silicon real-estate where more could be added. Servers use RAID arrays with very powerful embedded computers that internally operate on RAID volumes with data partitioned into stripes much larger than 512 byte blocks. These arrays use their embedded processors to emulate the 512-byte block interface of a late 1980s disk drive. Then, over on the server side, we still have file systems mapping files down to these small blocks as if IT were talking to an old drive.

This is what I'm wondering about. Is it time to stop designing storage subsystems that pretend to look like an antique disk drive and is it time to stop writing file systems and I/O stacks designed to spoon-feed data to these outdated disks?

Continuation of my storage weblog from blogs.sun.com/kgibson

This is a continuation of a weblog I started while at Sun where I was writing about making storage more intelligent in order to solve today's data management problems. The key enabler required is a new interface to the storage letting it expose those intelligent features to the application servers.

My last posting was a year ago before I moved into a different job at Sun. I have now left Sun and want to continue evolving these ideas. I'm going to use this weblog as my personal notebook. Some posts will be semi-coherent essays and others will be just collections of notes as a read various papers. First, for continuity, I'm going to re-post a couple of relevant postings here.

Wednesday, August 02, 2006

Test from MarsEdit

Test from Marsedit running on OS X