Rumble Rumble - Was that an earthquake? Virtualizing the Dedupe.

Hey, what's that noise I just heard this Monday? Ah, yeah, an announcement by Permabit of OEM deduplication technology called Albireo. Just more dedupe choices, right? Maybe good business for the Permabit guys. Maybe an easy way to get dedupe for OEMs. Doesn't seem all that exciting does it?

I think there's more to it. If you weren't paying attention on Monday and reading between the lines, you may not have noticed, but the tectonic plates of the storage industry just shifted, and we might all be standing in a slightly different place.

For the first time, we now have a hashing based capacity optimization technology use and reuse outside of a single box or a single vendor's box - and initial OEM licensing seems to suggest we're going to see this happen. Moreover, there are more places this technology could be used than you can shake a stick at. Let's look at some of the possibilities.

Dedupe without overhead - perfect for block
First up is the block virtualization layer within storage arrays, and this is no less than a game changer. You see, Permabit has come up with a deduplication algorithm that can be implemented at the block storage level, behind a controller. Grossly simplified, the way this works is pretty straightforward - as individual IOs come in, the controller still handles the write to disk, but by checking with a Permabit indexing and hashing algorithm engine (a C based software module called Albireo), the storage array can identify duplicate blocks, and write pointers to duplicate data instead of writing a full block. The thing is, this writing of blocks and pointers is barely a modification of how block storage works in the first place, especially within storage systems that perform today's pretty routine snap/copy/clone and thin provisioning tasks. Already, arrays could effectively "filter" out the blank blocks to do thin provisioning. Already, arrays were doing copy on write or redirect on write snaps. So what you get with deduplication looks poised to be little different in implementation than working with a snapshot. Such a system will write out data that is fully usable without the Permabit Albireo deduplication engine (the controller still understands the block pointers without Albireo), but is tons more space efficient. Dedupe for primary storage stands in the wings. Mileage may vary by any given vendor's integration, of course, so time will tell whether this changes the game for primary storage optimization.

One dedupe to dedupe them all
But Albireo doesn't stop there. The Permabit guys are of course attentive to block storage optimization, so with block in mind it is no surprise that Albireo can work with fixed chunk deduplication of 4K blocks. That matches the on disk sector size of next generation disk (a la Western Digital's Advanced Format) and is great for disk controllers that want to lay out IO in 4K or bigger chunks. But Albireo can also work with variable sized chunks of data (for the non-dedupe familiar folks, variable means the "segment" of data that the system tries to match to duplicate data can change in size - if you find 4K of duplicate data, then try a little bigger). Variable sized chunks are great for file systems, and a single piece of Albireo software integrated into a storage device can be called by block storage, a file system, object storage, or nearly anything else imaginable. Albireo is purely an indexing and hashing algorithm - that means it figures out the data patterns, and and a single Albireo software module will hash out and create a duplication index for any device that can communicate with it. By cross referencing those duplicated data locations, any "storage system" - whether a block volume, a NTFS file system, an object-based system, or CAS - can reference duplicate data any way it pleases. For most vendors, that simply means figuring out from the index a new location for a pointer in a file system (ext3, NTFS, UFS, VMFS, etc.) or in a block system (which is in reality supported by a file system-like block virtualization layer such as EMC's UbFS). Dedupe for anything, in one system.

Dedupe dedupe everywhere, no wasteful bits anywhere...
The thing is, this makes capacity optimization an intrinsic part of what you're doing, no longer handled by a proprietary file system, or integrated in terribly complex ways. Want to talk about unified storage? This is Unified Storage 2.0. You may have an infrastructure with ubiquitous and common deduplication across NAS and SAN, but you may also have an infrastructure that understands deduplicated data and keeps it optimized as volumes are replicated or moved. It doesn't take much to imagine a couple of block systems that can replicate cross-referencing block pointers between sites - optimized data replication. Meanwhile, the Permabit guys are masters of small, lightweight indexes. For file and object data, I can imagine an Albireo integration where an index is transmitted along with file data to another Albireo system that can correlate the received index with its own index, and re-optimize the data for local storage, potentially with minimal or zero reinflation. Systems may even have multiple Albirio-driven indexes - one for subsets of data replicated to other devices, so additional transmitted data is effectively source-side deduped. The possibilities look to cover all manner of storage technologies. Have a CDP system? Albireo = integrated dedupe. Want to store deduplicated data in the cloud? Integrate Albireo into a NAS gateway, or perhaps you could build a gateway that does Albireo API calls at the object and sub-object levels. This is distributed dedupe that can be useful way outside of a single appliance. Permabit has effectively virtualized dedupe, abstracting it from hardware and file system dependencies, and enabling dedupe anywhere.

You thought deduplication went mainstream when data domain was purchased by EMC? Uh-uh. This is mainstream dedupe. I'll admit I've long watched these Permabit guys, wondering if anything big was going to come of it. Sure they've had good archival storage, built on high capacity SATA, with great capacity optimization and almost enough performance on top of SATA to be primary storage-like. But the archival storage market can be at times muddy water to swim in, and breaking into the primary storage market can be no less than a cage match with a handful of giants. But if the Permabit guys execute right with Albireo, this could make your whole world end-to-end capacity optimized.

Comments? Twitter:@JeffBoles / or Email


posted by: Jeff Boles

Jeff Boles, InfoStor Guest Blogger
by Jeff Boles
InfoStor Guest Blogger

Jeff has a broad background of hands-on operational IT management and infrastructure engineering experience, with more than 20 years of experience in the trenches of practicing IT.
Prior to joining the Taneja Group, Jeff was director of an infrastructure and application consulting practice at CIBER and, more recently, an IT manager with a special focus on storage management at the City of Mesa, Ariz.