Data De-Duplication

Data De-Duplication on

Beth on started this great thread and I wanted to comment on it on my own home turf, as it were.

Data DeDuplication.  Also known as compression, hasn’t changed since the early days of PKZIP 1.0. 

Compression works by identifying like blocks of data and replacing them with a single block and pointers to every place the block was found.  One of the main reasons it works so well in plain text applications is that there are only so many combinations of ascii characters that can be found.

I find it interesting that this seemingly old technology has found new life in the form of the seemingly complicated “Data DeDuplication”.

So far – no one has sufficiently explained to me the benefits of using a Data Deduplication product over the conventional in-band tape compression.  Obviously offloading compression to something with a real processor might gain you some performance and maybe even allow compression to happen without causing a tape to ‘shoe-shine’ across the head as it keeps having to back up.  However I have not yet seen a single example that justifies the cost and effort involved.

Anyone?  Bueller?


Skip to comment form

    • on May 14, 2008 at 11:06 pm
    • Reply

    If the “Data Deduplication” technique can find identical parts of data that have already been backed up to tape and eliminate the need to back them up twice (or 50 times) thats where it can pay for itself. Just depends on the company and how “redundant” their storage is. For example, If 1000 users save the same 5MB file to different personal folders you end up backing up 5GB instead of 5MB and a bunch of softlinks on the tape set…

    • on May 15, 2008 at 12:01 am
    • Reply

    I think you’re missing a key point… why tape at all?

    That, and de-duplication, if done remotely well, is more efficient than standard compression.

  1. Andrew – I think you’re confusing “Single-instance store” with Data DeDuplication. While essentially the same principle, data dedupe looks for data chunks that are identical and uses a store-and-pointer method to compress the data.

    Tim – Tape will always be. Mostly because in instances where you have a requirement, whether it be legal or procedural, to store data off-site. A lot of people will say it’s a good idea to just set up a redundant datacenter and mirror to it, but some companies either don’t have the money for that or are too cheap to spend the money on that.

    Also remember the down-side to mirroring – corruption mirrors just as quickly as data. (Sometimes faster it seems) Now the obvious solution to his is point-in-time mirroring, using something like SRDF/AR or Replication Manager to generate the mirrors at specific intervals, but you’re adding so much expense to the mix that it’s not even funny.

    When I worked as SAN manager for Loan To Learn, I implemented a Disk–>Disk–>Tape backup that worked flawlessly. Two weeks worth of backup data was kept on disk, but it was copied off to tape and sent off-site every night. That way in the event of a restore we never had to recall tapes, but they were off in the vault if we needed them and to make the C-level officers feel better.

    My point being I don’t think tape will ever go away, if for no other reason than I don’t think people who don’t know any better will ever be truly comfortable with Disk-Only based backups.

    • on May 15, 2008 at 11:17 am
    • Reply

    I suppose you could call de-duplication a single instance store, but typically single instance stores work at a file level. Data de-duplication works on a variable-size block level. If I save two copies of a file and slightly modify one of them, I only need to save a single copy of the bits in the first file and the delta between the first and second. Also, if you’re comparing data dedupe to tape compression I think you’re missing the bigger picture. Say I have 1TB of data and I do a full backup. With tape and tape compression, I store say 500GB on tape on a good day. Via a dedupe appliance I store 500GB also. Wait a week and do another full backup. I store another 500GB on tape. Via my dedupe appliance I store the changes between that first full and the next, say 10%. So, with only two full backups I’ve got 1TB of data on tape and 550GB on my dedupe appliance. Consider how often you do full backups and your retention period and do the math, you’re storing a whole heck of a lot less data if you dedupe it.

  2. Block-level incrementals, right?

    I’ve seen it done before – but as with any incremental backup the longer you go between full stores the more tedious and time-consuming the restore/recovery procedure is, right?

    I mean in ‘backup-class’ we learned that in the perfect world you’d have the time and storage to do full backups every day. You start from the ideal and work your way backwards.

    What you’re describing as deduplication is EXACTLY how compression alghorhythms works. It’s almost akin to the difference between bitmap graphics and JPEG compressed graphics. With a bitmap you’re storing information on every pixel. With JPEG compression you’re storing infomration on one pixel, then counting how many similar pixels are around. The difference in the resulting picture quality is soly measured by how the compression algorhythm measures “similar.”

    I wonder if we’re truly getting into that “nothing new under the sun” state of technology. Everyone’s new breakthrough seems to be a re-branding / re-marketing of the old technology. I don’t think you can put a faster processor and more cache / RAM in a computer and call it a new computer. You can call it an upgrade certainly, but not new.

    • on May 15, 2008 at 11:39 am
    • Reply

    I would agree that the idea has been around for a long time, but it hasn’t been practical to implement on a reasonable scale until relatively recently. Sure, it’s exactly how compression algorithms work, but your typical compression algorithm takes a chunk of data and removes the redundancy in it. Depending on the algorithm that chunk could be measured in KB or MB, and the larger the chunk the better your overall compression ratio. In theory, data deduplication extends that chunk size to be the size of all the data the dedupe engine has ever seen before.

    People are getting significant storage savings from this. At my previous employer we were doing backup to disk and running out of room. Adding a dedupe appliance allowed us to store 20 to 30x as much data in the same space.

  3. Ok – that’s the first “new” aspect of DeDupe I’ve heard so far. You’re saying that the appliance stores historical data as well as what’s transmitted in-line? That adds a bit to the scheme.

    i also understand that the DeDupe appliance uses variable block-sizes, whereas i believe the older compression engines look at a fixed block-size, say 2114bytes, and look for redundancies within.

    Now add that to a key-based “de-duplication” engine and you have space-savings and encryption in the same barrel. Because once the data is de-duplicated it’s mostly encrypted as it is.

    Ok you have my attention.

    • on May 15, 2008 at 12:44 pm
    • Reply

    You got it, they store a hash table for everything they’ve ever seen. Encryption would be easy to add on at the back end so your data is encrypted at rest, maybe some of the dedup vendors are doing it already. A while back EMC bought a company called Avamar that does all this deduplication at the host side, and I know it does encrypted data in flight, not sure about at rest though. The Avamar appliance not only stores a big hash table for everything it has ever seen, it also shares the hashes it has seen with all the clients it backs up. This way the clients don’t ever send data over the wire that the back-end has already seen, no matter which client originally sent it. It’s a cool idea. In addition to the space savings, I imagine it works quite well for folks who want to do centralized backups for remote offices on WAN links.

    • on May 22, 2008 at 9:19 am
    • Reply

    I’m working on a POC for Avamar right now and it really does work as advertised. We have a 10TB ( and growing) VMWare environment. Right now we have to dump full backups to tape every night for various reasons. In my Avamar testing I’m seeing a 40-80% reduction for the first backup for any given guest. After the first pass I’m seeing reductions of 99.7 to 99.9%. This is across 18 VM guests of various applications ranging from 4GB to 750GB.

    This works well for us because we have a lot of spare cpu cycles on our VMware boxes at night when we run backups. Once we move to production we are planning of doing away with tape completely for VMWare and simply replicating the data to a 2nd Avamar appliance at our remote site.

  4. It’s not whether it works or not, it’s whether the ROI is worth it. If all you’re looking to do is to save money on tapes, well you’re going to have to save a *LOT* of tapes to come out even.

    I’m hoping no-one is trying to push this as something to be done to production data? I wouldn’t run any kind of compression against production data for the same reason that I wouldn’t rely on disk-compression. It takes processing cycles/time, and puts you further at risk for corruption.

    De-duplication hardware is expensive, it’s another working part to break, and it’s something else an already beleaguered SAN admin has to master.

    • on May 28, 2008 at 10:23 am
    • Reply

    As a real-world use case – we have a requirement to backup our mission critical databases (rather large) in a way that it’s easily restorable, we have to keep 30 days worth of backups and the backups must finish within a very short window so as to not drag down the performance. This is where we use our de-deup appliance. Because of all the white-space in the database it compresses really nicely. We can typically project up to 40x with the de-dup technology. Restoring it is fast and painless without having to recall or load LTO4 tapes (which we also use – for long term archival of the databases).

  5. We used to do the same thing at the pretend-bank – but we used a Clariion as a disk-storage-unit. Now I don’t know that I got 40% compression out of most of the crap we backed-up to that area, most of it was MP3 / WAV format audio from the call-center floor trying to keep the reps honest.

    I know, from past experience, that MP3 format audio just plain doesn’t compress well. I seriously doubt we would have seen “as advertised” compression ratios.

    Again – DeDuplication is simply new-fangled marketing-speak for compression. The basic mechanism for compressing data isn’t anything that PKZIP hasn’t been doing for 20 years.

    • on May 29, 2008 at 9:58 am
    • Reply

    Yep, but it has it’s benefits in the way it’s implemented today. Prior to these appliances we couldn’t backup 80TB of data online on a single 20TB tape cartridge or disk automagically and do it in a short back up window.

    Though I agree with the old technology, new market name thought…the mainframes could run multiple OSes back in the 60s…now it’s vmware. Sun E10K had dozens of CPUs way back when, now there are multi core multi processors from Intel/AMD (implemented differently). Databases wrote logs for lazy writes, now this is journaling file system (and concept behind many storage appliances to improve write performance)…and so on.

    However, all these technologies have a better implementation which allows them to become so popular in the mainstream computer market.

Leave a Reply

Your email address will not be published.