In a twist of logic, we are asked to put the same expenditure thinking into deduplication as you do into backup technology. We all know that backup technology is largely an expenditure, because you have to do it. The next rash move is the assumption that all backup technologies will duplicate files like rabbits in a 24x7 springtime frolic. Ergo, all backup technologies need deduplication.
Is that true? Do all backup solutions wildly and randomly copy all of your data all over the place?
No. And, whether your backup solution has a duplication factor of 10 or 52, the cost of dedupe products roughly equals the same cost of your backup solution. Coincidental, isn’t it?
Let’s return to the guru’s advice. "Storage vendors are typically better at finding a need for their technology in your environment rather than finding a technology that will actually meet your needs." I say we switch gears to meet your needs.
Consequently, any backup solution that duplicates, or to put it another way, any backup solution that recommends deduplication products, will then automatically cost twice as much as you expected. Unless you are absurdly wealthy, and nostalgic for all things archaic, that will not meet your needs.
In addition, deduplication products are an additional technology to install, integrate, and manage in your environment. Not only do these deduplication technologies cost the same as your backup solution they will also double the cost of your management, maintenance, support, and long-term planning expenditures.
Stop Duplicating
So, what should you do? Stop duplicating. There are solutions available that back up all the data you need without having to deduplicate. Find a solution where deduplication is not necessary because it does not duplicate data in the first place.
Why would anyone want to buy a backup solution that duplicates your data to the point that you will need to buy deduplication technologies to keep an exploding sea of bits from drowning your IT department? Deduping data created by your backup solution is akin to taking sleeping pills to cancel out an unwarranted overdose of caffeine. Sure, you can do that, but does that make sense, and at what cost?
What you need is a backup solution that offers all the advantages of a new non-duplicating backup solution for which you do not have to pay extra.
What should you look for? The key is the database system design. Insist upon a relational database backup solution that keeps tabs of the files that it has already backed up and can then identify only the new data or versions that needs to be copied. From this baseline such a system will not need to duplicate backups.
Also, for large files, like databases, look for an option that allows block level comparisons in backups, a form of built-in deduplication, which will help to reduce the duplication of object data. Databases change every few seconds or minutes, and a file-based backup will duplicate records within a database file simply because a database file will always require a new backup. Block level deduplication helps in both backup time and storage size by backing up only the changed blocks.
In fact, data deduplication has a place in IT shops. Deduplication that addresses production issues for duplicated and replicated files might be of interest to you. Think of the single instance restore technology now common in email archiving. Attachments common to many emails, for instance, do not need to be duplicated if you have a relational database that can manage pointers. Production is the place to fix duplication. Backup solutions built with 21st century capabilities (relational databases, virtual storage, policy-based architecture, etc.) should not be duplicating files.
Most file systems duplicate system files, and replicated operating systems, even in virtual environments, have duplication offenses. There are solutions out there that provide deduplication options for such production environment realities, similar to the single instance restore indexing now built into email archivers. These solutions are nice, but the actual space savings is only interesting for hundreds and thousands of replicated machines.