Big Data – Data Preservation or Simply Corporate Hoarding?
Several years ago my Mother passed away. As one of her children, I was faced with the challenge of helping clean out her home prior to it being put up for sale. As we struggled to empty out each room, I was both amazed and appalled by what we found. There were artifacts from almost every year in school, bank statements from the 1950s, yellowing newspaper clippings, and greeting cards of all types and vintages. Occasionally we’d find a piece that was worth our attention, but the vast majority of saved documents were just waste – pieces of useless information tucked away “just in case” they might someday be needed again.
Unfortunately many corporations engage in the same sort of “hording”. Vast quantities of low-value data and obsolete information are retained on spinning disk or archived on tape media forever, “just in case” they may be needed. Multiple copies of databases, outdated binaries from application updates, copies of log files, ancient directories and files that were undeleted – all continue to consume capacity and resources.
Perhaps this strategy worked in years past, but it has long outlived its usefulness. At the average industry growth rate, the 2.5 Petabyte of storage you struggle with today will explode to over 1.0 Exabytes within 15-yrs! That’s a 400 times increase in your need for storage capacity, backup and recovery, SAN fabric bandwidth, data center floor space, power and cooling, storage management, staffing, disaster recovery, and related support items. The list of resources impacted by storage growth is extensive. In a previous post I’d identified (46) separate areas that are directly affected by storage growth, and must be scaled accordingly. An x400 expansion will result in a simply stunning amount of hardware, software, facilities, support services, and other critical resources needed to support this rate of growth. Deduplication, compression, and other size reduction methods may provide temporary relief but in most cases they simply defer the problem, not eliminate it.
The solution is obvious – reduce the amount of data being saved. Determine what is truly relevant and save only information that has demonstrable residual value. This requires a system of data classification, and a method for managing, migrating, and ultimately expiring files.
Unfortunately that is much easier said than done. Attempt to perform data categorization manually and you’ll quickly be overwhelmed by the tsunami of data flooding the IT department. Purchase one of the emerging commercial tools for data categorization, and you may be frustrated by how much content is incorrectly evaluated and assigned to incorrect categories.
Regardless of the challenges, there are very few viable alternatives to data classification for maintaining massive amounts of information. Far greater emphasis should be placed on identifying and destroying low or no-value files. (Is there really sound justification for saving last Thursday’s cafeteria menu or knowing who won Employee-of-the-Month last July?). Invest in an automated policy-based management product that allows data to be demoted backward through the storage tiers and ultimately destroyed, based on pre-defined company criteria. Something has to “give” or the quantity of retained data will eventually outpace future IT budget allocations for storage.
In the end the winning strategy will be to continually manage information retention, establishing an equilibrium and working toward a goal of near-zero storage growth. It’s time to make data classification by value and projected “shelf-life” a part of the organizations culture.