Big Data – Data Preservation or Simply Corporate Hoarding?
Several years ago my Mother passed away. As one of her children, I was faced with the challenge of helping clean out her home prior to it being put up for sale. As we struggled to empty out each room, I was both amazed and appalled by what we found. There were artifacts from almost every year in school, bank statements from the 1950s, yellowing newspaper clippings, and greeting cards of all types and vintages. Occasionally we’d find a piece that was worth our attention, but the vast majority of saved documents were just waste – pieces of useless information tucked away “just in case” they might someday be needed again.
Unfortunately many corporations engage in the same sort of “hording”. Vast quantities of low-value data and obsolete information are retained on spinning disk or archived on tape media forever, “just in case” they may be needed. Multiple copies of databases, outdated binaries from application updates, copies of log files, ancient directories and files that were undeleted – all continue to consume capacity and resources.
Perhaps this strategy worked in years past, but it has long outlived its usefulness. At the average industry growth rate, the 2.5 Petabyte of storage you struggle with today will explode to over 1.0 Exabytes within 15-yrs! That’s a 400 times increase in your need for storage capacity, backup and recovery, SAN fabric bandwidth, data center floor space, power and cooling, storage management, staffing, disaster recovery, and related support items. The list of resources impacted by storage growth is extensive. In a previous post I’d identified (46) separate areas that are directly affected by storage growth, and must be scaled accordingly. An x400 expansion will result in a simply stunning amount of hardware, software, facilities, support services, and other critical resources needed to support this rate of growth. Deduplication, compression, and other size reduction methods may provide temporary relief but in most cases they simply defer the problem, not eliminate it.
The solution is obvious – reduce the amount of data being saved. Determine what is truly relevant and save only information that has demonstrable residual value. This requires a system of data classification, and a method for managing, migrating, and ultimately expiring files.
Unfortunately that is much easier said than done. Attempt to perform data categorization manually and you’ll quickly be overwhelmed by the tsunami of data flooding the IT department. Purchase one of the emerging commercial tools for data categorization, and you may be frustrated by how much content is incorrectly evaluated and assigned to incorrect categories.
Regardless of the challenges, there are very few viable alternatives to data classification for maintaining massive amounts of information. Far greater emphasis should be placed on identifying and destroying low or no-value files. (Is there really sound justification for saving last Thursday’s cafeteria menu or knowing who won Employee-of-the-Month last July?). Invest in an automated policy-based management product that allows data to be demoted backward through the storage tiers and ultimately destroyed, based on pre-defined company criteria. Something has to “give” or the quantity of retained data will eventually outpace future IT budget allocations for storage.
In the end the winning strategy will be to continually manage information retention, establishing an equilibrium and working toward a goal of near-zero storage growth. It’s time to make data classification by value and projected “shelf-life” a part of the organizations culture.
“Big Data” Getting Bigger? Beware of the Ripple Effect…
Everyone seems to be concerned about the “tsunami of data” that is overwhelming the IT world. However, relatively few people appear to be worried about the “ripple effect” of this growth on other areas that are directly or indirectly impacted by this phenomenon.
Storage growth does not occur in a vacuum. Every gigabyte of data written to disk must also be backed up, managed, transferred, secured, analyzed, protected, and supported. It has a “ripple effect” that can spread throughout the organization, creating problems and resource shortages in many other areas.
A case-in point is the backup & recovery process. Every gigabyte stored must be scheduled for backup, so if we’re experiencing a 50% CAGR data growth rate, then we are also subjected to a 50% growth rate in demand for backup & recovery services. In addition, most companies keep more than one copy of data in the form of supplementary backups, clones, replications, and other forms of duplication. Therefore a single gigabyte of data can exist in multiple areas throughout the organization.
The picture above identifies at least (36) specific areas that are impacted by data growth. I’m sure there are others. Gone are the days when problems could quickly be resolved by “just buying more disk”.
It’s time to “think outside the box”. This is no longer a localized issue that can be solved by stove-piped departments and back-room technologists. It is an enterprise-wide challenge that needs the creative minds of many individuals from diverse areas of the organization. Consider bringing in independent Subject Matter Experts from the outside to analyze complex problems, stimulate creative thinking, and discuss how others have attacked similar challenges.
In today’s world of “big data”, there needs to be far greater emphasis on comprehensive planning, designing in architectural efficiency, minimizing the impact on IT infrastructure, and improving the manageability of our entire IT environment. Your future depends on it.