Excellence & Organization

Dark Data is sustainably harmful

by Marcel Rauch

Dark data refers to unused data in the company that eats up money and energy. How can IT organisations and business units mitigate the problem?

 

Dark data - data which is not being used - is a rampant problem. The rule of thumb: the more data an organisation accumulates, the more difficult it becomes to manage it because it is lost from sight and access. According to analysts, storing dark data usually generates more cost than value. However, data has always been hoarded because it might be needed again, because it has to be stored due to legal requirements or because it is supposedly cheaper to buy new storage instead of documenting, categorising and assigning value and an expiry date to the data.

According to various studies, 30 to 80 percent of all stored data is dark, and 50 percent appears to be a realistic average. Dark data is found in logs, in text fields and documents, in surveillance videos, in audio files, in the IoT and in graphics. Photos and videos in a tweet, as well as the metadata, are also dark - such as broadcast times, users, #hashtags and information about devices and locations. This also applies to process metadata, such as log files or transactions in an ERP system. A big problem arises where all data is stored without any sense or reason.

End of the line: cloud storage

The cloud has accelerated the carefree approach. Economies of scale and the efficient use of storage allow cloud providers to pass on lower prices to customers. But on-site storage is also becoming cheaper: according to calculations by Metrics, the annual cost of a terabyte of SAN storage has fallen by around 50 per cent since 2016 to less than 1,000 euros today. At the same time, the amount of data is growing rapidly: IDC Research estimates that global data volume was more than 50 zettabytes in 2020 - by 2025 it is expected to be 175 zettabytes.

Dark data - three challenges

Thus, the savings effects from falling unit costs can quickly turn into the opposite, if the financial expenditure for electricity, backup and management as well as the lack of sustainability of unrestrained storage are set against it. Taking countermeasures? Companies face three main challenges here:

  • The legal dimension for data with personal reference, whereby storage is linked to a processing purpose. If the data is not documented, it is difficult to act in a legally compliant manner.
  • Accountability - who owns the data when it is no longer allocated? Finally, someone has to make the decision when to delete which data.
  • The knowledge problem, because it is not known whether data exists, where it is located and how it is understood - this is especially true in the heterogeneous IoT environment.

In any case, the challenge cannot be solved quickly; companies need technical and organisational measures. These include decision criteria determining which data can be kept and when the "off-cuts" can be deleted. In many cases, it would make sense to have a chief data officer who is equipped with appropriate competencies, who takes responsibility and who works outside of the organisational units.

Data management and data steward

Whether professional data management with a so-called "data steward" is useful depends on the organisation and its data use. In addition to storage savings and reduced compliance risks, there is also an increase in data quality and reusability, for example through company-wide data repositories. However, the additional short-term effort is seldom put in relation to the medium- and long-term benefits. In addition, the effort can only be mitigated to a limited extent by automation approaches, because ultimately a human entity must describe what the respective data represents and how it should be handled. A tool which indexes all data and makes intelligent decisions has yet to be invented.

Marcel Rauch

Marcel Rauch

Marcel Rauch, an industrial engineer, has been working as an IT management consultant for more than ten years. His main areas of expertise are IT benchmarking, IT service catalogs, and other data-driven analysis methods for optimising service portfolios and delivery.