January 29, 2020 | Written by: Susanne Beck Kimman
Categorized: Big Data | Inspiration
Share this post:
Today, data is at the core of running any business on this planet. We often do not think about this, but more about how we store, retrieve, cleanse, enrich, move and secure data. With a wealth of tools to help us do this, it is becoming increasingly more important that we utilise the right source of information to achieve desired results supporting business.
How we categorize data, is a little dependant on its structure, but in general we are dealing with 3 major categories.
Structured data is easily organizable and follows a rigid format, the kind of data we store and find in relational or hierarchical databases so that its elements can be made addressable for more effective processing and analysis.
Unstructured data is complex and often qualitative information that is impossible to reduce to or organise in a relational database and is information that either does not have a pre-defined data model or is not organised in a pre-defined manner. Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well.
Semi-structured data is a form of structured data that does not obey the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.
Moving onwards and upwards, phenomena as Public Cloud, Private Cloud, Multi Cloud, Hybrid Cloud and all the “regular” transactional applications continue to grow, and as a result, we are looking at serious data growth. Data is growing faster than ever before and by this year (2020), about 1.7 megabytes of new information is believed to be be created every second for every human being on the planet. Our accumulated digital universe of data will grow taking the marker up to zettabytes rapidly. Do we intend to use this data? I’m sure we do, it is referred to as the new oil or the new black.
We have just established the odd chance of building the data swamp at warp speed. Do you feel alarmed or get a strong desire to do spring cleaning?
Dumping data into a data lake alone won’t accelerate your analytics efforts. Without appropriate governance or quality, data lakes can quickly turn into unmanageable data swamps. Data users know, that their required data live in these swamps, but without a clear data governance strategy they won’t be able to find it easily nor trust it.
A governed data lake contains clean, relevant data from structured, semi-structured and unstructured sources, and data can easily be found, accessed, managed and protected.
Data coming into my data lake would preferably have to be properly cleansed, classified and protected ensuring reliable information about assets and metadata, so you guessed right. I’m all for spring cleaning.
The whole point of a data Lake is to store, process, and analyze large volumes of data at a much lower cost. If you don’t get the data part of the data lake right, you won’t get the ROI part of the data lake right. If you don’t plan to get the ROI part of the data lake right, why bother investing in the data lake?
Quoting Gartner Group Research Note (August 2018):
“Metadata management, data quality, data lineage, and data integration, among other things, are crucial prerequisites for a successful data lake“.
I’m in total agreement. We focus a lot on the right regulatory reporting or AI as the answer to any question. However, reports and very clever algorithms are all using data, good or bad. It would be rather unfortunate to do regulatory reporting on wrong or misguiding data, and a real bummer to have the fanciest AI algorithm and still fail due to data not being correct.
Imagine going into a public library and seeing a pile of books on the floor. I’m sure I’d finally find what I’m looking for with some persistence, but it will take time and time is one of the most precious things we have in life – in business as well as in our personal life so we want to spend it cautiously.
Spring cleaning or clearing the shed is not an easy job and not necessarily the one thing you just wake up doing on a Sunday morning, but with planning, effort, tooling and a vision to remove clutter and get a line of sight, it is possible and a recommendable thing to do. If we have data growth now, the likelihood of it slowing down is not really realistic, so the longer we wait, the harder the job.
If you have an equal desire to govern data or at least read about possible ways, may I suggest you have a look at IBM InfoSphere Information Governance Catalog (IGC). A web-based tool that allows you to explore, understand and analyse information. You can create, manage and share a common business language, document and enact policies and rules and track data lineage. Combine with Watson Knowledge Catalog (WKC) to leverage existing curated data sets and extend your on-premises investment to the cloud. A knowledge catalog allows you to put collected metadata into the hands of knowledge workers so data science and analytics communities can get easy access to the best assets for their purpose while still adhering to enterprise governance requirements.
The beautiful creatures are found and it’s game on.
If you have any further questions, please do not hesitate to contact me at KIMMAN@dk.ibm.com