June 3, 2021 | Written by: Hannu Löppönen
Share this post:
Data virtualization, which has emerged alongside traditional solutions, aims to achieve the same end results as data lakes and data warehouses. Which is better: self-service virtualization or a comprehensive data warehouse? Is virtualization also a useful method for updating applications via the service interface?
Data virtualization is self-service for data users
Data virtualization is a new dynamic way to search and utilize data from different data sources. The virtualized view provides the developer with data without having to move or copy it first. A well-known use of virtualization is the development of analytics, but it is also increasingly being used to provide data for applications. Virtualization provides users with self-service to rapidly utilize the company’s and its stakeholders’ information sources.
Virtualization makes data directly available to users
Does data virtualization compete with data warehousing?
Data warehouses meet users’ anticipated needs well once data sources and data types have been established. However, constant changes are severely increasing the costs of data warehousing. Virtualization reduces the need to connect new sources to data warehouses. At the same time, the need to move and edit data between sources and data warehouses is decreased. Data transfer work is typically done using ETL (Extract, Transform and Load) tools. Allowing analysts to retrieve their data themselves through virtualization reduces dependence on IT experts required for download tools and it also speeds up development. Additionally, virtualization reduces the costs associated with telecommunication and storage. Removing unnecessary data frees up space on the data warehouse server and reduces the size of backup copies.
Data virtualization users see the virtualization layer as a database that they can query with various reporting, analytical and development tools. It is also easy to add new data sources without in-depth database knowledge. A well-designed virtualization tool integrates different database technologies, including common relational databases, NoSQL, Hadoop and standard file formats (e.g. CSV and Excel)by combining them into a single SQL view. It also identifies similar database schemas and presents them as a single schema (schema folding). For example, a Sales table can appear 10 times in the database of as many source systems, but it appears as a single table in the virtualized view.
Virtualization should not be commenced by replacing existing functional data warehouses. The view that combines data from warehouses and lakes is an excellent example of virtualization in action.
Virtualized data as a service for applications
In addition to analytical use, virtualization is an asset in development when data needs to be moved from traditional database-driven systems into new applications for mobile users.
Service interfaces, such as the popular REST API, have been developed to provide a fast and flexible way to feed data to applications. Custom API Gateway solutions have also been developed for API management. However, they do not have advanced tools for data manipulation. Data virtualization includes all the necessary tools ranging from the technical conversion of data (SQL –REST API) to the aggregation and conversion of its contents and data security management.
The benefits of virtualization
- Data virtualization delivers the benefits of self-service, such as speed of work and iterative learning to fuel innovation.
- There is no need to copy data from a source to data warehouses or applications, which reduces the costs of the technical platform and development.
- The security risk is reduced. Virtualization can also be used to analyze sensitive data, even when it is prohibited to transfer data from the source to corporate data warehouses or applications.
Data virtualization and security
Many corporate data sources contain sensitive data that cannot be exported as is to a data warehouse or an application. In these cases, virtualization is the solution to the problem. For example, if customers’ personal data is needed for socio-demographic analysis, it can be pre-computed in the source database and the result provided to virtualization users. Security classifications can also be used to specify that sensitive data can only be accessed by those who are authorized to do so. Other people either do not see the data or it appears as a randomized string.
Data catalogs to support virtualization
Businesses have core knowledge about customers, products, markets, and many other aspects that are essential for business operations. Virtualization users need to know where to find the required data and whether it is reliable.
The library comparison has often been used to describe the use of a data catalog. A library catalog acts like a data catalog. It allows users to find the work they want by using a wide range of search criteria.
An advanced virtualization tool can leverage business vocabulary through an integrated data catalog solution. It allows users to see in which database and which part of the database (table and column) they can find the information related to the customer. If the technical metadata regarding the location of a book were missing from the library system, users would have to ask the librarian for help. Similarly, categorization of the business vocabulary used to support virtualization helps users find the information they need without help from IT. Carefully designed management ensures user autonomy and satisfaction.
Data catalogs are constantly being developed with AI-supported logic in order to make it easier to find data. For example, machine learning models can be used to automate the mapping of metadata. Similarly, independent correction mechanisms can be built to manage data quality. When bundled, narrow AI solutions start to gradually resemble human-like general AI. A conversation between AI and a user could sound like this:
- What do you want to do? Put together a marketing campaign plan, a sales forecast, and a purchase proposal to our buyers regarding the spring sale of our company’s outdoor grills.
- Are there any new information sources or do we rely on the ones we have used before? Run an analysis to see if there are new opportunities.
- I found a consumer study in Chinese on the internet and it seems to be a public study. It requires logging in. Additionally, I found a related paid service offered by a Canadian marketing research company. Send me both links and I’ll tell you what to do with them.
- Let’s include both sources of information. Here is my username for both services. Data has been virtualized and analyzed, and recommendations for action have been generated.
- Here are three proposals, which are optimized based on the total margin, and they include consumer target groups, a campaign program, a sales forecast by brand, and a list of potential suppliers with target net prices for purchase negotiations. Send me proposals with standard descriptions.
- Sent. Would you like to thank me for all this hard work? Oh yes, I forgot about your humanity algorithms. Thank you very much!
The dialogue above may sound like science fiction. However, there is already a solution on the market that bundles user-assisting tools with narrow AI, as shown below. In the next few years, increasingly comprehensive AI processes will be developed, but humans will still be needed for a long time to come up with new ideas and to weigh in on their company’s values, for example.
Features of the IBM Cloud Pak for Data solution for the development of analytics
Link to the technical implementation of data virtualization
Virtualization brings us one step closer to autonomous task analysis
Data virtualization is not a substitute for data warehouse development or even for data warehouses. However, it will bring new opportunities for users and substantially reduce IT tasks and costs. It also brings us one step closer to a world where AI autonomously performs analysis tasks from beginning to end, starting from data sources, and ending with a business recommendation.
In addition to analytics, data virtualization facilitates the development of applications. It can be likened to super glue, which attaches data from traditional database solutions to modern mobile and browser applications.
IBM Data Virtualization and Watson Knowledge Catalog solutions are part of the comprehensive IBM Cloud Pak for Data product, which includes not only data virtualization and data catalog, but also integrated tools to support analytics and AI development. Please contact me for any further questions: firstname.lastname@example.org
Related blogs written by Hannu Löppönen (in Finnish)
Q&A with Intel: What data virtualization means for the insight-driven enterprise
What is Data Virtualization (video)
New enablement materials for IBM Ecosystem Partners
On October 4th, IBM announced a revamped skilling program available for partners. The skilling and badging program is now available to our partners in the same way that it is available for IBMers, at no cost. This is something that our partners have shared, they want more expertise – more opportunities to sharpen their technical […]
Data Democratization – making data available
One of the trending buzzwords of the last years in my world is “Data Democratization”. Which this year seems to have been complemented by “Data Fabric” and “Data Mesh”. What it is really about the long-standing challenge of making data available. It is another one of these topics that often gets the reaction “How hard […]
How to act in the new regulation of financial sector
Our world is changing. Because of that regulators around the world are taking ambitious steps to improve the sustainability of the financial sector and guide capital towards sustainable economic activity. Especially in EU we are seeing a high level of regulations. These regulatory interventions present complex and sensitive legal challenges for financial sector firms, which […]