Matching: hadoop X
While data size is a consideration, it is not the only one, and I'd like to provide an operational consideration here if I may. Most of our customers get hooked on the flexibility well before PB+ scale out becomes a driver. The "cost" for experimentation and mixed data set analytics compared to conventional approaches is what really matters more than size when getting started. There is no schema to get "wrong", no ELT and MDM required before getting started (notice I didn't say they wouldn't be the need for those in the future however) so the "cost" to get started in low and the price for failure is equally low. Anyway, something to consider.
Those are actual words from an IBM customer. Interesting, no? I think they are right, and I say that as part of our Big Data team. See, what they are really focused on at this point is "Big Flexibility." The size of the data collections will get big over time, but they didn't want the rest of their IT management thinking that they had to start with "big" data sizes to get value from BigInsights. Bingo!
Look at it this way: What we see as the #1 driver of use cases is combining data from disparate sources that arrives in disparate formats. The ability to make use of these collections -- precisely because they are different and have different slices of the overall picture -- is typically the first order of business. Yes, you could design an ETL process and schema to accomplish this, but probably not in a day or even a week! BigInsights provides you with flexibility that enables you to simply load the data -- raw, whatever form it's in -- and then begin analyzing it, which makes these projects viable. Little to no lead time -- just load it and get started. Explore. Experiment. Find the correlations, and then introduce rigor, as required. Once it becomes a stable pattern -- and business value has been confirmed -- you can start the investment in developing a trustworthy and repeatable process at scale. And yes, this is where the "
This financial company has been creating logs for a long time for a variety of reasons: (1) regulatory requirements such as the Basel capital requirements, (2) audit validation, (3) security and fraud detection, and of course (4) the operational logs of what is happening in the system for each transation. These logs are being used for a variety of purposes, but it is often a tedious manual process. An example is when a problem is being debugged. Fingerpointing is common among the systems’ support because none of them have a more global view which in turn causes mis-information to be spread early on in the support case; all of which costs lots of money and time. For example, lot of time and effort is wasted by dealing with timezone differences, something that can easily be solved by automation.
The challenge facing this company is to get a comprehensive end-to end view of the transactions. This is in stark contrast with the silo, system view of today where every system has its own, private log. The end-to-end view will bring the important details of the transactions into clear focus, visible to all (not just the silo that happened to have collected that detail in their logs), and able to be related to other details. This applies equally well to operational people, e.g. trying to debug an issue, as to business people, e.g. trying to identify fraud.
For this to be possible, the different log records from the different systems that did work on behave of the business transaction must be correlatable. This is not a simple task because it was not build-in in the original system design. By combining (out of date) specs, knowledge of how the system works, and experience with the technology, one can start the correlating job by looking at SOAP envelopes, session IDs, transaction IDs, timestamps, etc. The whole process requires a good deal of trial and error combined with common sense.
For all this to be possible, the log files need to contain several important pieces of information: when the transaction entered and left a (sub)system, message IDs, transaction IDs, and correlation IDs, the operation being performed, the application instance that is processing, status, the message itself. Note that this is distinct from the system log which contains system messages, CPU usage, Memory usage, JVM alerts, etc.
The existing log files are often not accessible because they are on local machines. In order to use these logs to mine information and resolve issues, they need to be copied in a central location; we copy them into an Hadoop filesystem. Before that can happen though, these log files need to be scrubbed for security reasons: anonymize or de-identify the data for privacy. At some point in the process, the log records need to be filtered, transformed, and cleansed. The latter work can be done before insertion or can be done as part of the following Map/Reduce processing.
Once the logs are in Hadoop, a new world opens up and new applications are popping up like mushrooms in damp soil. We described one such application: correlating log files per business transaction for problem resolution. For that to work efficiently and fast, we need an index of all the log records so that related log records can be shown quickly in a UI when the user drills down into them. While that might look like a daunting problem in a PetaByte Hadoop system, the application requirements make this reasonable: correlation information is only needed for the last 2 months of data. Still, a sophisticated index is needed: it needs to be distributed for scale and performance, and being able to index and access data at the same time.
Numerous other applications that are being build on this infrastructure. Capacity planning takes the historical information in the logs as input. SLA reports are calculated based on the logs. Cross-channel fraud is identified by mining the logs. Analysis of cross-sell is done by running jobs on the log data. The impact of a product launch or a sale campaign can be deduced from the logs. Traffic flows of transactions can identify hot-spots (slow or under-sized system) or mistakes (credit card is verified twice). The impact of a problem can be seen quickly (e.g. a high percentage of failures in a system) and remedies can be created (e.g. direct more traffic to an alternate server) before the root cause is even known.
Looking a little closer at the last use case, “identify the impact of a problem”. It is obvious that this use case could benefit from reporting sooner rather than later. If minimum latency is required, then a product like InfoSphere Streams can accomplish this. Also, the time-sensitive light analytics are compatible with the deep analytics with Map/Reduce. The former is executed on the data when it is being inserted in Hadoop, whe the data is in motion. The latter is performed after the same data is inserted in Hadoop, when the data is at rest. We encounter such scenarios in several POCs and the customers are happy that we can make these scenarios work will together.
From reading the above examples, one deduces quickly that different teams will be able to get benefit from this information: Product Managers from cross-sell, System Managers and application teams from the traffic flow, Production Support from the SLA reports, IT Managers from problem impact.