Big data patterns for adapting real time analytics
Chris Nott 100000MPDE Visits (6541)
Analysing data in motion offers businesses new opportunities for innovation, by processing information much more quickly as it arrives, and for cost saving, by avoiding the need to store data before it can be put to use. The approach is to implement analytical models and allow data to run past them. These models filter, annotate, transform, enrich the data in real time. These capabilities can be applied to maintain a picture of what is happening, generate alerts or identify anomalies from data in motion, for example. However the statistical models built for such analytics applications are static.
Whilst technology designed to process data in motion is not designed to retain the vast majority of data for any length of time, technology design to store data at rest accumulates history. Data at rest can be analysed to insight on changes in the business. This can be used to inform the models used to analyse data in motion so that those models can be refreshed, thereby adapting real time analytics. Thus businesses are able to sustain the relevance of the results produced by the analytics.
In this post I shall provide a high level view of one example of how adaptive real time analytics has been deployed to realise value. I shall assume that IBM's InfoSphere Streams is used to implement analytics on data in motion and that InfoSphere BigInsights – IBM's commercial offering with Hadoop – for data at rest.
The premise is that what a business wants to monitor changes over time. The challenge is that a business must adapt the associated analytics to maintain awareness. The outline steps of the use case are as follows:
In the following architecture diagram, a data scientist first builds the analytics model using data at rest and deploys that model in Streams to build the picture of the business environment. The analyst is the consumer of that picture.
The implementation of this use case assumes that data reaches BigInsights via Streams. This means that the data is aligned across both technologies.
The control of the introduction of a new area of interest was implemented in Streams because it was felt that this approach provided greater flexibility and gave more opportunity for reuse.
The design and implementation may needs to handle two limitations. Firstly, the refresh of the model is not immediate and this latency may have adverse effects. Secondly, an overlapping period where data replayed through Streams from BigInsights is duplicated may arise when a new area of interest is promoted, or worse there is a gap in the data used to build the picture of what is going on.
The key benefit of this pattern is that a business can maintain a view of the important aspects of what is going on as needs change.
The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.