Reviewing a presentation done at last years Information on Demand conference by Centerpoint Energy
about Streams, they commented that to improve velocity of analysis, they considered increasing capacity of their data warehouse. But commented that a bigger warehouse meant bigger hardware.
Which sparks the question, Does Big Data Require Big Hardware? Emphatically, NO!
Earlier this week, I reviewed use cases of Hadoop
on the apache.org site, and it was rather surprising to see some of the huge clusters of hardware used to perform analytics. But in many of these use cases, sentiment analysis of blogs and tweets, index building of web sites, etc there is no need to save the data longer than what's needed to create the meta data. These use cases are much more suitable to real time analytic processing (RTAP) - analyze the data in memory, create meta data, and discard the original data. In the case of web pages, blogs, etc the data is all persisted elsewhere anyway - there's no need for a company to create a second copy of it (incurring the cost of managing all the data). One paper
cited the cost to manage data in a warehouse at between $500,000 and $1,000,000 per terabyte of storage. Wikipedia
cited $10,000 to $150,000 per terabyte for initial purchase of a warehouse, with 80% of the TCO from monitoring and tuning. Not to mention backups, disaster recovery, etc. If the initial costs are only 10% of the TCO, then again we see ranges of $100K to $1.5M per terabyte.
But doesn't a warehouse appliance or hadoop lower these costs? Well, yes, but ... one article said a Netezza appliance would all purchase of a terabyte of storage for $2,500 and that hadoop was merely $250 per terabyte. But again, if the TCO is 10x that number to manage, back up, restore you're up to $25K to $250K.
Streams with real time analytic processing (RTAP) allows you to analyze and get the results you need without saving the data, eliminating the TCO of managing the data over it's lifetime.
And we have more and more customers who are finding that Streams blazing speed means reduced hardware - 10x reduction in the number of blades for one customer application. Another company comparing log analysis and 39 different benchmark tests found that on the same hardware, Streams could handle at least 10x the number of events per second as other complex event processing systems. Early work to port a set of applications is indicating a 17x throughput improvement.
Does Big Data Require Big Hardware? NO..... by handling 10x the volume on the same hardware and by eliminating long term data management costs, a small number of x86 nodes can analyze your big data more effectively.
The question you must ask is, must the data be saved for the future? If not, then look to Streams to save money and improve operational efficiency.