Big data as a concept in IT has come on fast and hard. As in many things IT, new technology is first used by larger enterprises and then later in the adoption curve, small and mid-sized business begin to use it. Big data seems to be following the same course.
As big data evolves in the real world, it is being applied to data elements that are not big. Data sets that are small by most standards are being processed by big data tools in ways that are specific to the big data architecture.
Even so, the generally agreed-upon future is that there will be more data, not less; more data sources will be sending data into the enterprise and the speed of data flow will only increase. This is the future playground of big data. One question that comes up about that playground is where it will exist — on premise or in the cloud — and at what points you must consider selecting services.
Definition of a cloud-based big data solution
Like most things that deal with the cloud, defining what exactly the cloud is can be a bit tricky. Many different flavors of cloud exist in the big data space and no one definition is universal (although some are better than others).
First, let's start with a bit of wordplay. The state of big data is reached when the volume, variety, and velocity of incoming data are too much for current relational databases to handle and use in real time. The deployment of technologies in Big Data is the attempt to handle that condition and provide new ways of making productive use of the data — and that means hardware and a new way of organizing data for fast storage and fast read. This is the essence of big data.
It is also the raison d'etre for Apache Hadoop, MapReduce, and similar projects and products. The cloud-based big data environment needs to be able to reference external data such as enterprise resource planning systems and other on-premise databases, periodically updating it with fresh data. (External here means outside of the big data sandbox.)
That takes care of "storing" the data. Next, you need a way to analyze and present that analysis to where it will affect business processes.
A big data service needs to be able to look at a wide variety of data sources external to the data center, be able to include new data in the data center, accommodate new data elements not yet thought of, and provide a methodology for analyzing and reporting on all of it. The need for scalability, flexibility, and expandability lend themselves well to a big data environment from a cloud service.
Launching into cloud-based big data
These considerations cover the basic evaluation criteria for launching into big data. Start, experiment, and learn along the way, but the more you define up front what you need from big data, the more focused your experimentation time will be and the quicker you'll get your skill set revved up.
1. Universal real-time indexing of any machine data
This is the core of big data as most people think of it; it is often equated with the open source project Hadoop (see Resources). Don't confuse indexing in Hadoop with an index in a relational database: A Hadoop index is a file index. This way, Hadoop can ingest many different types of data.
Already, companies can be inundated with feeds from radio-frequency ID (RFID) movement, website clicks, and other data that might be structured if the IT people spent some time to make it into structured data and place it in a relational database. That might be worth the investment if you know how the data will be used and how it will be queried and accessed in the future.
Without your having to know the future potential uses of data, Hadoop provides an out. By taking the data just as it comes in, big data passes the data-definition step off until later, when analysis is performed. Hadoop distributes the data over many servers and keeps track of the locations without limiting future use.
2. Free-form search and analysis of real-time and historical data
Storing the data is only part of the way to the goal. The information needs to be relatively easy to recall. The quickest way to do that is to provide a search capability that is quick (as in implementation, not response time). Look for a tool set that allows text searches of unstructured data. Apache Lucene (Resources) is a common tool that provides text indexing and search in a big data environment.
Having a response right on the monitor gives people a warm, fuzzy feeling that everything is being stored correctly and can be accessed. The administrative step for this is to index the contents of the data stored in the distributed nodes. The search queries then access the indexes on the distributed nodes in parallel to provide a faster response.
3. Automated knowledge discovery from the data
Here's one of the business reasons for going to big data. Just as it can be inefficient to move all the semistructured data into a relational database, performing manual searches and manual reporting is inefficient for analysis.
Data mining and predictive analytics tools are rapidly moving to be able to use big data as a data source for analysis and to be a database for continually monitoring for change. All data-mining tools follow this process. Someone determines the purpose of the analysis, looks at the data, and then develops statistical models that provide insight or make predictions. Those statistical models then need to be deployed in the big data environment to perform continual evaluations. This portion should be automated.
4. Monitor your data and provide real-time alerts
Look for a tool to monitor the data in big data. Tools exist that create queries that are continuously processed, looking for criteria to be met.
I cannot begin to list all the possible uses for real-time monitoring of the data coming into Hadoop. Assuming that most of the in-bound data is unstructured and not destined for a relational database, real-time monitoring is probably the area where a data element is inspected most closely.
For example, you can set an alert when the RFID chip in a frozen food item is stored in a non-frozen area. That alert can go directly to mobile devices that are used in the warehouse, preventing food spoilage.
Customer movements in a store can also be monitored and advertisements targeted at the exact customer standing in front of a specific item can be played on strategically located monitors. (This is futuristic and maybe a tad "Big Brother"-like, but it is possible.)
5. Provide powerful ad hoc reporting and analysis
Akin to knowledge discovery and automated data mining, analysts need access to retrieve and summarize the information in the big data cloud environment. The list of vendors whose tools work for reporting from big data seems to grow longer every day.
Some of the tools use Apache Hive and the Hive Query Language (HQL; see Resources). HQL statements are similar to Structured Query Language (SQL) statements and many of the tools that provide familiar styles of reporting from big data use the HQL and Hive interface to run the queries through MapReduce.
Apache Pig is another open source project for reporting and manipulating big data. Its syntax is less like SQL and more like a scripting language. It too runs through MapReduce processing for easy parallel processing.
The cloud-based big data provider should allow both Pig and HQL statements to come from external requesters. That way, the big data store can be queried by people using tools of their own choosing, even using tools that have not yet been created.
6. Provide the ability to rapidly build custom dashboards and views
Like traditional business intelligence project evolution, when people can query big data and produce reports, they want to automate that function and create a dashboard for repetitive viewing with pretty pictures.
Unless people are writing their own Hive statements and using just the Hive shell, most tools have some ability to create dashboard-like views from their query statements. It's a bit early in the deployment of big data to cite many dashboard examples. One prediction, based on the history of business intelligence, is that dashboards will become an important internal delivery vehicle for summarized big data. And going by the history of business intelligence, having good dashboards for big data will be important for getting and maintaining executive support.
7. Scale efficiently to any data volume using commodity hardware
When using a cloud big data service, this is a point of philosophy more than practicality. It is up to the service provider to acquire, provision, and deploy the hardware on which the data resides. The choice of hardware shouldn't matter.
Be thankful, though, when the bill comes that big data was designed to use commodity hardware. There are certain nodes in the architecture where a "high quality" server makes sense. However, the vast majority of nodes (those storing the data) in a big data architecture can be on "lesser quality" hardware.
8. Provide granular, role-based security and access controls
When unstructured data is in a relational data world, the complexity of accessing the data can prevent everyone from getting at the data. The usual reporting tools won't work. Moving into big data is an active step toward making the complex easier to access. Unfortunately, the same security settings don't usually translate from existing relational systems to big data ones.
Having good security will become more important the more big data is used. Initially, the security can be wide open because no one will know what to do with big data (I'm being sarcastic here). As the company develops more analysis using the data in big data, the results need to be secured, particularly the reports and dashboards, similarly to how reporting from current relational systems is secured.
Getting started with cloud-based big data, be aware of the need to apply security at some point, particularly to the report and dashboard environment. For the beginning, though, I say let the analysts run wild. That's the best way to develop new insight.
9. Support multi-tenancy and flexible deployment
Using the cloud brings up the concept of multi-tenancy — obviously not a consideration in an on-premise big data environment.
Many people do have trepidation about placing critical data in a cloud environment. The point is that the cloud provides the low-cost and quick deployment needed to begin Big Data projects. Precisely because a cloud provider will put the data in an architecture where hardware resources are shared, the cost is dramatically lower.
All things being equal, it would be nice just to have your data on your servers with somebody else managing the whole setup. That is just not a cost-effective business model when big data needs are intermittent though. The result is more expense because companies would pay for a lot of idle time, especially during the first projects, when analysts are exploring, playing with, and learning about big data.
10. Integrate and be extensible via documented APIs
Many reading this article might be a couple of big data projects away from writing their own software interfaces to big data. Be aware, though, that it is possible and is done every day.
Big data is designed for access by custom applications. The common access methods use RESTful (Representational State Transfer) application programming interfaces (APIs). These are available for every application in the big data environment — for administrative control, storing data, and reporting on data. Because all of the foundational components of big data are open source, these APIs are well documented and openly available for use. Hopefully, the cloud-based big data provider will allow access to all of the current and future APIs under appropriate security.
Getting started in cloud-based big data
With the 10 key considerations in mind, select your big data provider. What? Need more information?
Realistically, a big data project starts by doing most everything I've described in batch mode, leaving the real-time aspects until later. By batch, I mean that as tools and processes are learned, the big data environment does not need to run constantly. I suggest looking for a vendor that allows starting and stopping the server instances as needed to minimize cost.
Installing your own on-premise big data environment requires Java™ technology skills and usually Linux® or UNIX® skills as well. With this in mind, ask the prospective cloud vendor how much administrative work needs to be done or how close to turnkey its service is.
One site to visit to learn how to install, test, and maintain a big data environment is BigDataUniversity.com (Resources). Register there at no cost. Many hours of videos are organized by tracks and the site even offers certificates of completion for many of the tracks. As of this writing, a no-charge download of the e-book Hadoop for Dummies is available.
In parallel with getting training, provision an instance of a big data environment in one of the cloud vendors. Several of the training tracks at BigDataUniversity.com cover installation and use of big data in IBM SmartCloud and on Amazon Web Services. These cloud services (and others) take a lot of the complexity out of installing and deploying your environment. Use the training videos at BigDataUniversity.com and overcome road blocks that will stymie others in getting big data installed and tested for first-time use.
Fortunately, cloud services take a lot of the maintenance issues in a big data environment off the task list. They will, obviously, take care of hardware and server room needs. You will have to maintain the data, adding servers and alternative data stores when needed with growth.
Big data is a learning and growth experience for everyone. New and different tools are constantly coming onto the market. Existing vendors in the business intelligence space are providing the hooks to use their tools with big data back ends.
Using a cloud-based big data environment makes starting much, much easier. Take advantage of the ease of starting by using a cloud service to step into big data for a small project first. Get your feet wet and learn. Demonstrate the value and then leap confidently into larger projects lat the near future.
- Check out BigDataUniversity.
- Learn more about multitenancy.
- Check out IBM developerWorks' big data area, and find developer and database administrator resources, tutorials, and articles to help you grow your knowledge on big data technology and IBM's integrated big data platform.
- Learn more about big data and how it's used on IBM platforms..
- Create your developerWorks profile today and set up a watch list for topics that interest you. Get connected and stay connected with the developerWorks community.
- In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
- Follow developerWorks on Twitter.
- Watch developerWorks demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- Learn more about the Hadoop project.
- Learn more about Lucene search technology.
- IBM provides good background on Hive.
- Learn more about Pig, Apache's data analysis package.
- Learn more about IBM SmartCloud, IBM's public cloud infrastructure and platform services.
- Learn more about Amazon Web Services (AWS).
Get products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Dig deeper into Cloud computing on developerWorks
Experiment with new directions in software development.
Complete cloud software, infrastructure, and platform knowledge.
Software development in the cloud. Register today and get free private projects through 2014.
Evaluate IBM software and solutions, and transform challenges into opportunities.