At some step, a fundamental evaluation for any new technology or a process should involve the question "What's in it for me?" — after all, what does business analytics in "the cloud" really accomplish?
Now technology companies and systems designers and developers have promoted many slightly different definitions of cloud computing, but they almost all coalesce around these points:
- Cloud computing allows applications and even entire separate systems to share traditional computing resources.
- Cloud computing allows apps and systems to share data.
- Cloud computing leverages the concept of distribution to maintain high availability.
- Cloud computing's greatest benefit to traditional IT is its ability to extend organizational-internal IT both with expertise and resources on an as-needed basis.
- Cloud computing's greatest benefit to organizations is the ability to extend the reach of an organization's product, service, or message in a scheduled, per-cost way.
- Cloud computing's greatest benefits to product and service users is the ability to provide seamless access on an on-demand schedule.
- Cloud computing can be internal, external, or a mix of both.
- Most people already engage cloud computing every time they logon to the Internet ... they may just not know it.
Definition #2 is the defining (pardon the pun) criterion in relation to cloud computing and analytics since a great library of interesting data is in cloud, or available to cloud systems, already. The cloud structure provides an almost "endless container" of data, the ideal input to sharpen an analytics tool on.
And since the data is out there, the next logistical step is to learn how to analyze it where it lives. In this article, I'll show you how to write a dashboard to get started with business analytics in the cloud, but first, let me define my concept of business analytics.
What do I mean by business analytics?
Business analytics can mean many things, just like cloud computing. In the code example included in this article, I will make it mean analyzing customer support email for the "tone" of customer service correspondence.
I like to use two examples to corral a working definition of business analytics:
- Robert McNamara and the introduction of systems analysis into public policy (now known as policy analyswis).
- Game theory, AI theory, and raw statistical analysis.
McNamara and systems analysis
Business analytics in its current form owes much to the pioneering effort of people like Robert McNamara, the eighth US Secretary of Defense under Presidents Kennedy and Johnson; he used quantitative analysis to the aid World War II effort. In the broadest definition, business analytics is anything and everything — that includes tools, processes, and techniques — that provide analysis into business data. (Hence the "business" part of business analytics; change the data to "scientific" or "sociological" and you've got analytics that pertains to a different discipline. Same framework, different I/O.)
Most often when Robert McNamara's name is brought up, it is for his association with the tragedies of the Vietnam War, but his legacy is more complex and nuanced than you find in his New York Times obituary as the "architect of a failed war."
After receiving an MBA from Harvard in 1939, McNamara applied quantitative analysis to army efforts in World War II to analyze B-29 bombers' efficiency and effectiveness, as part of an elite team called Statistical Control. In 1946 he joined Ford as a part of a team of "Whiz Kids" who achieved improvements with modern management control systems. He helped turn around Ford motor company, ultimately becoming the first non-family president of Ford in 1960.
McNamara is, perhaps, a canonical example of both the potential success — wringing cost-effective profitability out of Ford Motor Company — and the potential failure of business analytics — by not including certain sets of vital input in the analysis of how to conduct the war in Vietnam.
Games and statistical analysis
Some business analytics software makes use of artificial intelligence theory (in the form of data mining), game theory (like the prisoner's dilemma and Nash equilibrium), and raw statistical analysis.
When you combine the discipline of artificial intelligence with computer science, you come up with the practice of data mining, which is simply the process of extracting patterns from data.
This isn't a new concept. Here are two precursors of data mining that were used to discover patterns in data:
- Bayes' theorem (Thomas Bayes, 1702-1761) shows the relation between two conditional probabilities (the probability of A given the occurrence of B) which are the reverse of each other.
- Regression analysis (earliest form, method of least squares, Adrien-Marie Legendre in 1805, Carl Friedrich Gauss in 1809) contains techniques for modeling and analyzing several variables (where one is dependent the others are independent); perfect for creating prediction models based on patterns found in data.
Computer technology (such as cloud computing) has increased the ability to collect, store, access, and manipulate data; because of the massive volumes of data available, direct hands-on analysis is now almost always augmented with some level of automated data-manipulation techniques, such as neural networks, clustering, genetic algorithms, decision trees, and support vector machines.
Other than the obvious — data mining allows individuals to widen the amount of data used in analysis by discovering patterns in large amounts of data — for cloud-related business analytics goals, data mining's AI component comes in handy to by helping the researcher incorporate different types of data; for example, unstructured as a well as structured data. (For more on this, see IBM®'s Big Sheets technology.)
Game theory is the branch of applied mathematics that attempts to mathematically capture behavior in strategic situations where an individual's success in making choices depends on the choices that other participants make.
The "prisoner's dilemma" is perhaps the most famous game theory problem. It involves a scenario in which the "dominant" strategy leads to a non-optimal outcome for everyone involved. This is also loosely related to the problem called the "tragedy of the commons" in which self-interest ensures a worse outcome for the community.
Many game theory problems directly apply to business processes. For example:
- The "prisoner's dilemma" demonstrates why two people might not cooperate even if it is in both their best interests to do so. In the classic example, two suspects are arrested by the police. The police have insufficient evidence for a conviction, and, having separated the prisoners, visit each of them to offer the same deal. If one testifies for the prosecution against the other (defects) and the other remains silent (cooperates), the defector goes free and the silent accomplice receives the full 10-year sentence. If both remain silent, both prisoners are sentenced to only six months in jail for a minor charge. If each betrays the other, each receives a five-year sentence. Each prisoner must choose to betray the other or to remain silent.
- The "Nash equilibrium" (after inventor John Forbes Nash) is a solution of a game involving two or more players in which each player is assumed to know the equilibrium strategies of the other players and no player has anything to gain by changing only his or her own strategy unilaterally. If each player has chosen a strategy and no player can benefit by changing his or her strategy while the other players keep their strategies unchanged, then the current set of strategy choices and the corresponding payoffs constitute a Nash equilibrium.
- The "tragedy of the commons" is a dilemma arising from the situation in which multiple individuals, acting independently, and solely and rationally consulting their own self-interest, will ultimately deplete a shared limited resource even when it is clear that it is not in anyone's long-term interest for this to happen.
The real power that game theory brings to cloud-related analytics is in the role of developing online algorithms, algorithms that can process input piece-by-piece in a serial fashion (in the order that the input comes in) without having the entire input available at the start of the process. Using this type of algorithm, patterns can emerge from the analysis of existing and additional data so decisions can be made at earlier stages in the analysis, then reviewed when more input is made available.
Statistical programming languages, simply defined as those designed for statistical applications, are logically a strong component for business analytics in any environment.
The de facto standard for programmatic statistical analysis is the R programming language. R is an implementation of the S programming language combined with lexical scoping semantics inspired by Scheme; it is part of the GNU project and its source code is freely available under the GNU General Public License. Precompiled binary versions are available for various operating systems. You can manipulate R objects directly with C and Java™ code.
The R language provides many useful libraries for performing statistical analysis on large and small data sets. According to expert David Mertz:
The R environment is not intended to be a programming language per se, but rather an interactive tool for exploring data sets, including the generation of a wide range of graphic representations of data properties.
—"Statistical programming with R" series
An open question with any new software development problem though is whether it makes sense to "roll your own" solution or buy off the shelf. I won't answer that question in this article.
Now let's look at how to get started with business analytics by writing your own analytics dashboard program in Python.
Write your own analytics dashboard
In this example, I demonstrate how to read an email account, transform the data into tagged parts of speech, and then use that information to create three word phrases that get sorted by occurrence. (The Python code is available from downloads.)
Finally, this data gets placed into a chart using the Google Visualization API. Because the code example is long, I break it down into steps and talk about each step separately. You can download the entire code sample and follow along with your own email server (like gmail).
Figure 1 shows the data processing pipeline.
Figure 1. The data processing pipeling
- Access your email server. Listing 1 shows you how.
Listing 1. Programmically access your email program
import email username = "firstname.lastname@example.org" password = "asecretpassword" server = "imap.example.com" folder = "INBOX" def connect_inbox(): "Grab Data" mail=imaplib.IMAP4_SSL(server, 993) mail.login(username,password) mail.select(folder) status, count = mail.search(None, 'ALL') try: for num in count.split(): status, data = mail.fetch(num, '(RFC822)') yield email.message_from_string(data) finally: mail.close() mail.logout() def get_plaintext(messages): """Retrieve text/plain version of message""" for message in messages: for part in message.walk(): if part.get_content_type() == "text/plain": yield part
In this code snippet, a connection is made to an IMAP server and each message is yielded back to another function that looks for plain-text message content types, then yields them back.
- Transform and tag the data's parts of speech using the Python Natural Language Toolkit. The nltk library is used to transform raw text into sentences, and then words, and then finally into pairs of words tagged with their parts of speech, such as verb, noun, etc. This information is then used to create a three-word phrase consisting of a verb on each side of the word "to"; when this pattern is matched, the three-word phrase is yielded back along with the integer 1, so it can be summarized later (Listing 2).
Listing 2. Tagging and transforming
Import nltk def transform(messages): """Transforms data by tokensizing and tagging parts of speech""" for message in messages: sentences = nltk.sent_tokenize(str(message)) sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] yield sentences def three_letter_phrase(messages): """Yields a three word phrase with TO""" for message in messages: for sentence in message: for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence): if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')): yield ((w1,w2,w3), 1)
- Summarize the data.
A series of functions uses the MapReduce style to extract, partition, and then summarize the occurrence of these three-word phrases (Listing 3). Note that this is written in a completely sequential fashion, although with a bit of effort it could be converted to be more parallel. For more detailed information on MapReduce, see Resources.
Listing 3. Summarize the data using MapReduce
def mapper(): messages = connect_inbox() text_messages = get_plaintext(messages) transformed = transform(text_messages) for item,count in three_letter_phrase(transformed): yield item, count def phrase_partition(phrases): partitioned_data = defaultdict(list) for phrase, count in phrases: partitioned_data[phrase].append(count) return partitioned_data.items() def reducer(phrase_key_val): phrase, count = phrase_key_val return [phrase, sum(count)] def start_mr(mapper_func, reducer_func, processes=1): pool = Pool(processes) map_output = mapper_func() partitioned_data = phrase_partition(map_output) reduced_output = pool.map(reducer_func, partitioned_data) return reduced_output
- Sort the results and then use the Google Chart API to graph the results into a pie
chart (Listing 4).
Listing 4. Visualizing the results
Figure 2. The results in a pie chart
In this article, I've covered some of the theory behind quantitative analysis and business analytics, then provided a proof-of-concept code example. You can easily convert this example into a customer mood indicator for a business that provides customer support by analyzing the "tone" and "mood" of conversations. With a bit more work, it can be plugged into some other engine of logic which can be chained into another engine, and so on.
Analyzing data in the cloud is one unstoppable future direction of the Internet; as business increasingly turn to cloud solutions, the need to write or buy business analytics software will only increase. Hopefully, this article has given you some ideas of what you can develop for your business, either through creation or purchase.
|Sample Python code for this article||emailbi.zip||2KB|
- In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
- More developerWorks resources that match this article can be found at open source at developerWorks.
- The next step: Find out how to access IBM Smart Business Development and Test on the IBM Cloud.
Get products and technologies
- See the product images available on the IBM Smart Business Development and Test on the IBM Cloud.
- Join a cloud computing group on developerWorks.
- Read all the great cloud blogs on developerWorks.
- Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Dig deeper into Cloud computing on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Complete cloud software, infrastructure, and platform knowledge.
Software development in the cloud. Register today to create a project.
Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.