Cloud business analytics: Write your own dashboard

Find patterns in multitudes of cloud business analytics data

Business analytics and cloud computing are hot, complex topics; the idea of combining the two could drive away those with less experience. But fear not: The author provides a simple look at the complex history of business analytics, illuminates the common points where both meet, explains the benefits that a cloud environment can bring to business analytics (and vice-versa), and gives you an example for writing your own cloud business analytics application.

Share:

Noah Gift, Associate Director Engineer, AT&T Interactive

author Noah GiftNoah Gift is an experienced technical leader and software developer at AT&T Interactive. He solves interesting problems in a variety of languages including Python/Iron Python, Erlang, F#, C#, and JavaScript. (He's also worked at Caltech, Disney Feature Animation, Sony Imageworks, and Weta Digital.) A member of the Python Software Foundation, he is also an author of many developerWorks articles and the co-author of Python For Unix and Linux System Administration. He earned a BS in Nutritional Science from Cal Poly San Luis Obispo, an MS in Computer Information Systems from CSULA, and is an MBA Candidate at UC Davis specializing in business analytics, finance, and entrepreneurship. In his spare time, he composes for the piano and runs in marathons. Find him at his web site, on Twitter, or for consulting.



12 January 2011

Also available in Russian Japanese

At some step, a fundamental evaluation for any new technology or a process should involve the question "What's in it for me?" — after all, what does business analytics in "the cloud" really accomplish?

Now technology companies and systems designers and developers have promoted many slightly different definitions of cloud computing, but they almost all coalesce around these points:

  1. Cloud computing allows applications and even entire separate systems to share traditional computing resources.
  2. Cloud computing allows apps and systems to share data.
  3. Cloud computing leverages the concept of distribution to maintain high availability.
  4. Cloud computing's greatest benefit to traditional IT is its ability to extend organizational-internal IT both with expertise and resources on an as-needed basis.
  5. Cloud computing's greatest benefit to organizations is the ability to extend the reach of an organization's product, service, or message in a scheduled, per-cost way.
  6. Cloud computing's greatest benefits to product and service users is the ability to provide seamless access on an on-demand schedule.
  7. Cloud computing can be internal, external, or a mix of both.
  8. Most people already engage cloud computing every time they logon to the Internet ... they may just not know it.

Definition #2 is the defining (pardon the pun) criterion in relation to cloud computing and analytics since a great library of interesting data is in cloud, or available to cloud systems, already. The cloud structure provides an almost "endless container" of data, the ideal input to sharpen an analytics tool on.

And since the data is out there, the next logistical step is to learn how to analyze it where it lives. In this article, I'll show you how to write a dashboard to get started with business analytics in the cloud, but first, let me define my concept of business analytics.

What do I mean by business analytics?

Business analytics can mean many things, just like cloud computing. In the code example included in this article, I will make it mean analyzing customer support email for the "tone" of customer service correspondence.

I like to use two examples to corral a working definition of business analytics:

  • Robert McNamara and the introduction of systems analysis into public policy (now known as policy analyswis).
  • Game theory, AI theory, and raw statistical analysis.

McNamara and systems analysis

Business analytics in its current form owes much to the pioneering effort of people like Robert McNamara, the eighth US Secretary of Defense under Presidents Kennedy and Johnson; he used quantitative analysis to the aid World War II effort. In the broadest definition, business analytics is anything and everything — that includes tools, processes, and techniques — that provide analysis into business data. (Hence the "business" part of business analytics; change the data to "scientific" or "sociological" and you've got analytics that pertains to a different discipline. Same framework, different I/O.)

Most often when Robert McNamara's name is brought up, it is for his association with the tragedies of the Vietnam War, but his legacy is more complex and nuanced than you find in his New York Times obituary as the "architect of a failed war."

After receiving an MBA from Harvard in 1939, McNamara applied quantitative analysis to army efforts in World War II to analyze B-29 bombers' efficiency and effectiveness, as part of an elite team called Statistical Control. In 1946 he joined Ford as a part of a team of "Whiz Kids" who achieved improvements with modern management control systems. He helped turn around Ford motor company, ultimately becoming the first non-family president of Ford in 1960.

McNamara is, perhaps, a canonical example of both the potential success — wringing cost-effective profitability out of Ford Motor Company — and the potential failure of business analytics — by not including certain sets of vital input in the analysis of how to conduct the war in Vietnam.

Games and statistical analysis

Some business analytics software makes use of artificial intelligence theory (in the form of data mining), game theory (like the prisoner's dilemma and Nash equilibrium), and raw statistical analysis.

Artificial intelligence
When you combine the discipline of artificial intelligence with computer science, you come up with the practice of data mining, which is simply the process of extracting patterns from data.

This isn't a new concept. Here are two precursors of data mining that were used to discover patterns in data:

  • Bayes' theorem (Thomas Bayes, 1702-1761) shows the relation between two conditional probabilities (the probability of A given the occurrence of B) which are the reverse of each other.
  • Regression analysis (earliest form, method of least squares, Adrien-Marie Legendre in 1805, Carl Friedrich Gauss in 1809) contains techniques for modeling and analyzing several variables (where one is dependent the others are independent); perfect for creating prediction models based on patterns found in data.

Some terms

neural networks (artificial) | interconnecting artificial neurons, programming constructs that mimic the properties of biological neurons, that are used to solve artificial intelligence problems

clustering, cluster analysis | the assignment of a set of observations into subsets called clusters so that attributes of elements in the same cluster are similar

genetic algorithm | a search heuristic (an experience-based technique for problem solving, learning, and discovery) that mimics the process of natural evolution, good for optimization of search problems

decision tree | a predictive model that maps observations about an element to conclusions about the element's target value (how much influence it should have in decisions); in data mining, the DT describes the data, not the decision

support vector machines | a set of methods that analyze data and recognize patterns; SVM takes a set of input data and predicts for each given input which of two possible classes the input is a member of (also known as a non-probabilistic binary linear classifier)

data structure | structured data is data that has been organized to a particular scheme to make it more efficient for storage and manipulation purposes; unstructured data is data that doesn't follow a predefined organization scheme

Computer technology (such as cloud computing) has increased the ability to collect, store, access, and manipulate data; because of the massive volumes of data available, direct hands-on analysis is now almost always augmented with some level of automated data-manipulation techniques, such as neural networks, clustering, genetic algorithms, decision trees, and support vector machines.

Other than the obvious — data mining allows individuals to widen the amount of data used in analysis by discovering patterns in large amounts of data — for cloud-related business analytics goals, data mining's AI component comes in handy to by helping the researcher incorporate different types of data; for example, unstructured as a well as structured data. (For more on this, see IBM®'s Big Sheets technology.)

Game theory
Game theory is the branch of applied mathematics that attempts to mathematically capture behavior in strategic situations where an individual's success in making choices depends on the choices that other participants make.

The "prisoner's dilemma" is perhaps the most famous game theory problem. It involves a scenario in which the "dominant" strategy leads to a non-optimal outcome for everyone involved. This is also loosely related to the problem called the "tragedy of the commons" in which self-interest ensures a worse outcome for the community.

Many game theory problems directly apply to business processes. For example:

  • The "prisoner's dilemma" demonstrates why two people might not cooperate even if it is in both their best interests to do so. In the classic example, two suspects are arrested by the police. The police have insufficient evidence for a conviction, and, having separated the prisoners, visit each of them to offer the same deal. If one testifies for the prosecution against the other (defects) and the other remains silent (cooperates), the defector goes free and the silent accomplice receives the full 10-year sentence. If both remain silent, both prisoners are sentenced to only six months in jail for a minor charge. If each betrays the other, each receives a five-year sentence. Each prisoner must choose to betray the other or to remain silent.
  • The "Nash equilibrium" (after inventor John Forbes Nash) is a solution of a game involving two or more players in which each player is assumed to know the equilibrium strategies of the other players and no player has anything to gain by changing only his or her own strategy unilaterally. If each player has chosen a strategy and no player can benefit by changing his or her strategy while the other players keep their strategies unchanged, then the current set of strategy choices and the corresponding payoffs constitute a Nash equilibrium.
  • The "tragedy of the commons" is a dilemma arising from the situation in which multiple individuals, acting independently, and solely and rationally consulting their own self-interest, will ultimately deplete a shared limited resource even when it is clear that it is not in anyone's long-term interest for this to happen.

The real power that game theory brings to cloud-related analytics is in the role of developing online algorithms, algorithms that can process input piece-by-piece in a serial fashion (in the order that the input comes in) without having the entire input available at the start of the process. Using this type of algorithm, patterns can emerge from the analysis of existing and additional data so decisions can be made at earlier stages in the analysis, then reviewed when more input is made available.

Statistical analysis
Statistical programming languages, simply defined as those designed for statistical applications, are logically a strong component for business analytics in any environment.

The de facto standard for programmatic statistical analysis is the R programming language. R is an implementation of the S programming language combined with lexical scoping semantics inspired by Scheme; it is part of the GNU project and its source code is freely available under the GNU General Public License. Precompiled binary versions are available for various operating systems. You can manipulate R objects directly with C and Java™ code.

The R language provides many useful libraries for performing statistical analysis on large and small data sets. According to expert David Mertz:

The R environment is not intended to be a programming language per se, but rather an interactive tool for exploring data sets, including the generation of a wide range of graphic representations of data properties.
"Statistical programming with R" series

An open question with any new software development problem though is whether it makes sense to "roll your own" solution or buy off the shelf. I won't answer that question in this article.

Now let's look at how to get started with business analytics by writing your own analytics dashboard program in Python.


Write your own analytics dashboard

In this example, I demonstrate how to read an email account, transform the data into tagged parts of speech, and then use that information to create three word phrases that get sorted by occurrence. (The Python code is available from downloads.)

Finally, this data gets placed into a chart using the Google Visualization API. Because the code example is long, I break it down into steps and talk about each step separately. You can download the entire code sample and follow along with your own email server (like gmail).

Figure 1 shows the data processing pipeline.

Figure 1. The data processing pipeling
The data processing pipeline
  1. Access your email server. Listing 1 shows you how.
    Listing 1. Programmically access your email program
    import email
    
    username = "myaccount@example.com"
    password = "asecretpassword"
    server = "imap.example.com"
    folder = "INBOX"
    
    def connect_inbox():
        "Grab Data"
        mail=imaplib.IMAP4_SSL(server, 993)
        mail.login(username,password)
        mail.select(folder)
        status, count = mail.search(None, 'ALL')
        try:
            for num in count[0].split():
                status, data = mail.fetch(num, '(RFC822)')
                yield email.message_from_string(data[0][1])
        finally:
            mail.close()
            mail.logout()
    
    def get_plaintext(messages):
        """Retrieve text/plain version of message"""
        for message in messages:
            for part in message.walk():
                if part.get_content_type() == "text/plain":
                    yield part

    In this code snippet, a connection is made to an IMAP server and each message is yielded back to another function that looks for plain-text message content types, then yields them back.
  2. Transform and tag the data's parts of speech using the Python Natural Language Toolkit. The nltk library is used to transform raw text into sentences, and then words, and then finally into pairs of words tagged with their parts of speech, such as verb, noun, etc. This information is then used to create a three-word phrase consisting of a verb on each side of the word "to"; when this pattern is matched, the three-word phrase is yielded back along with the integer 1, so it can be summarized later (Listing 2).
    Listing 2. Tagging and transforming
    Import nltk 
    
    def transform(messages):
        """Transforms data by tokensizing and tagging parts of speech"""
        for message in messages:
            sentences = nltk.sent_tokenize(str(message))
            sentences = [nltk.word_tokenize(sent) for sent in sentences]
            sentences = [nltk.pos_tag(sent) for sent in sentences]
            yield sentences
    
    def three_letter_phrase(messages):
        """Yields a three word phrase with TO"""
        for message in messages:
            for sentence in message:
                for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
                    if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
                        yield ((w1,w2,w3), 1)
  3. Summarize the data. A series of functions uses the MapReduce style to extract, partition, and then summarize the occurrence of these three-word phrases (Listing 3). Note that this is written in a completely sequential fashion, although with a bit of effort it could be converted to be more parallel. For more detailed information on MapReduce, see Resources.
    Listing 3. Summarize the data using MapReduce
    def mapper():
        messages = connect_inbox()
        text_messages = get_plaintext(messages)
        transformed = transform(text_messages)
        for item,count in three_letter_phrase(transformed):
            yield item, count
    
    def phrase_partition(phrases):
        partitioned_data = defaultdict(list)
        for phrase, count in phrases:
            partitioned_data[phrase].append(count)
        return partitioned_data.items()
    
    def reducer(phrase_key_val):
        phrase, count = phrase_key_val
        return [phrase, sum(count)]
    
    def start_mr(mapper_func, reducer_func, processes=1):
        pool = Pool(processes)
        map_output = mapper_func()
        partitioned_data = phrase_partition(map_output)
        reduced_output = pool.map(reducer_func, partitioned_data)
        return reduced_output
  4. Sort the results and then use the Google Chart API to graph the results into a pie chart (Listing 4).
    Listing 4. Visualizing the results
    page_template = """
    <html>
    <head>
    <!--Load the AJAX API-->
    <script type="text/javascript" src="https://www.google.com/jsapi"></script>
    <script type="text/javascript">
    
          // Load the Visualization API and the piechart package.
          google.load('visualization', '1', {'packages':['corechart']});
    
          // Set a callback to run when the Google Visualization API is loaded.
          google.setOnLoadCallback(drawChart);
    
          // Callback that creates and populates a data table, 
          // instantiates the pie chart, passes in the data and
          // draws it.
          function drawChart() {
    
          // Create our data table.
            var data = new google.visualization.DataTable();
            data.addColumn('string', '3 Word "To" Phrase');
            data.addColumn('number', 'Occurances in Inbox');
            data.addRows(%s
    
            );
    
            // Instantiate and draw our chart, passing in some options.
            var chart = 
              new google.visualization.PieChart(document.getElementById('chart_div'));
            chart.draw(data, {width: 1000, height: 700, is3D: true, title:
              'Customer Service Email Phrases'});
          }
    </script>
    </head>
    
    <body>
    <!--Div that will hold the pie chart-->
    <div id="chart_div"></div>
    </body>
    </html>
    
    
    """
    
    def print_report(sort_list, num=25):
        results = []
        for items in sort_list[0:num]:
            phrase = " ".join(items[0])
            result = [phrase, items[1]]
            results.append(result)
        page = page_template % results
        print page
    
    def main():
        phrase_stats = start_mr(mapper, reducer)
        sorted_phrase_stats = sorted(phrase_stats, key=itemgetter(1), reverse=True)
        print_report(sorted_phrase_stats)

    This last visualization step is easier than it looks because the results of the report are simply a string replaced into a HTML and JavaScript template. As you can see in Figure 2, the results get neatly displayed a pie chart.

    Figure 2. The results in a pie chart
    The results in a pie chart

In conclusion

In this article, I've covered some of the theory behind quantitative analysis and business analytics, then provided a proof-of-concept code example. You can easily convert this example into a customer mood indicator for a business that provides customer support by analyzing the "tone" and "mood" of conversations. With a bit more work, it can be plugged into some other engine of logic which can be chained into another engine, and so on.

Analyzing data in the cloud is one unstoppable future direction of the Internet; as business increasingly turn to cloud solutions, the need to write or buy business analytics software will only increase. Hopefully, this article has given you some ideas of what you can develop for your business, either through creation or purchase.


Download

DescriptionNameSize
Sample Python code for this articleemailbi.zip2KB

Resources

Learn

Get products and technologies

  • See the product images available on the IBM Smart Business Development and Test on the IBM Cloud.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • developerWorks Labs

    Experiment with new directions in software development.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing, Open source
ArticleID=607536
ArticleTitle=Cloud business analytics: Write your own dashboard
publish-date=01122011