Modified on by fstein
GUEST BLOG BY MICHELLE HUCHETTE:
My name is Michelle Huchette and I am a rising fourth year at the University of Virginia studying Computer Science and Statistics. This summer I was fortunate enough to be a part of the IBM Summit Program as a Technical Sales Intern. In this role I was able to experience what it is like to be an IBM seller by attending customer events and working on various tasks and projects over the course of 11 weeks. A few weeks ago I was challenged with creating a Proof of Technology lab that would interest customers in the field of machine learning. This is a brief overview of the creation and utilization of the model I created to diagnose breast cancer tumors.
The data set used for the lab was found in UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) that contained information regarding breast cancer tumors and information to help predict the diagnoses of the tumors as malignant or benign. The data set contains 10 measurements of each cell nucleus captured using images of cell nuclei gathered from a fine needle aspiration procedure (FNA). The average, standard error, and extreme values of all nuclei in the tumor sample were calculated for each of the following features:
- Texture (standard deviation of gray-scale values)
- Smoothness (local variation in radius length)
- Compactness (perimeter2/area – 1)
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Fractal dimension (coastline approximation – 1)
After finding a data set, a machine learning model could be created to diagnose breast cancer tumors. In order to do so we needed to set up a Watson Studio account on the IBM Cloud platform( https://console.bluemix.net/registration/ ). Within Watson Studio(https://console.bluemix.net/catalog/services/watson-studio ) we created a Jupyter Notebook which was used to write a python code to work with the data set, create a model, and make the predictions, all using Apache Spark as the analytics engine.
The first step to creating the machine learning model required determining which type of model would be the best fit for the data. Research found that there are different types of models that Spark supports such as Naïve Bayes, Decision Trees, Random Forests, and Regression Models, which are the most common. Because Naïve Bayes required a strong independence assumption between the features, that type of model was ruled out. Ultimately, a Logistic Regression Model was chosen since it is often used for models of binary categorical outcome (exactly what we’re dealing with when trying to diagnose a tumor as malignant or benign) and it is good at measuring the relationship between the labels and features.
To start out, the logistic regression model was set to have the default parameters so that an initial model could be created and improved upon if needed. Once the model was defined, a pipeline was set up which contained a sequence of stages to be run in a specific order. Within a pipeline each stage is either a transformer, which converts a dataframe into other dataframes, or an estimator which calls fit() and trains a model. There are many different options that you can include in your pipeline, including tokenizers, hashes, normalizers, etc. In terms of this dataset and for the sake of creating an easy to follow lab our pipeline started by using StringIndexer to turn the label (diagnosis) into a form that SparkML could use by encoding the input columns to a column of indices based on their frequency. Then a Vector Assembler combined the list of columns into a single vector column to be used in training the model. A normalizer was added to normalize each vector into a standard form to improve the algorithm. Lastly, our defined logistic regression instance was implemented and IndexToString was used to get the results of the model back into human readable form.
Following the definition of a pipeline, the logistic regression model and pipeline could be used to train and test the model. The data set was split with the standard 70/30 split for the training and test dataset, respectively. The training data set was then used to fit the pipeline and train the model to make predictions and the accuracy of the model was tested using a Receiver Operator Characteristic curve for binary classifiers. This value is calculated by plotting the true positive rate (recall/probability of detection) against the false positive rate (fall-out/probability of a false alarm) at various levels. A value when using the ROC curve that is close to 1 suggests that the model performs very well, whereas a value close to 0.5 is about as good as flipping a coin. Once the model was trained and evaluated, the test data set was used to make predictions The logistic regression model that was created in the steps previously described resulted in a value of 0.989, meaning it was able to predict the diagnoses of tumors very well.
Even though the model was already proven to be able to diagnose tumors accurately it could still be improved on. Hyperparameter tuning includes the use of model selection tools that test different parameter values for the pipeline and find the best possible values. There are two main options when working in Spark in terms of model selection tools, a CrossValidator or a Train-Validation Split. For this project we used a CrossValidator because even though they can be more expensive for larger data sets, they are more reliable when the data set isn’t sufficiently large because it evaluates each parameter k times, rather than just once.
CrossValidators first split the data set into “folds” which are used as separate training and test data set pairs. We set the value of the number of folds for this project to be 10 and therefore the CrossValidator generated 10 training/test data set pairs which are all used to test the parameters. The average performance among the 10 instances for each parameter are averaged and compared to other parameter values tested. We defined a paramGrid which stated the values to be used for the parameters within the pipeline. For this pipeline we could define values for maxIter, elasticNetParam, regParam, which are the parameters in the logistic regression model, or the normalizer parameter of the pipeline. Included in our paramGrid for this lab was parameter values for elasticNetParam, which must be between 0 and 1. This is an important parameter in the pipeline because it can make the model closer to a Lasso regression model (coefficients that are not relevant are set to 0) with a value close to 1 or a Ridge regression model (minimize the impact of irrelevant coefficients without setting them to 0) with a value close to 0. Because of this the values to test the elasticNetParam in the grid were set to 0, 0.5, and 1 to see which type of regression model would be best for the data. The second parameter defined in the paramGrid is the normalizer from the pipeline. The normalizer ensures that the algorithm runs correctly and the value set for the parameter represents the p-norm for normalization. The default value of 2 was previously used so within the paramGrid the values to be tested were set to 1 and 3.
After using a CrossValidator to find the best paramMaps, that model was trained using the testing data set and it was evaluated. The model improved 0.058% due to hyperparameter tuning, meaning the newly defined model was 99.5% accurate.
With an almost perfect predictive model defined, the last step was to grab the undiagnosed tumors from the original data set and use the model to predict their diagnoses with a high level of confidence.
The creation of this highly accurate model shows the power of Machine Learning in bettering the lives of people worldwide. It allows for the augmentation of breast cancer diagnosing and ensures that doctors see the patients in dire need of medical attention. Models such as these can help detect cancer earlier and, in more individuals, than doctors can do alone. Machine Learning has already started to be implemented in oncology to diagnose tumors, pathology to analyze bodily fluids, and in diagnosing rare diseases using facial recognition and deep learning to detect rare genetic diseases. Machine Learning serves many purposes from chatbots to augmenting the medical diagnostic process and with the continued advancements in technology and AI its applications are sure to expand even more.
Modified on by fstein
Do you have Super Powers? Would you like to have Super Powers? I was recently invited to give a talk about AI at the Escape Velocity Science Fiction Conference (https://escapevelocity.events/) put on by the Museum of Science Fiction in Washington, DC. I focused the talk on how our advances in AI technology and augmenting human intelligence (Intelligence Augmentation = IA) are starting to provide humans with Super Powers, once only the realm of sci-fi writers. I’ve seen a lot technology come and go. And what we are now developing has the most potential of any of the technologies I’ve experience to help people to do more, do it faster, and do things we couldn’t do before. IA is going to have more impact on individuals, our professions, and society than all the previous advancements in computers to date.
John Campbell, the famous editor of Astounding Science Fiction, who published the likes of Asimov and Arthur C Clark, pushed his writers to create heroes and foes that had cognitive abilities that were better than humans, or had different attributes. So too, the comic books that came out featured heroes with unique powers, some cognitive and some around endurance and power. As you know, this vision of achieving super human capabilities has existed for most of recorded history. And it has been a dream of computer scientists for as long as computer scientists have existed too.
We haven’t made a lot of progress in the non-fiction world of creating people that are different - - Evolution is a VERY slow process. And while the world’s knowledge keeps increasing, people think pretty much at the same speed with the same memory limitations as before. Therefore, my talk focused on how we can use technology and data to help us to achieve super human capabilities.
Just like we’ve created assembly lines full of machines for our factories, we are starting to create tools to help those of us that are called Knowledge Workers to do our jobs more efficiently and create results that haven’t been possible in the past. Technology will redefine our professions and our jobs within our fields. These changes won’t just provide marginal improvements, they will provide significant new capabilities that will provide higher productivity to our employers, enhance our own well-being, and solve significant problems facing society.
We will know what customers are looking at in every store in the world, what they pass over and what they buy. Some might call that Omni-presence. We will be able to predict who will click on which ad on the web, who will buy which product, and who will get which diseases and which drugs will work for which individual. Is that Precognition or Clairvoyance? We’ll be able to instantly recognize a face in a crowd of thousands and see through objects. Our cars will help us to see black ice on the road and around corners. Our super-hearing will not only hear from a distance, but will allow detection of emotional stress and mental health issues that others might be facing – probably before they themselves realize it. Even better than superman!
In the government space, these super power of Super Vision will enable us to spot terrorists pictures among the millions of videos and images collected, as well as detect illegal fishing and logging operations. Precognition will allow us to predict the outbreak of a potential pandemic early enough to mount a robust public health defense, and predict weather events in time to evacuate and prepare emergency operations. In the cybersecurity world, we'll have to super power to detect threat patterns quickly, predict likely fast-fluxing techniques used by the intruders, and provide rapid advice for the response teams.
These Super powers come from taking all the data the world is now generating – which mostly is going to waste – and analyze it to find patterns and answers to questions that we couldn’t answer in the past. We’re now creating almost 10 Zettabytes per year – and the amount is increasing exponentially. Analyzing all that data will give us these superpowers and as that data grows, so too will our Super Powers. Analyzing all this data requires very sophisticated technology which IBM and others in the I/T industry are intensely developing. We will do this using Machine Learning, NLP processing, and reasoning.
My goal in the talk was not to talk about the technology but instead to show how far we have come in creating super human capabilities. I talked about some of the applications of IA – Intelligence Augmentation – to businesses, professions, and society. See the slides for some of the examples I used: https://www.slideshare.net/frankibm/getting-your-super-powers-with-watson-and-ia
I concluded with some discussion on how humans and machines can complement each other so that we can accomplish more together. It is my belief that we will need this collaboration to solve some of society’s hard problems such as climate change, supporting all the people that will soon be on planet earth, and even protecting us from incoming asteroids. The final slide shows 2 famous quotes regarding the value of the combination of people and machines:
- “The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.” - JCR Licklider, 1960, Professor at MIT
- “The computer is incredibly fast, accurate, and stupid. Man is unbelievably slow, inaccurate, and brilliant. The marriage of the two is a force beyond calculation.” – Leo Cherne, Presidential Advisor
Write to me at: firstname.lastname@example.org
My work this year has taken me from Big Data and Analytics towards Cognitive Computing and what IBM is now dubbing Cognitive Businesses (or Cognitive Government in our case). Cognitive businesses leverage cognitive computing technology (think Watson) to enhance, scale, and accelerate the expertise of their personnel. Below is the summary of the first part of a symposium I co-chaired last week. I'm happy to answer any questions you may have.
The AAAI Fall Symposia on November 12-14 included tracks on 1) AI for Human-Robot Interaction, Cognitive Assistance, Deceptive and Counter-Deceptive Machines, Embedded ML, Self Confidence in Autonomous Systems, and Sequential Decision Making for Intelligent Agents. This post will provide my general impressions of the Cognitive Assistance symposium.
Jerome Pesenti, IBM VP of Watson Core Development, provided the 1st day keynote. He started with the great quote from Fred Jelinek (Cornell/IBM/JHU) that “Every time I fire a linguist, the performance of the speech recognizer goes up.” He then talked about how deep learning is allowing reco systems that approach or surpass human performance. This led to a lively discussion with the audience on the universality of learning algorithms and whether the machines were learning in the same manner that humans learn something (no). Jerome finished with some applications of Watson including the Oncology Advisor, citizen support (e.g, tax questions), and security (finding relationships between data).
The rest of the morning was filled with examples of cognitive assistance for legal tasks such as filing a protective order (Karl Branting) and human-computer co-creativity in the classroom(Ashok Goel), and a tool to help SMEs define their vocabulary to find the most relevant content on the web (Elham Khabiri).
During lunch, much of the symposium had lunch together and a lively discussion ensued on cognitive assistance. One topic that I found interesting was on ultimate chess where human-machine teams compete. While these teams in the past have beaten computer-only teams, Murray Campbell noted that the advancements in chess playing computers are decreasing the value-add of humans to the team.
The afternoon session of Day 1 started with 2 interesting talks on cognitive assistance for helping those with cognitive disabilities. Madelaine Sayko described Cog-Aid which would include a cognitive assessment, recommender system (based on the assessment) and an intelligent task status manager for starters. Then Daniel Sontag described the Kognit technology program which includes tracking dementia patient’s behavior using eye tracking and mixed reality displays to assist the patient perform activities in daily living. Kevin Burns presented a sense-making approach that could be used by an intelligence analyst to help understand and define the Prior and Posterior probability calculations as new evidence is added. This could eventually be embodied into a cognitive assistant. Next came a presentation on capturing cybersecurity operational patterns to facilitate knowledge chaining by Keith Willett.
The final session of the day was a panel discussion of workforce issues associated with cognitive assistants led by Murray Campbell. Erin Burke of Fordham University Law School talked about how legal education must transition and that she is working at the intersection of law, big data, and cognitive computing. Jim Spohrer, Director of IBM’s University Programs, provided some predictions including that by 2035 everyone will be a manager and will have at least one Cognitive Assistant working for them. A lively discussion ensued with the audience about our forthcoming relationship with Cogs including whether we could trust them, unintended consequences, whether we can build common sense into a Cog, and whether our brains will atrophy as we depend on Cogs.
I’ll cover Day 2 in the next blog post.
In medieval times, Alchemists hoped to convert base metals
into the noble metal gold through the use of a Philosopher's Stone.
Today, in the field of information science, we talk about
Information Alchemy, converting data into information and then into
knowledge. Some people even add a 4th
stage of converting knowledge into wisdom[i], but
that will be for another blog post.
Data is defined as the raw characters or numbers, whereas information is
defined as the processing of that data into various relationships so they have
some meaning. Dr. Eisenberg at the University of Washington describes knowledge as the
“collected, combined, organized, processed information for a purpose.” Over time, it is thought that accumulated and
refined knowledge leads to Wisdom.
This year, the total of all digital data created is forecast
to reach close to 4 Zettabyes, or 4x 1021, according to IDC[ii]. This is nearly four times the 2010 volume and
it is growing rapidly. All of this data
should let us make a smarter and better planet.
However, today we’re drowning in all this data because we don’t have the
time as individuals to process all this information, and we don’t have computer
systems that can turn this data into insight,
But soon that will change.
We are entering a new era in computing which IBM is calling Cognitive
Computing. The first of these systems is
the IBM Watson system which debuted on the Jeopardy! Show 2 years ago. Traditional computing systems have done a
great job with handling data, including storing it and manipulating it into
information. So now we have lots of
financial, inventory, customer, and all sorts of other, mostly numerical,
We also have lots of unstructured information such as text,
audio, graphics, and video. We used to say that 80% of the new bytes being
created today were associated with unstructured data, but that number is
probably closer to 90% given all the video being created these days. This text and multimedia information is
human-readable – in fact, it is designed by humans for humans to understand but
is not easily understandable by today’s computers.
And that is a considerable problem. Today, the transformation of information into
knowledge is primarily done in people’s heads.
Not just by scientists, engineers, or financial analysts, but by
everyone who reads an article or watches a video. The time available for people (some would
say skilled people) to analyze information to gain insights (knowledge) is the
limiting factor in the production of new knowledge today. To say this another way, we are now
information-rich, but knowledge-poor.
The goal of the cognitive computing efforts is to remove
this limitation by designing computer systems that can take this abundance of
information, much of it in human readable/viewable formats, and convert into
knowledge. For example, in the Jeopardy!
IBM Challenge, the Watson computer system analyzed its deep information stores
to find the answer that best answered the clue and the category. It did this feat by utilizing many different
algorithms to attempt to “understand” the text information and a machine
learning (artificial intelligence) scoring system to select the best response.
In a more significant effort, IBM is working with Memorial
Sloan-Kettering and WellPoint (a major BC/BS licensee) to use cognitive
computing technology to assist doctors by helping to identify individualized
treatment options for patients with cancer. It is, in effect, creating knowledge of the
appropriate treatment options from information about the patient’s condition
and medical history, and information from clinical trials and best practices on
While the field of cognitive computing is just beginning, I believe
over the next several years, we will learn how to perform “Information Alchemy”
and we’ll see how this newly created knowledge can benefit our organizations
and our lives.
As the quintessential information-based organization, government agencies may be in the biggest need for "information Alchemy." Do you seen this need? Do you see opportunities for Cognitive Computing at your agency?
Director of IBM’s Analytics
[i] Eisenberg, Mike,
“Information Alchemy: Transforming Data and Information into Knowledge and
Wisdom”, March 30, 2012, http://faculty.washington.edu/mbe/Eisenberg_Intro_to_Information%20Alchemy.pdf
Derechos, Droughts, Hottest July on Record, Shattered
High Temp Records, Greenland Ice Sheet Melts. Just what is going on with the weather these
days? Is this weather really abnormal or
does it just seem to be that way? Is this part of a trend? Does global climate change mean we’ll have
more of these extreme weather events? Being
a data and analytics person, I started looking to see what data analysis had
been done on this subject.
The US Climate Extremes Index[i] provides
a measure to track the occurrence of extreme data (although it doesn’t take
into account Derechos and other severe wind events). The trend of the index (smoothed) has been on
the rise since 1970 and now is at an all time high, as shown below. The Index
was at a record high 46% during the January-July period, over twice the average
value, and surpassing the previous record large CEI of 42% percent which
occurred in 1934. Extremes in warm
daytime temperatures (83 percent) and warm nighttime temperatures (74 percent)
both covered record large areas of the nation, contributing to the record high
year-to-date USCEI value.
This index is
compiled by combining measurements throughout the country (1,218-station US Historical Climatology Network)
that show the percentage of the country impacted by extreme weather in terms of
maximum temperatures much above or below normal, minimum temperatures
above/below normal, percentage of country in severe drought/severe moisture
surplus, percentage of the country with a much greater than normal proportion
of precipitation derived from extreme 1 day events, and the percentage of the
country with a much greater than normal number of days with
The U.S. Global
Change Research Program in 2009 published a study which documented the changing
climate and its impact on the United
study uses 3 standard forms of data analysis: 1) reports on observations, 2)
predictions based on the observed trends, and 3) modeling to better predict future
climate changes based on various assumptions about the amount of heat-trapping
gases in the atmosphere. While the first
two types are based on large quantities of collected data, they use only U.S.
observations. The modeling, however,
must be done on a global basis which substantially increases the amount of data
that must be crunched.
Here are some of the findings as they relate to extreme
Overall Warming of the Climate
Temperatures, on average, in the1993-2008 period are 1-2ºF
higher than in the 1961-79 baseline. By
the end of the century, the average U.S. temperature is projected to
increase by approximately 7-11ºF under a high emissions model and by
approximately 4-6.5ºF under a lower emissions scenario. The temperature observations show that there
has been an increase in warmer and more frequent warm days and warm nights, and
warmer and less frequent cold days and cold nights in most areas.
More intense, more frequent, and longer-lasting heat waves
In the past several decades, there has been an increasing
trend in high-humidity heat waves, characterized by extremely high nighttime
temperatures. Parts of the South that
currently have about 60 days per year with temperatures over 90ºF are projected
to experience 150 or more days a year above 90ºF under a higher emissions
scenario. In addition to occurring more
frequently, at the end of this century these very hot days are projected to be
about 10ºF hotter than they are today.
Increased extremes of summer dryness and winter wetness with a generally
greater risk of droughts and floods.
Trends in drought have strong regional variations. Over the past 50 years, with increasing
temperatures, the frequency of drought in many parts of the West and Southeast
has increased significantly. Models show
that the Southwest, in particular, is expected to experience increasing drought
as the dry zone just outside of the tropics expands northward with global
Precipitation coming in heavier downpours, with longer dry periods in
While average precipitation over
the nation as a whole increased by about 7% over the past century, the amount
of precipitation falling in the heaviest 1% of rain events increased nearly
20%. One of the outputs of the climate
modeling is to project the probability of certain events. For example, heavy downpours that are now a “1
in 20 year occurrence” are projected to occur about “once every 4-15 years” by
the end of the century. These heavy downpours are expected to be
10-25% heavier by the end of the century than they are now. This will likely cause more flooding events
(flooding depends both upon the weather and the susceptibility of the area to
More intense but fewer severe storms
Reports of severe weather such as
tornadoes and severe thunderstorms have increased during the past 50 years.
However the climate study indicates that much of this may be due to better
monitoring technologies, changes in population areas, and increasing public
awareness. Climate models do project an increase in the frequency of
environmental conditions favorable to severe thunderstorms. But the report notes, “the inability to
adequately model the small-scale conditions involved in thunderstorm
development remains a limiting factor in projecting the future character of
severe thunderstorms and other small-scale weather phenomena.[iii]” Advances in modeling and big data analytics,
as well as improved monitoring networks are likely to reduce this limitation in
The June Derecho that hit the Washington metropolitan
area shows an example of the current state of the art in forecasting a severe
storm. The Storm Prediction Center of
NOAA was able to provide approximately 4 hours advance warning of the
storm. Longer term predictions would
require additional data about the atmospheric instability that propelled the
Derecho from Iowa to the Washington
Metro area, as well as better real time modeling.
Shift of storm tracks towards the poles
Cold season storm tracks are
shifting northward over the last 50 years, with a decrease in the frequency of
storms in mid-latitude areas. The
northward shift is projected to continue, and strong cold season storms are
likely to become stronger and more frequent, with greater wind speeds and more
extreme wave heights.
The climate changes will have an
interesting effect on the so called “lake-effect”. Over the past 50 years, there is a record of
increased lake-effect snowfall near the Great Lakes. As the climate has warmed there is less ice
on the Great Lakes which has allowed greater
evaporation from the surface resulting in heavier snowstorms. Eventually, the temperatures are expected to
rise sufficiently that much of the precipitation will end up falling as rain,
reducing the snow totals.
While trending of individual elements such as temperatures
is useful, accurate predictions require consideration of the interaction
between the climate elements. For
example, there is mutual enhancement effect between droughts and heat
waves. Heat waves enhance soil drying,
and drier soil heats the air above more since no energy goes into evaporating
the soil moisture. Big data modeling can
show the results of this escalating cycle of warming on the future climate.
The New Normal
So it seems that all this abnormal weather we are seeing
will become the new normal. Forewarned
Analytics Solution Center, Washington, DC
[ii] Global Climate Change
Impacts in the United States,
Thomas R. Karl, Jerry M. Melillo, and Thomas C. Peterson, (eds.) Cambridge University Press, 2009
On July 4th, CERN scientists announced that they
observed a particle that strongly resembles the Higgs boson, a critical element
of the standard model of particle physics.
This particle is thought to be responsible for the characteristic of
mass, which gives objects weight when combined with gravity.
Detection of the Higgs Boson would not have been possible
without the last decade’s advances in processing big data. Joe Incandela, CMS Spokesman at CERN,
explained that if every collision that they scanned was a sand grain, these
sand grains would have filled up an Olympic sized pool over the last 2
years. They had to find the several
dozen or so grains of sand that exhibited characteristics consistent with the
In addition to developing the Large Hadron Collider, the
CERN teams also developed a data strategy to deal with the data from the
hundreds of millions of particle collisions occurring each second. The sensors record the raw data on billions
of events occurring in the proton collider. These readings are then reconstructed
to show the energy and directions of many particle traces. The data goes through 2 stages of filtering
to reduce the data on 40 million collisions/sec down to 10 million interesting
ones per second, and then to 100 or 200 collisions that are studied in
According to Rolf-Dieter Heuer, director general at CERN, “The
computing power and network is a very important part of the research.” Over
15 Petabytes (1 million Gigabytes) are stored each year. This is distributed through the Worldwide
Large Hadron Collider Computing Grid (WLCG) to each of 11 major Tier 1 centers
around the world, and from there to research centers and individual
scientists. In the U.S., the Open
Science Grid, supported by NSF and DOE, provides much of the compute and
storage power for this work. The
scientists use Monte Carlo simulations for
generating and propagating the physics interactions of the elementary particles
passing through the collider to determine which ones correspond to the
hypothesized behavior of the Higgs Boson.
What they found was a never seen before elementary particle
that seems to fit the behavior of the Higgs Boson and is very heavy –
approximately 133 proton masses. Further
data analysis is now needed to ascertain its spin, decay modes, and other
Think the amount of data generated by the Large Hadron
Collider is huge? The forthcoming Square
Kilometre Array radio telescope is expected to generate 100’s of Petabytes of
data per day. More on that in a future
In the 1980’s, John Naisbitt wrote, “We have for the first
time an economy based on a key resource [information] that is not only
renewable, but self-generating. Running
out of it is not a problem, but drowning in it is.[i]” Little did Naisbitt know how much information
we’d be creating 30 years later. By some
estimates we are generating over 1 zettabyte (1x1021) per year[ii]. How do you avoid drowning in all that data,
and gain insights? That is the realm of
Big Data Solutions.
Center recently ran a
seminar on Big Data. We started off
talking about the ‘big data conundrum.’
The volume of data is growing so rapidly, that the fraction of data that
an enterprise can analyze is decreasing.
Because of this gap, we’re getting ‘dumber’ about our organization and
job over time. This is driving the need
for improved analytics and platform technology that can help us to process this
large volume of data.
What do customers want to do with big data? Popular requests we’ve heard include: I/T log
analytics, RFID tracking and analytics, fraud detection and modeling, risk
modeling, 360o view of a
person/place/thing, call center record analysis, and fusion of multiple
unstructured objects (e.g., pictures, audio).
Since we now collect so much data, the possibilities are only limited by
your imagination –and our ability to extract insights from the data.
In order to process these large volumes of data, special
systems and applications are being deployed.
Many of these are based on the Apache Hadoop middleware which supports a
distributed file system and processing environment for scalability,
flexibility, and fault tolerance. IBM’s
big data platform includes offerings based on Apache’s Hadoop with enhancements
to improve workload optimization, security, and cluster hardening. The IBM offering (BigInsights) also comes
packaged with advanced analytical capabilities for data visualization, text
analysis, and support machine learning analytics. One interesting item was the announcement
that the enhancements would be packaged to allow them to work with other Hadoop
distributions, such as the Cloudera™ hadoop.
Another offering discussed in the seminar was the Stream computing
offering designed to efficiently process “data in motion,” such as stock ticker
streams and social media feeds.
One of the biggest challenges given the huge volume of
information is finding the right information.
Governments, Utilities, and financial companies have this problem in
particularly because of the huge volumes they deal with. A recent IBM acquisition, Vivisimo, has
developed a next-generation search engine to provide search across multiple big
data and traditional platforms. Vivisimo
provides a scalable search application framework that can perform a federated
search across many different data sources including the web, social media,
content stores, and more traditional structured database systems. One feature that may be particularly
appealing to government agencies and corporate environments is its ability to
map individual access permissions of each data item, authenticate users against
each target system and limit access to information a user would be entitled to
view if they were directly logged into the target system.
They offer a clever search tool that provides easy
navigation and discovery, using both structured metadata (faceted search) and
keywords that the program dynamically discovers based on analysis of
unstructured content. Vivisimo provides an agile development layer, to allow
users to quickly create applications and dashboards to discover, navigate and
The seminar also featured a customer case study of using big
data for cybersecurity mission operations. IP traffic is growing at 29% CAGR, and with it,
the cyber-threats they are facing. Unfortunately, the customer’s headcount
isn’t growing, so more automated ways are need to detect and respond to threats. For this application, timeliness is key –
dealing with threats in real-time. To
identify potential threats, they want to be able to compare current threat and
traffic data to norms from the recent past, and similar periods in the
past. Their solution utilizes the
Netezza data warehouse appliance for near real-term data and IBM BigInsights
for long term storage. The solution eliminates
as many mundane “data retrieval” tasks as possible for the analyst, and provided
the analysts with those datasets that had a high probability of being
“interesting.” In this way, the solution helps the analyst deal with the
extreme data volumes, and yet remains flexible to the changing threat
Do you have an opportunity to use massive amounts of data to
accomplish a business/mission objective that can’t be done when we were limited
to small volumes of data? Do you have an
innovative solution? We’d like to hear
your stories about big data.
For more on the Big Data seminar, see our ASC website under past events.
[i] Naisbitt, John,
Megatrends: Ten New Directions Transforming Our Lives, NY Warner Communications
Company, 1982, pages 23-24
[ii] IDC Digital Universe
Does your government agency monitor the social media for information relevant to your mission? Should it?
IBM's Analytics Solution Center recently held a seminar to explore
how agencies and companies can obtain value and insight using social
Pat Fiorenza discussed how agencies can develop an ROI Model - Return
on Influence Model - for social media. Agencies use social media
analytics to help inform their decision making by gathering
information/research, and learn what other agencies and citizens are
saying. Interesting examples from CDC and Govloop were provided.
Learn more here.
Ed Burek, IBM, talked about how savvy companies are now taping into
customer generated content, how government agencies could do the same to
learn how tax payers feel about government actions and messaging. He
gave examples of how regulatory agencies could received the unvarnished
comments from those impacted by regulations, as well as how they could
stay on top of "negative chatter." IBM has created a framework to
derive business insight from the vast amounts of social media that is
now being transmitted. Called Cognos Consumer Insight it provides real
time information on trends and sentiment.
Rick Lawrence, IBM Manager for Machine Learning at Watson Research
Center next talked about the leading edge of social media analytics. He
provided examples from the research portfolio on discovering Who are
the Key Influencers? , Identifying emerging topics of discussion, and
Mapping the billions of tweet to concepts that we really care about.
All of the presentations are available on the ASC website under Past Events (May 10, 2012)
Does your agency care about what its constituents are saying about it
on social media? Does your agency need to have real time intelligence
on events within its mission space? With 340 million Tweets per Day, 2
million blog posts, and 500 million facebook updates, how can you find
the important information? Social Media Analytics may be an idea
whose time has come.
Analytics Solution Center
P.S. The Center for the Business of Government issued a new report on Tweeting in Government. Pat provided a good overview here.
At the end of the Superbowl, people created 12,233 tweets per second. And it turns out that was less than half the
number of tweets created in Japan
on December 9th, when 25,088 tweets per second were recorded about
the Castle in the Sky anime movie.
Which, according to the Chinese, is nothing compared to the 32,312
messages per second sent on their twitter-like Sina Weibo system during the
beginning of the Chinese new year.
Within the government space, we’re no strangers to our own Big Data. Whether you’re in the DOD or NASA, the IRS or
SSA, you’ve got your own Big Data to deal with.
Last week, Forrester Research released a report that should help those in
government understand the Big Data Market.
It is called “ The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012,
(February 2, 2012)” report. IBM Technologies evaluated were IBM InfoSphere
BigInsights (IBM’s Hadoop-based offering), and IBM Netezza Analytics. In this
evaluation, IBM was placed in the Leaders category of the Wave and achieved the
highest possible score in both the Strategy and Market Presence segments. In
the third segment, Current Offering, IBM received the second highest score. You
the complete report here.
The report by analyst James
Kobielus states, “IBM has the deepest Hadoop platform and application portfolio.”
The IBM Analytics Solution
Center in Washington, DC
also focused on how to handle Big Data at its January 19th
seminar. The seminar covered various
aspects of Big Data including data-in-motion processing software, Hadoop
software, SONAS (scale out network attached storage), and the Netezza data
1. Big Data in Motion
back to the Tweeting, if you’re a government agency and you need to get
actionable insights into 10s of thousands of tweets per second which might be
about an unfolding crisis, how would you do it?
InfoSphere Streams is unlike anything else in the market in its ability
to ingest, analyze and act on data “in motion” – that is, data is processed and
analyzed at microsecond latencies.
2. Hadoop Big Data
is an open source codebase supported by the Apache software foundation. It is designed to process large volumes of
unstructured data. For example, if a government agency wanted to analyze months
of tweets or documents in non-real time, the Hadoop distributed file system
would be a good choice. The enterprise
class IBM Hadoop-based offering, BigInsights, is designed with system
management, security, and performance features that go beyond what is available
in the open source. It provides the
ability to analyze and extract information from a wide variety of data sources,
and promotes data exploration and discovery.
Attached Storage, or NAS, has become a very popular way to provide storage
within an organization. However NAS has
a number of limitations when dealing with
Big Data including the number of objects (files) it can support, support
for very large files, the i/o bandwidth
it can deliver to applications, and fragmented data management across multiple
systems. The IBM SONAS system is
designed to overcome these limitations and look like a very large virtual
system to the applications.
4. Data Warehouse Appliance
data warehouses when used for large volumes of structured data can be costly to
operate and maintain, and can be very slow when used for sophisticated
analysis. The Netezza appliance is a
dedicated device requiring no tuning or storage administration and with special
hardware chips to accelerate the performance of advanced analytics.
Want to learn more?
- More details on the topics can
be found at the ASC Website under
- On the educational front, we
provide free online training through BigDataUniversity.com. To
date, more than 13,000 students have registered for courses on Hadoop,
cloud computing and more.
We are working with a broad range of clients to help them define
their big data strategies. We look forward to working with you on your Big Data
The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012,
Forrester Research, Inc., February 2, 2012. The Forrester Wave is copyrighted
by Forrester Research, Inc. Forrester and Forrester Wave are trademarks of
Forrester Research, Inc. The Forrester Wave is a graphical representation of
Forrester's call on a market and is plotted using a detailed spreadsheet with
exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is
based on best available resources. Opinions reflect judgment at the time and
are subject to change.
In these tough fiscal times, all agencies are going to be
focusing on doing more with less. How
does one get more done with less budget and staff? Consider turning to Analytics.
The consulting firm Nucleus Research has been looking at the
Return on Investment (ROI)
for various types of IT projects.
According to David O’Connell, Principal Analyst at Nucleus Research, “projects
involving analytics have some of the highest ROIs of any projects studied.”
Nucleus Research recently studied an analytics project IBM performed at DC
Water, the local water authority for Washington,
DC. In 2008, IBM began a first of a kind project
using advanced analytics to create a smarter water system that analyzes data on
valves, storm drains, service vehicles, truck routes and more to optimize its
infrastructure. With some pipes and other assets that date to the Civil War,
maintaining high levels of service while replacing older infrastructure is an
The project has resulted in the following benefits from a combination of IBM
Asset Management and Analytics technology and services:
Field Services trucks can be automatically
routed to optimize work management. This results in more work orders being
completed each week, as well as up to 20 percent reduction of fuel costs
related to fewer truck rolls and reduced "windshield" time.
Revenue loss from defective or
degrading water meters allowed recapture of $3.8 M because the analytics behind
the advanced metering infrastructure delivers more timely identification and
replacement of those meters. Revenue was
also recaptured because DC Water can now identify and bill locations where
there is unmetered water usage.
DC Water has been able to identify
assets most critically in need of repair using predictive analytics, so aging
infrastructure replacement programs can be more accurately scheduled,
preventing costly incidents that reduce service quality, such as outages and
water main breaks. This reduces both
maintenance labor costs and call center
costs associated with emergency incidents.
Nucleus Research reported in its case
study that the DC Water project resulted in $19.677 M of benefits over 3
years with a cost of $883 K, giving an ROI of 629%.
In 2010, Nucleus Research studied a number of other public
sector analytics projects. The results
from these projects are shown in the chart below. On average, the analytics projects have
resulted in an ROI of almost 600%! This
means that over 3 years, the projects have returned benefits 6 times the
original cost of the projects. The
payback period has been less than a year in all cases. This is important to government agencies because
it means you can see cost savings in the same fiscal year that you invest in an
According to David O’Connell, Principal Analyst at Nucleus
Research, “When government entities adopt
analytics, returns are high for two reasons.
First, waste such as leaky water mains, defective meters, or benefits
overpayments can be identified and eliminated.
Second, by making information more readily available, employees spend
less time looking around for information and more time getting their jobs done.” O’Connell went on to say, “Another improvement is better use of
workers’ time. The more an organization
knows about the public it serves, their needs, and the means of delivering
service, the smarter managers’ decisions are when they hand out workers’
Has your agency implemented any analytics projects? What’s been your experience?
Don't feel comfortable sharing
publicly? I'd be happy to hear your thoughts directly as well (email@example.com).
(net savings year 1 + net savings year 2 + net savings year 3)/3 * 100
Many of you probably saw the news about the Beltway Blockage
on July 8th
in the afternoon - - some of you may have been stuck in
the traffic like I was.
I had just read
IBM’s new Report, “The
Globalization of Traffic Congestion: IBM
2010 Commuter Pain Survey
,” but it was little consolation knowing that
traffic delays in Moscow were on average 2.5 hours, even as I watched my
commute time inch towards the second hour.
Transportation is a key governmental function that has
enormous impact on the citizens’ well being.
Traffic congestion adds stress to our lives, retards economic
development, and impacts the environment.
Performance Management is the mandate of the day for
governments, both federal and the state and local government. In the past, many government agencies would
measure performance such as the number of roads resurfaced, number of traffic
lights installed, and the number of dollars spent on transportation. These were input data elements. A more recent focus, and one that is more
meaningful to citizens, is to measure the outcomes achieved by the government
agencies. In the case of transportation,
an outcome might be the average commute time from one location to another, the
average speed on a roadway, or the volume of traffic (or persons) carried by a
road segment during the peak traffic hour.
Reporting on outcomes is but the first step. The performance achieved must be compared to
the desired quality of service (QoS).
Setting of the QoS goals for transportation and other government
functions is worthy of a public debate because there are invariably tradeoffs,
the major one being how much more one is willing to pay to achieve a better
Another step that can be done with the outcome data is to
determine trends and predict what might happen if the trends continue. We call this Predictive Analytics. We can plan the transportation infrastructure
that will be needed if Washington’s
growth continues at the current rate (except for 2008, we have grown
Additionally, the performance data can be analyzed to find patterns. Does the QoS fall short only in certain spots
or at a certain time of day? Why is this
happening? We can build models of the
traffic flows and run simulations to allow us to ask questions such as “Would
an extra off-ramp lane prevent the exiting traffic from backing up on the
Beltway?” Or “Would running an extra
lane Southbound in the morning improve the traffic flows?”
Getting back to the recent Tractor-Trailer accident, has
anyone done any modeling and simulation of what might happen if I-495 were
blocked by an accident - - or a terrorist action? Do we have alternate routes identified? Do we have the computer systems to redirect
traffic to these alternate routes and to dynamically change the traffic
patterns on certain roads to facilitate the flow in traffic in what may be
If you’d like to voice your opinion about the traffic
situation in your city, fill in our on-line questionnaire "Traffic Survey" Disclaimer:
This is not intended to be a scientific, randomized survey, and I make
no claim to its validity. However, I
will publish the results in a future blog, if we get enough interest in the
Give me your thoughts on how analytics might be used to
improve our traffic situation. Write to
me at firstname.lastname@example.org or respond to this
-Frank Stein, Director, IBM’s Analytics Solution
More on Analytics at our website www.ibm.com/ASCdc
As government leaders do you believe the world is getting
more complex? More volatile? If so, you’re not alone - - Sixty percent of
the CEOs surveyed by IBM in our 2010 CEO Study thought the world was getting
more complex, and even more, 69%, felt the world was getting more
For the first time, we also posed a similar set of questions
to college students. These future
leaders viewed the world as even more complex than the CEOs we surveyed. But
they saw less volatility, and significantly less uncertainty than the CEOs (65%
of the CEOs, but only 48% of the students).
Could it be that the students are more acclimated to economic boom/bust
cycles and feel more comfortable with the uncertainty of today’s world?
Or could it be that in the instrumented, interconnected,
collaborative world that they are used to (most of the students never knew a
world without web browsing and many don’t remember the pre-Facebook era), they
feel more comfortable dealing with this complex world? As a student in France put it, “We will have more
information, so it [the world] should be more predictable.”
We found that students who had the greatest sense of
complexity put much more emphasis on the analytics and predictive capabilities
of information. They were 50% more
likely to expect significant impact from increased information than peers who
did not have the same sense of complexity.
And they were 22% more likely to believe that organizations should focus
on insight and intelligence to enable their strategies. Also,
interestingly, students in China
were significantly more likely to prefer a fact- and research-based style of
decision making than their peers around the world. Does that indicate that the Chinese students
have been trained to feel more comfortable dealing with data than their
With the baby boom heading towards retirement in the coming
years, does this mean the government workers who replace them will be more
comfortable using information and analytical techniques to handle the world’s
problems? Or could it be that complexity
will always rise to be just beyond our ability to manage it with our current
level of technology?
Click here to see the IBM Report: “Inheriting
a complex world”
Click here to see the IBM Report: “2010 Global CEO
More on Analytics for Government here: www.ibm.com/ASCdc
Do you think our future leaders are inheriting a more
complex world? And do you feel they are
more prepared to manage it?
Comment on this blog or write to me at ASCdc@us.ibm.com
Frank Stein, Director of IBM’s Analytics Solution