This blog post is contributed by John N. Ryan. John Ryan has worked in the SMS communication sector since 2001 within an Irish company, Go2mobile Solutions.
I was asked to write a blog entry on identifying patterns within mobile data. Since my background is in SMS (Text Messages), the mobile data used for pattern identification for this blog is in relation to finding nuggets of information within a text messaging corpus. Just as a point of interest, all the images produced for this entry were generated through IBM’s excellent ManyEyes Visualisation tool (www-958.ibm.com).
SMS generates huge volumes of data, due to its popularity. This is especially the case within Ireland; over 2.9 billion messages were sent in the fourth quarter of 2012 (Comreg, 2013). Thus, there is potentially valid data mining trends for businesses to uncover such as analysing customer SMS originated feedback and reviews. Using data analytics could inform the business if their customers are satisfied or displeased with a service.
SMS has featured in various types of data analytic research. Some of which is briefly listed below to give you a flavour of what has been accomplished in this area:
SMS has featured in various types of research regarding both customer service and spam detection, some of this research includes:
et al. (2006), where they investigated if business related bulletins via SMS could be used to predict financial markets.
and Roy (2008) built a text mining system IBM Technology to Automate Customer Satisfaction Analysis (I-TACS), which comprised of VOC (Voice of Customer) data to analyse items such-as customer feedback. VOC comprises of emails, call agent logs, feedback reports, phone transcripts and SMS text.
The notion that SMS is too short to help identify patterns was dispelled by Leong et al. (2012) when they analysed lecturer performance via student SMS feedback. Their research proved that sentiment could be analysed within this; though there were issues such as emoticons and message incompleteness.
Delany et al. (2012) proved that SMS spam could be detected with up to 97% classification accuracy. In addition to this, there is currently a SMS Spam Public Dataset, that could be investigated for your own research, within the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/).
In a nutshell, as referenced from these previous research examples, effective patterns can be discerned and summarisation of text messaging content is possible. These text mining methods can include understanding the polarity (for example was the text positive or negative) as well as visually referencing the information using images such as Wordclouds.
If you do not have access to corporate warehoused SMS data for your research then obtaining a publicly available SMS corpus can be challenging due to privacy concerns (Chen et al ). However, there is a solution; Chen et al  generated a SMS dataset that is freely available to download from http://wing.comp.nus.edu.sg:8080/SMSCorpus/xml.jsp. It consists of 51,654 messages. They are extracted for this blog entry using the free data mining software RapidMiner 5.3.005 (https://rapid-i.com/content/view/181/196/).
Using Wordcloud visualisations allows summarisation of this SMS corpus to be garnered by displaying the more emphasised words (also called “terms”) in larger font sizes than less used terms, as represented in Figure 1. By removing some of those key terms (as in Figure 2) allows time intervals to be clearly emphasised as a trend; time, late, now, morning, later. Thus, it gives you a flavour of the context of the corpus; this would be especially beneficial if compiling a SMS feedback survey, for example, in order to visually comprehend the most emphasised concerns/issues through the most significantly used words.
Emotional polarity can be deduced from the sentiment analysis package (Jurka, 2012) that is used within R (http://www.r-project.org/) software. The results produced from this package indicate that the overall polarity of the corpus was negative (as referenced within Table 1), with negative words such as; drunken, disturb, weird, dying contributing to that scoring. Again by using a SMS customer feedback corpus within a corporate scenario, analysis could be run to understand if the statements submitted are positive, negative or neutral (objective).
Figure 1 SMS Corpus
Figure 2 SMS Corpus with some key terms removed
There are many other visualisation methods that could be used to mine the content further, such as defining PhraseNets (Figure 3) and Word Trees (where emoticons could be analysed as in Figure 4). Six emotions (Surprise, Anger, Joy, Sadness, Disgust and Fear) could also be scored within the sentiment package in R. Another beneficial text mining package within R is called tm (Feinerer et al., 2008); it should be noted that tm is required when using the previously mentioned sentiment package.
Table 1 SMS Corpus
Figure 3 PhraseNet SMS Corpus
Figure 4 Word Tree SMS Corpus - in this case using Emoticon
In conclusion, while this is a very brief overview, it does provide a flavour of what is possible. Just to recap, there are extremely powerful tools for data analysis such as RapidMiner and R, and excellent visualisation tools such as IBM’s ManyEyes; which in fact has a new Version 2 currently in a beta stage. You should definitely spend some time trying these out.
Some issues to watch out for when analysing SMS include possibly requiring specialised dictionaries to overcome SMS abbreviated text (these dictionaries help translate the abbreviations such as gr8t to great) and emoticons ( : ) as happy/smile). These should help to maximise your pattern outcome. Of course, an extra benefit of SMS is that due to the size of the 160 character message, the mining model derived for analysing it could also be potentially used for reviewing social media platforms such as Twitter.
If you wish to review any of the previously mentioned references, the following reports and research papers are listed below:
Comreg Quarterly Key Data Report Document 13/25 Date: 13th March 2013, http://www.comreg.ie/_fileupload/publications/ComReg1325.pdf
Tao Chen and Min-Yen Kan. Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus. Language Resources and Evaluation. Aug
2012. [doi:10.1007/s10579-012-9197-9] [Local copy (.pdf) ]
S.J. Delany, M. Buckley, and D. Greene. Sms spam filtering: Methods and data. Expert Systems with Applications, 2012.
Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in r. Journal of Statistical Software, 25(5):1–54, 3 2008. ISSN 1548-7660. URL http://www. jstatsoft.org/v25/i05.
S. Godbole and S. Roy. Text to intelligence: Building and deploying a text mining solution in the services industry for customer satisfaction analysis. In Services Computing, 2008. SCC’08. IEEE International Conference on, volume 2, pages 441– 448. IEEE, 2008.
Timothy P. Jurka. Sentiment: Tools for sentiment analysis. Last accessed: 17 february 2013, 2012. URL http://cran.open-source-solution.org/web/packages/sentiment/.
P. Kroha, R. Baeza-Yates, and B. Krellner. Text mining of business news for forecasting. In Database and Expert Systems Applications, 2006. DEXA’06. 17th International Workshop on, pages 171–175. IEEE, 2006.
C.K. Leong, Y.H. Lee, and W.K. Mak. Mining sentiments in sms texts for teaching evaluation. Expert Systems with Applications, 39(3):2584–2589, 2012.