Purpose of document
This article describes the development process of a mobile activity recognition service. The focus is on the techniques used in IBM SPSS Modeler for cleaning training data, selecting features, choosing the best performing classification algorithm, and validating the model.
The project that is described in this document was developed with IBM SPSS Modeler 15.0 and IBM Worklight 6.1.
Exclusions and exceptions
This document does not go into details for each item that is discussed or provide step-by-step instructions. It is intended to provide a high-level overview of the project, and highlight some analysis techniques that were used. For more details, see the IBM SPSS Modeler documentation and the IBM Worklight documentation.
A basic familiarity with SPSS Modeler is helpful.
One surprising thing that smartphone applications can do is detect the user's current physical activity, such as walking, driving, biking, or standing still. Activity recognition has a wide range of applications in mobile applications — from fitness and health tracking to context-based advertising and employee monitoring. Context-aware applications can customize their behaviour based on the current activity. For example, when it searches businesses near the user's current location, the application can use a large radius of search if the user is driving and a smaller radius if the user is walking.
This article describes how we used IBM SPSS Modeler and IBM Worklight to develop an activity recognition service for the Worklight platform. In particular, we focus on how we created the classification decision tree with IBM SPSS Modeler, used SPSS Modeler to choose the best classification algorithm, determined which feature combinations provided the best results, cleaned the training data, tested the decision tree, and handled overfitting.
The road to activity recognition
Implementing the activity recognition algorithm on the mobile platforms was the final step in a rather long development process, which started with building a dedicated application for collecting training data. The process consisted of the following steps:
- Using IBM Worklight, we developed the Sensor Logger app for collecting data from the accelerometer and the GPS.
- Twenty volunteers used the app to collect sensor data on their mobile phones while they did one of the activities (such as walking or running).
- The sensor data was collected from the phones and processed by a feature extraction program that extracted representative features of the data, and stored the features in an IBM DB2® database.
- Using one of the classification algorithms available in IBM SPSS Modeler, the training data was processed to generate a decision tree.
- The decision tree was exported from SPSS Modeler as a PMML (Predictive Model Markup Language) file. We then used the Java™ PMML API to convert the decision tree to code in Java, Objective-C, and C# languages.
- The decision tree code was incorporated into three native versions of an activity recognition API for Android, iOS, and Windows® Phone.
The Sensor Logger application
The Sensor Logger application is an IBM Worklight application that was developed for collecting training data. It samples the accelerometer at a rate of approximately 50 Hz and logs the readings in to a local file. Every few seconds it also writes the current speed, which is taken from the GPS, to the log. The app was given to 20 volunteers who installed it on various smartphones, including Android, iPhone, and Windows Phone devices. The volunteers were asked to use the app to record the data while they did the activities, one activity at a time, for at least 10 minutes. Figure 1 shows three screen captures of the Sensor Logger app. To begin recording, the volunteer clicked an activity button on the left screen. The recording screen, at the middle, displays the selected activity and the elapsed recording time. After clicking Stop, the user can save the log file and send it to the development team as seen in the right screen.
Figure 1: The Sensor Logger application
The sensor logger recording files were processed with a feature extraction program that we developed in Java code. The program sliced the recorded data file into windows of three seconds, and calculated representative features of each window.
The features that we evaluated were mostly based on frequency analysis of the accelerometer signal. To cancel the effect of variations in the phone orientation, we used the magnitude of the accelerometer vector that is calculated from the measurements of the force along the three axes. We calculated the Fast Fourier Transform (FFT) coefficients, by using the Apache Commons Math library. The FFT features were then calculated by dividing the series of FFT coefficients into subranges and taking the sums of the coefficients in each range. For example, one of the combinations we tested was the sums of the coefficients in the ranges around 1 Hz, 2 Hz, and so on. We calculated several alternative sets of features, varying in the minimum and maximum frequency, and the lengths of the subranges. Using SPSS Modeler, we easily evaluated which set of FFT features provided the best results, by creating alternative decision trees and testing the classification accuracy by using a subset of the data for testing.
Other features that we used, in addition to the FFT features, were the mean, the variance, and the energy of the signal, plus the speed, received from the GPS.
The calculated features were stored in a DB2 database — a record was created for every 3-second window with the extracted features and the activity for which the data was recorded. Information about the track (log file) was also stored in the database. This information included the smartphone model, the operating system, the user name, and the track name — a unique identifier of the track. These fields were used to filter out subsets of the training data for various tests that are described in the next section.
Creating the classification model
After a large enough set of training data was collected by the volunteers and processed by the feature extraction program, we were ready to analyze the data with IBM SPSS Modeler. Figure 2 shows the SPSS Modeler stream that we used to create and test the classification model. We now describe the function of each node in the stream and how we used it in our analysis.
Figure 2: The SPSS Modeler stream
The Database source node imports the training data from the database by using an SQL query. By changing the query's selection criteria, we filtered subsets of the training data — for example, import only the data that was recorded on Android devices, or exclude the data that was recorded by one of the users. In the properties pane of the Database source node, you can also define the usage type of the fields. This pane is where we set the role of the Activity field as the target field for classification.
The Database source node is connected to a Filter node, used to exclude fields from the data. We used the Filter node to test how the exclusion of various features and combinations of features affects the accuracy. The Filter node was also used to create two versions of the decision tree — one with the speed feature and one without the speed feature. With two versions, the activity recognition API performs recognition both with and without the GPS to save battery life and function in areas with no GPS signal.
The next node in the stream is the Partition node, which splits the data randomly into two subsets — training and testing. We used the Partition node with the Analysis node, which presents the accuracy of the model when applied to the testing partition. The Partition node allowed us to do various tests and gradually improve the quality of the training data and the resulting model. We used it to evaluate the accuracy we get with different subsets of features and different classification algorithms, and also evaluate the quality of the training data itself.
The Partition node is connected to the C5.0 node, which uses the C5.0 classification algorithm to build a decision tree. IBM SPSS Modeler offers many classification algorithms, allowing us to evaluate easily which algorithm works best for our problem by selecting different types of modelling nodes and testing the accuracy with the Partition node and the Analysis node. In addition to the C5.0 node, we tried other decision tree models, including the C&R tree node, the QUEST node, and the CHAID node. We also tried other types of algorithms, such as the KNN node and the Neural Net node. The C5.0 node was chosen because it provided the best accuracy with our training data. Another benefit of the C5.0 node is that the decision tree it produces has a straightforward interpretation, and is easy to convert to Java or Objective-C code.
The Activity Model nugget at the bottom of the stream is the model that is generated by the C5.0 node, by applying the C5.0 algorithm to the training partition. The model nugget is also connected to the Partition node and the Analysis node, allowing us to execute the stream for evaluating the accuracy with the testing partition. Double-click the model nugget to open the model nugget browser, where you can examine the decision tree by using several views and representation options. The model nugget browser also includes the predictor importance chart, which shows which features were more valuable as classification predictors.
In addition to the random partition, we also used select nodes to split the data based on explicit conditions. This was done in a separate stream as shown in Figure 3. At the middle of the stream, the two Select nodes, Training and Testing, define conditions for inclusion in the training partition or the testing partition. The conditions are based on the values of data records, such as the operating system, the user name, or the track name. We used this stream for data cleaning, model validation, and validation that the extracted features have no dependency on the mobile operating system.
Figure 3: The SPSS Modeler stream with two Select nodes
Before we accepted a new track (recorded log file) into the training / testing data, we checked the accuracy of classifying the new track with a decision tree that was generated from the data already validated. We configured the Training Select node to discard records that belonged to the new track and the Testing Select node to include only the new track. If the classification accuracy of the new track was lower than the accuracy achieved with random partitioning, we inspected the classification results more closely, by using the output browsers (views) generated by the Analysis node and the Table node. Figure 4 shows the Analysis node output for the classification of a single biking track that was found to be faulty. The recognition accuracy is only 73%, which raises the suspicion that something went wrong with the recording of this track. The coincidence matrix, at the bottom, shows how the predicted results spread across the correct activity and the other activities. It shows that 180 records (time slices) were classified correctly, but 63 records were wrongly classified as Walking. In this case, we found that the user forgot to stop the recording when he stopped biking and started walking. To fix the problem, we edited the track file, deleted lines from the end of the file, and kept only the first two thirds of the file, the part that was recorded while biking.
Figure 4: Analysis results of a faulty track
For deeper inspection of the results, we also looked at the table output that shows the fields of each record, together with the predicted result. At the first stages of the development, this review was also helpful for identifying bugs in our feature extraction program, where wrongly calculated features resulted in wrong classification.
With random partitioning, the records of each track are randomly divided between the training set and the testing set. Thus, the training set and the testing set may not be independent of each other because they include records from the same tracks. Records that are part (time slices) of the same track, recorded by the same user with the same device on the same occasion, are typically closer to each other in terms of feature variance than to records of another track of another user recorded elsewhere. Therefore, testing with random partitioning might be less indicative of the efficacy of the model and might not reveal overfitting. A better way to estimate the accuracy is to split by users — discard one or more users from the training data and use the data that these users recorded for testing. In at least one case, we found that one of the sets of features that we evaluated provided good results with random partitioning, but the results were not so good with a spilt by users, probably due to overfitting that occurred with the random partitioning.
Validating operating system independency
When we started to collect data from the three operating systems, it was important to verify that we could mix the data and generate a common decision tree. To verify that, we did several tests where we split the data by operating systems — for example, generate a decision tree from the data that was recorded on Android and apply it to the iOS data. The results showed that indeed, the accelerometer behaved the same way on the three operating systems and no inherent differences were in the data that was collected from the different operating systems. This confirmation allowed us to mix the data and generate one decision tree with the same recognition accuracy on Android, iOS, and Windows Phone.
Converting the decision tree to source code
The model nugget browser provides several options to export the decision tree. The tree can be exported as a text file, an HTML file, or a PMML file. The text file is a simple textual representation of the tree, where internal nodes are represented by the predicate that is associated with the node and line indentation is used for the branching structure. This file looks very much like a tree of nested if-then statements and indeed it is easy to manually convert this representation to if-then statements. For our project, however, we preferred to have an automated process so that we can repeatedly create the code and test different versions of the tree on the mobile devices.
For the automated conversion, we exported the decision tree as a PMML file. We then used a Java program that parses the PMML file and converts the decision tree to nested if-then statements in Java, Objective-C, or C# code. The program is based on the Java PMML API, an open source API for producing and scoring models. The Java PMML API is used to parse the PMML file, and produce the lines of code while traversing the nodes of the tree — an if-then statement for internal nodes and a return statement for leaf nodes.
Implementation on the mobile operating systems
When we finally settled on the best performing decision tree and converted it to source code, we created native implementations of the activity recognition module for the mobile operating systems. The recognition algorithm starts with sampling the accelerometer and the GPS (optionally), the same way it is done in the Sensor Logger application. Every three seconds (the same window size that is used for the analysis), the data that is collected within the window of three seconds is processed to extract the features, by using native mathematical libraries for FFT calculation. Finally, the native decision tree code is applied to get the classification result — the recognized activity.
In developing the mobile activity recognition service we went through the steps of training data collection, feature extraction, model building and validation, and implementation on the mobile operating system. This rather complicated process was simplified by using IBM Worklight for cross-platform mobile development and the rich, yet easy-to-use analytical tools of IBM SPSS Modeler.
The author would like to thank Harry Hornreich, Einat Pickman, Itamar Elem, and Tom Gamliely for their contribution to this project.
- IBM Worklight on developerWorks: Build, run, and manage HTML5, hybrid, and native mobile apps with this open, comprehensive platform for application development.
- Apache Commons Math: Explore a library of lightweight, self-contained mathematics and statistics components to address the most common problems not available in the Java programming language or Commons Lang.
- Java PMML API: Try an XML-based file format to describe and exchange models created by data mining and machine learning algorithms.
- Find the resources that you need to improve outcomes and control risk in the developerWorks Business analytics zone.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos that range from product installation and setup demos for beginners to advanced functionality for experienced developers.
Get products and technologies
- Learn more about the SPSS family of products.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment.
- Get involved in the developerWorks community. Connect with other developerWorks users while you explore the developer-driven blogs, forums, groups, and wikis.