By their nature, trains invite imagination. They conjure visions of exotic travel and long-distance commerce. But trains are also large, complicated, and expensive to repair when they break unexpectedly. For a railway operator, the ability to predict a failure before it happens would be tremendously valuable.
Recently, an IBM® team worked with a European train manufacturer to develop exactly this capability, using IBM SPSS® Modeler and IBM DB2® to analyze data generated by the trains. This article will show how to use SPSS to identify patterns in data. It also will show how IBM DB2 made an unexpected contribution to the project—handling a set of moderately complex mathematical calculations directly within SQL.
Predictions from statistics
On a modern train, more than 1,000 different types of mechanical and electrical events are recorded, including operational events such as “opening door” or “train is braking,” warning events such as “line voltage frequency out of range” or “compression is low in compressor X,” and failure events such as “pantograph is out of order” or “inverter lockout.” More than 2,000 events are collected on each train every day.
IBM proposed to use sequential analysis to determine whether any of these events or event sequences indicate that a more severe problem is about to occur. Sequential analysis uses statistical modeling to detect patterns in data. For example, it can show that if particular grocery store customers purchase baby formula and diapers one day, they are highly likely to purchase stewed fruit (baby food) at some point in the future.
Sequential analysis of events
The IBM team started with the raw event data for 2010: a 20-million-row data set containing the event time, event code, and the GPS location of the event. The team used the sequence node algorithm in IBM SPSS Modeler to analyze the train data. The sequence analysis is a two-stage process for sequential pattern mining 1. First, the algorithm mines the data for frequent sequences and computes an adjacency lattice. The user can then explore the lattice by using specified criteria, such as support and confidence bounds, or by placing restrictions on the antecedent sequence.
Using the IBM SPSS Modeler workbench (see Figure 1), the team defined the input table in a DB2 data source node, added a Select node to remove data that occurred while the trains were in maintenance depots, and defined the input columns in the sequence node algorithm. The team also set the time period for which the sequences were to be studied, specifying that only events occurring in a two-day timeframe were to be considered.
As shown on the screen in Figure 1, the ID field indicates which train is related to a sequence of events, the Time field describes the sequence order, and the Content field contains the code of the event that has occurred.
Figure 1. ID field, Time field, Content field.
The SPSS Modeler rule sets interface shows the rules found by SPSS after the analysis has been completed (see Figure 2). The highlighted rule displays an antecedent sequence (21382, 29203) followed by a consequent sequence (in this case a single event: 18232), with a support rating of 55.7 percent and a confidence rating of 93.1 percent. This means that during 2010, this antecedent sequence appeared in 55.7 percent of trains and was followed by the consequent sequence 93.1 percent of the time.
Figure 2. Rule sets interface
Once the rules have been calculated, the data can be explored using the SPSS Modeler filtering tools (see Figure 3). According to the data, event code 2201 followed event code 2003 in 76.9 percent of the cases and event code 2001 in 84.6 percent of cases. Because event code 2201 indicates a failure that makes the train stop and leads to delays, this information can be used to optimize maintenance planning: if event code 2001 appears, the device related to this code should be inspected as soon as possible to avoid a failure.
Figure 3. A filter applied for event code 2201
Is the train at the station?
As mentioned earlier, the team needed to eliminate data that was recorded while the trains were in a depot for maintenance; the service process triggered sensors in ways that did not reflect the actual behavior of the train systems. The best way to detect events that occurred during maintenance was to use the well-known point-in-polygon algorithm (see Figures 4 and 5) to calculate whether the train’s position at the time an event was recorded corresponded with the location of the maintenance depots. The manufacturer had an existing table with the depot polygons already defined.
As you see in Figure 4, The solution to the point-in-polygon problem is to draw a horizontal line starting at the point in question. If the point is within the polygon, the line will intersect the edges of the polygon an odd number of times.
Figure 4. Point-in-polygon solution
If the horizontal line drawn is aligned with a polygon vertex, then only the segments that are below the line are counted.
Figure 5. Horizontal line aligned with a polygon vertex
Initially, the team thought that the calculation would be performed in a Visual Basic or Java application. But faced with processing 20 million events, the team quickly determined that these solutions would not be fast enough. Instead, they implemented the algorithm in SQL directly in IBM DB2. (Note: IBM DB2 Spatial Extender provides a native function to find whether a given geometry falls within the boundaries of another geometry.)
Implementing point-in-polygon in SQL
To express the point-in-polygon algorithm in SQL, the team first built a pair of functions to transform longitude and latitude coordinates into mathematical X and Y planar coordinates:
- X (longitude) implemented as GetX SQL function:
CASE WHEN LonDeg <= 180 THEN -1 * (LonDeg + (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) ELSE (360 - LonDeg - (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) END
- Y (latitude) implemented as GetY SQL function:
CASE WHEN LatDeg <= 90 THEN -1 * (LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60) ELSE ((LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)-90) END
Next, the team built a table of the segments of the polygon. They created a POLYSEG
table with the polygon segment coordinates X1,Y1 and X2,Y2, keeping track of the
segment order. They used a
CASE WHEN … THEN ELSE END
structure to detect whether a given point was actually the first point in the table.
If it was, then it became the endpoint for the final segment. Otherwise, the code
moved on to the next point:
Listing 1. POLYSEG table
SELECT p1.DepotID,p1.DepotName,p1.CoordinateRank,p2.GPSOrder, GetX(p1.LonDeg,p1.LonMin,p1.LonSec), GetY(p1.LatDeg,p1.LatMin,p1.LatSec), FROM POLYGONS p1,POLYGONS p2 WHERE p1. DepotID=p2. DepotID AND p2.GPSOrder=CASE WHEN p1.CoordinateRank=(SELECT MAX(p3.CoordinateRank) FROM POLYGONS p3 WHERE p3. DepotID =p1.DepotID) THEN 1 ELSE p1.DepotID +1 END;
To detect the number of intersections to the right from a point that is not aligned with a vertex, start with the formula of the coordinate where the horizontal line intersects a segment (Xi):
Figure 6. Horizontal line intersecting a segment
For a given horizontal line that intersects the segment, use X<=Xi, or:
To avoid any division by zero or overflow, transform all division into multiplication and check for the sign (horizontal lines with Y2=Y1 do not count in the algorithm):
If (Y2-Y1)>0 then X*(Y2-Y1)<=X1*(Y2-Y1)+ (Y-Y1)*(X2-X1)
If (Y2-Y1)<0 then X*(Y2-Y1)>=X1*(Y2-Y1)+ (Y-Y1)*(X2-X1)
WHERE clause to count a segment that is
intersected is the following:
Listing 2. WHERE clause table
WHERE Y2<>Y1 AND (((Y2-Y1)*(TEMP.Y-Y1)>0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)<=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1) OR ((Y2-Y1)*(TEMP.Y-Y1)<0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)>=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)) AND (TEMP.Y>Y1 OR TEMP.Y>Y2) AND (TEMP.Y<Y1 OR TEMP.Y<Y2)
To deal with the points on the line aligned with a segment vertex, clauses that count only the segments that are lower than the line are necessary:
OR (TEMP.X<TEMP.X1 AND TEMP.Y=Y1 AND Y1>Y2)
OR (TEMP.X<TEMP.X2 AND TEMP.Y=Y2 AND Y2>Y1)
Removing events from maintenance depots
To actually count the intersections, identify the points that have an odd number of
intersections, and then filter out those points using
GROUP BY and
HAVING clauses. Notice that
COUNT is an integer. In an odd number division, it
does not retain the decimal part; this is why it can be used to test for an odd
GROUP BY EVENT.ID, P.DepotName HAVING
Finally, the team needed to assemble everything into a table and update the events table with the depot location name or null. The full SQL is:
Listing 3. Full SQL code
CREATE OR REPLACE PROCEDURE UpdateGPSLocation BEGIN DECLARE SQLCODE INTEGER; DECLARE nbrows INTEGER DEFAULT 0; DECLARE batchsize INTEGER DEFAULT 10000; DECLARE t1 TIMESTAMP; DECLARE t2 TIMESTAMP; SET T1=CURRENT TIMESTAMP; loop: LOOP CALL DBMS_OUTPUT.PUT_LINE('Updating '||nbrows||' rows...'); UPDATE EVENTS SET X = 1, Location = (SELECT P.LOCATIONNAME FROM POLYSEGS P, TABLE(values(CASE WHEN LonDeg <= 180 THEN -1 * (LonDeg + (DOUBLE(LonMin) +DOUBLE(LonSec)/100)/60) ELSE (360 - LonDeg - (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) END, CASE WHEN LatDeg <= 90 THEN -1 * (LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60) ELSE ((LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)-90) END )) as TEMP(X,Y) WHERE (Y2<>Y1 AND (((Y2-Y1)*(TEMP.Y-Y1)>0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)<=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1) OR ((Y2-Y1)*(TEMP.Y-Y1)<0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)>=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)) AND (TEMP.Y>Y1 OR TEMP.Y>Y2) AND (TEMP.Y<Y1 OR TEMP.Y<Y2)) OR (TEMP.X<TEMP.X1 AND TEMP.Y=Y1 AND Y1>Y2) OR (TEMP.X<TEMP.X2 AND TEMP.Y=Y2 AND Y2>Y1) GROUP BY TMSEVENTS.ID, P.LOCATIONNAME HAVING COUNT(*)-2*(COUNT(*)/2)=1) WHERE ID IN (SELECT T.ID FROM EVENTS as T WHERE T.X <> 1 OR T.X IS NULL AND T.LatMin IS NOT NULL FETCH FIRST 10000 ROWS ONLY); IF SQLCODE = 100 THEN LEAVE loop; END IF; COMMIT; SET nbrows=nbrows + batchsize; END LOOP; SET T1=CURRENT TIMESTAMP; CALL DBMS_OUTPUT.PUT_LINE('T1: '||T1||' T2:'|| T2); END @ CALL UpdateGPSLocation
The final code updates the 20-million-row table in less than one hour, identifying
the events that happen in a maintenance depot. Because this update is performed on a
monthly basis, this performance is acceptable. Additional optimization with
appropriate indexes and
EXPLAIN command usage could be
used to increase performance.
Keep the trains running
The team demonstrated that IBM SPSS Modeler provides the capability to automate failure predictions in real time by calculating confidences on the fly and generating alerts when needed. In addition, DB2 provides advanced SQL features to help handle complex queries, including geospatial data and complex computations. To see the real-world benefits, consider the money saved by reducing the need for emergency repairs. Or think of the lessons learned that can be used to make the next generation of trains more reliable. Or perhaps, simply contemplate the railway traveler: watching the landscape smoothly unfurl while the train rushes quietly to the next exciting destination.
- 1 Rakesh Agrawal and Ramakrishnan Srikant, Mining Sequential Patterns, (San Jose, California: IBM, 1995)
- Learn more about DB2 at the DB2 page on developerWorks.
- Get more information about IBM SPSS predictive analytics software.
- Download a free trial version of DB2 for Linux, UNIX, and Windows.
Dig deeper into Big data and analytics on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Crazy about Big data and analytics? Sign up for our monthly newsletter and the latest Big data and analytics news.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.