Modeling trains with SPSS

How sequential analysis can predict the future, and why DB2 might be smarter than you think

The ability to predict a failure before it happens is tremendously valuable. Learn how an IBM team worked with a train manufacturer to develop this capability using IBM SPSS Modeler and IBM DB2. In this instance, the team used SPSS to identify patterns in data and DB2 to handle moderately complex mathematical calculations directly within SQL.

This article was originally published in IBM Data magazine.

Share:

Gregory Neuveglise, Predictive Analytics Solution Architect, IBM

Gregory Neuveglise, predictive analytics solution architect at IBM, has 10 years of experience in implementing business analytics. He specializes in customer relationship optimization and predictive maintenance.



Marc Fiammante, Distinguished Engineer, Europe Software Group Solutions Chief Architect, IBM

Marc Fiammante is an IBM Distinguished Engineer Software Solutions Chied Architect for Europe in the office of the Vice President and CTO of the IBM Software Group for Europe, where he helps large organizations with their SOA and BPM transformations. He is a member of the International Standards Organization (ISO) Standards Committee (SC) 38, which consolidates and creates standards on web services, SOA and cloud. In this organization, Marc is one of the five global co-editors of ISO SC38 "General Technical Principles of Service Oriented Architecture." He is an IBM Master Inventor, a recognized thought leader and innovator, and is a member of the IBM Academy of Technology.



27 July 2011

Also available in Chinese

By their nature, trains invite imagination. They conjure visions of exotic travel and long-distance commerce. But trains are also large, complicated, and expensive to repair when they break unexpectedly. For a railway operator, the ability to predict a failure before it happens would be tremendously valuable.

Recently, an IBM® team worked with a European train manufacturer to develop exactly this capability, using IBM SPSS® Modeler and IBM DB2® to analyze data generated by the trains. This article will show how to use SPSS to identify patterns in data. It also will show how IBM DB2 made an unexpected contribution to the project—handling a set of moderately complex mathematical calculations directly within SQL.

Predictions from statistics

On a modern train, more than 1,000 different types of mechanical and electrical events are recorded, including operational events such as “opening door” or “train is braking,” warning events such as “line voltage frequency out of range” or “compression is low in compressor X,” and failure events such as “pantograph is out of order” or “inverter lockout.” More than 2,000 events are collected on each train every day.

IBM proposed to use sequential analysis to determine whether any of these events or event sequences indicate that a more severe problem is about to occur. Sequential analysis uses statistical modeling to detect patterns in data. For example, it can show that if particular grocery store customers purchase baby formula and diapers one day, they are highly likely to purchase stewed fruit (baby food) at some point in the future.


Sequential analysis of events

The IBM team started with the raw event data for 2010: a 20-million-row data set containing the event time, event code, and the GPS location of the event. The team used the sequence node algorithm in IBM SPSS Modeler to analyze the train data. The sequence analysis is a two-stage process for sequential pattern mining 1. First, the algorithm mines the data for frequent sequences and computes an adjacency lattice. The user can then explore the lattice by using specified criteria, such as support and confidence bounds, or by placing restrictions on the antecedent sequence.

Using the IBM SPSS Modeler workbench (see Figure 1), the team defined the input table in a DB2 data source node, added a Select node to remove data that occurred while the trains were in maintenance depots, and defined the input columns in the sequence node algorithm. The team also set the time period for which the sequences were to be studied, specifying that only events occurring in a two-day timeframe were to be considered.

As shown on the screen in Figure 1, the ID field indicates which train is related to a sequence of events, the Time field describes the sequence order, and the Content field contains the code of the event that has occurred.

Figure 1. ID field, Time field, Content field.
Screen capture from SPSS Modeler

The SPSS Modeler rule sets interface shows the rules found by SPSS after the analysis has been completed (see Figure 2). The highlighted rule displays an antecedent sequence (21382, 29203) followed by a consequent sequence (in this case a single event: 18232), with a support rating of 55.7 percent and a confidence rating of 93.1 percent. This means that during 2010, this antecedent sequence appeared in 55.7 percent of trains and was followed by the consequent sequence 93.1 percent of the time.

Figure 2. Rule sets interface
The rule sets interface, showing the results of the analysis. The sequence node found 12,358 rules.

Once the rules have been calculated, the data can be explored using the SPSS Modeler filtering tools (see Figure 3). According to the data, event code 2201 followed event code 2003 in 76.9 percent of the cases and event code 2001 in 84.6 percent of cases. Because event code 2201 indicates a failure that makes the train stop and leads to delays, this information can be used to optimize maintenance planning: if event code 2001 appears, the device related to this code should be inspected as soon as possible to avoid a failure.

Figure 3. A filter applied for event code 2201
A filter applied for event code 2201. One rule shows that 87.2 percent of the time, event code 2201 was preceded by two occurrences of event code 2001.

Is the train at the station?

As mentioned earlier, the team needed to eliminate data that was recorded while the trains were in a depot for maintenance; the service process triggered sensors in ways that did not reflect the actual behavior of the train systems. The best way to detect events that occurred during maintenance was to use the well-known point-in-polygon algorithm (see Figures 4 and 5) to calculate whether the train’s position at the time an event was recorded corresponded with the location of the maintenance depots. The manufacturer had an existing table with the depot polygons already defined.

As you see in Figure 4, The solution to the point-in-polygon problem is to draw a horizontal line starting at the point in question. If the point is within the polygon, the line will intersect the edges of the polygon an odd number of times.

Figure 4. Point-in-polygon solution
The solution to the point-in-polygon problem is to draw a horizontal line starting at the point in question. If the point is within the polygon, the line will intersect the edges of the polygon an odd number of times.

If the horizontal line drawn is aligned with a polygon vertex, then only the segments that are below the line are counted.

Figure 5. Horizontal line aligned with a polygon vertex
If the horizontal line drawn is aligned with a polygon vertex, then only the segments that are below the line are counted

Initially, the team thought that the calculation would be performed in a Visual Basic or Java application. But faced with processing 20 million events, the team quickly determined that these solutions would not be fast enough. Instead, they implemented the algorithm in SQL directly in IBM DB2. (Note: IBM DB2 Spatial Extender provides a native function to find whether a given geometry falls within the boundaries of another geometry.)


Implementing point-in-polygon in SQL

To express the point-in-polygon algorithm in SQL, the team first built a pair of functions to transform longitude and latitude coordinates into mathematical X and Y planar coordinates:

  • X (longitude) implemented as GetX SQL function: CASE WHEN LonDeg <= 180 THEN -1 * (LonDeg + (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) ELSE (360 - LonDeg - (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) END
  • Y (latitude) implemented as GetY SQL function: CASE WHEN LatDeg <= 90 THEN -1 * (LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60) ELSE ((LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)-90) END

Next, the team built a table of the segments of the polygon. They created a POLYSEG table with the polygon segment coordinates X1,Y1 and X2,Y2, keeping track of the segment order. They used a CASE WHEN … THEN ELSE END structure to detect whether a given point was actually the first point in the table. If it was, then it became the endpoint for the final segment. Otherwise, the code moved on to the next point:

Listing 1. POLYSEG table
SELECT p1.DepotID,p1.DepotName,p1.CoordinateRank,p2.GPSOrder, 
GetX(p1.LonDeg,p1.LonMin,p1.LonSec), 
GetY(p1.LatDeg,p1.LatMin,p1.LatSec), FROM POLYGONS p1,POLYGONS p2 
WHERE p1. DepotID=p2. DepotID AND p2.GPSOrder=CASE 
WHEN p1.CoordinateRank=(SELECT MAX(p3.CoordinateRank) 
FROM POLYGONS p3 WHERE p3. DepotID =p1.DepotID) THEN 1 ELSE p1.DepotID +1 END;

To detect the number of intersections to the right from a point that is not aligned with a vertex, start with the formula of the coordinate where the horizontal line intersects a segment (Xi):

Xi=X1+(Y-Y1)*(X2-X1)/(Y2-Y1)

Figure 6. Horizontal line intersecting a segment
shows horizontal line intersecting a segment

For a given horizontal line that intersects the segment, use X<=Xi, or:

X<=X1+(Y-Y1)*(X2-X1)/(Y2-Y1)

To avoid any division by zero or overflow, transform all division into multiplication and check for the sign (horizontal lines with Y2=Y1 do not count in the algorithm):

If (Y2-Y1)>0 then X*(Y2-Y1)<=X1*(Y2-Y1)+ (Y-Y1)*(X2-X1)

If (Y2-Y1)<0 then X*(Y2-Y1)>=X1*(Y2-Y1)+ (Y-Y1)*(X2-X1)

The resulting WHERE clause to count a segment that is intersected is the following:

Listing 2. WHERE clause table
WHERE Y2<>Y1 AND (((Y2-Y1)*(TEMP.Y-Y1)>0 
AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)<=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)
OR ((Y2-Y1)*(TEMP.Y-Y1)<0 
AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)>=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)) 
AND (TEMP.Y>Y1 OR TEMP.Y>Y2)
AND (TEMP.Y<Y1 OR TEMP.Y<Y2)

To deal with the points on the line aligned with a segment vertex, clauses that count only the segments that are lower than the line are necessary:

OR (TEMP.X<TEMP.X1 AND TEMP.Y=Y1 AND Y1>Y2)

OR (TEMP.X<TEMP.X2 AND TEMP.Y=Y2 AND Y2>Y1)


Removing events from maintenance depots

To actually count the intersections, identify the points that have an odd number of intersections, and then filter out those points using GROUP BY and HAVING clauses. Notice that COUNT is an integer. In an odd number division, it does not retain the decimal part; this is why it can be used to test for an odd number.

GROUP BY EVENT.ID, P.DepotName HAVING COUNT(*)-2*(COUNT(*)/2)=1)

Finally, the team needed to assemble everything into a table and update the events table with the depot location name or null. The full SQL is:

Listing 3. Full SQL code
CREATE OR REPLACE PROCEDURE UpdateGPSLocation
 BEGIN
  DECLARE SQLCODE INTEGER;
  DECLARE nbrows INTEGER DEFAULT 0;
  DECLARE batchsize INTEGER DEFAULT 10000; 
  DECLARE t1 TIMESTAMP;
  DECLARE t2 TIMESTAMP;
  SET T1=CURRENT TIMESTAMP;
  loop: LOOP
  CALL DBMS_OUTPUT.PUT_LINE('Updating '||nbrows||' rows...');
  UPDATE EVENTS
    SET X = 1,
    Location = (SELECT P.LOCATIONNAME FROM POLYSEGS P,
    TABLE(values(CASE WHEN LonDeg <= 180 THEN -1 * 
    (LonDeg + (DOUBLE(LonMin) +DOUBLE(LonSec)/100)/60)
    ELSE (360 - LonDeg - (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) END,
    CASE WHEN LatDeg <= 90 THEN -1 * (LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)
    ELSE ((LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)-90) END
    )) as TEMP(X,Y)
    WHERE (Y2<>Y1 AND (((Y2-Y1)*(TEMP.Y-Y1)>0 AND
    TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)<=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)
    OR ((Y2-Y1)*(TEMP.Y-Y1)<0 
    AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)>=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1))
    AND (TEMP.Y>Y1 OR TEMP.Y>Y2)
    AND (TEMP.Y<Y1 OR TEMP.Y<Y2))
    OR (TEMP.X<TEMP.X1 AND TEMP.Y=Y1 AND Y1>Y2) 
    OR (TEMP.X<TEMP.X2 AND TEMP.Y=Y2 AND Y2>Y1)
    GROUP BY TMSEVENTS.ID, P.LOCATIONNAME
    HAVING COUNT(*)-2*(COUNT(*)/2)=1)
    WHERE ID IN (SELECT T.ID FROM EVENTS as T
    WHERE T.X <> 1 OR T.X IS NULL
    AND T.LatMin IS NOT NULL
    FETCH FIRST 10000 ROWS ONLY);
  IF SQLCODE = 100 THEN
   LEAVE loop;
 END IF;
 COMMIT;
 SET nbrows=nbrows + batchsize;
END LOOP;
SET T1=CURRENT TIMESTAMP;
CALL DBMS_OUTPUT.PUT_LINE('T1: '||T1||' T2:'|| T2);
END
@
CALL UpdateGPSLocation

The final code updates the 20-million-row table in less than one hour, identifying the events that happen in a maintenance depot. Because this update is performed on a monthly basis, this performance is acceptable. Additional optimization with appropriate indexes and EXPLAIN command usage could be used to increase performance.


Keep the trains running

The team demonstrated that IBM SPSS Modeler provides the capability to automate failure predictions in real time by calculating confidences on the fly and generating alerts when needed. In addition, DB2 provides advanced SQL features to help handle complex queries, including geospatial data and complex computations. To see the real-world benefits, consider the money saved by reducing the need for emergency repairs. Or think of the lessons learned that can be used to make the next generation of trains more reliable. Or perhaps, simply contemplate the railway traveler: watching the landscape smoothly unfurl while the train rushes quietly to the next exciting destination.


Footnote

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=742682
ArticleTitle=Modeling trains with SPSS
publish-date=07272011