Topic
11 replies Latest Post - ‏2013-10-02T06:01:15Z by FrankRojas
FrankRojas
FrankRojas
23 Posts
ACCEPTED ANSWER

Pinned topic How to classify & remove 5% and 95% tiles outliers?

‏2013-07-08T06:15:49Z |

Looking for good tip.

I have a statistic table consisting of millions of events having 3 key fields PROD (Productivity), TYPE (4 classes)  and SIZE (binned into 3 classifications).

We want to remove the events whose PROD falls in the 5% and 95% tiles from each sample set of TYPE - SIZE  (12 total sets) .... and use the remaining events to do analysis.

So I'd like to see if anyone has a better way to do this in the modeler.   Below I provide high level flow for current method (w/o modeler) and the approach I am taking with modeler.

I am looking to see if anyone has a better way then the modeler approach of FOR loops... thing

  1. Binning does not allows me to bin with catigorical variables.
  2. How do I get  SPSS EXAMINE to work on Modeler so I can use the output table.

CURRENT METHOD  (non modeler - using python scripts)

1. I invoke SPSS EXAMINE on all events":  PROD  by TYPE X SIZE to get the 5% and 95% PROD limits for 12 sets.

2. Extract the 5% and 95% limits from the EXAMINE report

3. Build a SPSS script that classifies each event for each of the 12 sets
         if  (TYPE = t and SIZE = s and   [ 5%_Limit < PROD < 95%_LIMIT] ) then classify as OUTLIER else classify as INTERIER.

4. Execute SPSS script across all events.

APPROACH USING MODELER  (under development)

using standalone script to build the stream dynamically.

create statisticnode   [containing all the events]
create "NewNode"  -- it is an empty statistic Node (used to append all classification)
for t in TYPES
    create selectnode t_NODE   [ TYPE = t ,  connected  from statisticnode]
    for s in SIZES
        create selectnode s_NODE   [SIZE = s;  connected  from t_NODE]
        BIN into quintile (20 tiles)
        reclassify tiles (1 and 20) into OUTLIER and all else into INTERIOR
        appendnode (reclassifynode into NewNode)

 Let me know if someone has a better method.

  • FrankRojas
    FrankRojas
    23 Posts
    ACCEPTED ANSWER

    Re: How to classify & remove 5% and 95% tiles outliers?

    ‏2013-07-09T11:23:58Z  in response to FrankRojas

    Is there a way to get a CUSTOM TABLE with PTILE in modeler?

    I found that I can use statisticsoutputnode to build custom tables that produces a report of Type X Size and PTILE 05 and PTILE 95.   

    Below is statisitc CTABLE command I am trying to do via Modeler.  Is there
    (Prod = Productivity)

    * Custom Tables.
    CTABLES
      /VLABELS VARIABLES=Type3 SIZE Prod DISPLAY=LABEL
      /TABLE Type3 [C] > SIZE [C] > Prod [S][PTILE 05, PTILE 95]
      /CATEGORIES VARIABLES=Type3 SIZE ORDER=A KEY=VALUE EMPTY=EXCLUDE.

    • TedFischer
      TedFischer
      248 Posts
      ACCEPTED ANSWER

      Re: How to classify & remove 5% and 95% tiles outliers?

      ‏2013-07-10T15:51:30Z  in response to FrankRojas

      If you need separate bins for multiple types of data, then using a script is the best way -- that is what scripting was built for.

      If you want to get the results of your Custom Table into Modeler you need to save the results (I think the OMS command will do that) and then reimport into Modeler.  Statistics output generally cannot be fed directly into another Modeler node.

      • FrankRojas
        FrankRojas
        23 Posts
        ACCEPTED ANSWER

        Re: How to classify & remove 5% and 95% tiles outliers?

        ‏2013-07-10T20:23:32Z  in response to TedFischer

        I am using CLEM scripts to do this.    But I was I trying to avoid loops... I have it working with loops and several transforms.... not easy.

        I am familiar with OMS with SPSS statistics on Windows, Ie. .   When we switched to Linux we want to stick to using Modeler only, ie. invoke Statistic nodes. 

        Below is python script invoking spssaux -- my goal is to do this within Modeler
        ----------------------------------------------------------------

        cmd_exm_template =  r"""
        * Custom Tables.
        CTABLES
          /VLABELS VARIABLES=Type3 Size2 Prod_W_T DISPLAY=LABEL
          /TABLE Type3 [C] > Size2 [C] > Prod_W_T [S][PTILE 05, PTILE 95]
          /CATEGORIES VARIABLES=Type3 Size2 ORDER=A KEY=VALUE EMPTY=EXCLUDE.
        """
        datasetName, err = spssaux.CreateDatasetOutput(
            cmd_exm_template,
            omsid='CTables',
            subtype='Custom Table')

        # 1. get data from spssaux
        data = spssdata.Spssdata().fetchall()

        s=""
        for row in data:
            type = str(row[4])[0]
            if ( type == "9" ):
                        continue                #
            cmd_trim = cmd_trim + build_trim_cmd(str(row[4]), str(row[6]), str(row[8]), str(row[9]))

        cmd_trim = cmd_trim_pre + cmd_trim + cmd_trim_post

        SPSS(cmd_trim)    # A derive new variable based on PTILE(5%) and PTILE(95%)

         

  • JonPeck
    JonPeck
    13 Posts
    ACCEPTED ANSWER

    Re: How to classify & remove 5% and 95% tiles outliers?

    ‏2013-07-10T17:29:53Z  in response to FrankRojas

    Ted has replied to the "how" question, but you should be aware that the usual statistical inference procedures are likely to be misleading if you have truncated your data as described.  And with such a large dataset, under reasonable assumptions outliers are not likely to overly influence your results if you leave them in.

    As for CTABLES, if you just want the tables as they would be in Statistics, using a Statistics node will generate an spv file similar to what you would get from Statistics.  If you want the table as data, it would be easiest to save it using OMS in SAV format and read it as a new input in Modeler.

    • FrankRojas
      FrankRojas
      23 Posts
      ACCEPTED ANSWER

      Re: How to classify & remove 5% and 95% tiles outliers?

      ‏2013-07-10T20:28:56Z  in response to JonPeck

      Jon, thanks...

      A couple of points

      1.  the data is skewed so including of outlier does have an impact.    Esp as we are using this for negotiation on cost factors.   There has been a LOT of debate on this within the community so I don't want to re-open that item....

      2. Can modeler read in an spv file?

      3. As I shared with Ted, we're trying to move to a linux environment with only modeler installed... so it is not clear how I would get OMS output via modeler... (don't think this is possible.)..

      Thanks...
       

      • JonPeck
        JonPeck
        13 Posts
        ACCEPTED ANSWER

        Re: How to classify & remove 5% and 95% tiles outliers?

        ‏2013-07-10T20:46:55Z  in response to FrankRojas

        1. Deskewing by transformations such as logs or square roots might help, but the methodology is, of course, entirely up to you.

        2. Modeler cannot read an spv file.  It isn't really a data source format.

        3. The bigger problem is that you could not run the CTABLES command without Statistics.  If you could, you could also use OMS.  The CTABLES output, though, could be saved in any of a wide variety of formats by using OMS.

        Regards,

      • TedFischer
        TedFischer
        248 Posts
        ACCEPTED ANSWER

        Re: How to classify & remove 5% and 95% tiles outliers?

        ‏2013-07-12T20:15:00Z  in response to FrankRojas

        As Jon noted, you do need Statistics installed on the same computer as Modeler is installed to use Statistics output nodes.  However, if you obtain Statistics server, you can run it on the same Linux environment you have Modeler server installed.

        If getting Statistics Server is not an option, and you really do not want to do scripting, then it is actually possible through a complex series of nodes to compute a percentile.  You would need to basically need to sort the data, compute the number of observations in total, join it to the sorted data in a cartesian join, and then keep a running count of the number of records making sure to handle ties.  I think that creating a script to bin each group of data separately would be easier.

  • FrankRojas
    FrankRojas
    23 Posts
    ACCEPTED ANSWER

    Re: How to classify & remove 5% and 95% tiles outliers?

    ‏2013-07-13T09:52:47Z  in response to FrankRojas

    Thanks Ted/Jon!  

    1. I have a script working that selects each group, then bins, and then classifies based on PTILE5 and PTILE95 (quintile 1 and 20).
            Unfortunately, I had to use temp files (.sav) for each group... then have to merge them all in.   It works though.
           The scripting capability did not allow me to merge the 12 nodes all at once... (I believe it is capacity issues... ie. it hung... may need to revisit this) .

    2. Been looking at where the custom table PTILE5 and PTILE95 could be implemented within Modeler.

    The aggregatenode allows one to get the [Sum Mean Min Max SDev] for a aggregated group per the spec.
    On the modelerclient it says it supports [Sum, Mean Min Max SDev Median Count Variance "1st Quartile"  "3rd Quartile" ]

    It would be nice  if aggregatenode were to support  PTILE5 and PTILE95 options. (and all the Custom Table options)  then this would be very helpful.

    Thanks again...

    Updated on 2013-07-13T10:04:21Z at 2013-07-13T10:04:21Z by FrankRojas
  • This reply was deleted by muneharu 2013-09-30T06:58:31Z.
  • muneharu
    muneharu
    1 Post
    ACCEPTED ANSWER

    Re: How to classify & remove 5% and 95% tiles outliers?

    ‏2013-09-30T08:07:28Z  in response to FrankRojas

    Hi,

    Please consider attached stream. 5-95% as below

     

    Attachments

    • This reply was deleted by FrankRojas 2013-10-02T04:31:36Z. Reason for deletion: Incorrect formula
    • FrankRojas
      FrankRojas
      23 Posts
      ACCEPTED ANSWER

      Re: How to classify & remove 5% and 95% tiles outliers?

      ‏2013-10-02T05:59:34Z  in response to muneharu

      Thank you very much!   I converted your stream into a script... see attachment below.

      Please note that I am trying to identify the outlier events (ie. furthest away from the mean).   The PTILE5 and PTILE95 represent the limits of the events closest to the "sampled mean"....

      In the example stream, it generates 49 random values between 0 and 1000.   Assuming a normal distribution we would hope to find:

      • the PTILE5 value  =  50 and
      • the PTILE95 value = 950 .

      Thus events with PROD < PTILE5 and PTILE95 would be trimmed.

      Thus I added a QTILE_FACTOR to the classification function within the DIST_CLASS derivenode formula:

      set ^n.formula_expr = 'if ( PROD < (@GLOBAL_MEAN(PROD) - (1.96*STD_ERROR*' >< ^QTILE_FACTOR >< '))) then "TRIM" else if (  PROD > (@GLOBAL_MEAN(PROD) + (1.96*STD_ERROR*' >< ^QTILE_FACTOR >< '))) then "TRIM" else "NORM" endif endif'

      When I set QTILE_FACTOR = 5.15 I get a good approximation of a PTILE5 and PTILE95 values expected.   (Yet this depends on the distribution of the random #'s.)

      THANKS AGAIN!!!   Good approach.

      Attachments

  • FrankRojas
    FrankRojas
    23 Posts
    ACCEPTED ANSWER

    Re: How to classify & remove 5% and 95% tiles outliers?

    ‏2013-10-02T06:01:15Z  in response to FrankRojas

    One comment, I deal with data sets having millions of events.  

    Based on the sample stream provided by muneharu...

    The creation of DUMMY followed by setglobalsnode causes 2 passes thru the data...it would be nice to have a count value as part of the setglobalsnode (simliar to the aggregatenode).