Topic
5 replies Latest Post - ‏2013-09-10T05:30:09Z by RakeshPimplikar
VijayGabale
VijayGabale
1 Post
ACCEPTED ANSWER

Pinned topic Problem 1

‏2013-07-18T07:01:13Z |

Do you like to analyze the data? Do you like to code algorithms to do such analysis? Here is a problem faced by today's top online transaction merchants which can be solved using algorithm and classification techniques. Apply your analytical skills, crack the problem and be the creator of mechanisms for secure online transactions.

 

Technological advancements in e-commerce have led us to a very comfortable life, where customers can buy almost anything sitting in front of a computer. It requires them to use channels like internet banking or credit cards for payments. If a customer is novice and doesn't adhere to security guidelines, hackers or thieves can use these payment modes for frauds, causing huge monetary loss to the customer. Today's online transaction systems allows this, because it appears as if a genuine user is doing all the transactions. To prevent this, system has to be more intelligent to detect any anomaly in a transaction being made.

 

Approach 1: One solution to this problem is to use outlier detection techniques in data mining to figure out transactions which are not similar to the transactions which a customer usually makes. This can give a signal to the system to block the transaction and have it verified before it is executed.

 

Approach 2: Now suppose you have a list of historical transactions where every transaction is labeled as 'fraudulent' or 'clean'. This information enables us to use several well-known classification techniques in data mining to flag every transaction as 'fraudulent' or 'clean'. Classification approach is supposed to work better than simple outlier detection, but it may not be true in practice. This is because fraud detection problems fall into a particular category of huge class imbalance and very limited labeled data. Both these properties of data introduce a huge error in classification predictions. Class imbalance means that ratio of fraudulent transactions to clean transactions is very small, say of the order of 0.1%. Limited labeled data means only few transactions (say of the order of 0.5%) are labeled in historical data.

 

Problem: We provide 2 test datasets of transactions, where every transaction is associated with a transaction id (TXN_ID) and a set of attributes referred as F1, F2, ..., F38. Your task is to assign appropriate labels to every transaction in these datasets, as guided below.

 

Elimination Round: Use approach 1 of outlier detection and assign a label ('anomalous' or 'clean') to every transaction in outlier_dataset.csv. Outlier transactions should be flagged as 'anomalous'. You can either use any of existing outlier techniques or come up with a new technique to achieve the objective.

 

Final Round: Use approach 2 of classification and assign a label ('fraudulent' or 'clean') to every transaction in test dataset classification_dataset._test.csv. You are provided with a training dataset classification_dataset_train.csv, where true labels for all transactions are given. You can use training data to guide your classification model. You can use existing classification techniques. However, keep in mind the peculiar properties of transaction data, i.e., huge class imbalance and very limited labeled data. You may want to design your own classifier to solve this problem.

 

Test Cases: Test dataset for elimination round, outlier_dataset.csv; Training dataset for final round, classification_dataset._test.csv; Test dataset for final round, classification_dataset_train.csv

 

Development platform: Weka (http://www.cs.waikato.ac.nz/ml/weka/). It has all standard data mining algorithms implemented, where you can plugin data and get the results. This tool also allows you to design your own algorithm.

 

Deliverables:

1. In both rounds, you have to generate a file solution.csv, which contains a label for every transaction from test dataset. Format of the file should be as follows. Do not add any other information to this file, not even headers.

<TXN_ID>, <label>

<TXN_ID>, <label>

<TXN_ID>, <label>

…..

 

2. You also have to submit a report in final round, describing which techniques/algorithms you used for both rounds.

 

Judging criteria: We will compare your labels with true labels we have. Score of every submission would be based on F1 score for 'anomalous' label in Elimination Round and for 'fraudulent' label in Final Round. This link describes how to compute F1 score. http://en.wikipedia.org/wiki/F1_score

 

1. Score of elimination round (20% weightage)

2. Score of final round (60% weightage)

3. Techniques/Algorithms described in report (20% weightage)

Updated on 2013-08-06T19:55:45Z at 2013-08-06T19:55:45Z by RakeshPimplikar
  • RakeshPimplikar
    RakeshPimplikar
    4 Posts
    ACCEPTED ANSWER

    Submission Format for Problem 1

    ‏2013-08-06T18:26:50Z  in response to VijayGabale

    Hi All,

    You must have received an email from NTC 2013 Organizers about what to submit and in what format. Here are some explicit instructions for the submission of Problem 1 solutions.

    Elimination Round:

    elimination-<Problem No>-<Your Team Name>-<Your Institute>.zip/tar

    (Example, elimination-2-warmachinerocks-iitb.zip)

    |- code

    |  |- solution.csv

    |

    |- logic (leave this folder blank)

    |- instructions (leave this folder blank)

     

    Final Round:

    final-<Problem No>-<Your Team Name>-<Your Institute>.zip/tar 

    (Example, final-2-warmachinerocks-iitb.zip)

    |- code

    |  |- solution.csv

    |

    |- logic 

    |  |- logic.pdf / logic.doc / logic.txt (This file should contain techniques/algorithms/logic    

    |                                                   that you used for both rounds.)

    |- instructions (leave this folder blank)

     

     

    Email zip/tar file to us at urirl@in.ibm.com

     

    Updated on 2013-08-06T18:29:08Z at 2013-08-06T18:29:08Z by RakeshPimplikar
    • meniman
      meniman
      1 Post
      ACCEPTED ANSWER

      Re: Submission Format for Problem 1

      ‏2013-08-08T03:41:52Z  in response to RakeshPimplikar

      Sir, so for the first round we just need to send the solution.csv and the code files , leaving the other two blank ?

       

       

      • RakeshPimplikar
        RakeshPimplikar
        4 Posts
        ACCEPTED ANSWER

        Re: Submission Format for Problem 1

        ‏2013-08-10T18:47:21Z  in response to meniman

        You need to send only solution.csv, not even code files.

  • RakeshPimplikar
    RakeshPimplikar
    4 Posts
    ACCEPTED ANSWER

    Validation Dataset for Elimination Round

    ‏2013-08-11T12:14:59Z  in response to VijayGabale

    Hi All,

    After you are ready with outlier detection algorithm for elimination round, you can use attached outlier_dataset_validation.csv for validation of your algorithm. Better the value of F1 score on validation dataset, better the quality of algorithm. 

    How to validate?

    1. Run outlier detection algorithm on dataset given in attached outlier_dataset_validation.csv and compare your predicted labels with true labels given in file.
    2. Compute F1 score.
    3. Modify your algorithm and repeat step 1, until you are satisfied with the value of F1 score.

    IMPORTANT: Validation dataset should be used only to improve the quality of algorithm. Eventually you have to submit solution.csv for outlier_dataset.csv provided inside NTC_2013_Fraud_Detection_Datasets.zip.

     

    Attachments

  • This reply was deleted by jigarkb 2013-09-11T16:37:14Z.
  • RakeshPimplikar
    RakeshPimplikar
    4 Posts
    ACCEPTED ANSWER

    Format for Logic.pdf for Final Round Submission

    ‏2013-09-10T05:30:09Z  in response to VijayGabale

    Please follow below mentioned document structure while preparing logic.pdf for final round submission.

    Title: NTC 2013 - Problem 1 - Fraud Detection

    1. Outlier Detection

        1.1 Explain the step by step logic which you have used to come up with solution.csv for elimination round.

        1.2 Provide pseudocodes of any new algorithms that you came up with. No need to provide pseudocodes for standard algorithms.

    2. Classification

        2.1 Explain the step by step logic which you have used to come up with solution.csv for final round.

        2.2 Provide pseudocodes of any new algorithms that you came up with. No need to provide pseudocodes for standard algorithms.

     

    Formatting Instructions:

    - Number of columns in every page: 1 (single column pages)

    - Maximum number of pages: 10

    - Font: Times New Roman

    - Font size: 12 for normal text, 14 or 16 for heading/titles

    - You may include images, if they are necessary to explain logic.

    Updated on 2013-09-10T05:34:54Z at 2013-09-10T05:34:54Z by RakeshPimplikar