Do you like to analyze the data? Do you like to code algorithms to do such analysis? Here is a problem faced by today's top online transaction merchants which can be solved using algorithm and classification techniques. Apply your analytical skills, crack the problem and be the creator of mechanisms for secure online transactions.
Technological advancements in e-commerce have led us to a very comfortable life, where customers can buy almost anything sitting in front of a computer. It requires them to use channels like internet banking or credit cards for payments. If a customer is novice and doesn't adhere to security guidelines, hackers or thieves can use these payment modes for frauds, causing huge monetary loss to the customer. Today's online transaction systems allows this, because it appears as if a genuine user is doing all the transactions. To prevent this, system has to be more intelligent to detect any anomaly in a transaction being made.
Approach 1: One solution to this problem is to use outlier detection techniques in data mining to figure out transactions which are not similar to the transactions which a customer usually makes. This can give a signal to the system to block the transaction and have it verified before it is executed.
Approach 2: Now suppose you have a list of historical transactions where every transaction is labeled as 'fraudulent' or 'clean'. This information enables us to use several well-known classification techniques in data mining to flag every transaction as 'fraudulent' or 'clean'. Classification approach is supposed to work better than simple outlier detection, but it may not be true in practice. This is because fraud detection problems fall into a particular category of huge class imbalance and very limited labeled data. Both these properties of data introduce a huge error in classification predictions. Class imbalance means that ratio of fraudulent transactions to clean transactions is very small, say of the order of 0.1%. Limited labeled data means only few transactions (say of the order of 0.5%) are labeled in historical data.
Problem: We provide 2 test datasets of transactions, where every transaction is associated with a transaction id (TXN_ID) and a set of attributes referred as F1, F2, ..., F38. Your task is to assign appropriate labels to every transaction in these datasets, as guided below.
Elimination Round: Use approach 1 of outlier detection and assign a label ('anomalous' or 'clean') to every transaction in outlier_dataset.csv. Outlier transactions should be flagged as 'anomalous'. You can either use any of existing outlier techniques or come up with a new technique to achieve the objective.
Final Round: Use approach 2 of classification and assign a label ('fraudulent' or 'clean') to every transaction in test dataset classification_dataset._test.csv. You are provided with a training dataset classification_dataset_train.csv, where true labels for all transactions are given. You can use training data to guide your classification model. You can use existing classification techniques. However, keep in mind the peculiar properties of transaction data, i.e., huge class imbalance and very limited labeled data. You may want to design your own classifier to solve this problem.
Test Cases: Test dataset for elimination round, outlier_dataset.csv; Training dataset for final round, classification_dataset._test.csv; Test dataset for final round, classification_dataset_train.csv
Development platform: Weka (http://www.cs.waikato.ac.nz/ml/weka/). It has all standard data mining algorithms implemented, where you can plugin data and get the results. This tool also allows you to design your own algorithm.
1. In both rounds, you have to generate a file solution.csv, which contains a label for every transaction from test dataset. Format of the file should be as follows. Do not add any other information to this file, not even headers.
2. You also have to submit a report in final round, describing which techniques/algorithms you used for both rounds.
Judging criteria: We will compare your labels with true labels we have. Score of every submission would be based on F1 score for 'anomalous' label in Elimination Round and for 'fraudulent' label in Final Round. This link describes how to compute F1 score. http://en.wikipedia.org/wiki/F1_score
1. Score of elimination round (20% weightage)
2. Score of final round (60% weightage)
3. Techniques/Algorithms described in report (20% weightage)