Share this post:
In our webinar yesterday, the first in the series AI Fraud Detection — Beyond the Textbooks, we ran out of time and weren’t able to address some great questions we had from the audience. Instead of waiting until the next episode, I hope these brief answers will be of help and interest to both those who tuned in as well as our wider fraud prevention audience. (Note: Some of these questions have been lightly edited to provide a clearer context.)
1. “What rules or models can you recommend to detect push-payment fraud”?
I’ve heard more than one type of fraud or payment referred to as a “push payment” so I may misunderstand your question. But, lacking the opportunity to have a conversation with you, here’s my best answer, and I’ll take it a bit far afield to answer broader question others may have. My bottom-line answer is that it doesn’t matter that the transaction is a push; similar modeling methods apply. Here’s why:
All fraud models are predictors. I often call them “detectors” but really, they are predicting the future. Speaking precisely, well-crafted fraud models are trained to do one of these tasks:
a. Give the probability the subject transaction, if allowed to execute, will, in the future, be found to have been fraudulent. Or, almost the same:
Give the relative risk ranking that the subject transaction, if allowed to execute, will, in the future, be found to have been fraudulent. (Many fraud models do better at ranking risk than estimating the continuous probability of risk, and that’s good enough for most real-world applications.)
b. Give the probability the subject transaction will, in the future, be found to have been executing at a time during which fraudulent transactions were occurring on the transacting account. Or, again:
Give the relative risk ranking that the subject transaction… (same issue as before.)
So, push transactions may require special data or a model with extra, clever features, but inherently they are not so different from other fraud models because they are all about predicting future outcomes.
2. “How will the growing synthetic identity fraud impact current fraud monitoring and detection systems?”
Great question! (And, thanks for the question on LinkedIn as well — this is a little better answer.) Detection of synthetic accounts, customers, etc. requires special methods at first and then may be folded into conventional, transactional detection. Here’s the general approach I recommend. In practice, the real solutions I’ve worked on wandered a bit, but here’s the more direct process learned from experience…
a. Criminals don’t seem to work very hard at making the behavior of their synths diverse. The account origination data may be diverse due to diversity in the population of stolen identity information, but the behavior of synths after origination is usually pretty stereotypical. (Sometimes you get lucky and even the origination data has some unusual consistency due to where or how the personal data was stolen.)
So, you look closely at the transactional behavior of newer accounts. You probably need to fight with the marketing department a bit to keep some restrictions on the privileges of new accounts, giving you and your system time to collect a bit of transactional behavior. (I’ve seen cases where 3 or 4 transactions was enough to nail most synths.)
To look closely, you perform a cluster analysis on the data using profile-like information summarizing past transaction behavior in some detail. Usually, it’s most feasible to do a clustering with something like k-means, but if you have a lot of attributes in the data (or among useful features computed from the data) you may have a better result applying the clustering to a reduced-dimensionality mapping of the data. You would use a technique like principal components, or SVD (Singular Value Decomposition) to reduce dimensions, then apply the clustering algorithm.
Once you get a clustering, push the number of clusters higher than you would for something like a marketing or credit-segmentation model in order to get some pretty small and tight clusters. If you have many synthetic identities, they are likely to appear together in a very few clusters. To pick them out, examine each tight cluster to see if anything looks odd about the members of the cluster. You’ll usually be able to spot the synths. They’ll do odd stuff like take money out of an ATM and put it back in to drive up their apparent activity, or run up a credit balance on a card, or all take out home equity lines of credit (HELOC’s) for the same amount, etc.
Note that you might catch some flesh and blood synths — people recruited or coerced by criminals to open accounts in their genuine name and follow orders on how to use them, preparing to take the money and run. (Often, these may be people planning to leave the country. Sometimes travel agencies are involved. Sometimes there are threats to family in another country, etc.) Depending on the law-enforcement practices in your country, be careful as some of these conspirators may be innocent or only aware of a minor crime in the process — don’t let anyone get hurt just because they fell into one of your clusters!
b. Now that you know what differentiates the behavior of the synths by the key characteristics of their clusters, you can build a model to detect that behavior. One way to do that is to add an attribute to account records or profiles, flag all the probable synths in your cluster and train a model to recognize that specific behavior. (And, train it tightly to avoid false positives.) You might also add some new features tuned to report on the characteristic behaviors you detected.
You can use a similar approach used by a few, advanced acquirers: peer-group profiles. Acquirers do something like put all the shoe stores in a peer-group. Profile the peer-group and the variance of members on key attributes. Then investigate the shoe stores that don’t look like the other shoe stores. Peer groups can be formed on account characteristics, how they were recruited, etc.
3. “What’s the best way to prevent fraud on new channels where you don’t have history data to train with or when there’s no fraud-flagged transactions in your data set?”
Another very good question.
a. Start by ensuring your operating environment is collecting all the data that might help in future fraud detection, and ensure the environment leaves enough time for fraud detection before a transaction is committed. (100ms is a practical minimum if you have a high-performance system, otherwise, 300ms — add any time needed to fetch data from any other systems). Side note: You might need to tell some fraud horror stories to management or IT to convince them the fraud system is important. If you need any, I can provide a few!
b. Design “policy rules” in cooperation with higher powers in risk management. These don’t detect fraud so much as set limits on the level of risk the organization is willing to take on a transaction with specific characteristics. And, I recommend you create some tighter rules that are not immediately implemented, but carried in the system and available in case you need to later say “if you had activated this proposed rule you would not have lost that $1 million last week.”
c. Collect every ounce of data you can and arrange for fast reporting of frauds and probable frauds. Don’t wait for financial certainty in fraud tagging to react to new attacks.
d. As you collect data and see a few frauds or probable frauds, start to build simple models using obvious features like spending rate, change in spending rate (more precisely, use log(spending) in the rates, it tends to work well), transaction frequency change, maximum amount, kinds of merchandise or counterparties if you have such data. For example, a simple tree-based model (not a random forest yet), an optimized rule model (like IBM Safer Payments builds), or, a linear regression (with risk tables like fraud rate in known hot spots), or, more aggressively, a logistic regression model.
e. Watch how your new models behave and start to use them in production with high thresholds so that downstream folks investigating alerts gain some confidence in the system and don’t see too many false-alerts.
f. And, critically important for a new system, build in some flywheel rules that apply across all accounts or big segments of accounts and prevent run-away frauds that may exploit processing weaknesses across many accounts or an entire channel.
g. As you collect more data, build more complex features and more complex models. Those added features may include comparisons of individual accounts to their expected peer-group. (In IBM Safer Payments, one way to do this is to store peer-group membership(s) as list entries and use profile variables specific to the peer-group both for update and for comparison to the subject account’s history.)
One of the trickiest parts of this process is to educate portfolio and risk management about the dependence of AI on data and the special risks of a new payment method. Especially, make sure that any introductory promotions get a close look by a fraud expert before they go public.
4. “Are supervised-learning algorithms enough for detecting fraud cases? What can you do to identify or detect a new fraud pattern not in the historical data?”
Supervised learning works well enough, but you must use it with some cleverness. Yes, using supervised learning to detect fraud attacks that are similar to prior fraud attacks leaves you vulnerable to new attack patterns not in the historical data. But, you have another, complimentary approach…
You can detect fraud indirectly by modeling how accounts (or other participants) behave and recognize deviations from expected behavior in a generally fraud-like direction. Indeed, most fraud systems on large portfolios lean heavily in this direction while direct detection by recognizing attack patterns has usually been used less, partly because it can be blindsided by new attacks, but more because it requires updating models often and that has been inconvenient for many vendors and users.
As model-building becomes more widespread and more efficient, we are seeing more cases of direct detection, but often paired with indirect detection either in a single model trained on data and features relating to both methods or in two or more models cooperating in an ensemble.
It’s all supervised learning but it’s the difference of modeling an object and modeling its shadow. Clearly, the best way for fraud detection is both.
(By the way, there is a special case that dumbfounds most supervised-learning models. Say you build a model on well-featured credit card data, then show it a transaction for $10 million. It’s a toss-up how the model will respond because it has never seen a transaction anything like that before. You prevent such problems with “hinting” where you make up extreme transactions and label them as frauds for training — doesn’t take many made-up transactions, but one of them might save you a million.)
5. “What is AI? I have had our parent company ask me questions about which tools we are using that use AI. I am not sure what fits the definition of AI.”
Here is my official answer:
Artificial Intelligence is the kind of intelligence one employs when one cannot afford the real thing.
Facetious, yes, but true when you think about it. If you had unlimited money (and time) to spend on having humans examine each transaction for fraud, I suspect detection would be pretty good. In the early days of building fraud models, we would often explain to new modelers that our goal was to build a system that would do what an old-fashioned bank teller in a small town would do — sniff out the suspicious transactions and check them out before completing them.
More seriously, there is no clear line and there is a great deal of marketing noise. Ironically, as AI, especially machine learning, has become more popular, it has become more difficult to recognize. In the past, AI machine learning differed from statistical modeling only in style. Machine-learning practitioners were more cavalier with pushing the assumptions of modeling algorithms, less careful about how model performance was measured and nearly always packed more variables into a model than any self-respecting statistician would allow herself.
But, as experience with today’s large datasets has grown, all of us, AI and not-AI, have learned that machine-learning works much better than anyone expected — it’s a bit of a mystery to all concerned that you can bend the rules of statistical modeling and still get surprisingly good results. So, statisticians have moved more toward the AI practices. At the same time, new AI practitioners are learning that the statisticians pretty well had this territory staked out by about 1930 or so. Yes, a few specific algorithmic tricks have evolved since then, but the basics, developed by pure statisticians, were pretty much in place before any of the kind of machine learning we do today was in vogue.
The “revolution” in AI is not a revolution in theory or understanding. It is a revolution driven by the ubiquity of organized and easily accessed data that started with widespread use of the Internet, and of engineering technique that has continued to hone, refine, and compound model-building methods.
Meanwhile, there are other branches of AI that aren’t getting much notice, but will have their day sometime. (You also might be interested in the “AI Winter,” actually, both of them. See Wikipedia.)
6. “Is a prescriptive model better than predictive models for fraud detection? Do any fraud products currently offer prescriptive models?”
Most fraud-management systems start with prescriptive models, whether or not anyone remembers when today’s systems started. I prefer the term “policy rules” for a prescriptive model or set of rules. These are rules that express what risks the institution is and is not willing to take in the course of their business. They really are management decisions, not detection recipes.
Typically, as predictive models come along and become reliably indicative of fraud risk, policy rules tend to evolve into hybrids that express a policy position in terms of model results. For example: “No transactions with a score over 68 for accounts less than three months old.” That’s prescriptive and defines a policy, but does so in term of a model result. Rules of this kind are the “rule” in mature systems and usually come about without anyone realizing the shift from strictly prescriptive to hybrid.
So, no. Prescriptive is not better than predictive or vice versa. Both methods need one another in practical applications.
7. “What is uniquely different between debit card and credit card fraud?”
It depends upon the card schemes and policies. In many countries, the two are very similar for fraud-management purposes. But, they are quite different fraud problems in the U.S. so I’ll assume the question is about the U.S. and similar markets, but the ideas apply in many regions. (And, for you experts, I’ll be talking about single-message debit cards versus dual-message credit cards and signature debit cards.) In the U.S., for most fraud-management purposes, credit cards differ from debit cards in these ways:
a. Debit card withdrawal and payment transactions require a PIN, credit card transactions (other than ATM and some branch transactions) do not.
b. Credit cards are heavily used for Internet payments, debit cards much less so.
c. Credit card available-credit levels tend to be higher than debit card available balances — usually.
d. Credit cards may help open up other credit opportunities not influenced by a debit account.
e. Vetting of credit-card accountholders is more rigorous than for debit-card accountholders.
Item (a) makes credit-card transactions easier to compromise. Item (b) exposes credit cards in a high-risk arena not travelled much by debit cards. And, items (b), (c), and (d) make stealing access to a credit card more valuable than stealing access to a debit card.
So, criminals like credit cards better, but have to get around item (e) making account take-over popular as false originations are not easy (but, of course, some criminals specialize there as well).
Also, because fraud rates are substantially higher on credit cards than on debit cards, detection on credit cards tends to deliver higher detection rates at a given false-positive rate.
Fraud detection system vendors especially prefer protecting credit cards because there’s more money in the fraud, and somewhat easier detection due to higher fraud rates.
And why do issuers put up with the greater risk? Because they make more money on the credit aspect of credit cards.
8. “Ted, do you believe that it would be a good fraud-detection strategy to use different models targeting different angles as opposed to using one generic model to detect all fraud attempts?”
Absolutely, yes. I answered this one in the recorded version of the Webinar and the answer is long, a bit involved, and most of session 4 is on this topic. So, here’s a short version.
There are two main reasons for using multiple models:
a. It sometimes helps to build models that are specific to a particular kind of transaction, account, or fraud type. Some argue that a full-featured complex model like a neural network, support vector machine or random forest, can cope well with multiple situations and it isn’t necessary to partition the detection problem like this. You often hear that tree-based models (like the popular random forests) effectively partition the problem automatically — yes, but not quite, there remain some variables that may be confounded across different sub-problems. Nonetheless, the relationship between model complexity (roughly, the number of adjustable parameters in the model) and available training examples often favors multiple, specialized models. Not very elegant, but often effective in practice, especially in the hands of an expert, experienced modeler.
b. As I mentioned in response to an earlier question, there are basically two main ways to detect fraud: direct recognition of repeated fraud attack patterns, and indirect recognition of frauds as deviations from the expected behavior of an account or participant. One can build a model that encompasses both approaches (indeed, modelers who have not recognized the distinction usually do build hybrid models without knowing it). However, there is good reason to build and combine separate models, each emphasizing one approach or the other, if only because they need renewal on different schedules. (And, these separate models are best used under the supervision of a master-model that learns when and how much to listen to each of the specialist models.)
9. “Which payment method has the current highest fraud detection rate?”
I think I answered the wrong question when addressing this in the live session. I more or less listed the criteria for a very low net fraud rate. But, that’s not where detection is highest. Hands down, one achieves the best detection on a transaction system (payments or otherwise) that has a very high fraud rate and has little or no prior fraud protection. In such cases, the high fraud prevalence (remember that chart in the presentation) makes it easier to achieve lower false-positive rates. One could say there are “more fish in the barrel” to catch. And, a lack of detection pressure on those active in fraud makes them lazy and easier to catch. (The fish are lazy.) Until, of course, they wise up and get more diligent. So, the highest detection rates? Look for the highest fraud rates. (But don’t count originations fraud due to poor risk-management on account creation.)
10. “Hi Ted, what is your opinion of rules-based models vs machine-learning models or will we always have hybrid models to reduce the false-positive rate? Thanks.”
Here’s another one I answered in the recorded live session. But to summarize: There is no fundamental difference because any algebraic model may be expressed as a set of rules and any set of rules may be expresses as an algebraic model (indeed, there is a theorem about this for neural networks). It’s two languages for a common concept.
However, because most humans can speak rules easier than they can speak formulae, there tend to be two different kinds of rules with some gradations between.
Some rules are what I call “policy rules.” These say what risks the institution is and is not willing to take — rules constructed directly by humans based on issues beyond fraud detection alone. Other rules are designed to detect frauds rather than just rule them out.
Detection rules may be optimized using machine-learning methods or estimated using more ad hoc methods or semi-automated approaches to optimization — like the Excel spreadsheet rules often pasted over machine-learning models to patch some weakness.
So, we have four types of rules with some gradations between them:
Fully optimized detection rules (with parameters optimized with historical data)
Less than fully optimized detection rules.
The fully optimized rules are just another way of expressing a machine-learning model.
11. “Can you explain why the rareness of fraud makes it harder for a model to detect? As long as there are enough frauds to learn from, why does it matter how many non-frauds there are?”
Because the frauds may get lost as noise in the variation of attributes in the genuine cases.
Think of it this way: if you have just two records, one describing a fraud and one describing a genuine transaction, it will be pretty easy to find some clear differences. But, as you add more genuine cases, the differences become less clear because there is more variation in the attribute values of genuine cases. Indeed, it’s not that unlikely you will even have some genuine cases with exactly the same attribute values as your one fraud case. If there are two of those identicals, and just your one fraud record, the model will likely call all transactions genuine and completely ignore your one fraud case.
I think you get the idea now. In practice, it’s not as clear cut as with only one fraud and thousands of genuines, but the idea is the same.
There are pretty good solutions to this training problem, but they don’t solve the problem of higher false-positive (false-alert) rates on detections in a low-fraud environment.
12. “You said the data about frauds is scattered. Why is that? Why aren’t they organized?”
Fraud is usually about transactions between parties. No transaction, no money changes hands, not much opportunity for fraud. And, in many, modern transaction systems there are lots of parties that might be interacting, many of them using somewhat different channels or tributaries of the channels.
Probably the worst case is the worldwide credit-card system. Each issuer is responsible for their own fraud-protection. Despite large processors, still there are hundreds of fraud-detection systems that do not share detailed data. A clever criminal can repeat a good fraud scheme hundreds of times over, once for each issuer, then moving on to the next issuer once found-out. This can and should be fixed.
13. “Why are detected fraud attempts removed from the data used to train models later? Why not include them in the training so old fraud methods are still in newer models?”
Because you don’t know if the detection was accurate. You only know the outcome of transactions you don’t stop. If you stop a transactions because it looks like a known type of fraud, you never know if it really was fraudulent.
Plus, if you assume interdicted transactions were fraudulent and include them as frauds when training new models, the models will ultimately “ring” like feedback in a public address system — and for the same reason of circular re-amplification.
There are some techniques to reduce this “forgetfulness,” but they are not widely used. We’ll touch on at least one in a coming session.
14. “What did you mean when you said there is zero cost for an unsuccessful fraud and why is that important? Can we change that?”
So, imagine I live in a country hostile to most of the world’s richer nations. I’m an out-of-work computer-science professor struggling to feed my family in a dysfunctional economy where justice is a hope at best. And, I’ve learned to hack into some merchant point-of-sale system where I can steal account credentials. With easily found sites on the Internet, I can anonymously sell the identities I steal without having to meet some sketchy criminal in a dark alley.
Now, I think I have a new variation on my hacking technique and I try it out. I’m so confident of it, I try out 100 variations of the new scheme and 99 of them fail, but one works. What was the cost of my 99 failures? Did I run a risk of being arrested? No. Did I feel guilty about exploiting a big bank in a far-away rich country at odds with my country? Probably not. Did I worry about some cosmic notion of justice? Also, probably not, especially if I can buy a new toy for the kids and take my spouse out to dinner.
No, there was no cost to my trial so I keep trying not just hundreds, but thousands of variations and exploit the few that work.
I’m one of thousands in the swarm intelligence faced by today’s banking industry.
15. “You talked about fraud data being “not normal” why does that matter to a machine-learning model?”
The full answer and example is in the next session — it’s an interesting and fairly deep issue often overlooked.
A short version is that during the optimization of model parameters (i.e. learning), each parameter is being adjusted based on the sample of relevant attributes in historical observations that are relevant to that parameter.
The parameter doesn’t “know” how the attribute values in the historical record are distributed so when several sample values are seen, the parameter-setting mechanism (the “learning algorithm”) can’t tell if the values lean up or down because of random variation or because the population of attribute values naturally leans in one direction. The learning algorithm has to assume some distribution to interpret the sample values it sees and set the best possible overall value for the parameter.
Most designers of such algorithms make the wise choice that normally distributed attribute values are probably the most common across the range of problems the algorithm will be applied to. But, that’s not so often true for financial and payments data, so the parameters are not as well optimized for a given number of observations as they might be if the attribute values were more normally distributed.
In practice, we use transformation features to re-scale important attributes so their values are more normally distributed, but this remains challenging because the transformations are rarely perfect.
16. “What do you mean by ‘overfitting’?”
If you keep optimizing the parameters in a complex machine-learning model long enough, it will tend to memorize the data used to train it. The problem is that it starts to lose the forest for the trees and may narrow in on unimportant details in the data rather than broader, more important factors.
A common way to prevent over-training leading to overfitting is to train a system on one set of data and hold out another set of data for testing that is not used in training. As the model is trained, it will perform better and better on both the training examples and the held-out examples. But, eventually, it will continue to do better mimicking the training data, while performing worse on the hold-out sample of examples. That’s when it’s overfitting and one rolls it back to where it achieved peak performance on the hold-out examples.
Other methods of preventing overfitting are used on various types of models. Tree-based methods are especially prone to overfitting; that’s one reason random forest models are often used to control overfitting of decision trees.
However, when using a model to recognize specific fraud types, overfitting gets a bit more subtle and we’ll talk about it again in session 3.
17. “You said ‘self-learning’ is ‘weaker’ for fraud detection because the model targets are evading detection. Why do you say that?”
Because criminals adapt to changes in the detection system, adapting to their adaptations can come to be chasing one’s own tail. Self-learning schemes that attempt to adapt to changes can contribute, but will generally not be as successful as they are for unconscious or oblivious subjects not trying to evade changes. It’s challenging for humans guiding model development to stay ahead of, or at least close behind, criminals, but they have a better chance of anticipating the criminals’ next move, or at least their general direction.
Also, experience shows that adapting to criminal adaptations usually involves not just changing parameter values as in most self-learning methods, but changing the variables used by the model, introducing new, derived features. (More about this in the next session.)
18. “What did you mean that fraud data is ‘censored’?”
In this context, the word “censored” has a special connotation. It means that a sample is selected from the data by some non-random process favoring some data values over others. It need not be a conscious process, but it’s a process that tends to alter the distribution of values that get through a censoring selection process.
In our case, fraud data tends to be censored by the detection process itself. Since the only frauds we see are the ones we didn’t catch, the caught frauds have been removed from the data sample and they have certain characteristics in common (because we were able to catch them). Those characteristics are systematically removed from the data we use to train our models, and the distribution of naturally occurring characteristics is necessarily biased by this “censoring” selection process. This isn’t all bad because we want to learn the common characteristics of fraudulent transactions, but the data we deal with is then distorted by the absence of the missing, caught transactions.
19. “You said card companies were often disappointed by fraud detection systems for debit cards. Please explain.”
For U.S. PIN-protected debit cards, fraud rates are generally much lower than for credit cards issued by the same companies and banks. Because prevalence strongly affects the false-positive rate at lower fraud rates, the issuers experience a surprisingly high false-positive rate on their fewer frauds on debit cards and often expect a false-positive rate similar to what they experienced on higher-fraud-rate credit cards.
20. “Will future sessions be more or less technical?”
We will dig in more deeply to modeling and fraud-management technology, but we’ll try to keep the level of vocabulary and prior experience required about the same as today.
If you missed the live session, you can watch the on-demand version as well as sign up for future episodes here.