#22 - I asked Claude to score fraud AGAIN. Then this happened...

Apr 12

A couple of months ago I ran an experiment that left me reeling:

I found out that LLMs (Claude 3.5 Sonnet specifically) can already function as junior fraud investigators, without any special prompting or tuning.

Ever since, I’ve been thinking about how to explore this further. My first idea was to see what would happen if I enriched the dataset with more data.

Luckily, the good folks over at AtData (no affiliates) helped me do just that: enrich my full dataset with email, IP, and address intelligence. I was curious to see what the effect would be.

It ended up taking me down a whole different road…

Side note: this might get a bit technical, skip to the end if you’re here just for the insights.

Oh, and before we start: this experiment was done with the Sonnet 3.5 model. Since then, Sonnet 3.7 is now live and I have every reason to believe it’ll perform even better.

Recreating the original experiment

Going into it, I thought it would be super simple.

Last time, once the dataset was ready, it took me a couple of minutes to get the initial results, and a couple of hours to finalize them.

Side note: I’m not going to list the technical setup again, you can read about it more in the link above.

However, this was not my experience this time around.

It took me longer (~full day) to rerun the experiment with good results. I don’t know if I “got lucky” the first time around, if I did something different, or if something changed with Claude.

This triggered my first insight:

Scoring with LLMs, especially without pristine prompting, is difficult to reproduce consistently. Even if the high-level results look similar, the individual scores changed each time.

Eventually, though, I managed to reproduce similar results to the initial experiment: fraud was scored 56% higher than non-fraud payments (versus 64% in my first run).

At this point I tried adding the AtData features and failed miserably. The results actually looked worse, with fraud sometimes being scored lower than non-fraud.

This triggered my second insight:

The more features I added, the more overwhelmed Claude became. It became increasingly difficult for it to choose which datapoints to focus on.

I had to intervene and look at the features myself, and after a quick glance three looked quite promising:

Risk Score: scored the email for how likely it is to be fraudulent.
Longevity: how long ago was this email first seen by AtData.
IP Postal Distance: how far is the IP location from the address location.

Unsurprisingly, these are all hallmarks of fundamental fraud detection logics.

Once I cherry-picked these features, and gave Claude some instructions on how to interpret their values, it finally clicked.

In my best experiment using those three AtData features, it scored fraud 105% higher. That was a x2 improvement over my best experiment without those features.

Was I too quick to celebrate LLMs?

Improving my results by 105% should have left me thrilled, but I felt devoid of emotion.

Yes, I was frustrated by the amount of time I wasted on building the datasets, and the time it took me to find the right experiment setup.

But I also felt that planning such a simple iteration exposed the limitations of LLMs.

I was even beginning to wonder whether the surprisingly good results I got last time were due to the simplicity of the experiment.

But this time around, I noticed that the more I tried to squeeze out of Claude, the worse Claude performed.

I had to give it more and more hints on how to approach the task, which was in contradiction to what I was trying to achieve.

What frustrated me most was that while I was glancing over the AtData features, I noticed that certain values were very correlated to fraud.

I was thinking: “Man, if this were a real case, instead of spending a day trying to get results, I could have written a solid fraud rule within 10 minutes”.

Next thing you know, I was thinking:

“Wait a minute, why don’t I ask Claude to do just that?”

Is Claud a better analyst than it is an investigator?

I created a new dataset: ~2,300 payments, including their raw data and the three AtData features mentioned above. But this time around, I included the labels (which payments got chargebacks).

Next, I prompted Claude as follows:

Act as a Fraud Data Analyst:
* Analyze "joined_dataset_cleaned.xlsx"
* Treat column CG as your fraud label ("yes" = fraud)
* Research a rule that you'd recommend implementing to stop fraud while having as little impact on legit customers as possible (decline as little CG="no" payments).

This was the result I got:

This rule would decline 218 payments (~%10 of the flow), while blocking 26 fraud cases (35% of all fraud). Obviously, this isn’t ready to go live, but it’s a good start that took me 3 minutes to produce.

But the real kicker?

Notice how it focused on the exact three AtData features I did, without me prompting it? It was much easier for Claude to pick the right datapoints when I asked for a rule instead of a score.

Naturally, my next prompt was:

“Can you think of how to expand the base logic you proposed so we can exclude sub-populations that have a lower rate of fraud in them?”

Here was its response:

Supposedly, I should have been happy with this. It took me a minute to shave 30% of the false positives, without impacting how much fraud I block.

But I also had to be real:

As a fraud analyst, would I really use these exception logics? Or would I dismiss them as overfitted logics? I tended towards the latter, they were simply not strong enough to withstand real-world fraud.

However, this response did help me to find a slightly different exception, getting my own rule to block only 68 payments (~3% of the flow), out of them 24 were fraud (32% of fraud).

This was already a close-enough performance for me to consider such a rule for production, and it took me less than 10 minutes to research it.

But I wasn’t done yet. I wanted to check how Claude would fair if I took these three features out of the dataset.

And as I suspected, it struggled.

When it wasn’t presented with strong “fraud-splitting” features, it just couldn’t find even the start of a rule I could continue and develop myself.

Is Claude going to replace your fraud team?

So, what did I learn from spending a full day with Claude?

Claude Investigator: with basic setup, Claude can be used for basic fraud scoring at small scale. However, given that scores have some level of randomness, it is probably better to use it as a co-pilot rather than a fully automated agent.

Claude Analyst: Claude helped me research a worthy fraud rule within less than 10 minutes. Co-piloting non-experts in rule creation seems like a classic use-case, as long as you make sure to battle-test exception logics yourself.

Datasets need to be curated: throughout my experiments I’ve seen Claude struggle with datasets that had too many datapoints, or not strong enough datapoints.

Data enrichment is critical for performance: when scoring, performance was boosted x2. When researching rules, it couldn’t do without it.

How would all of that affect teams?

Non-expert teams:

Teams that don’t have any fraud expertise but suddenly need to manage it, can use Claude’s out-of-the-box capabilities to triage fraud attacks. It’s not a recommended solution for the long term, but it can be a great stopgap (= cheap and available).

Junior teams:

Fraud teams that are junior or not fully fleshed in terms of skillset, can also gain a lot of impact from using Claude. Whether it is for dataset creation and cleaning, investigation co-pilot, or for quickly identifying rule “stubs” that can be further developed by a human. Using it will likely increase efficiency considerably.

Senior teams:

I currently don’t see how Fraud teams that are highly professional and well-rounded will gain much by using Claude. It can probably still be of value in data manipulation and cleaning, but seasoned fraud teams usually would have their environment in order.

It seems that almost any team, besides the most seasoned, can benefit from using Claude for fraud detection.

Does that mean that these jobs are on the line?

Not if you know how to harness it to become more impactful in how you generate value.

And just before we wrap up things for this week, I’d like to thank AtData again for their support. They are not sponsors, just cool guys.

Did you try to use LLMs to write fraud rules? Hit the Reply button and tell me about your experience!

In the meantime, that’s all for this week.

See you next Saturday.

P.S. If you feel like you're running out of time and need some expert advice with getting your fraud strategy on track, here's how I can help you:

Free Discovery Call - Unsure where to start or have a specific need? Schedule a 15-min call with me to assess if and how I can be of value.
Schedule a Discovery Call Now »

Consultation Call - Need expert advice on fraud? Meet with me for a 1-hour consultation call to gain the clarity you need. Guaranteed.
Book a Consultation Call Now »

Fraud Strategy Action Plan - Is your Fintech struggling with balancing fraud prevention and growth? Are you thinking about adding new fraud vendors or even offering your own fraud product? Sign up for this 2-week program to get your tailored, high-ROI fraud strategy action plan so that you know exactly what to do next.
Sign-up Now »

Enjoyed this and want to read more? Sign up to my newsletter to get fresh, practical insights weekly!

Chen Zamir

#22 - I asked Claude to score fraud AGAIN. Then this happened...

Recreating the original experiment

Was I too quick to celebrate LLMs?

Is Claud a better analyst than it is an investigator?

Is Claude going to replace your fraud team?

#23 - Confessions of a fraud CTO: My million-dollar algorithm mistake

#21 - Are you mislabeling credit losses as fraud?