#05 - Claude vs. Fraud: A 3-Hour Game-Changing Test

Dec 14

Ever since LLMs exploded into our lives, I asked myself this question:

How good would they be in detecting fraud?

After spending only 3 hours with Claude.ai, I can say this: Pretty darn good.

In fact, I was shocked by its "out-of-the-box" performance, and my mind is still thinking of all the possible implications.

Can it replace Machine Learning models?

Heck, can it replace FraudOps teams?

The answer for both surprised me.

Let's take a deep dive.

(This is about to get a bit technical, so skip to the end if you want just the bottom line).

The Experiment

The basic question I set up to answer was this:

Can an LLM detect fraudulent cases out-of-the-box, and without any training or tuning?

Basically: Can it be operated by a beginner GenAI user like myself?

To run the experiment, I used Claude 3.5 Sonnet. Specifically, I ran it as a Project which requires you to be a paid user. But I'm certain you can easily replicate the same results with a free account.

I've built a small dataset of 99 records, which was pretty much the limit in terms of size Claude allowed me. Of these, 21 were confirmed fraud cases, and 78 were payments selected randomly.

The dataset included raw data like basic identity information (name, address, email, phone), card information (type, BIN, issuing country) and payment information (amount, payment source).

No velocity features, no info about the IP other than the address itself, no device details. And as it was a subset of the entire dataset, Claude didn't have the full account event history.

Most importantly - the dataset was a blind one.

I didn't include any data about which of the payments was fraudulent, and didn't even mention how many fraud cases were within the dataset.

I then simply uploaded the file to a new chat and entered the following prompt:

In this file you'll see 99 transactions. Please output a file with an added column called "fraud score". Please insert a number between 1-5 to reference how likely this transaction is to be fraudulent. 1 means low risk, 5 means high risk.

Doesn't get any more basic than that.

The Results

The results were surprisingly good straight out of the gate: Good transactions averaged in a 2.00 score, while fraud averaged in 3.14 (I swear, I'm not making this up).

Let's stop for a minute.

Out of the box, without labels, without training, without all the data, and with basic features only - Claude scored fraud 64% higher than the non-fraud population.

Eve more interesting, this was part of the output I received:

Risk Factors Considered:
1. Name matching between cardholder and user
2. Payment method
3. Transaction amount patterns
4. Email address characteristics
5. Account age and signup method
6. IP address patterns
7. Card type and category
8. Mismatched location data

Evidently, this wasn't a part of my prompt or any other instructions I ever gave Claude.

My next thought was: "That was the first try, what happens if I actually feed it with some review guidelines?"

Next, I ran four more experiments, but surprisingly it actually made the results worse.

I'm not going to get too deep into this, but it made it clear to me that prompt engineering would not be that easy. It would probably need to be very dataset/pattern specific to show real gains.

But the surprising bit came next, when I tried looking at false positives cases:

Payments that were not marked as fraud, but received high scores (4-5).

What I found was very interesting: Out of 11 such cases, 4 were actually fraud cases that were unreported. I proved it by linking them to confirmed fraud in the full 10k+ dataset.

Now, admittedly, the high-level results of the last experiment were identical to the first - 2.19 average score for non-fraud, versus 3.48 for fraud (63% difference).

But the interesting bit was looking specifically at the high score cases.

In the medium-high risk band (scores 3-5 ), Claude caught all the fraud cases (24), and 23 of the non-fraud cases.

That's almost a 1:1 False Positive Rate, much better than any Unsupervised Machine Learning model out there (ignoring for a minute the skewed dataset).

For the high risk only band (scores 4-5), Claude caught 10 fraud cases (~40% of fraud) and only 7 non-fraud ones.

Very (very) impressive!

Closing Thoughts

Here are my takeaways from this experiment:

LLMs are rapidly catching up on fraud teams: considering the limited data (features and records) and the beginner-level prompt engineering, Claude 3.5 Sonnet performed way above my expectations. I can only think of the results if I gave it labeled data to train on. Not to mention when data upload limits will be extended.

Democratizing Fraud Detection: Preparing the dataset, running all 5 experiments, and analyzing the results took under 3 hours. This can be done on a free Anthropic account and without having any special prompting knowledge. Don't have a fraud team? Claude can be of help right now.

LLMs excel in low-data environments: Claude was able to produce these results without labeling, and even identifying some unreported fraud. Considering it didn't get any training data to work with means it can operate well in conditions where Machine Learning Models will fail hard. And I haven’t even mentioned you don’t need to set a structured data schema.

To be honest, I didn't expect this performance level, especially given the rudimentary experiment conditions.

Now think about how it would perform in two years' time.

A revolution is coming.

That’s all for this week.

See you next Saturday.

P.S. If you feel like you're running out of time and need some expert advice with getting your fraud strategy on track, here's how I can help you:

Fraud Strategy Workshop - are you an early-stage Fintech that needs to move fast and with confidence? Book this 1.5-hours workshop to get instant insight into your vulnerabilities, optimization opportunities, and get clear actionable recommendations that won't burn through your budget.

Book Your Workshop Now >>

Fraud Strategy Transformation Program - are you a growth-stage Fintech in need for performance optimization or expansion of your products offering? Sign up to this 6-8 weeks program, culminating in a tailored made, high-ROI roadmap that will unlock world-class performance.

Schedule a Call Now >>

Enjoyed this and want to read more? Sign up to my newsletter to get fresh, practical insights weekly!

Chen Zamir

#05 - Claude vs. Fraud: A 3-Hour Game-Changing Test

The Experiment

The Results

Closing Thoughts

#06 - My Zero-Cost Fraud Protection Guide for Fintech Startups

#04 - The 5 ways to spot False Positives