Finding Fraud Part One

Credit Card thief

It’s no secret that one of our hottest use cases lately has been Fraud Detection. Awhile back we did a webinar talking about some of the ways you could use Neo4j to fight fraud. Today, I want to augment that webinar with some Cypher queries. Let’s see how it works.

In our model, we have users that use their credit cards to make transactions with different merchants. Fraudsters with stolen credit card data test their stolen numbers, make big purchases, and then disappear. We are interested in finding either where the credit card data may have been stolen or merchants where stolen credit card data is tested.

Let’s start with John. He has made some credit card transactions:

John's credit card transactions

Let’s create this data in our graph. We have a user, some merchants, and some transactions made at those merchants:

CREATE (john:User {name:"John"})
CREATE (m1:Merchant {name:"Computer Store"})
CREATE (m2:Merchant {name:"Gas Station"})
CREATE (m3:Merchant {name:"Jewelry Store"})
CREATE (m4:Merchant {name:"Furniture Store"})
CREATE (tx1:Transaction:Fraudulent {amount: 2000.00, date:datetime()})
CREATE (tx2:Transaction {amount: 35.00, date:datetime() - duration('P1D')})
CREATE (tx3:Transaction {amount: 25.00, date:datetime() - duration('P2D')})
CREATE (tx4:Transaction {amount: 12.00, date:datetime() - duration('P3D')})
CREATE (tx1)-[:AT_MERCHANT]->(m1)
CREATE (tx2)-[:AT_MERCHANT]->(m2)
CREATE (tx3)-[:AT_MERCHANT]->(m3)
CREATE (tx4)-[:AT_MERCHANT]->(m4)

That’s a good start, but we still need to connect John to his transactions. How could we do that? Well, we could just create a MAKES relationship from John to each like so:

CREATE (john)-[:MAKES]->(tx1)
CREATE (john)-[:MAKES]->(tx2)
CREATE (john)-[:MAKES]->(tx3)
CREATE (john)-[:MAKES]->(tx4)

That’s one way to model the relationship, and it works great to know every transaction a particular user made. But what if we wanted to get just the most recent transactions? Well, with this model we’d have to traverse all of them, check their date property, sort and retrieve just a few.

// The last week of John's transactions using the MAKES relationships
MATCH p = (n:User {name:"John"})-[:MAKES]->(tx)
WHERE > datetime() - duration('P7D')

The results would look like this:

John's transactions for the week

That seems like a lot of work. What we can do instead is arrange them in a Linked List. Yeah you know, that thing you learned about in class but never really used outside of whiteboard interviews?

We will make a chain of transactions starting from John to his most recent transaction and going back in time like so:

CREATE (john)-[:PREV_TX]->(tx1)
CREATE (tx1)-[:PREV_TX]->(tx2)
CREATE (tx2)-[:PREV_TX]->(tx3)
CREATE (tx3)-[:PREV_TX]->(tx4)

If we wanted to see these transactions using the PREV_TX relationship chain, we could use the magic star (*) character in Cypher to “keep going” from John down the way.

// The last week of John's transactions using the PREV_TX relationships
MATCH p = (n:User {name:"John"})-[:PREV_TX*]->(tx)
WHERE NONE (tx IN nodes(p) WHERE COALESCE(, datetime()) <= datetime() - duration('P7D'))

John&apos;s transactions represented as a Linked List

Alright, let’s make this interesting. John notices a large transaction in his statement he didn’t make and calls his credit card company. They mark the Transaction as fraudulent.

Detected fraudulent transaction

John is not the only customer. Let’s add Karen, Sheila, and Robert as more users, along with their transactions and merchants they’ve interacted with. I’m not going to put all that Cypher here, but you can see it on this gist.

Karen and Sheila also found transactions they did not make on their statement and called in. Unbeknownst to them (and us for now), they had all visited the same gas station about two weeks ago. How can we use Neo4j to find out this interesting fact?

Finding the source of fraud multiple users

Let’s start by finding all the fraudulent transactions in the last week and see which other transactions these users made — up to two weeks ago. We can make use of the PREV_TX relationship chain to go backwards until we find them all.

 // All the transactions marked fraudulent in the last week and // the transactions that came before them up to two weeks ago.
MATCH p = (fraud:Fraudulent)-[:PREV_TX*]->(tx)
WHERE > datetime() - duration('P7D') AND NONE (tx IN nodes(p) WHERE COALESCE(, datetime()) <= datetime() - duration('P14D'))

From these chains of transactions, we need to find the common links connecting them together.

Chains of transactions

We can find out which merchants these transactions occurred at in order to connect them together. We’ll do that and order them by a count of purchases.

// Top 5 common merchants from fraudulent transaction chains up to two weeks ago.
MATCH p = (fraud:Fraudulent)-[:PREV_TX*]->(tx)
WHERE > datetime() - duration('P7D') AND NONE (tx IN nodes(p) WHERE COALESCE(, datetime()) <= datetime() - duration('P14D'))
WITH nodes(p) AS transactions
UNWIND transactions AS tx
MATCH (tx)-[:AT_MERCHANT]->(merchant)

There is that gas station we were looking for at the top of the list. The fraud could have originated there, or fraudsters may have used that place to test our stolen credit cards. Let’s dig a little further and find out who else is at risk.

Site of fraud found

First, let’s get the users we know were affected by fraud. We’ll want to exclude these users, since it’s too late for them.

// Find the users who reported fraud the last week
MATCH (fraud:Fraudulent)<-[:MAKES]-(user)
WHERE > datetime() - duration('P7D')
WITH COLLECT(user) AS users

Now, let’s go get their transaction chains:

// Find the transactions that happened before the reported fraud in the last 2 weeks
MATCH p = (fraud:Fraudulent)-[:PREV_TX*]->(tx)
WHERE > datetime() - duration('P7D') AND NONE (tx IN nodes(p) WHERE <= datetime() - duration('P14D'))
WITH users, nodes(p) AS transactions

Let’s continue to the merchants involved:

// Find out at which merchant these transactions took place
UNWIND transactions AS tx
MATCH (user)-[:MAKES]->(tx)-[:AT_MERCHANT]->(merchant)

Next, let’s change how we counted transactions. Instead of the total number of transactions, we will use the count of distinct users who made transactions there. This will make a daily coffee shop transaction equal to one instead of once for each day.

// Use the number of distinct users who made transactions as a score
WITH users, merchant, COUNT(DISTINCT user) AS userCount

Now, we’ll find other users who made transactions at those merchants in the last two weeks and filter out the users who already fell victim to fraud.

// Find out who else may be at risk
MATCH (merchant)<-[:AT_MERCHANT]-(tx)<-[:MAKES]-(user)
WHERE > datetime() - duration('P14D') AND NOT user IN users

Then, we’ll collect our set of potential merchants. Different users may be victims to multiple fraudsters that are operating at the same time, so we’ll group our potential merchants by user.

// Which merchants may have been the first sign of fraud
WITH DISTINCT user, COLLECT({merchant:, userCount: userCount}) AS potentials, AVG(userCount) AS averageuserCount
WITH user, averageuserCount, potentials
UNWIND potentials AS potential

Then, we’ll get the merchants where an above average number of users made transactions.

// Return those with higher than average WITH user, potential, averageuserCount
WHERE potential.userCount > averageuserCount
RETURN user, potential.merchant, potential.userCount
ORDER BY user, potential.userCount DESC

Here is the result:

Final count for potential fraud

Surprise! The fraud could have originated at the gas station, but it could also have started at the bowling alley!

// Surprise, it could have been the Bowling Alley!
MATCH p=(m:Merchant {name:"Bowling Alley"})<-[:AT_MERCHANT]-(tx)<-[:MAKES]-(user) RETURN p

Potential beginning of fraud at bowling alley

We could run the fraud query whenever a new fraudulent transaction was reported and freeze the credit cards of any users who may have also been affected to prevent further loses from occurring. If you want to give it a try yourself, start Neo4j and run this Cypher Script to generate the example data and try the queries!

If you have a moderate amount of transactions to deal with, this query works great, but at large scales, there is a performance improvement we can make. Once we have the merchants involved in the chain of fraudulent transactions, and we go find the transactions that occurred with the last two weeks, we run this MATCH statement.

// Find out who else may be at risk
MATCH (merchant)<-[:AT_MERCHANT]-(tx)<-[:MAKES]-(user)
WHERE > datetime() - duration('P14D')

Do you see the problem? If a merchant has lots of transactions and we have kept a long history, we will be looking at many transactions and their date property. What can we do to speed this up? Well, we can promote the date property into the AT_MERCHANT relationship, so it becomes AT_MERCHANT_ON_2019_08_19, for example. If you are a regular, you’ve seen me use this trick here and there, but if you are new, follow those links and watch this explanation. Now, in the query we construct, we only traverse from merchants through the 14 relationship types (one for each day we care about); we do not care about how much history we have.

MATCH (merchant)<-[:AT_MERCHANT_ON_2019_08_19| AT_MERCHANT_ON_2019_08_18| AT_MERCHANT_ON_2019_08_17...]-(tx)<-[:MAKES]-(user)

This change allows us to skip looking at the date property of the transaction, and we’re off to the races. We’ll also want to incorporate more information to come up with a better score than just counting users. We could look at whether the transaction was online or not, if the credit card was present, if they used a chip, pin, zip code, or just magnetic strip for the checkout, what kind of business the merchant is in, and whatever else we think could help.

So, go on and fight fraud with Neo4j. 

This UrIoTNews article is syndicated fromDzone