Home Data Tutorials Detecting fraud on e-commerce orders with Benford’s law

Detecting fraud on e-commerce orders with Benford’s law

April 14, 2016 - 12:00 am

2452

6 min read

In this article by Andrea Cirillo, author of the book RStudio for R Statistical Computing Cookbook, has explained how to detect fraud on e-commerce orders.

Benford’s law is a popular empirical law that states that the first digits of a population of data will follow a specific logarithmic distribution.

This law was observed by Frank Benford around 1938 and since then has gained increasing popularity as a way to detect anomalous alteration of population of data.

Basically, testing a population against Benford’s law means verifying that the given population respects this law. If deviations are discovered, the law performs further analysis for items related to those deviations.

In this recipe, we will test a population of e-commerce orders against the law, focusing on items deviating from the expected distribution.

(For more resources related to this topic, see here.)

Getting ready

This recipe will use functions from the well-documented benford.analysis package by Carlos Cinelli.

We therefore need to install and load this package:

install.packages("benford.analysis")

library(benford.analysis)

In our example, we will use a data frame that stores e-commerce orders, provided within the book as an .Rdata file.

In order to make it available within your environment, we need to load this file by running the following command (assuming the file is within your current working directory):

load("ecommerce_orders_list.Rdata")

How to do it…

Perform Benford test on the order amounts:

benford_test <- benford(ecommerce_orders_list$order_amount,1)

Plot test analysis:
```
plot(benford_test)
```
This will result in the following plot:
Highlights supectes digits:
```
suspectsTable(benford_test)
```
This will produce a table showing for each digit absolute differences between expected and observed frequencies. The first digits will therefore be more anomalous ones:

> suspectsTable(benford_test)

   digits absolute.diff

1:      5     4860.8974

2:      9     3764.0664

3:      1     2876.4653

4:     2     2870.4985

5:      3     2856.0362

6:      4     2706.3959

7:      7     1567.3235

8:      6     1300.7127

9:      8      200.4623
Define a function to extrapolate the first digit from each amount:
```
left = function (string,char){

  substr(string,1,char)}
```

Extrapolate the first digit from each amount:

ecommerce_orders_list$first_digit <- left(ecommerce_orders_list$order_amount,1)

Filter amounts starting with the suspected digit:

suspects_orders <- subset(ecommerce_orders_list,first_digit == 5)

How it works

Step 1 performs the Benford test on the order amounts. In this step, we applied the benford() function to the amounts. Applying this function means evaluating the distribution of the first digits of amounts against the expected Benford distribution.

The function will result in the production of the following objects:

Object	Description
Info	This object covers the following general information: data.name: This shows the name of the data used n: This shows the number of observations used n.second.order: This shows the number of observations used for second-order analysis number.of.digits: This shows the number of first digits analyzed
Data	This is a data frame with the following subobjects: lines.used: This shows the original lines of the dataset data.used: This shows the data used data.mantissa: This shows the log data’s mantissa data.digits: This shows the first digits of the data
s.o.data	This is a data frame with the following subobjects: data.second.order: This shows the differences of the ordered data data.second.order.digits: This shows the first digits of the second-order analysis
Bfd	This is a data frame with the following subobjects: digits: This highlights the groups of digits analyzed data.dist: This highlights the distribution of the first digits of the data data.second.order.dist: This highlights the distribution of the first digits of the second-order analysis benford.dist: This shows the theoretical Benford distribution data.second.order.dist.freq: This shows the frequency distribution of the first digits of the second-order analysis data.dist.freq: This shows the frequency distribution of the first digits of the data benford.dist.freq: This shows the theoretical Benford frequency distribution benford.so.dist.freq: This shows the theoretical Benford frequency distribution of the second order analysis. data.summation: This shows the summation of the data values grouped by first digits abs.excess.summation: This shows the absolute excess summation of the data values grouped by first digits difference: This highlights the difference between the data and Benford frequencies squared.diff: This shows the chi-squared difference between the data and Benford frequencies absolute.diff: This highlights the absolute difference between the data and Benford frequencies
Mantissa	This is a data frame with the following subobjects: mean.mantissa: This shows the mean of the mantissa var.mantissa: This shows the variance of the mantissa ek.mantissa: This shows the excess kurtosis of the mantissa sk.mantissa: This highlights the skewness of the mantissa
MAD	This object depicts the mean absolute deviation.
distortion.factor	This object talks about the distortion factor.
Stats	This object lists of htest class statistics as follows: chisq: This lists the Pearson’s Chi-squared test. mantissa.arc.test: This lists the Mantissa Arc test

Step 2 plots test results. Running plot on the object resulting from the benford() function will result in a plot showing the following (from upper-left corner to bottom-right corner):

First digit distribution
Results of second-order test
Summation distribution for each digit
Results of chi-squared test
Summation differences

If you look carefully at these plots, you will understand which digits show up a distribution significantly different from the one expected from the Benford law. Nevertheless, in order to have a sounder base for our consideration, we need to look at the suspects table, showing absolute differences between expected and observed frequencies. This is what we will do in the next step.

Step 3 highlights suspects digits. Using suspectsTable() we can easily discover which digits presents the greater deviation from the expected distribution.

Looking at the so-called suspects table, we can see that number 5 shows up as the first digit within our table. In the next step, we will focus our attention on the orders with amounts having this digit as the first digit.

Step 4 defines a function to extrapolate the first digit from each amount. This function leverages the substr() function from the stringr() package and extracts the first digit from the number passed to it as an argument.

Step 5 adds a new column to the investigated dataset where the first digit is extrapolated.

Step 6 filters amounts starting with the suspected digit.

After applying the left function to our sequence of amounts, we can now filter the dataset, retaining only rows whose amounts have 5 as the first digit. We will now be able to perform analytical, testing procedures on those items.

Summary

In this article, you learned how to apply the R language to an e-commerce fraud detection system.

Resources for Article:

Further resources on this subject:

Recommending Movies at Scale (Python) [article]
Visualization of Big Data [article]
Big Data Analysis (R and Hadoop) [article]

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Detecting fraud on e-commerce orders with Benford’s law

Getting ready

How to do it…

How it works

Summary

Resources for Article:

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter