How data cleansing unlocks the potential of your data

How data cleansing unlocks the potential of your data

In 2012, Gartner, a global research and advisory firm, defined Big Data as data that 1) arrives in high volumes, 2) consists of high variety, and 3) ever-higher velocity. These defining points are known as the three V’s.

Transactional data has been increasing with each advancement in technology. For any bank, retailer, payment service provider or other financial institution, customer transaction data is a plethora of information – a literal goldmine. A goldmine with no access key though, is effectively useless.  You may as well have no mine at all!

Imagine that you’ve buried a heap of treasure so deep down in the earth that you’re now spending all your time and energy trying to reach it. This analogy almost mirrors the very real problems faced within the payments industry today: transactions occurring at tens of millions of merchants around the globe and the content of merchant payments transactions being cryptic and non-descript. In the past, a process that correctly identifies merchant names has never existed. Because of this, banks cannot always easily decipher a merchant’s overall expenditures. If the merchant is unidentifiable, therefore, there can be no further observations for that merchant.

I tackled this lingering issue in a recent quest to improve FinanSeer’s efficiency, analytics drive, and actionable insight. To do so, I started with a list of known merchants as follows:

Note: An example using a single merchant, CHICK-FIL-A, is shown below.  This example has been selected for the fact that there are thousands of CHICK-FIL-A locations globally.

CHICK-FIL-A #00242

CHICK-FIL-A #02373

CHICK-FIL-A #03642

CHICK-FIL-A #03698

CHICK-FIL-A #03888

CHICK-FIL-A #1126

CHICK-FIL-A #02799

CHICK-FIL-A 3213173677

CHICK-FIL-A @doordash

UBER *CHICK-FIL-A

CHICK-FIL-A 1222O4U9790790

CHICK-FIL-A Unit #11

CHICK-FIL-A 3250 Windward Pl.

CHICK-FIL-A *WEB

CHICK-FIL-A #04016

CHICK-FIL-A #00418

CHICK-FIL-A #02138 4285 State Bridge Rd

CHICK-FIL-A  #0836

CHICK-FIL-A #03462

CHICK-FIL-A #02985

CHICK-FIL-A  # 00864

CHICK-FIL-A #01622

CHICK-FIL-A #02809

CHICK-FIL-A #01230

  CHICK-FIL-A #00381

PAYPAL *CHICK-FIL-A #00

PAYPAL *CHICK-FIL-A #01

PAYPAL *CHICK-FIL-A #02

PAYPAL *CHICK-FIL-A #03

Paypal *Chick-Fil-A #00

Paypal *Chick-Fil-A #03

Paypal *Chick-Fil-A #00

Paypal *Chick-Fil-A #01

Paypal *Chick-Fil-A #02

Paypal *Chick-Fil-A #03

Paypal *Chick-Fil-A #04

Paypal *Circle K # 2376


The data cleaning process is tedious.  Let’s break it down into segments for better understanding:

  • The problem is that duplicate values persist.  Therefore, we first need to identify the uniqueness of individual names. We do that by grouping the data.  In doing so, we found that we had random numbers in the names that did not allow for grouping and removing junk data. However, grouping and sorting helped us get started because it allowed us to see different instances of the same merchant.
  • Next, as we can see in the above example, Chick-Fil-A transactions have – in some cases – merchant names combined with a store number, address, phone number, and/or acquirer. We have two methods to clean this information.  We either 1) start by creating patterns and using REGEX to dictate the information extraction. Here are a few examples: we determined that whatever data is in front of the asterisk is the Acquirer. Any 10-digit numbers preceded by 1 or 0(optional), preceded by +(optional) and with hyphens in between (optional) grouped together are phone numbers, and information following the pound symbol is the store number, etc. Alternatively, 2) we use in-built pattern matching/text similarity algorithms like edit distance, etc.
  • Finally, by splitting the data into its own fields we can group the merchants.  This is used in cases such as grouping merchant names “Walmart” and “Wal-Mart” into a single entity.

Other companies have built products around the same concept, but they are merely providing an API to integrate with.  They will do the cleansing and then charge for every call.  It is essential we acknowledge that this is not only a significant financial investment, but also an act of sharing the invaluable goldmine that is data.

Leave a reply

Your email address will not be published. Required fields are marked *