Data Matching: Path to Quality Datasets

With the amount of business data growing, more and more options to categorize it appear, resulting in many datasets. Some of their data fields and records overlap, while others include pieces of information that, when merged, can give you a full picture.

Therefore, for small and medium-sized businesses (SMBs) in search of top talent, one regularly updated job posting database might suffice, multinational software companies might want to add at least technographic and product review datasets to their collection.

Unfortunately, without proper data matching, much business value is lost in between the datasets, forever hidden, and forever unreclaimed.

What is data matching?

Data matching is the method of identifying and linking records from different datasets that represent the same entity. Also known as entity resolution and record linkage, data matching uses machine learning, statistical methods, and occasional manual verification to evaluate if and how data points and records correspond.

A good example of matching data is comparing employee and B2B contacts databases to understand potential clients better. While the former may have location data on a city level, current position, and work experience, the latter may boast location on a state level and email addresses.

However, the B2B contacts database may only have the surname and the first name letter. In such a situation, data matching lets you enrich data and remove duplicates along with spelling errors that make similar entries seem unrelated in the eyes of an algorithm.

The value of data matching for business

If you read our data matching use example above, it should be evident that its business value is somewhere between huge and immeasurable.

Of course, if you need just one database, standardizing, filtering, and deduplicating (or getting already cleaned data) is enough. Otherwise, we recommend starting the process of matching data as soon as possible for these five reasons:

Attaining “golden” records. In the data community, a golden record means the “right” version of a particular entity. This is crucial, especially for sales, as it allows switching from cold-calling to contacting actual leads while already having some background information.
Cleaning data. This also holds significant value for your salespeople as it helps avoid instant turn-offs like misspelled names or gender misidentification. Duplicate records can also lead to two of your guys contacting the same person and offering slightly different deals. Without deduplicating, it might be difficult to spot those separate entries due to filters or other settings in your CRM.
Enabling business intelligence. One dataset might not be enough to do proper market research. Here’s where entity resolution comes into play by comparing and combining records across your company, employee, and other datasets. And with clean data, analysis takes way less time.
Improving customer segmentation. Like it or not, the buyer persona has gone crazy and now has multiple. To cater to each taste, customer segmentation is of utmost importance. After the matching data process, you can attribute interests and behaviorism, which positively impacts ROI.
Enhancing compliance. Complying with GDPR and other regulations can be a cog in the wheel for many B2C businesses. For instance, a contact database you bought might include people who didn’t consent or the same people with different email addresses. Also, record linkage simplifies compliance for B2B companies with The Office of Foreign Asset and Control (OFAC) regulations that blacklist those facing sanctions.

5 data matching methods

There are multiple ways to categorize data matching methods or techniques, but for the sake of simplicity, we will present only the most common ones.

These five data matching types are exact matching, fuzzy matching, probabilistic matching, machine learning-based matching, and hybrid matching. Let’s discuss each in more detail.

1. Exact matching

Exact matching does exactly what its name implies – it finds exact matches. While this simple record linking technique might work on some quality datasets, using it most often means losing important information.

Let’s say you want to connect the names of potential clients to deduplicate the database. With exact matching, Richard Dickinson and Dick Dickinson will be counted as separate records, even though their email and location are identical. While you can manually check a smaller dataset this way, going through millions of records is not viable, to say the least.

2. Fuzzy matching

This entity resolution method recognizes similar but not identical instances. Examples include incomplete data, spelling variations, and typos.

Fuzzy attribute matching has its own sub-techniques, such as Levenshtein distance, which counts the number of one-character edits needed to change one word to another. In our case, it takes 5 actions to switch from Richard to Dick, a threshold that can be interpreted as close enough.

The fuzzy matching process will also pair the original with “Richar” or “Richars” Dickinson. This data matching type is great for aligning the US and UK datasets with spelling discrepancies, such as “analog” and “analogue” camera.

However, the problem of fuzzy matching is that it can produce false negatives and false positives. On the other hand, this applies to probabilistic and machine learning-based matching as well, and the chance of such an error highly depends on the rules set by the user.

3. Probabilistic matching

This more advanced method of matching attributes uses statistics to determine the chance two records are connected. Here, 0% is no match, and 100% is a full match, meaning the records are identical twins.

Getting back to our Richard and Dick example, the probabilistic approach would note Levenshtein distance and matching emails and locations, giving a final score of, i.e. 95%. Naturally, the more factors you weigh in, the more accurate the probabilistic matching will be unless you make weighting mistakes.

4. Machine learning-based matching

This data matching technique relies on you teaching the algorithm to spot connected entities. Usually, it involves labeling both matching and non-matching pairs from which the machine can learn. The complexity of patterns the match algorithm looks for far exceeds the other three methods, allowing it to adapt to new data and increase accuracy in time.

At this stage, it becomes problematic to say how the algorithm finds out that Richard and Dick are, in fact, the same person, but it does. Heck, it even matched him with his ex-wife.

5. Hybrid matching

As the name implies, this technique takes what’s best of the other four. Different methods can be applied sequentially or in parallel to maximize the chance of finding all matches. Of course, that doesn’t mean you have to use all four—a good combination can be running the machine and then checking with fuzzy matching to ensure the algorithm didn’t miss some less common instances.

Most popular data matching use cases in the B2B sector

Data matching is widely used in a number of industries, such as education, healthcare, and the public sector. Government institutions also use it to detect fraud by spotting suspicious activities in banking and finance departments. That being said, we’d like to focus on the five most popular use cases of matching data in the B2B sector.

1. Email marketing

Probably the most simple data matching use case, email marketing, benefits highly due to deduplication. After all, receiving the same offer twice might lead to instant unsubscribing. At the same time, the data matching process maximizes the audience by fixing emails with spelling errors and other minor issues.

2. E-commerce

Machine learning-based matching is widely used to compare prices. A good example is the aggregator websites, such as Google Shopping. Furthermore, with record linkage, one can detect the same products across all the stores and see if you can offer your goods to some of them.

3. Business intelligence

There is no knowledge that is not power unless it’s based on erroneous data. While data-driven decision-making has become a norm in business, so have the missteps caused by a simple comma taking zero’s seat. Also, it always helps to work with a leaner dataset sans duplicate and incomplete records.

4. Sales

The benefit of matching data for sales is twofold. Firstly, it identifies leads who may have registered more than once using different email addresses. Then, it enriches each record by merging information from two or more databases. This way, a salesperson can contact the lead by switching from “Dear Sir/Madam” to a proper name.

5. Service improvement

Checking the top reviews of your product or service might not tell the whole story. With data attribute matching, you can find similar reviews, remove duplicates, and identify negative review spamming. Heck, you can even find those five-star recommendations you wrote.

Data matching vs Data mapping

While sometimes used interchangeably, data matching and data mapping have slightly different meanings, which can become important in certain situations. The main difference, however, is that data matching is the first step, which involves identifying related data fields and records. Then comes data mapping – modifying entities to form golden records.

As you can see from the table above, both data mapping and matching processes are highly automated and can be done using different techniques or methods discussed earlier.

That being said, human intervention is still required to deal with ambiguous cases and refine the algorithms of data matching. Data mapping also needs a human touch for the initial setup and defining rules.

Conclusion

Data matching is an essential practice for improving data quality. At the same time, it’s a tool for better understanding of information and making its analysis easier. Removing duplicate, incomplete, and unintelligible records makes your dataset lighter and more actionable, increasing its value to business.

The demand for data analysis follows the evergrowing data demand. Therefore, having just one semi-outdated dataset of B2B contacts may no longer be enough to keep up with the competition. And even if you have multiple, without proper dataset matching, there’s simply no way to rule them all.

It’s a (Data) Match! Data Matching as a Business Value