Text Match and Fuzzy Lookup

I would bet that every one of us has seen Customer lists from separate sources that can’t be matched up because of spelling mistakes and inconsistent naming conventions and abbreviations. In fact, there are consulting firms whose entire business revolves around helping organizations reconcile their “Master Data”. I can’t count how many times I’ve seen entire projects launched to create a “gold source for master data”. Yet here we are; manually matching up addresses, customer names, or people names from different data sources.

.

Well today, I’d like to introduce you to a free tool from Microsoft that can help ease your pain a little. This tool is a free Excel Add-in called Fuzzy Lookup. The Fuzzy Lookup Add-in enables you to perform similarity analysis to match textual strings that mean the same thing but are not spelled exactly the same. With this tool, you can match things like customer addresses from two separate data sources, or even find imperfect duplicates in a single dataset.

.

Understanding the Jaccard Index of Similarity

The magic behind the Fuzzy Lookup Add-in comes from the Jaccard Index of Similarity.

The Jaccard index, also known as the Jaccard similarity coefficient, was developed by Paul Jaccard as a way to document the distribution of different types of flora (yawn). Jaccard’s index gave him a statistical way to measure similarities between sample sets. The gist of Jaccard’s index is this:

.

You take two sample sets. You count the attributes shared by both samples (call this Z). You count the attributes observed in only one sample (call this X). You then count the attributes observed in only the other sample (call this Y). Finally, you can calculate the similarity index by using the three counts in this operation: Z / (Z+X+Y).

.

Confused? Don’t worry. I have an example.

Let’s say we want to compare the similarities between these two Excel nerds.

We define some attributes, then we tag each nerd with a 1 when each attribute evaluates to TRUE.

.

We can now get the count of the intersections and then perform the math.

As you can see, the similarity index is .57 or 57%.

The more similar two sample sets are, the closer Jaccard’s index will be to 1.

 

In its simplest form, the same kind of analysis can be done on Textual Strings using the Jaccard Index. You can see in this example, the words hair and hare result in an index of .60.

By the way, in terms of textual matches, 60% is not a great score. In my experience, most textual matches below 70% index is suspect.

 

Understanding the Fuzzy Lookup Transformations

Ok. Now you know that the Fuzzy Lookup algorithm is based on the Jaccard Index of Similarity. But it’s important to note that Fuzzy Lookup does not run on your raw data. Before running the Jaccard Index, Fuzzy Lookup applies some built-in transformation algorithms designed to improve its matching results.

 

The first thing to understand is that Fuzzy Lookup converts data using something called a Tokenizer. For example, the record {“DataPig”, “Bacon Boulevard”} might be tokenized into the set, {” DataPig”, “Bacon”, “Boulevard”}. Each word is considered to be a token. Tokens are assigned weights based on frequency of use. High weights are applied to tokens that are uncommon. Low weights are applied to tokens that come up frequently. For example, frequent words such as “Boulevard” are given a lower weight than less frequent words such as “DataPig”. Fuzzy Lookup gives you the opportunity to override the default token weights by supplying your own weights via an Excel Table.

.

Fuzzy Lookup also has algorithms and built-in dictionaries that allow for the automatic correction of spelling mistakes, word merge scenarios, string split scenarios, and string prefix corrections. All of these automatic transformation can be turned on and off by via the Fuzzy Lookup Configure dialog box. You can add custom Transformation rules by pointing the TransformationRowsetName property to your own Excel Table of rules.

Note: The sample workbook installed with the Fuzzy Lookup Add-in includes a Customization tab that shows you how to set up your own custom transformation table.

.

Using the Fuzzy Lookup Add-In

To use the Fuzzy Lookup Add-in, you’ll need to first convert your datasets into Excel Tables (click anywhere inside your data and press Ctrl+T).

In this example, we have two Excel Tables (IndProfile and CustInfo). We want to match up the customer names.

.

Click the Fuzzy Lookup command button to launch the tool.

.

In the Fuzzy Lookup dialog box, do the following:

1. Select the two tables you want matched.

2. Select the columns that hold the data you want matched.

3. Click the join button to drop your two columns down to the Match Columns section.

4. Select all the columns you want included in your output. Be sure to include the FuzzyLookupSimilarity column.

5. Define how many matches you want per record. I recommend setting this to 1.

6. Define the Similarity Threshold. This is the Jaccard Index number. As mentioned before, in my experience, any match with an index of less than .70 is typically not that accurate. I usually set this to .60 just to see the results.

.

Once you have defined your selections, be sure to place your cursor in a new blank worksheet. Be warned, Fuzzy Lookup WILL OVERWRITE DATA.

Press the Go button to output the results.

You will see a new table that includes the Output Columns you selected (in step 4 above). The Similarity column will show you the Jaccard index for each match. Any row with no matches will get an index of 0.

.

It’s important to review any matches with a similarity index less than 1.

Don’t take for granted that Fuzzy Lookup will get everything right.

For example, you can see in Row 7 below, Fuzzy Lookup gave us a .86 index score on the match for NVISION Inc. and WORLD VISION INC. An index score of .86 is typically a good match. However, we can see that in this case, the two names are not the same.

.

If you want to perform a Fuzzy search for duplicates within the same table, simply reference only that table when setting up the search parameters. For instance, in this example we are looking for Fuzzy duplicates in the Org Name column of the IndProfile table.

.

I guess that’s about it.

Fuzzy Lookup is one of those tools you really need to experiment with to get a feel for how helpful it can be.

Although it’s not perfect, it has saved me hours of scripting and manual matching.

So before you burn out your eyeballs staring at two data sets again, take some time to play with Fuzzy Lookup. It may become your new best friend.

14 thoughts on “Text Match and Fuzzy Lookup

  1. Jan Karel Pieterse

    Nice work Mike. I’ve used a different (simpler) algorithm to try to catch typos in my websites’ urls. The method is called the Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance) and boils down to counting the number of characters you have to change to get from string A to string B.
    So my error landing page gets the url the user used and computed the minimum Levenshtein for all available pages on my website.

  2. Jeff Weir

    Dick: you should see my bumper sticker:
    Economists do it with models.

    Mike: That TransformationRowSet thing is so awesome it deserves an image of its own in your otherwise very-non-fuzzy article.

  3. Jeff Weir

    It’s a real pity that the transformations table doesn’t accept wildcards though. I’d like to be able to substitute out anything from (1800) to (2015) to handle things like this:
    Evil Genius (2015) Inc.

    I have a formula-based transformations table in the book where I use (18??) and (19??) and (20??) to do this, which works a treat. But this tool don’t like it. Bummer.

  4. Ali

    I have multiple variables that I am matching on. Some of them are alphabetic words/phrases but one is numeric which can and needs to be matched exactly (It is not unique to the other variables) . I set configuration to ‘ExactMatch’ instead of default for that variable, however, fuzzy matching still ends up matching the numeric values less than accurately. Any advice on this?

  5. KB

    Have you experienced a bug with this add-in since installing Excel 2016? I find I have to set the threshold to less than .5 to see .8 matches. The scores are also being calculated differently. I reran some older data, and the scores were different.

  6. greg

    I tried using Excel fuzzy lookup on some jobs after reading this article and another one on another site and was only able to do “stemming” matches, not real fuzzy logic.

    For example Excel fuzzy lookup can match “Company ABC” with “Company ABC Inc.” But even notepad can do that with “Find/Replace”, (e.g., Find “Company ABC” and Replace with “Company ABC Inc.” 🙂

    My jobs which I assume are not uncommon are needing to match things like “122 Main Street #22” with “122 Mane Unit 22” along with typos/errors where a few random characters are incorrect.

    In any event I ended up going with a cloud based app: http://sullivansoftwaresystems.com/cgi-bin/fuzzy-lookup

  7. George Carlisle

    Can you publish your datasets? I’ve installed fuzzy and want to practice on your set used.

  8. Bikash Kumar

    This add in does not seem to be available anymore on the Microsoft site. Are there alternate links / add-ins / tools that someone could recommend?

    Thanks!

    Bikash

  9. Sabrina

    I confirm Ali experience, I’m not able to work with multiple columns match and configurations.
    In particular to enforce “Exact match” to some column and “default” to other columns .
    I’m working with geographical data, and I need to apply the fuzzy lookup for Street names and exact matching for city names, since I’d like to search streat name corrispondence only among the same city.
    The output instead show also similiar street name also referred to different city.

Leave a Reply

Your email address will not be published. Required fields are marked *