Quantifying Subjective Text with an Excel-Based Sentiment Analysis Tool

July 24th, 2014 by datapig 7 comments »

Sentiment Analysis is a growing field in the world of BI and Data Science. With social media serving up millions of data points by way of comments, tweets, and status updates, it's becoming increasingly important to be able to take all that subjective text and turn it into quantitative data that can be integrated with other metrics. That is essentially what Sentiment Analysis is - a term that describes the exercise of quantifying subjective text by analyzing certain keywords deemed to be either positively or negatively charged.

.

How Sentiment Analysis Works

In its simplest form, a Sentiment Analysis algorithm can be nothing more than determining if a text string has more positive keywords than negative keywords. Text with more positive words is classified as positive content, while text with more negative keywords it gets classified as negative content. Some Sentiment Analysis algorithms go far beyond the simple keyword-based methodology, delving into idioms, colloquialisms, language patterns, sentiment shifter words, and the list goes on.

.

Regardless of complexity, all Sentiment Analysis tools apply some sort of scoring system to findings. A typical scoring system looks something like this:

  • (+2) very positive
  • (+1) somewhat positive
  • (0) neutral
  • (-1) somewhat negative
  • (-2) very negative

If a given comment or tweet is evaluated and considered to be very positive in sentiment, then that text string will be given a +2. For a text string that is considered neutral, a 0 would be applied. The idea being that if 5,000 comments or tweets are classified, a picture would start forming about the general attitude about a particular topic.

.

You can do a search for "on-line sentiment analysis" and you'll get a dozen websites that let you test drive their model. Most are trying to sell their wares (like I said, it's becoming big business). But one educational site will let you play around with their Text2Sentiment API for free and without an API key. If you're interested, go to www.datasciencetoolkit.org and scroll down to their Text to Sentiment tool.

.

.

An Excel-Based Sentiment Analysis Tool

After playing around with this for a while, I decided to try and roll my own Sentiment Analysis tool with Excel and VBA.

You can download it here and take a look as you follow along.

.

When you open the workbook, you'll see some sample comments I pulled from a Hotel chain's Facebook feed. The score next to the comment is generated by a Function called GetSentimentValue. This Function returns a score between -5 and 5. A Low score indicates that there were words associated with dissatisfaction. A high score shows that there were one or more 'positive' words that occur in the comment. A score of zero either means that no charged words were found, or that the positive and negative words balanced themselves out. I applied Conditional Formatting manually. You can imagine that if you had hundreds or thousands of comments, you could easily apply this function and do things like identify the negative comments, apply pivot tables, create histograms, build charts, etc. The point is that the text in your comments just went from being highly unusable to being a relatively useful data point.

.

.

Change any comment and the GetSentimentValue function will adjust the score to reflect the new sentiment of that comment.

.

.

The GetSentimentValue function applies a very basic algorithm (the same one used by the folks at datasciencetoolkit.org). It uses a word for word keyword approach, parsing each word in the text, determining if each word is a positive or negative keyword, and assigning a score to each word. It then averages all of the found keywords to get to a final score. For example, "Great trip. Thanks for the warm welcome" gets a score of +2 because "great" has a weight of +3, and "warm" gives +1, so the average is +2.

.

Understanding the Keyword List

The keyword list and scoring system used in the function comes from the AFINN-111 affective lexicon developed by Finn Arup Nielsen. That list is stored in the VBA module and exposed through a Collection. For ease of editing, I split the keyword list into several Private Subs. This way, if you want to add, edit, or remove a word, you can do so directly in the code.

.

The list of keywords and terms used in a Sentiment Analysis is quite important. No list will cover all needs, so most folks using Sentiment Analysis find that they frequently need to adjust their keywords to account for the types of comments they encounter. This is called "training the model". It's frankly the primary drawback to using a Sentiment Analysis API service instead of a local tool. With an API service, you have no way to adjust the model. For example, imagine you worked at the electronics superstore Best Buy. In most Sentiment Analysis models, the word "best" is positively charged. So you can imagine how many false positives you would get as a result of people including your company's name into comments. You would definitely want a way to train the model to handle the word "best" in a particular way.

.

Scenarios to Be Aware of When Relying on This and Other Sentiment Analysis Tool

  • Keep in mind that the algorithm I chose is the simplest (laziest) one I could get away with. Each word is evaluated separate from other words. Many Sentiment Analysis tools found on-line use the same kind of algorithm. This means key sentiments made up of two or more words are missed. For example, these sentiments aren't caught: "white wash", "checked out", "no go", and "won't work". Also, there is no accounting for common phrases that represent sentiment. For instance, "I'm over the moon" or "customer for life".
  • Most Sentiment Analysis tools have no way to capture context. A positive or negative sentiment word can have the opposite connotation depending on context (e.g. "This is a great hotel, if you like the smell of urine")
  • Sentiment ambiguity is a real problem for most tools. Text with sentiment ambiguity is often classified as neutral. For example, "The rooms at this hotel are priced higher than others in the area" doesn't contain any charged words, although it's clearly a negative comment.
  • Sarcastic comments are often misclassified in most Sentiment Analysis tools. (e.g. "I called to move my prepaid reservation because my wife went into labor; I was told that is our problem. Wonderful. Such lovely customer service.").
  • Slang needs to be taken into account when "training the model". For example, "sick", or "bananas" can have a positive or negative sentiment depending on the context.

..

Enjoy playing with the workbook. I'd love to share an enhanced version should any gurus out there want to take a shot at improving this thing.

Generating Normally Distributed Random Numbers – Sample Data that Makes Sense

July 16th, 2014 by datapig 7 comments »

Like many of you, I'm often asked to develop mockups (working demos of dashboards and reports).

In my zeal to make my mockup as realistic as possible, I spend a good bit of time generating sample data.

.

For example, if I'm mocking up an HR dashboard, I'll dedicate some brain power to generating salary data. Now in order to make salary data realistic, there has to be a believable mix of high salary and low salary records. I can't just use the RAND and RANDBETWEEN functions Excel provides. These functions generate random data in a "Uniform Distribution". That is to say, every number has an equal chance of occurring with an equal frequency.

.

Let's say I use this formula and copy it down to 500 rows.

=RANDBETWEEN(1000,200000)

I instantly create 500 salary numbers, but there's a problem.

Plotting my newly created data in a histogram reveals the issue.

In this chart, the columns represent the count of records that fall within each salary range. Notice that there are just as many records in the 191k-200k range as in the 51k-60k range. This is definitely not something that holds true in the real world.

.

In reality, salary data has a distribution closer to the one shown in the histogram below.

There are lots people with salaries between 40k and 80k; and far fewer with salaries above 150k.

So if the RAND and RANDBETWEEN functions fail, how does an aspiring Excel nerd easily create sample data that makes sense for its intended purpose?

.

Well, after years going through the pain of creating sample data myself, I finally had enough and pulled together this function called RandNormalDist. This function, based on code by Donald Knuth, allows you to automatically generate random data that falls within a normal distribution. You can copy the code shown here and paste it into a Standard Module.

.

  1. Function RandNormalDist(Optional sMean As Double = 1, Optional sDeviation As Double = 1, Optional sPosSkew As Single = 0)
  2. 'Mike Alexander: www.datapigtechnologies.com
  3. 'Based on code originally published by Donald Knuth
  4. '***************************************************************************
  5. 'This function will generate a random number that falls within a normal distribution based on a starting Mean and Standard Deviation.
  6.  
  7. 'You can call this function from an Excel Spreadhseet by entering
  8. '=RandNormalDist(sMean, sDeviation, sPosSkew)
  9.  
  10. 'Example Usage 1:
  11. 'This generates a random number that falls within a distribution where the Mean is 20
  12. '=RandNormalDist(20)
  13.  
  14. 'Example Usage 2:
  15. 'This generates a random number that falls within a distribution where the Mean is 20 and the Standard Deviation is 3
  16. '  =RandNormalDist(20, 3)
  17.  
  18. 'Example Usage 3:
  19. 'This generates only a POSITIVE random number that falls in a distribution where the Mean is 20 and the Standard Deviation is 3
  20. '  =RandNormalDist(20, 3, 1)
  21. '***************************************************************************
  22. Dim x   As Double
  23. Dim y   As Double
  24. Dim z   As Double
  25. Dim m   As Double
  26.  
  27. 'Calculate initial coordinate pairs
  28. Start:
  29.     z = 2
  30.     Do Until z < = 1
  31.         x = 2 * Rnd - 1
  32.         y = 2 * Rnd - 1
  33.         z = x ^ 2 + y ^ 2
  34.     Loop
  35.    
  36. 'Do the math to get base multiplier
  37.    z = Sqr((-2 * Log(z)) / z)
  38.    
  39. 'Account for sPosSkew Positive request and factor in the other user inputs
  40.    If sPosSkew = 1 And (sDeviation * x * z + 1) < 0 Then
  41.     GoTo Start
  42.     End If
  43.    
  44.     RandNormalDist = Int(((sDeviation * x * z + 1)) * sMean)
  45.    
  46. End Function

.

Once you have the code in your workbook, you can call the function by entering it into your spreadsheet.

The idea is to enter the desired formula and copy it down to as many samples as you need.

.

This function accepts three optional arguments:

Mean: The average value you would like your sample data to have. If omitted, the function uses 1 as the Mean.

Standard Deviation: The number of standard deviations you would like your sample data values to be from the Mean. If omitted, the function uses 1 as the standard deviation.

Skew Positive: A tag that tells the function to only return positive numbers. If omitted, the function includes negative numbers in the resulting random numbers.

.

Here are a few examples of using the RandNormalDist function.

.

Entering RandNormalDist(20) will give you a tight group of numbers that fall around the number 20. As you can see, both negative and positive numbers are returned in a normal distribution.

.

Entering RandNormalDist(20,3) will give you a more dispersed (spread out) set of numbers. Again, you will see both negative and positive numbers returned in a relatively normal distribution.

.

Entering RandNormalDist(20,2,1) will give you a distributed group of ONLY POSITIVE numbers that taper away from the Mean. This kind of distribution is ideal for creating sample data where the numbers need to start around a Mean and move into a handful of higher positive values. This is ideal for generating random samples such as Salary data, Commission data, Help Desk Ticket data, etc.

.

This function has saved me tons of time and made my mock data so much better.

Hopefully it will do the same for you!

Bonus: "Yo momma" joke for Statisticians:

Lazy Summer Update

July 9th, 2014 by datapig 4 comments »

Hey there….remember me? I've been having a lazy summer; cooking my obscene body in the hot South Carolina sun. I'll be back with regular posts next week. I just wanted to get this update out.

.

New Book in August

Dick Kusleika and I wrote an Excel Formulas book. This book is for anyone who wants to jump head first into formulas. Here, we offer step-by-step instructions for creating and using 101 of the most commonly requested formulas. We include tips, tricks and detailed explanations along with actual working formulas. This is targeted at beginning to intermediate Excel users. Feel free to click on the book to see more details.

.

.

The Next Excel Business Intelligence Boot Camp

I've been getting a ton of emails asking when the next Excel Business Intelligence Boot Camp will be held. I have an answer. April 21 - April 23 in Dallas, Texas. At this event, we'll cover dashboards, visualizations, Excel's Power BI tools, and much more. I'll have more info on that in a future post.

.

.

If You Can't Wait for the Boot Camp

If you can't wait for the next Excel BI boot camp, then do the next best thing. Pick up the book at Amazon. It has an excellent plot and it's based on a true story.


.

.

A Random Summer Song to Finish out the Week

This song is about riding on the back of a Jet-Ski with an orangutan.

8 VBE Tips Every Programmer Should Know

June 19th, 2014 by datapig 13 comments »

I've covered these time-saving VBE (Visual Basic Editor) tricks across several posts over the years. I thought it would be helpful to get them all into a single post.

.

Whether you're a fresh-faced analyst new to programming, or a jaded veteran living on sunflower seeds and Mountain Dew, these tips will make programming with Excel and Access so much easier.

.

Tip 1: Block Comments

Are you still commenting code one line at a time?

Well you can save time by using the Comment Block command on the Edit toolbar.

This command lets you highlight entire sections of code and comment them at the same time.

.

To get to the Edit toolbar, go up to the VBE menu and click View->Toolbars->Edit.

Once you've activated the Edit toolbar, you can dock it to the top of your screen.

.

.

Tip 2: Quick Copy Blocks of Code

Copy whole blocks of code by holding down the Ctrl key on your keyboard while dragging the block where you need it.

This is an old Windows trick that works even when you drag across modules.

.

.

Tip 3: Jumping between Modules and Procedures

Press Ctrl+Tab to quickly move between modules.

Press Ctrl+Page Up and Ctrl+Page Down lets you move between procedures within a module.

.

.

Tip 4: Jumping directly to your Functions/Variables

If you encounter a Function or a Variable that you would like to explore, simply click on the Function/Variable name and press Shift+F2.

This will take you right to it. Pressing Ctrl+Shift+F2 will take you back.

.

.

Tip 5: Staying in Right Procedure

Have you ever tried to scroll through your code only to find that you've scrolled into the next procedure?

You try to remedy this by scrolling up, but you overshoot again into the wrong procedure.

To avoid this nonsense, use the Procedure View button on the low left hand corner of the VBE.

Clicking this button will limit scrolling to only the function or procedure you're in.

The button next to it will take you back to full module view.

.

.

Tip 6: Dragging the Yellow Arrow

You know that yellow arrow you see when you debug your code?

Well, you can drag it up or down to the particular line of code you're interested in running.

You can use this technique to easily skip bits of code while debugging.

.

.

Tip 7: See the Beginning and End of Variable Values

Hovering over your variables (while in debug mode) activates a tooltip that shows you the values that are being passed in and out of them - very useful.

By default, these tooltips will show around 70 or so characters (including the variable name).

If the value in your variable is too long, it gets cut off.

.

To see the last 70 or so characters, simply hold down the CTRL key while you hover.


.

.

Tip 8: Turning off Auto Syntax Check

Often times, while working on some code, you find that you need to go to another line to copy something.

You're not done with the line – you just need to leave it for a second.

But VBE immediately stops you in your tracks to tell you something you already know.

After a half-day of this, you're ready to clean your ears out with a gun.

Well, you can save your bullets and your sanity by turning off Auto Syntax Check.

Go up to the VBE menu and select Tools -> Options.

The options dialog box will activate, exposing the Auto Syntax Check option under the Editor tab.

Simply uncheck the property and the VBE will keep quiet.

The neat thing is that your code still turns red if you goof up, providing a visual indication that something is wrong.

.

Feel free to comment on any other tip I may have missed.

Enjoy your Mountain Dew and Sunflower seeds.

Rounding to Significant Digits

June 16th, 2014 by datapig 6 comments »

I was recently asked to show some financial figures in significant digits. I'll be honest and say I've never had to do this in my many years of reporting, so I didn't exactly know what that meant. Apparently, the idea is when you're dealing with numbers in the millions, there is no need to inundate a report with superfluous numbers for the sake of showing precision down to the tens, hundreds and thousands place.

.

As an example, instead of showing the number 883,788, you could choose to round the number to one significant digit. This would mean displaying the same number as 900,000. Rounding 883,788 to two significant digits would show the number as 880,000.

.

In essence, you're deeming that a particular numbers place is significant enough to show. The rest of the number can be replaced with zeros. This may feel like it could introduce problems, but when dealing with large enough numbers, any number below a certain significance would be inconsequential.

.

The screenshot below demonstrates the formula I used to round my values to a given number of significant digits.

.

How it Works

We all know that Excel's ROUND function is used to round a given number to a specified number of digits (the ROUND function takes two arguments: the original value and the number of digits to round to). What you may not know is that passing a negative number as the second argument tells Excel to round based on significant digits to the left of the decimal point. For instance, the following formula returns 9500.

=ROUND(9489,-2)

Changing the significant digits argument to -3 will returns a value of 9000.

=ROUND(B14,-3)

The works great, but what if you have numbers on differing scales? That is to say, what if some of your numbers are millions while others are hundreds of thousands? If you wanted to show them all with 1 significant digit, you would need to build a different ROUND function for each number to account for the differing scales.

.

To help solve this issue, I replaced the hard-coded significant digits argument with a formula that calculates what that number should be. Imagine that your number is -2330.45. You can use this syntax as the significant digits argument in your ROUND function.

LEN(INT(ABS(-2330.45)))*-1+2

This first wraps your number in the ABS function, effectively removing any negative symbol that may exist. It then wraps that result in the INT function, stripping out any decimals that may exist. It then wraps that result in the LEN function to get a measure of how many characters are in the number without any decimals or negation symbols. In this example, this part of the formula results in the number 4. If you take the number -2330.45 and strip away the decimals and negative symbol, you have 4 characters left. This number is then multiplied by -1 to make it a negative number, and then added to the number of significant digits we you are looking for. In this example, that calculation looks like this: 4*-1+2 = -2.

.

I used this as the second argument for the ROUND function.

Test it out yourself. Enter this formula into Excel, and you'll round -2330.45 to 2300 (2 significant digits).

=ROUND(-2330.45, LEN(INT(ABS(-2330.45)))*-1+2)

.

As you can see in my screenshot, I simply replaced the hard-coded numbers with a cell reference that points to the actual value to be rounded and a cell reference that holds the number of desired significant digits.

=ROUND(B5,LEN(INT(ABS(B5)))*-1+$E$3)

.

Again, I'm not sure I'll ever need to use this significant digits nonsense in the future, but at least I now know how.

5 Years of Bacon Bits

June 9th, 2014 by datapig 18 comments »

On June 10th 2014 the Bacon Bits blog officially hits the 5-year mark.

Yes, this humble blog of mine has been changing the lives of countless nerds for 5 years.

So today, you get to watch me stroke my own ego with this self-congratulatory navel-gazing post. Enjoy!

» Read more: 5 Years of Bacon Bits

Using Pictures to Create Info-Graph Charts

June 6th, 2014 by datapig 4 comments »

I've been playing around with using pictures in my Excel info-graphics. I rarely use pictures in charts, but with an info-graphic, it seems like pictures could be appropriate for this type of visualization.

Here are a couple of examples I pulled together to illustrate the kinds of things you could do.

.

In this example, I created a waffle chart and then filled the two data series with their own pictures. » Read more: Using Pictures to Create Info-Graph Charts

Aligning Custom Formatting Icons

June 3rd, 2014 by datapig 12 comments »

Hello there Excel nerds.

Well it's been a few weeks since my last post. I've been wrapping up my last book of the year and doing some other projects. I'm ready to get back into the blog saddle.

.

Today, I'd like to show you a quick tip for fixing the alignment of Custom Formatting Icons. If you don't know what those are, see the post I wrote on Improving the Default Percent Formatting.

.

In that post, I show » Read more: Aligning Custom Formatting Icons

Creating a Slide-Out Menu in Excel

May 15th, 2014 by datapig 12 comments »

 I've been toying with the idea of an animated Slide-Out Menu in Excel.  As you can see, a Slide-Out Menu is simply a macro-driven way to hide your slicers away, until you need them.

 

 

 

 

 

 

 

 

 

 

 

.

.

.

» Read more: Creating a Slide-Out Menu in Excel

Getting Fancy with your Excel Slicers

April 30th, 2014 by datapig 17 comments »

During last week's BI Boot Camp, I demonstrated how to customize Excel Slicers to better fit into the color scheme and theme of a dashboard. The process of fully customizing a slicer is a bit of a drag. But once you have the hang of it, you'll never use the default slicers again. In this post, I'll give you a few ideas for your slicers and show you how the customization works.

.

First, let's go through a few examples of some fancy slicers.

.

Tabs: With a little customization and a few tricks, I made this slicer look like tabs behind my pivot table.

.

Clean Dashboard Look: I customized this slicer so that any values with no data are kept subdued with no buttons around them. The modern color scheme gives it a clean dashboard feel.

» Read more: Getting Fancy with your Excel Slicers