Sentiment Analysis is a growing field in the world of BI and Data Science. With social media serving up millions of data points by way of comments, tweets, and status updates, it's becoming increasingly important to be able to take all that subjective text and turn it into quantitative data that can be integrated with other metrics. That is essentially what Sentiment Analysis is - a term that describes the exercise of quantifying subjective text by analyzing certain keywords deemed to be either positively or negatively charged.
How Sentiment Analysis Works
In its simplest form, a Sentiment Analysis algorithm can be nothing more than determining if a text string has more positive keywords than negative keywords. Text with more positive words is classified as positive content, while text with more negative keywords it gets classified as negative content. Some Sentiment Analysis algorithms go far beyond the simple keyword-based methodology, delving into idioms, colloquialisms, language patterns, sentiment shifter words, and the list goes on.
Regardless of complexity, all Sentiment Analysis tools apply some sort of scoring system to findings. A typical scoring system looks something like this:
- (+2) very positive
- (+1) somewhat positive
- (0) neutral
- (-1) somewhat negative
- (-2) very negative
If a given comment or tweet is evaluated and considered to be very positive in sentiment, then that text string will be given a +2. For a text string that is considered neutral, a 0 would be applied. The idea being that if 5,000 comments or tweets are classified, a picture would start forming about the general attitude about a particular topic.
You can do a search for "on-line sentiment analysis" and you'll get a dozen websites that let you test drive their model. Most are trying to sell their wares (like I said, it's becoming big business). But one educational site will let you play around with their Text2Sentiment API for free and without an API key. If you're interested, go to www.datasciencetoolkit.org and scroll down to their Text to Sentiment tool.
An Excel-Based Sentiment Analysis Tool
After playing around with this for a while, I decided to try and roll my own Sentiment Analysis tool with Excel and VBA.
You can download it here and take a look as you follow along.
When you open the workbook, you'll see some sample comments I pulled from a Hotel chain's Facebook feed. The score next to the comment is generated by a Function called GetSentimentValue. This Function returns a score between -5 and 5. A Low score indicates that there were words associated with dissatisfaction. A high score shows that there were one or more 'positive' words that occur in the comment. A score of zero either means that no charged words were found, or that the positive and negative words balanced themselves out. I applied Conditional Formatting manually. You can imagine that if you had hundreds or thousands of comments, you could easily apply this function and do things like identify the negative comments, apply pivot tables, create histograms, build charts, etc. The point is that the text in your comments just went from being highly unusable to being a relatively useful data point.
Change any comment and the GetSentimentValue function will adjust the score to reflect the new sentiment of that comment.
The GetSentimentValue function applies a very basic algorithm (the same one used by the folks at datasciencetoolkit.org). It uses a word for word keyword approach, parsing each word in the text, determining if each word is a positive or negative keyword, and assigning a score to each word. It then averages all of the found keywords to get to a final score. For example, "Great trip. Thanks for the warm welcome" gets a score of +2 because "great" has a weight of +3, and "warm" gives +1, so the average is +2.
Understanding the Keyword List
The keyword list and scoring system used in the function comes from the AFINN-111 affective lexicon developed by Finn Arup Nielsen. That list is stored in the VBA module and exposed through a Collection. For ease of editing, I split the keyword list into several Private Subs. This way, if you want to add, edit, or remove a word, you can do so directly in the code.
The list of keywords and terms used in a Sentiment Analysis is quite important. No list will cover all needs, so most folks using Sentiment Analysis find that they frequently need to adjust their keywords to account for the types of comments they encounter. This is called "training the model". It's frankly the primary drawback to using a Sentiment Analysis API service instead of a local tool. With an API service, you have no way to adjust the model. For example, imagine you worked at the electronics superstore Best Buy. In most Sentiment Analysis models, the word "best" is positively charged. So you can imagine how many false positives you would get as a result of people including your company's name into comments. You would definitely want a way to train the model to handle the word "best" in a particular way.
Scenarios to Be Aware of When Relying on This and Other Sentiment Analysis Tool
- Keep in mind that the algorithm I chose is the simplest (laziest) one I could get away with. Each word is evaluated separate from other words. Many Sentiment Analysis tools found on-line use the same kind of algorithm. This means key sentiments made up of two or more words are missed. For example, these sentiments aren't caught: "white wash", "checked out", "no go", and "won't work". Also, there is no accounting for common phrases that represent sentiment. For instance, "I'm over the moon" or "customer for life".
- Most Sentiment Analysis tools have no way to capture context. A positive or negative sentiment word can have the opposite connotation depending on context (e.g. "This is a great hotel, if you like the smell of urine")
- Sentiment ambiguity is a real problem for most tools. Text with sentiment ambiguity is often classified as neutral. For example, "The rooms at this hotel are priced higher than others in the area" doesn't contain any charged words, although it's clearly a negative comment.
- Sarcastic comments are often misclassified in most Sentiment Analysis tools. (e.g. "I called to move my prepaid reservation because my wife went into labor; I was told that is our problem. Wonderful. Such lovely customer service.").
- Slang needs to be taken into account when "training the model". For example, "sick", or "bananas" can have a positive or negative sentiment depending on the context.
Enjoy playing with the workbook. I'd love to share an enhanced version should any gurus out there want to take a shot at improving this thing.