Brewing a Coffee Recommender (Part 1)
For many, the day really begins with the first sip of coffee in the morning. This magical beverage has been with us for centuries, but the proliferation of roasters and offerings in the market has accelerated in the last couple decades. As the market shifted from First Wave (think Folgers), through Second Wave (think Starbucks), and into Third Wave (think your local roaster) shelves and online points of sale have left consumers inundated with options. What’s a coffee lover to do? Moreover, what about the consumer who is too novice to differentiate or the ones who do not particularly want to do the work to sift through the options?
To help in this process, I wanted to analyze descriptions of coffee using Natural Language Processing (NLP) to ascertain how clusters or topics of descriptions and create a metric to compare coffees to one another. This comparison could then be the basis for a content based recommender, a tool for the coffee enthusiast to find new roasters and the coffee novice to gain guidance. My goal in this post is to outline the steps needed to create a recommendation system from coffee reviews, from data scraping and cleaning to the final output.
This comparison could then be the basis for a content based recommender, a tool for the coffee enthusiast to find new roasters and the coffee novice to gain guidance.
To begin, I think it is important to think about what will be needed to accomplish the task. I wanted to find a dataset of coffees and descriptions of the experience of drinking those coffees. Looking around online, I found a site called Coffee Review that had been regularly posting reviews and scores of coffees from roasters around the world for years. I was most excited about the realization that each coffee had a dedicated section of “Blind Assessment.” That meant I had a few sentences of adjective heavy descriptions about each coffee without any mention of the roaster, the origin, or the roast level. These formed the basis of my corpus, each document being a window into the experience of drinking that particular coffee. My hope was these descriptions might end up providing an breakdown of each coffee into components that might line up with something like the Coffee Wheel (from SCA or Counter Culture) as seen below:
I set to gathering the data by scraping using Beautiful Soup, pulling in the text mentioned about and the other details about the coffee that would be useful for secondary analysis. All in, the site provided just under six thousand reviews of coffee, spanning the last two decades. I knew that this trove of descriptors would be interesting to dive into, but also some of the reviews were quite old and might not end up giving actionable results for purchasing. Still, if the recommendations were valid, it could function as a useful model for any input set of descriptions a store or roaster had in stock.
With my text acquired, I was able to turn my attention to cleaning and preprocessing the reviews. While there are many tools and directions available here, I focused on a few keys steps for cleaning. First, I needed to remove all of the numbers, take out punctuation, and convert characters to lower case in each document. This can be done on the data frame of text using Pandas and regex such as this code below (numbers, punctuation, lowercase):
coffee['Review'] = coffee.Review.str.replace(r'd+',",regex=True)
coffee['Review'] = coffee.Review.str.replace(r'[^\w\s]+', '') coffee['Review'] = coffee.Review.str.lower()
At this point, I looked into stemming (reducing words to their roots) or lemmatizing (reducing words by grouping them to a common parent). For stemming, that would turn words like “driving” and “driven” to “driv”. Lemmatizing would turn both of those words into “drive.” While I tried both methods through the Python package NLTK, I did not find that they ultimately provided the most interpretable results later on, so I ended up moving on without them.
Following that, each review was a Python string of lower case words, devoid of punctuation, numbers, but still carried flaws such as misspellings and extremely common words. My goal was to turn each of these reviews into a numerical representation, so that reviews could be more easily compared to each other. I explored two methods for this process, (both from
sklearn.feature_extraction.text): CountVectorizer and TfidfVectorizer.
In CountVectorizer, the tool identifies all of the words used throughout all of the documents and then returns a vector with the same number of dimensions. Each document is then given an entry in each dimension for how many times a particular word appears, or term frequency.
TF-IDF takes the same initial approach, but instead of giving a straight count (Term Frequency), it also accounts for how rare each word is in the entire corpus (Inverse Document Frequency). That way, if two documents happen to have the same uncommon word, they will appear more similar than two documents that have the same very common word. Each input for TF-IDF is the value found with the CountVectorizer but multiplied by the log of the fraction of the number of documents plus one divided the term frequency plus one.
In both cases, unlike in the images shown here, I was removed “stop-words” and required words to appear in multiple documents to be counted, to avoid misspellings. Stop words in English are words like “and” or “the”, which occur frequently and won’t aid in the comparison between documents. I also added a few coffee-specific stop words that either appeared too frequently (coffee, cup, etc.) or were not about the specific experience of drinking the coffee (Keurig, espresso).
With two possible embeddings in hand, I set about better understanding the documents. These several thousand dimensional vectors were unwieldly, both in computational usage and in valuable interpretation. My next step was to reduce the number of dimensions to a more manageable number that would still produce meaning in the context of my work. To do this, I used Non-negative Matrix Factorization (NMF) to cluster coffees into groups. By looking at the most commonly used words in each group, I could see what the topic of that group might be. Trying out a variety of numbers of topics on both the CountVectorizer and TfidfVectorizer embeddings, I ultimately found the most meaning in an NMF model with nine topics from my TFIDF embedding.
Considering the top words in each topic, I gave them the following titles:
‘Bright, Floral, Citrus’, ‘Chocolate, Dark, Woody’, ‘Tart, Sweet, Smooth’ , ’Cacao, Nutty, Clean’, ‘Sweet, Hazelnut, Pine’, ‘Juicy, Honey, Cacao’, ‘Red Berries’, ’Nutty, Caramel, Woody’, ‘Cherry, Vinous, Chocolate’
Most importantly, while coffees could be assigned to one topic that they scored most highly in, each review was actually given a nine-dimensional vector of scores across the topics. This key result, the dimension reduction I was seeking, allowed for a much more manageable and meaningful comparison across the coffees. Thus, every coffee’s review had been converted into a nine-dimensional “flavor vector,” as I saw it. Below, you can see a visual representation of the average flavor vector for coffees assigned to each topic. This can give an idea of how each coffee scored more highly in one area, but had smaller contributions across the flavor spectrum.
The final step, was to take these flavor vectors and place them into a dataframe, with each vector taking a row. By employing pairwise_distances from Sci-kit Learn, I was then able to compute the distance all possible pairs of coffees. This distance comparison was done using a cosine distance, as I was more interested in the direction of the flavor vector (aka the relative contributions to each topic) than the Euclidean distance (which would compare closeness in space without regard to similarity in direction). Using the numpy argsort() method, I could then find the indices of the most and least similar coffees, store those values, then use them to slice the dataframe to return the most and least similar coffees to given review!
And that, in sum, produced the recommendation engine. Every description was cleaned, turned into a numerical embedding (TFIDF), reduced to a nine-dimensional flavor vector (NMF), and then compared pairwise for a content based filter. As an example, the plot below shows two coffees in their nine-dimensional flavor space. The Santa Barbara, Honduras was a coffee that came from the original corpus, while the Costa Rica Cloza was reviewed afterward. When the review of the Costa Rica Cloza was entered, the Santa Barbara was returned as most similar.
In the figure, you can see that not only do these two coffees share a common topic as their strongest contribution (Tart, Sweet, Smooth), they also score very similar across some smaller contributing topics. I would expect this to mean that these coffees would produce a very similar drinking experience, even though they are from different origins and roasters.
In my next post, check out how I took the models created here and deployed them into an app using the Streamlit Python package!