Recipes Rating Analysis
Name(s): Grace Gao & Faith Jones
Website Link: https://gracegyq.github.io/recipes_rating_analysis/
Introduction
This data analysis centers around recipies and ratings found on food.com posted since 2008. There were two relevant datasets for this problem which was the recipes dataset with 83782 rows and 12 columns and the interactions dataset with 73197 rows and 5 columns which represents the rating and reviews for recipes. Together, the datasets include many columns about the recipe like name, ingredients, cooking time, ratings and more.
To examine this data set we came up with an overarching question: How does different recipe types and factors affect the rating?
To investigate this question, we first cleaned the data and conducted exploratory data analysis to try and identify potential trends or important factors to our question. Next, we built a baseline model to start trying to predict recipe’s ratings. After this, we had many iterations and variations of the model as we attempted to improve our prediction accuracy until we finally landed on a final model for predicting recipe rating.
The original data sets contained 16 columns, however as part of our cleaning we both created new relevant columns and selected the pertinent columns for the sake of our analysis. The table below is a list of the pertinent columns along with a description of each:
Column | Description |
---|---|
'calories' | This column was extracted from the 'nutrition' column from the recipes dataset. It describes the number of calories in the recipe |
'num_tags' | This column was created from the 'tags' column in the recipes dataset. It described the number of tags per recipe |
'minutes' | This column is from the recipes dataset and describes the prep/cooking time per recipe |
'n_steps' | This column is from the recipes dataset and describes the number of steps it takes to make the recipe |
'num_ingredients' | This column was created from the 'ingredients' column in the recipes dataset. It describes the number of ingredients in the recipe |
'rating' | This column is from the interactions dataset and describes the rating or stars that a reviewer gave a specific recipe |
'rating_category' | This column was created from taking the average rating per recipe and then rounding it to the nearest integer. It describes the 'average' rating for the recipe in distinct categories which makes it easier to use classes for our model |
Data Cleaning and Exploratory Data Analysis
Before we started any analysis we first needed to clean the data.
Data Cleaning
To begin cleaning the data, we first performed a left merge on the recipe ID for the recipes and interactions dataset with the recipes being on the left. Merging the data allows us to then be able to use the factors in the recipe dataframe to prdict the ratings from the interactions dataframe.
Once merged, we dealed with any missingness in our data. First we checked for missingness in the data. We found that the only columns in the dataframe that had missing values were only the ‘name’ and ‘description’ columns in the dataset. For the name column, we decided to leave it as NaN because we are inferring that it is a recipe with all special characters for the name, so, even though the recipe does not have a name it can still be helpful in our attempts to predict ratings based on the other columns. Further, we decided to do nothing for the description because as stated above, we didn’t include the description column in our pertinent dataframes. Additionally, when analyzing the data we noticed some of the raings had a 0, which is not possible on food.com. For all the 0 values, we decided to fill these values with NaN because the cause of these 0 star ratings is that someone left a review/comment on the recipe, and did not provide a rating. Therefore, having these recipes with a rating of NaN makes more sense since they did not have a rating as opposed to getting a rating of 0.
Next, we added a new column to find the average rating per recipe since as the dataset was the same recipe could have multiple ratings. We did this by grouping the recipe ids and finding the mean of the ratings. Then we converted this series to a dataframe and merged it with the current dataframe we were working with. This affected our analysis because it allowed us to be able to predict the recipes overall rating instead of focusing on individual reviews and recipes since we want a more holistic prediction for the recipes. To further build on the ratings, we then added a column for the rating_category which rounds the average rating to an integer value in order to use these distinct categories as classes for our models. We decided to drop the rating_categories that were NaN because without knowing which category it’s in, we can’t test the accuracy of the prediction for that column.
Then we considered the data types of the columns for our cleaning, specifically we changed the ‘nutrition’ column originally from the recipes dataset. We first converted the data for this column into a list. After then we extracted all of the information from nutrition out into its relevant columns which are ‘calories’, ‘total fat (PDV)’, ‘sugar (PDV)’, ‘sodium (PDV)’, ‘protein (PDV)’, ‘saturated fat (PDV)’, and ‘carbohydrates (PDV)’. This is helpful because it allows us to use these factors, like calories, to build our model and help predict the ratings.
Lastly, we considered outliers in our data that are significantly changing our mean and standard deviation. We first dropped recipes that only have one review because intuitively we thought that recipes with only one review might be too biased. Some outliers we decided to drop from our dataset for the sake of the analysis were any recipes that contained the word “homemade” in the name. This makes sense to do because making something homemade, generally, takes a lot of time because often it accounts the time for the food to do something like ferment. We found that without the homemade recipes the standard deviation decreased by over 6%, which we found to be quite significant since we were only removing 280 samples from the dataset. Additonally, we dropped recipe 447963 - “how to preserve a husband” because this it is not a serious recipe, which means it would not be a good predictor of the rating which we are interested in. Below is the head of the cleaned data set:
name | id | minutes | contributor_id | submitted | tags | n_steps | steps | description | ingredients | n_ingredients | rating | recipe_id | num_ratings | calories | total fat (PDV) | sugar (PDV) | sodium (PDV) | protien (PDV) | saturated fat (PDV) | carbohydrates (PDV) | num_tags | rating_category | num_ingredients | rating_bins | calories_quartile |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘main-ingredient’, ‘preparation’, ‘side-dishes’, ‘vegetables’, ‘easy’, ‘beginner-cook’, ‘broccoli’] | 6 | [‘preheat oven to 350 degrees’, ‘spray a 2 quart baking dish with cooking spray , set aside’, ‘in a large bowl mix together broccoli , soup , one cup of cheese , garlic powder , pepper , salt , milk , 1 cup of french onions , and soy sauce’, ‘pour into baking dish , sprinkle remaining cheese over top’, ‘bake for 25 minutes or until cheese is lightly browned’, ‘sprinkle with rest of french fried onions and bake until onions are browned and cheese is bubbly , about 10 more minutes’] | since there are already 411 recipes for broccoli casserole posted to “zaar” ,i decided to call this one #412 broccoli casserole.i don’t think there are any like this one in the database. i based this one on the famous “green bean casserole” from campbell’s soup. but i think mine is better since i don’t like cream of mushroom soup.submitted to “zaar” on may 28th,2008 | [‘frozen broccoli cuts’, ‘cream of chicken soup’, ‘sharp cheddar cheese’, ‘garlic powder’, ‘ground black pepper’, ‘salt’, ‘milk’, ‘soy sauce’, ‘french-fried onions’] | 9 | 5 | 306168 | 4 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 10 | 5 | 165 | nan | Q2 |
2000 meatloaf | 475785 | 90 | 2202916 | 2012-03-06 | [‘time-to-make’, ‘course’, ‘main-ingredient’, ‘preparation’, ‘main-dish’, ‘potatoes’, ‘vegetables’, ‘4-hours-or-less’, ‘meatloaf’, ‘simply-potatoes2’] | 17 | [‘pan fry bacon , and set aside on a paper towel to absorb excess grease’, ‘mince yellow onion , red bell pepper , and add to your mixing bowl’, ‘chop garlic and set aside’, ‘put 1tbsp olive oil into a saut pan , along with chopped garlic , teaspoons white pepper and a pinch of kosher salt’, ‘bring to a medium heat to sweat your garlic’, ‘preheat oven to 350f’, ‘coarsely chop your baby spinach add to your heated pan , stir frequently for approximately 5 min to wilt’, ‘add your spinach to the mixing bowl’, ‘chop your now cooled bacon , and add it to the mixing bowl’, ‘add your meatloaf mix to the bowl , with one egg and mix till thoroughly combined’, ‘add your goat cheese , one egg , 1 / 8 tsp white pepper and 1 / 8 tsp of kosher salt and mix till thoroughly combined’, ‘transfer to a 9x5 meatloaf pan , and cook for 60 min or until the internal temperature is at least 160f’, ‘let stand for 5min’, ‘melt 1tbsp unsalted butter into a frying pan , and cook up to three eggs at a time’, ‘crack each egg into a separate dish , in order to prevent egg shells from reaching the pan , then add salt and pepper to taste’, ‘wait until the egg whites are firm looking , but slightly runny on top before flipping your eggs’, ‘after flipping , wait 10~20 seconds before removing each egg and placing it over your slices of meatloaf’] | ready, set, cook! special edition contest entry: a mediterranean flavor inspired meatloaf dish. featuring: simply potatoes - shredded hash browns, egg, bacon, spinach, red bell pepper, and goat cheese. | [‘meatloaf mixture’, ‘unsmoked bacon’, ‘goat cheese’, ‘unsalted butter’, ‘eggs’, ‘baby spinach’, ‘yellow onion’, ‘red bell pepper’, ‘simply potatoes shredded hash browns’, ‘fresh garlic’, ‘kosher salt’, ‘white pepper’, ‘olive oil’] | 13 | 5 | 475785 | 2 | 267 | 30 | 12 | 12 | 29 | 48 | 2 | 10 | 5 | 231 | nan | Q2 |
50 chili for the crockpot | 501028 | 345 | 2628680 | 2013-05-28 | [‘course’, ‘main-ingredient’, ‘cuisine’, ‘preparation’, ‘occasion’, ‘main-dish’, ‘soups-stews’, ‘beans’, ‘beef’, ‘pork’, ‘mexican’, ‘easy’, ‘stews’, ‘crock-pot-slow-cooker’, ‘spicy’, ‘lentils’, ‘meat’, ‘taste-mood’, ‘equipment’, ‘presentation’, ‘served-hot’, ‘3-steps-or-less’] | 4 | [‘combine all ingredients in a 7-quart crockpot’, ‘it might be easier to combine in a larger vessel , mix , and then transfer to your crockpot’, ‘cook on “high” for 5 hours , or “low” for 8 hours’, ‘i typically serve the chili over a bed of whole tostito chips , with a layer of shredded cheese in between’] | first, thank you to parsley - chef # 199848 - for inspiring me with her “thick and chunky crock pot game day chili,” recipe #251837. i had tried about 20 different recipes for chili, and parsley’s was the best. but i’ve been tweaking it for a year and a half now, and i think i did it! this chili is medium to medium-hot, not very sweet, hearty and stew-like, with depth and a delicious aftertaste. i make in a 7-quart crockpot, for my large family. if you like your chili sweeter, use a red onion instead of a white onion; add more brown sugar; and/or add more chocolate chips. be careful: a little chocolate - just like cumin and cinnamon - goes a looonnnngggg way! | [‘stewing beef’, ‘stewing pork’, ‘white onion’, ‘bell peppers’, ‘habanero pepper’, ‘garlic’, ‘beans’, ‘chunky salsa’, ‘tomato paste’, ‘beef broth’, ‘tortilla chips’, ‘chicken bouillon cube’, ‘beef bouillon cube’, ‘sazon goya’, ‘cinnamon’, ‘mexican chili powder’, ‘cumin’, ‘ground coriander’, ‘black pepper’, ‘salt’, ‘light brown sugar’, ‘dark chocolate chips’] | 22 | 5 | 501028 | 2 | 270.2 | 19 | 26 | 48 | 52 | 21 | 4 | 22 | 5 | 360 | nan | Q2 |
blepandekager danish apple pancakes | 503475 | 50 | 128473 | 2013-07-08 | [‘danish’, ‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘cuisine’, ‘preparation’, ‘pancakes-and-waffles’, ‘breakfast’, ‘scandinavian’, ‘european’] | 10 | [‘beat the eggs lightly and add the milk’, ‘combine the flour , sugar and salt’, ‘stir the flour mixture into the egg mixture , stirring in the cup of cream as you mix’, ‘fry the apple slices in butter in a skillet’, ‘preheat oven to 500 degree’, ‘cover the bottom of an oven-proof baking dish , or heavy skillet , with apples’, ‘pour the batter over slices and bake in a preheated 500 oven’, ‘when nearly done , remove from oven and sprinkle here and there with a mixture of sugar and cinnamon to taste’, ‘place dabs of butter on the pancake and return to oven until browned’, ‘just before serving , sprinkle with lemon juice , and cut into triangles’] | this recipe has been posted here for play in zwt9 - scandinavia. this recipe was found at website: mindspring.com - christian’s danish recipes. | [‘eggs’, ‘milk’, ‘flour’, ‘sugar’, ‘salt’, ‘cream’, ‘apples’, ‘butter’, ‘cinnamon’, ‘lemon, juice of’] | 10 | 5 | 503475 | 2 | 358.2 | 30 | 62 | 14 | 19 | 54 | 12 | 10 | 5 | 102 | nan | Q3 |
bbq spray recipe it really works | 327356 | 5 | 398160 | 2008-09-26 | [‘15-minutes-or-less’, ‘time-to-make’, ‘course’, ‘preparation’, ‘low-protein’, ‘healthy’, ‘5-ingredients-or-less’, ‘very-low-carbs’, ‘condiments-etc’, ‘easy’, ‘low-fat’, ‘dietary’, ‘low-sodium’, ‘low-cholesterol’, ‘low-saturated-fat’, ‘low-calorie’, ‘low-carb’, ‘low-in-something’] | 5 | [‘mix ingredients together and add to a clean spray bottle’, ‘spray and “wet” your bbq meat’, ‘seal the meat in a plastic bag and refrigerate at least 4 hours or overnight’, ‘grill the meat over indirect heat according to your recipe , spraying every 15minutes during the cooking time’, ‘remove meat from grill and serve’] | using this marinade/spray will insure that your grilled meats are tender and tasty every time. try it on beef or pork ribs, beef brisket, pork butt or chicken. you’ll love the results! | [‘red wine vinegar’, ‘lemon juice’, ‘water’] | 3 | 4.75 | 327356 | 4 | 47.2 | 0 | 2 | 0 | 0 | 0 | 0 | 18 | 5 | 44 | [4, 5) | Q1 |
Exploratory Data Analysis
Univariate Analysis
To start understanding our data set, we made many different graphs to visualize trends in our data. One of the most prevalent distributions to our analysis is the distribution of the recipe ratings which you can find below:
From the plot we can see that most of the ratings are either 4 or 5. For our analysis this might mean that the dataset is biased towards the higher ratings, especially 5, because a significant portion of the dataset is the higher ratings.
Bivariate Analysis
Below is one of the bivariate analysis plots we created:
While the plot does not seem to show highly significant difference among the different rating categories, looking at the specific metrics of each rating, it seems that rating 3 is associated with a slightly higher median for calories, and it goes down as the rating increases to a 4 and a 5, which is meaningful in helping us answer the initial question because it shows that calories
might be a potential feature that we can use to predict our model, although we are anticipating some challenges when using calories
because of the lack of separation among the different ratings. Additionally, we can also see that the 5-rating class seems to have significantly more outlier recipes with extremely high ratings, which might be helpful for our models too in predicting that the recipes with extremely high calories are more likely to have 5-star ratings.
Interesting Aggregates
calories_quartile | Q1 | Q2 | Q3 | Q4 |
---|---|---|---|---|
rating_category | ||||
1.0 | 159 | 166 | 117 | 144 |
2.0 | 182 | 198 | 208 | 180 |
3.0 | 646 | 690 | 713 | 689 |
4.0 | 4995 | 5182 | 5352 | 527 |
5.0 | 14199 | 14033 | 13803 | 13765 |
This pivot table provides more insight regarding the relationship between calories
and rating
in that the highest rating class, namely the 5.0 ratings, seems to have higher entries in the lower calories quartile, which might suggest the idea that there are many recipes with lower calories that have high ratings. Additionally, looking at each column, it seems that for each calories quartile, there significantly higher 5.0 ratings than every other rating category, suggesting that in our prediction model, we might encounter some class inbalance issues and should try to work it out.
Imputation
We did not end up imputing any values into the dataset, but instead removed the columns that were misisng key values that we needed because the majority of the missing values existed either in columns that were not used in our specific anlaysis or in the column that we are trying to predict.
Framing a Prediction Problem
Prediction Problem: We are attempting to use the columns with information regarding calories, number of steps it takes, number of ingredients, and cooking time to predict the rating of the recipe, and weo performed multiclass classification for our prediction models.
Evaluation Metric: We used a combination of overall accuracy as well as the precision level per class. The overall accuuracy evaluated how well the model did in terms of predicting the correct class given information from the four columns described above. Additionally, we looked at the confusiono matrix, particularlyy the precision level for each class, as an additional metric because that would provide insight on whether the accuracy is from actually predicting all the classes correctly or just from predicting all the data points to be of the class with the highest frequency, which would still result in a decent overall accuracy if that class has significantly more data points than the rest.
Baseline Model
For our baseline model, we chose to use a RandomForestClassifier
as our training model, which uses a collection of decision trees to make the classification prediction. The features we used to train this model are calories
, which represents the number of calories for each recipe, and minutes
, which is the number of steps it takes to finish the recipe, boht of which we thought have the potential to make reasonable predictions for the rating of a recipe. Both of the features are quantitative data columns. We considered using the categorical columns, such as ingredients and tags; however, after thorough exploration, we decided that performing one-hot-encoding or other transformation functions on the categorical variables would not be very effective given the fact that there are too many options for ingredients or tags. We did not perform StandardScaler()
on the two features chosen because RandomForestClassifier
is not very sensitive to the scale of data since it is a threshold-based approach.
We believe that our model is decent, but it revealed a signficant problem in out datasets that makes classification much more challenging, which is the problem of class imbalance. Our model’s accuracy is approximately 0.56362 on used on the unseen testing set, which is not a bad prediction accuracy; however, when looking at the precision parameter for each of the rating classes in the classification report, the precision of ratings of 1, 2, and 3 are all 0.0 or near 0 while the precision for ratings of 4 and especially 5 are signficantly higher. Looking more specifically at the distribution of predictions, it seems that the model mostly predicts that all the data points have a 5-rating. The decently high accuracy of the model is simply from the fact that the entire dataset, and by extension the test set as well, have a significantly higher proportion of 5-ratings than all other ratings. This suggests that our current model is not the best predictor model because it is taking advantage of the class imbalance and not faily predicting results for all five of the distinct classes.
Final Model
After going through some iterations of the model we landed on a final model of a random forest classifier with some added features for an improved accuracy of 0.64. To improve our model, we oversampled the underrepresented classes by duplicating the data as well as cross validating some of the hyperparameters in the model.
To oversample the underrepresented classes, we duplicated the under represented classes, 1, 2, 3, and 4 to be around the same size as the number of samples in class 5, which has a major bias in the data. This helped improve our model by encouraging it to predict a rating other than 5 because when running previous versions of our model we noticed that the models would mostly predict 5s. We tried other methods, such as class weight adjustment and SMOTE and paired them with logistical regression models; however, those did not seem to significantly improve the accuracy and precision of our model.
Additionally, we added cross validation to our model for the n_estimators and max_depth hyperparameters in the random forest classifier. To do this we picked a variety of values for the hyperparameters and ran a GridSearchCV to then find which combination of these hyperparameters produces the best accuracy.
After many iterations and models, our final model improved from the baseline by 0.08 which validates that our final model improved frm our baseline model.