Starbucks Challenge — Capstone Project

Kaique Colognesi de Oliveira
9 min readSep 14, 2021
Image From:

The objective of this project is to fulfill the starbucks challenge provided by Udacity (Capstone Project)

Project Definition

These datasets contain information that simulates the behavior of starbucks customers and promotional types sent to them.

Promotional events range from a simple advertisement to purchase a drink or BOGO-type discount offers.

Problem Statement

The 3 main datasets are as follows:

Portfolio.json -- Contains information about offer identifiers and metadata about all offer types

Profile.json -- File containing demographic information about simulated users.

Transcript.json -- file containing information about the types of offers provided, offers viewed and offers completed

A simple note, json is a type of file format that is also used for data storage.

The main objective of this project is to merge the 3 datasets, with all the demographic information of customers and promotions so that, in the end, it is possible to instantiate a machine learning model to predict which groups will respond best to which offer. (For this validation, I am considering only viewed and completed offers)


1) Perform data mining to better understand the type of information I'm dealing with.

2) Instantiate different machine learning models and verify their performance to determine which would be best for our case.

3) Metrics selected to validate performance: F1-score, Accuracy and Recall.

Reasons for selecting these metrics for validation

The metrics mentioned above were selected as they are considerably insensitive to unbalanced classes and are very well applied when dealing with data that requires a high volume of previous preprocessing.

although accuracy is not so well applied to deal with unbalanced data, I decided to keep it also as a metric for validation, to check how much our model got right / wrong.

Since we are dealing with a classifier model, we want it to have a good hit rate and part of its function is to perform a classification for each item in our test data. So this was one of the reasons for selecting accuracy as one of the metrics, its formula can be described as follows:

Accuracy = correct predictions / total predictions

As opposed to accuracy, which in this case would be the error rate, we can calculate it as follows:

Error Rate = incorrect predictions / total predictions

This information can be acquired separately using certain libraries or using a confusion matrix, which works as if it were a summary of how our model behaved, using the data that were used to make the prediction X the real results.

Data Exploration


the first file I decided to explore was the portfolio.json file. There is not much to say about this file, just that it has information on the types of promotions offered, promotional ID, duration of the promotion and some other information.

The file came complete (ie no null data), not requiring preprocessing and already reducing part of the work needed.

Some information that is worth highlighting is:

It has 3 types of promotional events, namely: BOGO, informational and discount.

Promotional events are triggered through 4 communication channels, namely: E-mail, mobile, web and through the media.


During the first part of the analysis, it was already easy to see that we have 2175 null data, both for gender and income. So I generated 2 graphs to perform the comparison between them and validate if it would be a valid option to discard the null data or not.

But before performing this discard, one of the proposed challenges was to discover how to deal with data in which age is considered null, which in this case was delimited as 118 years instead of leaving a null value. So I took a test based on the removal of age 118.

Graph used to verify the frequency distribution of ages.

As we have seen, most of the data distribution is before the bar representing the age of 118, so I performed the removal of the ages 118 and checked the frequency again (As a delimiter of the amount of classes, I used the sturges formula, this formula is “responsible” for bringing us closer to a normal distribution for verification).

results came much closer to a normal distribution.

One last change that was made was to change the column of "became a member on", that was on a wrong type. Besides that, we’re good to go!


In this dataset, all columns (with the exception of the “time” column) are categorical. So to further enrich this dataset, some columns that were nested, I performed the extraction of their values ​​and transformed them into a matrix

Some information that is worth highlighting is:

1) Number of the events in the transcript
transaction 138953
offer received 76277
offer viewed 57725
offer completed 33579

Data Visualization

In this view, I found it interesting to disregard the “sent” events and check the behavior of purchase events (completed) and that were viewed. With this information, we were able to validate that the ways that generated greater public engagement were through email and mobile.

Another metric I wanted to validate was to verify the average income by age in a time series. This ended up helping me to create certain groups that I separated by age percentile.

Data preprocessing

After getting these views, treating the columns that were typed incorrectly, I merged the 3 dataframes using the ID column as a key.

I will summarize in a few steps the processes that were taken in each dataframe.


1 — One hot encoding for the channels column using the Multi Label Binarizer

So is the matrix that was generated.

2 — In promotional names, I turned their names into categorical values.


1 — Formatting the ‘became member on’ column in the Profile dataset.

2 — Drop the 118 age values with the age column.

3 — Mapping the sex informed to numerical values. (Ex: M = 0, F = 1, O = 2)

4 — Create a column with the customer's age percentile (Ex: if the customer is 44 years old, he will be included in the group of customers who are aged 40th)

5 — Created 3 groups to delimit the income of customers, customers who earn from 30 to 60 thousand annually, will be in group 1, from 60 to 90 in group 2 and above 90 in group 3.


1 — Acquisition of a matrix with the type of event that was sent to the client

2 — Extraction of values ​​from the “value” column, as it has nested values. Its values ​​have been put in separate columns.


After performing the above transformations, removing null values, and creating new features, all 3 datasets were merged, being ready to start separating training and testing data and validating our models.


In this part, I will comment on the definition of our training and testing variables.

As mentioned at the beginning, the objective is to predict which event a customer will be more likely to see / buy a product, so our target variable (Y) will be the "event" column

As for our variable X, which is the data we want to use to make the predictions, the following columns will be used: ‘time’,’difficulty’,’duration’,’reward’,’offer_type’,’percentill_group_code’,’income_group_number’,’gender_group’.

Training and testing data were separated at a rate of 33% and with a random state of 356.

After doing this separation, the min max scaller function was also used, which will perform a normalization in our training data to improve the prediction capacity of our models.

Model Selection

During the section to select the models, the only one I have knowledge that handles unbalanced data well is the GradientBoostingClassifier, but I selected others to test as well, they are: AdaBoostClassifier, ExtraTreesClassifier, RandomForestClassifier.

I decided not to change its hyperparameters, to just test the models in a basic way. I created a function to test the models and return the metrics that were previously decided.

With the above function, it is only necessary that the model is instantiated to make a validation and return the results, below some of the results.

With the results of the tests above, we can see that the models performed very similarly, but what came out a little better was the GradientBoostingClassifier model, as it is a little easier to deal with unbalanced data.

Tinkering with its hyperparameters could further improve the performance of each model.


To further refine the Gradient model, I decided to make a slight change to its hyperparameters.

The parameters were as follows:

param_list = {'loss' : ['deviance', 'exponential'],
'n_estimators' : [100, 150],
'min_samples_split' : [2, 3, 4]}

The parameter Loss, I decided to randomize it a bit to see how it would behave, because using the “exponential” value, it operates similar to the ADAboost model while the “deviance” parameter makes the model operate more like the Logistic Regression model.

The learning rate, I decided to test at slightly higher rates, because our Gradient boost model handles better the more data it receives, having a very small chance of overfitting.

The estimators and sample splits I selected some other numbers, just to see if there would be any drastic change in performance (which didn't)

To search the parameters, I used a methodology called GridSearchCV, in short, it randomizes the model you want with the specified parameters, performing several fits and returning the overall score of that fit, by default, GridSearchCV uses 3 folds for each test, I used the standard input to do these validations.

Best results we got for the selected parameters.

With this slight adjustment, we were able to increase the performance of our model a little more, increasing its precision and f1-score.
Leaving the average score's at 69, which is a great number!

Model Evaluation and Validation

The justification for using the Gradient model was because it handles better with the more data that was added, as said earlier the chances of overfitting in this model are very low. This is the main reason for choosing it as the main one to change the parameters.

The changes that occurred with the hyperparameters change might not seem like a big deal, but they generally improved the metrics we used, such as f1-score, accuracy and recall. This reinforces even more that we can use it to make our predictions.

The good thing about using GridSearchCV to randomize the hyperparameters is that it takes care of performing (by default) 3 folds before each fit, this increases even more the robustness of our model.


In the end, in my view the GradientBoostingClassifier is the best classifier which can be used for predicting our ML models on the dataset provided for predicting the response of the customer to an offer made in the Star Bucks App which are either viewed or completed.


At the beginning of the project, I was very undecided about how to go, what to find, what questions to ask, what problem to solve, especially since I recently got a job, so I had to divide my attention between work, college and my course, so it was very complicated to find directions to follow.

But doing little by little, one step at a time, the project took shape, and became one of the most beautiful projects I've worked on. I applied a lot of what I knew to him, and what I didn't know I went back to try to understand. I evolved along with the progression of this project and I am proud of its results.

Possible Improvements

One of the ways that I consider valid to improve even more would be a type of user-user recommendation, in which they could select the filters of which promotions they find most attractive, or even create a recommendation template with user-by-user information.

And for our instantiated models, changing more of their hyperparameters could improve performance even more.

Referall Links:


All the data was provided by Udacity in partnership with starbucks for the Data Science NanoDegree.