Data science tutorials: Online Marketing Ads click prediction: End-to-End workflow for machine learning solution

Introduction

Online space is now a ubiquitous market place for almost every company; not just eCommerce. Just like office spaces and sales centers, there is high competition for who gets offered the place to display and sell goods and services. Ads campaign and online marketing in general is often seen as the silver bullet by most marketing teams so much so that budget for traditional word of mouth marketing or sign-post are increasingly being redistributed among several online marketing channels.This highlight the demand for online marketing by advertisers. Taking note of this trend, a number of online businesses have sprung up as online space providers for marketers with the promise of linking marketers to end-consumers to increase sales and providing end-users with the best offer. This promise in itself has become a business problem where online marketing platforms introduces bidding systems that not only identifies which ads get displayed on the platform but also selects Ads that is relevant and offer best value to users to enable the platform to deliver on their promise.

While this problem could be tackled from both ends, that is either the marketing team designing ads that increase conversion or platform providers selecting ads that are likely to drive engagement hence satisfy both the advertiser and platform visitor, the latter is addressed in this post.

Thus, the analysis is undertaken for a business that provides it platform for ads advertisement with the aim of selecting ads that are likely to have higher clicks in order to fulfill the promise of optimizing online marketing for clients and offering the best deal to platform visitors. Platform visitors aim to click ads and patronize services that offer value for money.

Problem Statement

For an anonymous online advertisement platform provider with business clients advertising their accommodation services, the business problem of deciding which ads gets to be advertise on the platform to online visitors requires a strategic decision making which seeks to balance the satisfaction between end-users at opposite divide; mainly advertisers who run campaigns and users (consumers) who consume (click) those campaign ads. Identified among others is the the number of clicks as a Key Performance Indicator (KPI) capturing satisfaction by both users and advertisers. From this, the higher the number of clicks, the higher the perceived campaign success by advertisers and the higher the user utility (to some extent). With this understanding, a higher number of click is likely to result from displaying ads with accommodation characteristics that satisfy users’ interests. This suggests that predicting number of clicks can be done by identifying patterns in accommodation Ads’ characteristics that compel users to click more often.

Thus, the number of clicks is the target variable and variables that capture various properties of the accommodation ads are the predictor variable.

In order to translate this understanding into a technical solution that can be communicated to non-technical audience later, there is the need to capture the prevailing challenge as a snapshot with a problem design framework.

The Situation Complication Question (SCQ) framework will serve as a great tool for that purpose. A key element is identifying the stakeholder and end-user of the solution to be provided. Hypothetically, the bidding team is tasked with the problem of Ads bidding and number of clicks is one of the key components of their bidding system hence identified as a stakeholder and end-user of the analysis.

The SCQ is depicted below;

Identifying variables in the data

In developing an algorithm for prediction, identifying the variables to use and how to categorize them is important. The following is deduced;

Target / Response / Outcome variable

n clicks: Continuous variable

Predictor / feature variables

• city id: Categorical

• content score: ordinal

• n images: discrete

• distance to center: continuous

• avg rating: discrete

• n reviews: continuous

• avg rank: ordinal

• avg price: Continuous

• avg saving percent: discrete

By identifying the type of variable, appropriate visualization can be undertaken for different variables during exploratory data analysis.

Exploratory data analysis

Given that target variable is present, a supervised machine learning algorithm will be used for the prediction exercise. Generally, algorithms can be seen to belong to two categories, namely parametric and non-parametric. Deciding on the category from which an algorithm is chosen for modeling is dependent on whether the dataset exhibits characteristics that satisfy certain assumptions. This need to be verified by undertaking exploratory analysis and visualization hence the focus of this section. The insights gained from the exploratory analysis serves as basis on which to narrow down the group of algorithms that will produce good results

The implementation of the exploratory analysis is highlighted below beginning with importing packages and the dataset

# import packages
import pandas as pd
import seaborn as sns
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from plotnine import ggplot, geom_point, geom_smooth, ggtitle, aes
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import  HistGradientBoostingRegressor, BaggingRegressor
from xgboost import XGBRFRegressor
from dmba import regressionSummary
from sklearn.metrics import make_scorer

# %% load data
data = pd.read_csv(r"Data/train_set.csv")

print(f"Data types \n {data.dtypes}")

Data types 
 hotel_id              float64
city_id               float64
content_score         float64
n_images              float64
distance_to_center    float64
avg_rating            float64
stars                 float64
n_reviews             float64
avg_rank              float64
avg_price             float64
avg_saving_percent    float64
n_clicks                int64
dtype: object

print(data.head())

       hotel_id   city_id  ...  avg_saving_percent  n_clicks
0  9.767406e+10  134520.0  ...                18.0         0
1  9.768889e+10  133876.0  ...                28.0         4
2  9.811544e+10  133732.0  ...                27.0        44
3  9.824279e+10   43772.0  ...                 2.0         4
4  9.833438e+10   50532.0  ...                 0.0        10

[5 rows x 12 columns]

Basic descriptive statistics

One of the first checks to do during exploratory analysis is to obtain a description of the data by identifying the mean, median, quantiles among others to enable understanding of the data distribution among others. This is done as follows;

data.describe()

           hotel_id        city_id  ...  avg_saving_percent       n_clicks
count  3.964870e+05  395979.000000  ...       396317.000000  396487.000000
mean   1.326304e+11  149193.465376  ...            7.179601      13.781980
std    1.033722e+11  219189.285044  ...           13.081529     123.572896
min    1.557962e+08       2.000000  ...            0.000000       0.000000
25%    4.062255e+10   32014.000000  ...            0.000000       0.000000
50%    1.087280e+11   55122.000000  ...            0.000000       0.000000
75%    2.281935e+11  137464.000000  ...           10.000000       2.000000
max    3.237114e+11  878736.000000  ...           99.000000   13742.000000

[8 rows x 12 columns]

From the above, the maximum avg_price is 8000 while the 75th percentile is 120 which suggests there could be outliers. This high difference between the maximum and the 75th percentile is also noted for n_views, n_clicks among others. Outliers are better visualize with Boxplot and are a key consideration in determining which model to choose as some models are more robust to outliers than others.

Missing values

An important consideration for analysis and modeling is the proportion of missing values. This usually defines the type of preprocessing tasks which could include imputing missing values or choosing a model that handles missing values well. Hence, the number of missing values are analyzed below to make such decision.

# number of missing values per variable
data.isnull().sum()

hotel_id                   0
city_id                  508
content_score            508
n_images                 509
distance_to_center       529
avg_rating            110398
stars                    562
n_reviews                529
avg_rank                   0
avg_price                170
avg_saving_percent       170
n_clicks                   0
dtype: int64


# number of data points (observations)
n_obs = data.shape[0]

# number of observation after removing all missing values
n_obs_drop = data.dropna().shape[0]


narm_percent = (n_obs_drop/n_obs) * 100

print(f"Percentage of data left if all missing values are removed: {narm_percent: .2f}%")

Percentage of data left if all missing values are removed:  72.14%

72% (286,026) of data will remain after removing all missing values. While a sizeable amount of data will still be left for modeling, a lost of 28% of data could be quite impactful on the model to be developed particularly when attempt is made to impute missing values. Dropping all missing values will lead to loss of information while an imputation could lead to introducing a sizable amount of noise or “fake” data that may deviate from reality. In this regard, a decision is made in favour of choosing an algorithm that natively handles and is less impacted by missing values.

Data visualization to ascertain certain assumptions required by some models

Generally, parametric models like linear regression requires data to be normally distributed and a linear relationship to exist between the predictor and target variable (s). Among others, data visualization is one of the techniques used to verify such assumptions. Various plots are used to visualize certain characteristics of the data to ascertain if certain assumptions are met for certain models to be used.

To investigate this, histogram is used to visualize the distribution of data for continuous variables. This is implemented below.


variables=['content_score', 'n_images',
            'distance_to_center', 'avg_rating', 
            'stars', 'n_reviews', 'avg_rank',
            'avg_price', 'avg_saving_percent',
            'n_clicks'
            ]

# function to create histogram 
def plot_histogram(data: pd.DataFrame, colname: str):
    """Plot the distribution of variable using a histogram

    Args:
        data (pd.DataFrame): Data to use for plotting
        
        colname (str): column name or name of variable to plot 
    """
    img = data[colname].plot(kind='hist', 
                       title=f'Distribution of {colname}'
                       )
    plt.show()

# plot histogram of all variables  
for colname in variables:
        plot_histogram(data=data, colname=colname)

The result shows that most of the variables are right skewed. The variable content_score is left skewed (not too bad). The variables n_images, distance_to_center, avg_rating, n_reviews, avg_rank, avg_price, avg_saving_percent and n_clicks are right skewed

city_id is a categorical variable and a number of ways exist to treat such variables including OneHot encoding method. This method has the disadvantage of significantly increasing the dimensionality of the dataset when several categorical variables are present or even when a single categorical variable with numerous classes exist. This could lead to significant increase in computation time of model. Hence, the decision on how to handle city_id variable is made by first understanding number of classes it has below.

data['city_id'].nunique()

city_id is a high cardinality categorical variable with 33213 unique values. This will make OneHot encoding an expensive exercise for modeling. Worthy of notice is that the city_id has been hashed as numeric values which makes it acceptable for the Scikit-learn API.

Boxplot to visualize outliers

As indicated earlier, a key factor in deciding which algorithm to use is the distribution of the target and predictor variables. Some algorithms are influence by the presence of outliers hence analyzed to make an inform decision on which class of algorithm to choose from.

# function to create boxplot
def make_boxplot(data: pd.DataFrame, variable_name: str):
    """This function accepts a data and variable name and returns a boxplot

    Args:
        data (pd.DataFrame): Data to visualize
        variable_name (str): variable to visualize with boxplot
    """
    fig = px.box(data_frame=data, y = variable_name,
                 template='plotly_dark', 
                 title = f'Boxplot to visualize outliers in {variable_name}'
                 )
    fig.show()

for var in variables:
    make_boxplot(data=data, variable_name=var)

From the boxplot of content_score, the least score of 7 is quite distant from the lower fence and could be seen as an outlier. The boxplot of other variables such as n_images, distance_to_center, avg_rating, n_reviews, avg_price,n_clicks suggest outliers are present upon visual inspection. This information informs our decision making process about which model to choose. From the visualization, a decision is made in favour of choosing a model that is fairly robust against outliers instead of removing or imputing the outliers altogether.

Scatterplot to visualize outliers

In order to investigate whether or not a linear relationship between the target variable and predictor variables exists, a scatterplot is used to visualize them. Thus, all the predictor variables are plotted against n_clicks on the y-axis This is implemented below;

def plot_scatterplot(data: pd.DataFrame,
                 x_colname: str,
                 y_colname: str = 'n_clicks'):
    """ Scatterplot to visualize relationship between two variables. 
    Args:
        data (pd.DataFrame): Data which contains variables to plot
        
        y_colname (str): column name (variable) to plot on y-axis
        x_colname (str): column name (variable) to plot on x-axis
    """
    print(
        (ggplot(data=data, 
                mapping=aes(y=y_colname, x=x_colname)
                ) 
                + geom_point() + geom_smooth(method='lm')
                + ggtitle(f'Scatter plot to visualize relationship between {y_colname} and {x_colname}'
                    )
        )
    )

#
for colname in variables:
    plot_scatterplot(data=data, x_colname=colname)

The scatterplots show that none of the predictors has a linear relationship with the response variable. Hence an insight is being gain that parametric algorithms such as linear regression with the assumption of linear relationship are less likely to produce a convincing prediction with low error.

Correlation analysis to check for multicollinearity

Another exploration to do for the data is checking for multicollinearity among predictor variables given that some algorithms assume that the variables are not strongly correlated to each other. Strong correlation between variables implies the variables are supplying similar information to the algorithm hence dimension reduction technique could be used to reduce or select only variables that enable the algorithm to gain new insights from the data and improve predictive power.

Correlation analysis is undertaken on the predictor variables to check for multicollinearity as follows;.

# 'n_clicks' is the target variable
## all others with exception of 'hotel_id' are predictors
y = data['n_clicks']

hotel_id is not included as predictor because it is just an identifier and offers no real insight about a hotel Ad characteristics for modeling.


## create predictor variables data
X = data.drop(columns=['hotel_id', 'n_clicks'])
X.head()

    city_id  content_score  n_images  ...  avg_rank  avg_price  avg_saving_percent
0  134520.0           70.0       2.0  ...    17.550      81.64                18.0
1  133876.0           67.0       3.0  ...    17.383     189.38                28.0
2  133732.0           39.0       3.0  ...    16.438      57.63                27.0
3   43772.0           59.0       8.0  ...     7.000      72.16                 2.0
4   50532.0           66.0       1.0  ...    12.564     173.25                 0.0

[5 rows x 10 columns]

#%% create correlation
corr = X.corr()
corr

                     city_id  content_score  ...  avg_price  avg_saving_percent
city_id             1.000000      -0.038639  ...  -0.052988           -0.057525
content_score      -0.038639       1.000000  ...  -0.074976            0.388301
n_images           -0.003005       0.012113  ...  -0.001429            0.004675
distance_to_center  0.015587      -0.113261  ...   0.007168           -0.014881
avg_rating          0.043550      -0.135439  ...   0.211038           -0.153566
stars              -0.043517       0.535937  ...  -0.010134            0.468344
n_reviews          -0.091224       0.289233  ...  -0.002831            0.372850
avg_rank           -0.029956      -0.142255  ...   0.016033           -0.121451
avg_price          -0.052988      -0.074976  ...   1.000000           -0.028133
avg_saving_percent -0.057525       0.388301  ...  -0.028133            1.000000

[10 rows x 10 columns]

# Create a mask to hide the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)

<string>:1: DeprecationWarning:

`np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

mask[np.triu_indices_from(mask)] = True

# visualize correlation matrix
sns.heatmap(corr, mask=mask, cmap=sns.color_palette("GnBu_d"), vmax=.3, center=0, 
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

The correlation analysis shows a weak correlation between the predictor variables hence multicollinearity is absent.

Using insights gained from exploratory analysis to inform modeling approach

The findings of non-linear relationship between the predictors and the outcome variable(s), presence of outliers, and sizable missing values suggest that a non-parametric model that handles non-linear relationship, outliers and missing values well needs to be used. Moreover, the business goal is to achieve a good precision rather than interpretability of the model.

On the basis of the findings from the exploratory analysis, decision tree model will be used. Before that, a critical aspect of the findings was that missing data is present and has to handled. For this, a model that natively handles missing data as part of the modeling process is explored. Hence, HistGradientBoostingRegressor from the sklearn library is implemented because it is a decison tree based model with an inbuilt handling of missing data.

The implementation is as follows;

Splitting into Training and validation dataset

The decision on how the data is proportioned for learning and evaluation is one that is sometimes subjective. For this exercise, 70% of the data is used for training and 30% for validation. Given that, there is relatively enough data, 70% dataset will likely provide enough data points to learn and derived as much insight from and 30% will probably be enough to test the model on data points capturing varying characteristics that are unknown to the model. The implementation is as follows;

# split the data into train and validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.3, random_state=2022)

Define baseline model

Modeling can be an iterative process in attempt to optimize and arrive at the best generalizable model.

A key question that needs to be answered before beginning the modeling process is at what level of performance do we accept the model as doing a better job than random guesses hence adding value to the business. There is the need to ascertain that efforts and resources put into developing a model is worthwhile than doing nothing and making random guesses (in which case those resources are also saved)

To provide an objective answer to this question requires defining a baseline model that the model to be developed needs to perform better than in order to be accepted for deployment. This implies that, all algorithms used to develop the model including those for which hyperparameters have been tuned need to do better than the baseline model. Once this is satisfied, a comparison between the models are made to select the best model using the evaluation metric define. In this case, the evaluation metric is defined to be Root Mean Squared Error (RMSE).

In defining the baseline model, an assumption is made that, for each observation, the average of all clicks in the training data is predicted as the number of clicks for the validation data. With this, the error is estimated for the baseline model for which the actual model needs to do better. Thus, the baseline model is estimated as follows,

#%% mean clicks in the training data
mean_clicks = y_train.mean()
print(f"Average click is {mean_clicks: .3f}")

Average click is  14.078

With the assumption that prediction for all observations in the validation dataset is the average clicks in training data, the the RMSE estimated as follows;

# Baseline model RMSE, Mean Absolute Error (MAE)

# function to calculate RMSE
def rmse(y_true, y_pred) -> float:
    return np.sqrt(np.mean((y_true - y_pred)**2))
    
baseline_rmse = rmse(y_val, mean_clicks)
print(f"Validation data RMSE for baseline model: {baseline_rmse: .3f}")
# MAE for baseline model

Validation data RMSE for baseline model:  114.057

print(f"Validation data MAE for baseline model: {abs(y_val - mean_clicks).mean(): .3f}")

Validation data MAE for baseline model:  22.300

From the above analysis, the baseline model metric performance is 114.057 for RMSE and 22.3 for MAE. Thus for the actual model to be accepted as worthy of adding value to business, it has to achieve an error better (lower) than RMSE estimated for baseline model.

Developing model for click prediction – HistGradientBoostingRegressor

After defining the baseline model performance, the actual model development is done as follows;

#%% Histogram-based Gradient Boosting Regression Tree.
## chosen because it has an inbuilt handling on missing data

# define the HistGradientBoostingRegressor model instance
# random state set to achieve reproducible results  
histgb = HistGradientBoostingRegressor(random_state=2022)

# fit the model on the training dataset
histgb.fit(X_train, y_train)

Determining extent to which model captures pattern in the data

A common metric used to understand how well the model explains the data and capture variance present is the co-efficient of determination. For regression models, the score function is used to estimate that and is implemented below;

# Co-efficient of determination
histgb_r2score = histgb.score(X_train, y_train)

print(f"HistGradientBoosting explains {(histgb_r2score * 100): .2f}% of variance in the training dataset")

HistGradientBoosting explains  57.68% of variance in the training dataset

A co-efficient of 0.577 indicates that only about 57.7% of variance in the training data is captured by the model leaving about 42.3% unaccounted for. Hence a lot of insights and patterns in the data has not been captured by the model.

Evaluation of Model

RMSE and MAE (Mean Absolute Error) is used to evaluate the model as follows

# model evaluation on training data
print("Metrics for HistGradientBoosting on Training dataset")

Metrics for HistGradientBoosting on Training dataset

regressionSummary(y_train, histgb.predict(X_train))


Regression statistics

               Mean Error (ME) : 0.0206
Root Mean Squared Error (RMSE) : 82.9041
     Mean Absolute Error (MAE) : 13.3683

# model evaluation on validation data
print(f"Metrics for HistGradientBoosting on Validation dataset")

Metrics for HistGradientBoosting on Validation dataset

regressionSummary(y_val, histgb.predict(X_val))


Regression statistics

               Mean Error (ME) : -0.8717
Root Mean Squared Error (RMSE) : 93.0712
     Mean Absolute Error (MAE) : 14.2191

The results show a RMSE of 82.9 and 93.07 on training and validation dataset respectively. Thus, it is suggested that HistGradientBoosting model is quite overfitting to the training data. The 82.9 RMSE on the training data implies that on the average, the model predicts 83 more or less clicks than the actual number of clicks made. The evaluation indicates that the model is performing better than the baseline model. Nonetheless, there is the need to improve the model by tuning hyperparameters and using alternative models.

Hyperparameter tuning

While model with the default parameters is doing a better job than random guess (suggested by the baseline model), the aim is to reduce error as much as possible to increase accuracy in prediction. While it is possible to achieve this by using a different model, a common approach is to tune hyperparameters of the model already developed to verify if improvement is attainable. Hyperparameter tuning can be undertaken using the grid search where a set of parameters are specified and a search is made using various combinations or permutation to test which combination best reduces error. Another approach is the Radomized grid search where a range is specified for numeric parameters and the algorithmn randomly select the values of the hyperparameter within the range or condition specified to optimize the model. Bayesian optimization is also another approach.

For grid search, all possible combinations are used while random search randomly choose some of the combination a number of times equal to the value of n_iter argument specified.

GridSearch is demonstrated below. Given the time and computational constraints, limited combinations of parameters can be specified to tune HistGradientBoosting Regressor.

# specify values for hyperparameters for the grid search
param_grid = {'max_depth': [10,11,12,],
              'learning_rate': [0.1, 0.8, 0.9],
              'min_samples_leaf': [10,12,14],
              'l2_regularization': [0.1, 0.8, 1]
            }

# create an instance of the grid search
gridSearch = GridSearchCV(HistGradientBoostingRegressor(random_state=2022,
                                                        verbose=0
                                                        ),
                          param_grid, cv=5
                          )
# fit the model for grid search
gridSearch.fit(X_train, y_train)


# show parameters that produce the least error from the possible combinations
print('Best parameters: ', gridSearch.best_params_)

Best parameters:  {'l2_regularization': 0.8, 'learning_rate': 0.1, 'max_depth': 10, 'min_samples_leaf': 10}

The hyperparameter tuning was done as a quick search for demonstration purpose only. A well designed parameter set will take a longer computational time to achieve good result. That noted, the parameters for the best model from the tuning is evaluated as follows;

improved_histboostgb = gridSearch.best_estimator_

print(f"Metrics for best tuned HistGradientBoosting on Training dataset")

Metrics for best tuned HistGradientBoosting on Training dataset

regressionSummary(y_train, improved_histboostgb.predict(X_train))


Regression statistics

               Mean Error (ME) : 0.0253
Root Mean Squared Error (RMSE) : 79.0735
     Mean Absolute Error (MAE) : 13.1996


print("Metrics for best tuned HistGradientBoosting on Validation dataset")

Metrics for best tuned HistGradientBoosting on Validation dataset

regressionSummary(y_val, improved_histboostgb.predict(X_val))


Regression statistics

               Mean Error (ME) : -0.8557
Root Mean Squared Error (RMSE) : 92.1132
     Mean Absolute Error (MAE) : 14.0488

Using a grid search for various combinations of the hyperparameters specified, RMSE reduced from 82.9 to 79.07 on training data and that of validation data reduced from 93.09 to 92.11. Generally, some reduction in error is attainable with hyperparameter tuning particularly when and time and computational resource is dedicated to it. Overfitting is suggested in the result and could be handled by tuning the regularization parameter.

Bagging as an approach to improving model performance and overfitting

In order to further reduce the error, the tuned model is bagged. This approach undertakes multiple random sampling with replacement and for each sample fits HistGradientBoostingRegressor to produce scores which are aggregated. As expected, this helps reduce overfitting as more samples are fitted hence making the model more stable to unseen data.

Bagging is implemented for the HistGradientBoostingRegressor as follows

# create an instance for model
bagging = BaggingRegressor(HistGradientBoostingRegressor(random_state=2022,
                                                        l2_regularization=0.8, 
                                                        learning_rate=0.1, 
                                                        max_depth=10, 
                                                        min_samples_leaf=10
                                                        ),
                           n_estimators=100,
                           random_state=2022
                           )

# fit the model 
bagging.fit(X_train, y_train)

print("Metrics for bagged HistGradientBoosting on Validation dataset")

Metrics for bagged HistGradientBoosting on Validation dataset

regressionSummary(y_val, bagging.predict(X_val))


Regression statistics

               Mean Error (ME) : -1.0130
Root Mean Squared Error (RMSE) : 90.2835
     Mean Absolute Error (MAE) : 14.0052

print("Metrics for bagged HistGradientBoosting on Training dataset")

Metrics for bagged HistGradientBoosting on Training dataset

regressionSummary(y_train, bagging.predict(X_train))


Regression statistics

               Mean Error (ME) : -0.0685
Root Mean Squared Error (RMSE) : 80.8758
     Mean Absolute Error (MAE) : 13.3553

The result for bagging of HistGradientBoosting shows further improvement with RMSE on the validation data reducing to 90.28 compared to 92.11 by the tuned HistGradientBoosting without bagging.

In addition, an insight provided is that bagging equally reduced the amount of overfitting on the training dataset given that it decreased the difference in error between validation RMSE and training RMSE (90.28 vs. 80.75) compared to tuned HistGradientBoosting which produced 92.1132 and 79.0735 for validation and training RMSE. Generally, the lesser the difference in error between the training and validation dataset, the better and the lower the overfitting or underfitting. The goal is to produce a model that scores lower error with minimal difference on the validation and training dataset.

Alternative models – XGBRFRegressor

XGBRFRegressor is Extreme Gradient Boosting Random Forest regression model and be seen as an advance variation of decision tree model where multiple trees are grown with random sampling of features and data hence Random Forest. This API also inherently handles missing data hence another option to using decision tree models that cater for missing data as undertaken with HistGradientBoostingRegressor.

Its implementation as below;

## create an instance of XGBRFRegressor with default parameters 
## and set random state for reproducible results
xgb = XGBRFRegressor(random_state=2022)

# fit model on training data
xgb.fit(X_train, y_train)

# XGBRFRegressor model evaluation on training data 

print(f"Metrics for XGBRFRegressor on Training dataset")

Metrics for XGBRFRegressor on Training dataset

regressionSummary(y_train, xgb.predict(X_train))


Regression statistics

               Mean Error (ME) : -0.0010
Root Mean Squared Error (RMSE) : 95.5276
     Mean Absolute Error (MAE) : 15.1109

# XGBRFRegressor model evaluation on validation data
print(f"Metrics for XGBRFRegressor on Validation dataset")

Metrics for XGBRFRegressor on Validation dataset

regressionSummary(y_val, xgb.predict(X_val))


Regression statistics

               Mean Error (ME) : -0.9871
Root Mean Squared Error (RMSE) : 99.1090
     Mean Absolute Error (MAE) : 15.2338

The RMSE of 99.1 on validation dataset for XGBRFRegressor indicates better performance of XGBRFRegressor compared to the baseline model but with a higher error compared to HistGradientBoosting. Thus, HistGradientBoosting will be chosen over XGBRFRegressor on the basis of a lower RMSE on the validation dataset.

The importance of various features in contributing to the model, otherwise known as predictive power of the variables can be estimated for the model below.

#%% estimate feature importance in the model
important_features = pd.Series(data=xgb.feature_importances_)
important_features.sort_values(ascending=True,inplace=True)

## create a DataFrame with features and their importance score in the model
feature_importance_df = pd.DataFrame({'Predictors': X_train.columns,
                                    'importance': important_features,
                                    }                                    
                                    )
feature_importance_df

           Predictors  importance
5             city_id    0.054296
9       content_score    0.055762
4            n_images    0.065593
7  distance_to_center    0.071565
1          avg_rating    0.080671
2               stars    0.090017
8           n_reviews    0.092771
3            avg_rank    0.138946
6           avg_price    0.152653
0  avg_saving_percent    0.197725

Visualize feature importance

fig = px.bar(data_frame=feature_importance_df, y="Predictors", x="importance", orientation="h",
             color="Predictors", title="Feature Importance in XGBRFRegressor",template="plotly_dark"
             )
fig.show()

From the visualization above, avg_saving_percent has the most predictive power in XGBRFRegressor model fitted on the data

Potential measures for to improve the model

Modeling can be a very iterative exercise with time and resource such as computational power constraints and this is duly recognized here. It is therefore concluded that, there is a good chance the model developed here can be further improved with those available resources using appropriate measures. Instead of aiming to exhaust all implementation to produce the lowest error possible for a model; some of the measures to improve the model are highlighted here as further course of action.

Feature selection and engineering:

Feature selection could be explored to select only the most influential features and probably reduce overfitting. Some categorical features could also be transformed by encoding if computation power allows for the increase in data dimensionality to explore the potential of improving the model. One-Hot encoding was ruled-out for city_id due to its high cardinality. While only normalization or scaling is less likely to significantly improve performance of decision tree based models explored here, the technique should be explore for other type of models. Closely related is the option of augmenting the data with new variables. Potential variables and data to use or collect include availability of special services at hotels like excursions and guided tours, cultural performance among others. This could be an influential variable that appeal to users to click and book. This could be a categorical variable or all special services use to rate a hotel as an ordinal variable.

Experiment with different techniques of handling missing data to discover which approach reliably produce a stable and optimized model. This approach could be iterative but a very interesting domain dedicate resources to establish domain relevant method of handling missing data since it appears to be a challenge for the data.
Tuning hyperparameters:

The models developed can further be improved by experimenting with hyperparameters to select combinations that best reduces error.

Making prediction on test data

The test data is not used in the training the model but for prediction. This is one of the approaches used in gaining an objective view of the model’s generalizable performance. Something that was done on the validation dataset. To make predictions on the test data, the bagged HistGradienBoostingRegressor model is used.

Generally, the prediction made were float values some of which do not reconcile with reality in terms of the unit of measuring the target variable which is count of clicks. For instance, a prediction of 8.387 in reality deviates from logic as there can only be clicks numbering whole numbers and not decimals. Thus, the prediction were round-off to nearest whole numbers which for this example will be 8. It should be noted that, this action in itself could introduce some margin of error. More importantly, number of clicks are counts hence a poisson regression could be more appropriate for the modeling exercise and should be explored.

## read test dataset
test_data = pd.read_csv(r"Data/test_set.csv")

test_data.head()

       hotel_id   city_id  ...  avg_price  avg_saving_percent
0   14942256073  122750.0  ...      90.19                32.0
1   16036037903   28134.0  ...      98.27                19.0
2  288585940112   30578.0  ...      48.77                 0.0
3  129041645070   54398.0  ...      72.32                 0.0
4   12460296563   63890.0  ...      24.54                19.0

[5 rows x 11 columns]

# making prediction on test dataset -- select features and use model to make prediction
test_data["predictions_made"] = bagging.predict(test_data.iloc[:, 1:])
test_data

            hotel_id   city_id  ...  avg_saving_percent  predictions_made
0        14942256073  122750.0  ...                32.0          8.387677
1        16036037903   28134.0  ...                19.0         25.187326
2       288585940112   30578.0  ...                 0.0          0.142417
3       129041645070   54398.0  ...                 0.0          1.738487
4        12460296563   63890.0  ...                19.0          7.663731
...              ...       ...  ...                 ...               ...
132157  146542092044   22896.0  ...                 0.0         10.893020
132158  191459858176   28030.0  ...                 4.0          6.136969
132159   71241988826  137926.0  ...                19.0          2.265503
132160  204878950620   54574.0  ...                 0.0          0.580113
132161   99788791152  774630.0  ...                 0.0          2.434706

[132162 rows x 12 columns]

# round off predictions to nearest whole to reflect actual counts of clicks


test_data['n_clicks'] = round(test_data['predictions_made'])
test_data

            hotel_id   city_id  ...  predictions_made  n_clicks
0        14942256073  122750.0  ...          8.387677       8.0
1        16036037903   28134.0  ...         25.187326      25.0
2       288585940112   30578.0  ...          0.142417       0.0
3       129041645070   54398.0  ...          1.738487       2.0
4        12460296563   63890.0  ...          7.663731       8.0
...              ...       ...  ...               ...       ...
132157  146542092044   22896.0  ...         10.893020      11.0
132158  191459858176   28030.0  ...          6.136969       6.0
132159   71241988826  137926.0  ...          2.265503       2.0
132160  204878950620   54574.0  ...          0.580113       1.0
132161   99788791152  774630.0  ...          2.434706       2.0

[132162 rows x 13 columns]

Summary

This project demonstrates a real world approach to using machine learning to solve business problem. The problem was that the bidding team lacks a mechanism that supports them to achieve their goal of enabling advertisers run successful campaign ads. The solution offered is a prediction tool powered by decision tree model that enables the bidding team predict ads clicks better than random guesses. The project demonstrates how exploratory analysis is undertaken as a sign-post showing the way for which machine learning algorithm to use. The model was trained and evaluated and hyperpramater tuning undertaken. Bagging was also undertaken to reduce overfitting. Further, suggestions are provided to be explored in improving the model.

Comment on this article Share: