Analyzing the impact of platform features on booking sales

Linear regression revenue impact assessment

This post analyzes how various strategies employed by a business for growth actually influences their sales. The data driven approach aims to assess how a firm will progress forward. To gain such actionable insights, this post demonstartes how to undertake a linear regression with focus on result interpretability.

Linus Agbleze https://agbleze.github.io/Portfolio/
2022-09-05

Background

Businesses exist to make money and this simple assertion also has a simple rule. Do more of what grows the business and less of what makes losses. This golden rule is known by all but practice by few. In fact, it is not knowing it that matters but identifying the impact of various business drivers that is the silver bullet. Luckily, many businesses have what it takes to take the perfect shot. It is data analysis and that is what will be undertaken in this post for an anonymized firm to make an inform decision to increase sales. The product team is preoccupied with the question of how user interaction is influencing sales on their booking platform. For such an analysis, the aim is not to develop a well optimized model that predict sales but to understand how the various features on their platform influence sales. How much of an insights is derive to drive the business is dependent on the dataset features. With the understanding of what the business is and what the product team wants, the first step for the analysis is to formulate business questions that the analysis should provide answers to. This analysis proceeds to answer the following questions

Business questions

  1. What is the impact of user verification, instant booking, and faster page loading time on booking sales

  2. To what extent does these variables explain booking sales?

Data analysis – Overview of the variables

Two main categories of variables are required for the analysis; outcome variable and predictor variable. The analysis to be undertaken is regression given that the outcome variable is numeric and continuous. It is extremely important to group the variables in the dataset rightly as this informs the modeling process. The outcome variable is what is being predicted or explained while the predictors are the variables that are used in predicting or explaining the outcome variable. For this analysis, the following variables are available for use

Outcome variable

Total booking sales: This variable depicts the total price that a customer pays for booking an apartment

Predictors

Test status: This is a binary predictor where customers who were offered the feature of a faster page loading time are classified as test and those without named as control. For this analysis, the reference group is designated to be test users. The reference group is exempted from the model and serve as the benchmark for comparison and interpretation of that variable.

User verification is a binary predictor indicating whether a user has been verified or not. Users who are verified are labeled ‘Verified’ and those not verified are named as ‘Not_verified’ for the variable “user_verified”. The reference group for the regression analysis for this variable was users who are not verified.

Instant booking: is a binary predictor indicating whether or not a customer used the instant booking feature. Customers who used the instant booking feature were designated as the reference group for the analysis

Worthy of notice is that all the predictors are binary categorical variables but numeric and continuous variables can also be use when available. The various libraries for the analysis are loaded and the dataset read using the code below.

library(readr)
library(tidyverse) # data manipulation and visualization
library(ggplot2)
library(GGally)
library(ggstatsplot)
library(modelr)     # provides easy pipeline modeling functions
library(broom)      # helps to tidy up model outputs
library(car)  ## for regression
library(haven)
library(caret)
library(rsample)
all_conversions_variables = read_csv("all_conversions_variables.csv")

Visualization of the data

It is always a good idea to visualize the variables to gain some understanding before the modeling process. First, the average sales made by verified and non-verified users is estimated and visualized using the code below.

user_verified_sales <-  all_conversions_variables %>% 
  select(user_verified, total_paid.EUR) %>%
  na.omit() %>%
  dplyr::group_by(user_verified) %>%
  dplyr::summarise(avg_sales = mean(total_paid.EUR)) 
user_verified_sales
# A tibble: 2 × 2
  user_verified avg_sales
  <chr>             <dbl>
1 Not_verified      2475.
2 Verified          4965.
ggplot(data=user_verified_sales, 
       mapping=aes(x=user_verified, y=avg_sales, fill=user_verified)) +   geom_col()+ 
  ggtitle('Average sales (verification status)') + 
  xlab("user verification status") + ggthemes::theme_wsj() + 
  ylab("Average booking sales")

The result shows that verified users booked more rooms, generating almost twice as much sales on the average compared to non-verified users.

Average sales based on whether customer used instant booking feature

In order to compute average sales based on instant booking feature usage, the data is grouped based on instant_booking variable and the mean sales for each group is estimated. This is undertaken as follows;

# estimate avg booking sales by instant booking
avg_instant_booking = all_conversions_variables %>%
  dplyr::select(instant_booking, total_paid.EUR) %>%
  dplyr::group_by(instant_booking) %>%
  dplyr::summarise(avg_sales = mean(total_paid.EUR))

avg_instant_booking
# A tibble: 2 × 2
  instant_booking avg_sales
  <chr>               <dbl>
1 Instant             2983.
2 Not_instant         5424.

The result is visualized as follows;

ggplot(data = avg_instant_booking, 
       mapping = aes(x = instant_booking, y = avg_sales, 
                     fill = instant_booking)) + geom_col() + 
  ggtitle("Average sales (instant booking)") + 
  xlab("Instant booking types") + ylab("Average booking sales") + 
  ggthemes::theme_wsj()

Avergare sales based on test status

Average sales made by test group can be compared to the control group. Users labeled as test experienced a faster page loading time. Average sales can be estimated using the code below.

avg_sales_test_status <- all_conversions_variables %>%
  select(test_status, total_paid.EUR) %>%
  na.omit() %>%
  dplyr::group_by(test_status) %>%
  dplyr::summarise(avg_sales = mean(total_paid.EUR))

avg_sales_test_status
# A tibble: 2 × 2
  test_status avg_sales
  <chr>           <dbl>
1 control         4345.
2 test            5482.
ggplot(data = avg_sales_test_status, 
       mapping = aes(x = test_status, y = avg_sales, fill = test_status)) +
  geom_col() + ylab("Average booking sales") + 
  ggtitle("Average sales (test status)") + 
  ylab("Average booking sales") +
  xlab("Test status") + 
  ggthemes::theme_wsj()

Regression analysis

After visualization, it is time to build a regression model. Given that the focus is on interpretability, linear regression will be built using all the variables identified. The procedure is highlighted as follows;

  1. Set seed to make splitting of dataset reproducible
  2. Randomly split the dataset into training and test set. This process is a standard for all machine learning processes. Nonetheless, it can be avoided here given that the focus is not on estimating the performance of the model on unseen data.
  3. Build the model, identifying the outcome and predictor variables
  4. Estimate model co-efficient for the various predictor variables and their significance
  5. Estimate the co-efficient of determination to explain the amount of variability in the data explained by the model
  6. Interpret the results

The steps identified above are implemented as follows

# set seed for reproducibility
set.seed(123)

# Spliting dataset into training and testing samples
sample_all_conversions <- sample(c(TRUE, FALSE), nrow(all_conversions_variables), replace = T, prob = c(0.6,0.4))

# retrieve training dataset
train_all_convern.lmvar <- all_conversions_variables[sample_all_conversions, ]

## retrieve testing dataset
test_all_convern.lmvar <- all_conversions_variables[!sample_all_conversions, ]

## regression for qualitative predictors (categorical dataset -- factors) 
options('contrasts' = c('contr.treatment','contr.treatment') )
model4_all_convern_factors <- lm(total_paid.EUR ~ test_status + instant_booking +
                                    user_verified, data = train_all_convern.lmvar)


# add model diagnostics to our training data
model4_results <- augment(model4_all_convern_factors, train_all_convern.lmvar)

Plot results of regression analysis

booking_sales.predictioplot <- ggcoefstats(
  x = stats::lm(formula = total_paid.EUR ~ test_status + instant_booking +
                  user_verified, data = train_all_convern.lmvar),
  ggtheme = ggplot2::theme_gray(),
  title = "Regression analysis: Predicting the influence of various factors on booking sales",
) 

booking_sales.predictioplot

Determining extent to which various variables explain booking sales

The second question for this analysis will be answered by estimating the co-efficient of determination as follows

rsquare(model4_all_convern_factors, data = train_all_convern.lmvar)
[1] 0.0416999
rsquare(model4_all_convern_factors, data = test_all_convern.lmvar)
[1] 0.009705932

Results and interpretation

The answers to the first question for this analysis is answered by the coefficient of the variables which explains their influence on booking sales.

The results shows that based on test status, test visitors, that is those who experienced faster page loading time are less likely to make higher booking sales as they made bookings worth €452 less than what is ordered by the control group. Also, visitors who did not use instant booking made purchase worth €2,319 more than those who used instant booking. Hence instant booking is less likely to result in higher booking sales. A possible explanation is customers may be using instant booking for cheap offers that will not last long and need to booked immediately. Thus, on the average generate lower revenue compared to those who book more expensive offers that are price stable hence do no use instant booking.

Moreover, users who are verified made bookings worth €1,085 more than those who are not verified. A possible explanation is getting verified means investing one’s trust in the platform to be legitimate and worthwhile hence willing to book more expensive apartments.

The results however suggests that there is no statistically significant difference in purchases among visitors based on the variables analyzed given that the p-value found was greater than 0.05 in all cases.

In response to the second question for this analysis, the predictors analyzed explain only 4% of variations in booking sales (rsquare = 0.0416) as highlighted by the co-efficient of determination for training set. The predictive power of the model on unseen data is very low as it explains only 0.9% of variability for test dataset but this is not the focus so no optimization technique will be employed for that purpose.

Recommendations

Now it is time to restate specific golden rules as follows,

  1. Implement more strategies to entice users to get verified as it gives them the confidence to book more high value accommodations hence higher sales

  2. Resources dedicated to faster page loading time are not contributing to high sales. In case, this is the main objective for implementing faster page loading time feature, then it needs to be reconsidered. It is however noted that, such resources may not be entirely wasted given that faster page loading time reduces bounce rate as assessed in a related analysis. Thus, the final decision should be entirely based on the strategic business goal.

Summary

This blog post discussed taking a data-driven approach to running a business. The business problem was absence of a clear understanding of how features implemented by an anonymized online accommodation booking firm are being used by their clients to impact sales. The solution provided was a linear regression explaining the extent to which the various variables analyze account for sales. With focus on gaining insight on how the various factors influence sales, the model developed suffices. However, there is the need to consider other important variables if we are to focus on developing precise models for predicting booking sales.