This post discusses a detailed analysis for a start-up that aims to make a data driven decision on optimal marketing strategy that maximizes sales and conversion as well as understanding factors driving them.
Businesses are well aware of the various strategies by which to reach their ends which could selling products, increasing sales, reducing cost, enlisting more clients among others. In many cases, the goal is not to know the various strategies that exist impractically execute all of them. Rather the goal is to utilize the most rewarding and optimal strategy that maximizes the business goal. This understanding directly feeds into how products are marketed, customers are reached among others.
Of the many strategies that exists, how do we know if the best method is being used to achieve the best results? For instance, how do marketing teams know an ad will achieve a better sales than a different one? How do sales team determine the best demo session approach to convert a prospective client? Well, nothing is known by gut feeling, at least when money is at stake. The best evidence that can be made for a strategy is one that is testable with verifiable principles. What are a way to make a case for statistical analysis for businesses! This post aims to demonstrate such a case.
In essence, this post undertakes statistical analysis for a firm that is challenged with the problem of determining the best sales script to use during a demo session with a prospective client in order to increase number of conversions. The firm provide services for which prospective clients sign-up. In order to stay in business, it is not enough to access the market by landing demo request but more essentially turn those requests into conversions and increase conversion rate. To get prospective clients sign-up for their services, the firm provides a demo session after a client has requested for one. The sales team then decide the strategy to use in convincing the prospective client to sign-up during the demo session. The strategy is one of 3 sales scripts that is used during the demo session. For such a strategy, there is the possibility that one of the scripts is optimal in attaining higher conversion rate by design and not by chance. The aim is to identify this script if existing and recommend it to the sales team in order to stop incurring the opportunity cost of not using the optimal script.
A key component for data analysis in business setting is that the business goal informs the kind of analysis to be done and business understanding determines how the analysis should be undertaken. In this case, the goal of the business, sales team to be specific, is stated as follows;
To increase conversion from demo request to contract signing
The sales team aims at undertaking a data-based decision making to achieve their goal. To achieve that, data has been collected to be analyzed to gain insights. The challenge is to discern insights from the data that enable them attain their goal by asking the right business questions. Data analysis is undertaken in response to business problems hence it is critical that all the problems that the analysis aims to provide solutions or rather recommendations on how to solve them, are identified in an analytical framework or statement. This is identified as problem statement as highlighted as follows;
Given the fact that sales script was the main tool used during the demo session, it is hypothesized to be associated with conversion. This understanding translates into the following analytical framework for further assessment.
Null Hypothesis (H0): There is no relationship between type of sales script and conversion
Alternative Hypothesis (H1): There is a relationship between type of sales script and conversion
While the total conversion for each script can be easily assessed, it is not definitive of a script’s better performance without an element of chance. There is the need to assess whether difference in conversion is significant to suggest a preference for one of them to increase the chances of achieving higher conversion. This need not be a guesswork but data driven hence translated into the following hypothesis for analysis;
Null hypothesis (H0): Difference in mean conversion among all sales scripts is equal
Alternative hypothesis (H1): There is difference in mean conversion for at least one of the sales scripts
The result for the hypothesis stated above will determine whether or not there is the need for further enquiry in the form of post-hoc test. If the analysis results in rejection of the null hypothesis, then there will be the need to assess how the different sales scripts compare to each other in order to identify which of the sales scripts produce significantly higher mean conversion.
For business success, the higher the amount of conversions, the better. The business problem at hand is to understand the drivers of conversion. The decision by a client to sign-up or not during or after a demo session is one that needs to be better understood in order to reinforce what produces conversion and eliminates factors that prevents conversion. This problem requires assessing how various factors for which data is available are influencing conversion.
Based on the the business problem statements conceptualized, this analysis is undertaken to provide answers to the following business questions
Is there a significant relationship between type of sales script and conversion?
Are all sales scripts achieving a similar amount of conversions?
How do sales scripts compare in terms of conversions and which sales scripts can be used to achiever higher conversions?
How do various factors influence the probability of a client to sign-up?
To define the data analysis tasks, the business questions are translated into objectives as follows
To test the hypothesis that there is a statistically significant relationship between type of sales script and conversion
To test the hypothesis that difference in mean conversion among sales script is equal
To assess and identify which sales script produces higher average conversions if any
To assess potential drivers of conversion and understand their influence
This section details the procedure used to analyzed the data to derive insights and draw recommendations. First of all it is important to highlight some of the formula used.
Conversion rate = \(frac{Number of contracts signed}{Number of Demo requests}\100\)
Conversion rate for script = \(frac{Total conversions when script was used}{Total number of demo sessions when script was used}\)
In order to make results reproducible and understandable for contextualized interpretation, much effort is made to lay bare assumptions that may influence how insights is drawn. Some of these assumptions are required for the statistical analysis undertaken to hold true. These are highlighted in this section as follows.
Scripts were used independently, that is a single script was applied for a client. All scripts were used for approximately the same time period. It is however realized from the data that the first demo appointment session for which script C was used was on 2020-01-08 16:50:35 and the last date of use was 2021-03-01 19:40:01. For script A, it was first used for a demo session on 2019-12-28 03:57:38 and its last usage was on 2021-03-29 12:44:07. Script B was first used for a demo session on 2019-12-28 11:38:55 and last used was on 2021-03-06 16:57:21. Thus, some differences in date of first usage and last usage of script exist but this is assumed to be negligible. The script were indeed used independent of each other.
## Import packages
import pandas as pd
from plotnine import ggplot, ggtitle, aes, geom_col, theme_dark, theme_light, geom_boxplot
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
import scipy.stats as stats
from bioinfokit.analys import stat
import statsmodels.api as sm
import pingouin as pg
import scikit_posthocs as sp
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
The data analysis begins with importing the data and inspecting the variables. In this section, the data is explored.
## import data and read top 5
= pd.read_csv(data_path)
df
## View first 3 rows of the data
3) df.head(
Unnamed: 0 ... first_call_to_1streach_timelength_minutes_
0 1 ... 327.465776
1 2 ... 138.429704
2 3 ... 357.238224
[3 rows x 26 columns]
## Identify the categories of sales scripts
df.sales_script_variant.unique()
array(['a170e8b5b0085420fa52f9f9e1d546f9',
'e908f62885515872936a2bf07e5960a0',
'21790c97eeb6336e5f0fdb9ef4de636f'], dtype=object)
## get description of the variables in the data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 10000 non-null int64
1 request_id 10000 non-null int64
2 source 10000 non-null object
3 request_created_at 10000 non-null object
4 company_type 5084 non-null float64
5 first_call 10000 non-null object
6 first_reach 9418 non-null object
7 sales_script_variant 10000 non-null object
8 sales_group_name 10000 non-null object
9 demo_appointment_datetime 10000 non-null object
10 appointment_added_by_id 10000 non-null int64
11 is_signed 10000 non-null int64
12 conversion 10000 non-null object
13 sales_script_type 10000 non-null object
14 company_group 10000 non-null object
15 request_created_date 10000 non-null object
16 first_call_date 10000 non-null object
17 first_reach_date 9418 non-null object
18 request_created_year 10000 non-null int64
19 request_created_month 10000 non-null object
20 request_to_1stcall_interval 10000 non-null object
21 request_to_1streach_interval 10000 non-null object
22 first_call_to_1streach_interval 10000 non-null object
23 request_to_1stcall_timelength_minutes_ 10000 non-null float64
24 request_to_1streach_timelength_minutes_ 9418 non-null float64
25 first_call_to_1streach_timelength_minutes_ 9418 non-null float64
dtypes: float64(4), int64(5), object(17)
memory usage: 2.0+ MB
## shows the shape of the data -- number of rows and columns
df.shape
(10000, 26)
## Cast the data into DplyFrame in order to use dplython functions on it.
= df.copy()
df_dataframe df_dataframe.columns
Index(['Unnamed: 0', 'request_id', 'source', 'request_created_at',
'company_type', 'first_call', 'first_reach', 'sales_script_variant',
'sales_group_name', 'demo_appointment_datetime',
'appointment_added_by_id', 'is_signed', 'conversion',
'sales_script_type', 'company_group', 'request_created_date',
'first_call_date', 'first_reach_date', 'request_created_year',
'request_created_month', 'request_to_1stcall_interval',
'request_to_1streach_interval', 'first_call_to_1streach_interval',
'request_to_1stcall_timelength_minutes_',
'request_to_1streach_timelength_minutes_',
'first_call_to_1streach_timelength_minutes_'],
dtype='object')
Non-conversion are cases where the client did not sign up during and after the demo session Given that our goal is to increase conversion, the analysis will be centered around that. First, the total number of conversion is estimated and compared to non-conversion
From the analysis below, total number of conversion was 5,018 which is slightly higher than non-conversion of 4,982. The sum of both conversion and non-conversion equates to the total number of requests made.
## select some columns needed to estimating conversion
= df_dataframe[['is_signed', 'conversion', 'sales_script_type', 'request_to_1streach_timelength_minutes_', 'company_group']]
df_dplyselect
## Group data based on conversion column and count number of conversion
= df_dplyselect['conversion'].value_counts().reset_index()
conversion_total
={'index': 'conversion', 'conversion': 'total_conversion'}, inplace=True)
conversion_total.rename(columns conversion_total
conversion total_conversion
0 convert 5018
1 not_convert 4982
='conversion', y='total_conversion'))
(ggplot(conversion_total, aes(x+ geom_col(stat = 'identity') + ggtitle('Bar chart of Total Conversion and non-conversion') + theme_light()
)
<ggplot: (8794146572797)>
After estimating conversion and non-conversion, conversion needs to be estimated. This can be considered as one of the major KPI of interest to the sales team.
Сonversion rate = \(frac{Number of contracts signed}{Number of Demo requests}\)
## The number of Demo requests] is estimated below
= conversion_total['total_conversion'].sum()
request_total request_total
10000
## Conversion rate
= conversion_total.iloc[[0],[1]] / request_total
conversion_rate
=str, columns={"total_conversion": "C1"}) conversion_rate.rename(index
C1
0 0.5018
Thus, conversion rate is estimated to be 0.5018. It is concluded that conversion rate in percentage terms is 50.2%
In order to better understand conversion, there is the need to explore the data based on certain dimensions.
= df_dplyselect.groupby(['company_group', 'conversion'])['conversion'].count().to_frame().rename(columns={'conversion': 'total_count'}).reset_index()
company_conversion
company_conversion
company_group conversion total_count
0 Others convert 2446
1 Others not_convert 2470
2 company_A convert 1027
3 company_A not_convert 1039
4 company_B convert 1545
5 company_B not_convert 1473
='company_group', y='total_count', fill='conversion'))
(ggplot(company_conversion, aes(x+ geom_col(stat='identity', position='dodge')) + theme_dark() + ggtitle('Conversion based on type of company')
<ggplot: (8794153999857)>
From our research questions and hypothesis, conversion based on sales script type is will offer valuable insight hence the data is grouped and conversion is estimated and visualized for the different sales group before testing their hypothesis further.
## group data based on type of sales script and count total conversion
= df_dplyselect.groupby(['sales_script_type', 'conversion'])['conversion'].count().to_frame().rename(columns={'conversion': 'total_conversion'}).reset_index()
script_conversion
script_conversion
sales_script_type conversion total_conversion
0 script_A convert 3320
1 script_A not_convert 3237
2 script_B convert 1638
3 script_B not_convert 1692
4 script_C convert 60
5 script_C not_convert 53
### Plot the total conversion based on script type
='sales_script_type', y='total_conversion', fill='conversion'))
(ggplot(script_conversion, aes(x+ geom_col(stat='identity', position='dodge')) + theme_dark() + ggtitle('Conversion based on type of sales script')
<ggplot: (8794154038872)>
Estimates of conversion based on script type shows that both script A and script C made a higher conversion compared to non-conversion. Script A made 83 more conversions than non-conversion while script C made 7 more conversion than non-conversion. On the contrary, script B made 54 less conversions compared to non-conversion.
While this difference gives a clue about performance of the various script, it does not enables us to make decisive conclusion but guesses of what the difference could result in. Forinstance, the total conversion is a summation and hence the difference could be the result of number of demo sessions that a script has been used for. Clarity needs to be brought to such guessestimates by computing their mean.
In order to make a data driven decision, hypothesis need to be tested to derive better understanding base on statistical significance.
To test the hypothesis that there is statistically significant relationship between type of sales script and conversion.
Proceeding from the insights gained, this section tests the hypothesis for the first objective
H0: Conversion is independent of type of sales script used
H1: Conversion is dependent on type of sales script used
Chi squared test of independence is appropriate for assessing whether there is a relationship between two categorical variables hence used. The procedure involves computing a contingency table and using that for the analysis. This is demonstrated below.
Contingency table
## contingency table between sales script type and conversion
= pd.crosstab(df_dplyselect['sales_script_type'], df_dplyselect['conversion'])
conversion_script_type_contengency
print(conversion_script_type_contengency)
conversion convert not_convert
sales_script_type
script_A 3320 3237
script_B 1638 1692
script_C 60 53
### chi-square test of independence
= stat()
chi_square_result
=conversion_script_type_contengency)
chi_square_result.chisq(df
print(chi_square_result.summary)
Chi-squared test for independence
Test Df Chi-square P-value
-------------- ---- ------------ ---------
Pearson 2 2.23037 0.327855
Log-likelihood 2 2.23068 0.327804
With a p-value of 0.3279 being greater than 0.05 (5% significance level), it is suggested that there is no statistically significant relationship between type of sales script and conversion. Thus, we fail to reject the null hypothesis of independence based on available evidence.
This result has an added twist to it, which is that, there is the possibility of a change in result when more and better data is acquired.
A major research gap that remains is whether the differences in conversion between scripts as deduced from the bar plots is significant. This requires the need to test another hypothesis hence research objective 2.
In order to test this hypothesis, continuous variable is needed. For this, the total number of conversions on daily basis can analyzed and used as a continuous variable. The rational for estimating daily total conversions instead of monthly or yearly is to ensure that there are enough data points to make statistical inference.
The available data covers only a year and a couple of days, hence daily conversions is a logical time frame for estimating total conversion. By this, the dataset needs to be grouped based on days and the sum of conversions estimated for each sales script type. The question that arise is how are conversions to be counted? For this, ‘is_signed’ variable is used; where its boolean data type (true or false) are treated as integers with 1 counted as a single conversion and 0 as no conversion for each demo session. This allows for the total daily conversion to be estimated.
Another question is, which of the dates should be used as a basis for counting the total daily conversion? For this, the demo_appointment_datetime variable is a better option based on the context that attempt to convert the potential client is done during and after a demo session hence the assumption that the impact of the script used on conversion came into effect on the demo appointment day.
It is assume that the date used will be of less relevance for testing the hypothesis despite there could be changes in the total number of daily conversion for scripts when the date is change. But the impact is assumed to be negligible.
An assumption was made that all sales scripts had equal chance of being used during a demo session.
It is assumed that there were no cases of repeated sessions where scripts were used again as repetition of campaign could be a contributor to conversion and not the script used.
### select various columns and split the demo appointment column into date and time
= df[['demo_appointment_datetime', 'request_to_1streach_timelength_minutes_',
df_grp 'conversion', 'company_group', 'sales_script_type', 'is_signed']]
'demo_appointment_yyyymmdd', 'demo_appointment_hhmmss']] = df_grp.demo_appointment_datetime.str.split(' ', expand=True)
df_grp[[
## sum the total conversion for each script on each day
<string>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
<string>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
= (df_grp.groupby(['demo_appointment_yyyymmdd','sales_script_type'])['is_signed']
df_grp_sum sum().to_frame()
.={'is_signed': 'total_conversion'}).reset_index()
.rename(columns
)
df_grp_sum.head()
demo_appointment_yyyymmdd sales_script_type total_conversion
0 2019-12-28 script_A 0
1 2019-12-28 script_B 2
2 2019-12-29 script_A 0
3 2019-12-29 script_B 1
4 2019-12-30 script_B 1
First, the distribution of total conversion based on sales script is visualized using boxplot which shows the mean conversion for sales scripts as well as the minimum, second quartile, third quartile and maximum.
= df_grp_sum, mapping = aes(x = 'sales_script_type', y = 'total_conversion'))
(ggplot(data + geom_boxplot() +
'Total conversion by script type (based on conversion counts on demo day)')) ggtitle(
<ggplot: (8794144187769)>
From the boxplot, some points which are outside the normal range of the plot can be regarded as outliers. However, they are not remove for now. Moreover the difference in conversion between scripts is clearer with the boxplot. It is deduced that script C has the lowest average conversion while script A has the highest.
Before deciding on the statistical method to use to answer the business questions, there is the need to verify that statistical assumptions hold true for the data. For valid statistical inference to be made about our target market based on our current clients data, it is very important that appropriate methods are used. The appropriateness of the method is not a guesswork but one informed by both visual and statistical test.
Generally, there are two family of statistical techniques that can be employed in analyzing the data. Namely, Parametric and non-parametric methods. The parametric methods usually have a greater statistical power but this comes at the cost of requiring the data to meet a number of assumptions for proper conclusions to be reach. Given that, our aim is to present the possible best solution, the assumptions for using a parametric method are tested first before considering their use or otherwise.
To use a parametric method, normality of data distribution is assumed. A simple approach to verifying this is with the aid of histogram as shown below.
'total_conversion'], bins=20, kde=True, color='g') sns.histplot(df_grp_sum[
The histogram plotted above does not necessarily answer the question on normality assumption. Nonetheless, it is one of several visualization techniques that gives clues.
From visual inspection, the distribution is right-tailed hence a right-skewed or positive-skewed distribution is noted for the data.
In order to gain a better view on the normality of the distribution, a Q-Q plot is use to compare actual total conversion to an expected value in a normal distribution. The yardstick for detecting normality will be to verify that the actual data distribution are linearly arranged along a straight 45 degrees diagonal line. The result is illustrated below.
### QQ-PLOT
'total_conversion'], line = '45') sm.qqplot(df_grp_sum[
'Q-Q plot for total conversion') plt.title(
plt.show()
From the Q-Q plot, it is deduced that the dataset deviates from the line of expected normal distribution hence heavily skewed. While the visualization so far points to the direction of a non-normal distribution, there is still the opportunity to cross-check these suggestions with some statistical test for normality.
The visual inspections are supported with Shapiro-Wilk test to test the hypothesis that the distribution of data is not different from an expected normal distribution.
#### Shapiro-Wilk test
= stats.shapiro(df_grp_sum['total_conversion'])
w, pvalue print(w, pvalue)
0.9258283376693726 2.6865092770727672e-20
This is another approach is to assessing normality using the pingouin module.
='sales_script_type', dv='total_conversion') pg.normality(df_grp_sum, group
W pval normal
script_A 0.977833 8.850418e-06 False
script_B 0.960671 1.574002e-08 False
script_C 0.745857 1.854477e-11 False
The p-value of the test is less than 0.05 (5% significance level) which suggests a statistically significant difference from a normal distribution. Thus, the null hypothesis is rejected. This is not one of the relevant hypothesis to be tested for the business goal but its provides an important clue as to the right statistical method to adopt for assessing business relevant hypothesis.
With the data failing to meet the assumption of normal distribution required for the adoption of a parametric method, the compass of the analysis is gearing towards a non-parametric method. But before ascertaining that, there is the need to test for other assumptions required such as homogeneity which stipulates that variance between categories to be analyzed should be equal across the data distribution for methods such ANOVA to be properly use.
The first approach being used to test for equal variance in the distribution is the Bartlett’s test. Given that the distribution is non-normal, Barlett’s test is supported with Levene’s test which is a much more robust test when the data is not a normal distribution.
### bartlett's test for homogeneity
= stat()
script_bartlett = df_grp_sum, res_var = 'total_conversion', xfac_var = 'sales_script_type')
script_bartlett.bartlett(df script_bartlett.bartlett_summary
Parameter Value
0 Test statistics (T) 425.8573
1 Degrees of freedom (Df) 2.0000
2 p value 0.0000
The bartlett test result shows a p-value that is statistically significant hence the hypothesis that variation in the distribution is equal is rejected. This points in the direction of a non-parametric approach for the analysis but before then the result needs to be verified with a Levene’s test.
Levene’s test is employed as a final approach in this context to verify whether not the variance in the data is equal.
## levene test
= stat()
script_lev = df_grp_sum, res_var='total_conversion', xfac_var = 'sales_script_type')
script_lev.levene(df script_lev.levene_summary
Parameter Value
0 Test statistics (W) 119.2498
1 Degrees of freedom (Df) 2.0000
2 p value 0.0000
### Usingthe pingouin module for the analysis
='sales_script_type', dv='total_conversion') pg.homoscedasticity(df_grp_sum, group
W pval equal_var
levene 119.249752 1.772745e-46 False
The Levene’s test depict an unequal variance with a statistically significant p-value < 0.05 (at 5% significant level) hence confirming Bartlett’s test.
With visual inspections and all statistical tests indicating the data has a non-normal distribution and variance of values are unequal, the statistical approach adopted for testing the various business relevant hypothesis devised is a non-parametric one.
After finally deciding on the type of statistical approach to adopt, a selection is made among several non-parametric methods. For this, consideration is given to the number of categories or groupings in the independent variable. Three types of sales scripts were identified in the datset hence the decision to use Kruskal-Wallis test for the analysis of the second objective.
=df_grp_sum, dv='total_conversion', between='sales_script_type') pg.kruskal(data
Source ddof1 H p-unc
Kruskal sales_script_type 2 338.713092 2.814406e-74
The p-value from Kruskal-Wallis test is statistically significant at 5% significant level which suggests the null hypothesis that conversion is equal among the different scripts is rejected.
The objective 2 seeks to test the hypothesis that all sales scripts produce equal conversions without statistically significant difference. The conclusion drawn is to reject this hypothesis. This can be seen as a breakthrough from the exploratory data analysis yet this result equally raise another inquiry. Though, it is now clear that at least one sales script seems to do better than another for conversion, the analysis does not indicate which sales scripts has a higher conversion and how they differ. This calls for further analysis in the form of post-hoc test.
With the suggestion that different sales script do not produce equal conversions, it is recommended that, a deliberate effort is made to choose the script with higher rate of conversion. While this recommendation indicates there is opportunity to prioritize sales scripts for higher conversion, it has not specifically suggested which sales script in particular should be used. To provide such a data driven recommendation, further analysis is required to determine which script should be used to maximize conversion.
Post-hoc test is used to satisfy this objective. It provides a better insight through a pairwise comparison of the significant level of differences in conversion between scripts and further identify which scripts are associated with higher conversion.
### Post-hoc test with Dunn
= sp.posthoc_dunn(a = df_grp_sum, val_col = 'total_conversion',
dunnposthoc_result ='sales_script_type', p_adjust='bonferroni'
group_col
)
dunnposthoc_result
script_A script_B script_C
script_A 1.000000e+00 9.195689e-30 5.152029e-66
script_B 9.195689e-30 1.000000e+00 3.801421e-23
script_C 5.152029e-66 3.801421e-23 1.000000e+00
The Post-hoc test based on Dunn test indicates that there are significant differences in conversion between all the sales scripts and not just at least one. The p-value for the pairwise difference between all the scripts are statistically significant. The result shown above only captures the p-values and in order to discern the difference better, another module is used for the estimation.
## Post-hoc test with means of script among other feature output
= pg.pairwise_ttests(dv='total_conversion', between='sales_script_type',
posthoc_result =df_grp_sum, parametric = 'false', alpha = 0.05,
data= 'bonf', return_desc = 'true').round(3) padjust
/Users/lin/Documents/python_venvs/pytorch_deepL/lib/python3.9/site-packages/pingouin/pairwise.py:27: UserWarning: pairwise_ttests is deprecated, use pairwise_tests instead.
posthoc_result
Contrast A B ... p-adjust BF10 hedges
0 sales_script_type script_A script_B ... bonf 1.127e+41 1.055
1 sales_script_type script_A script_C ... bonf 9.194e+113 1.784
2 sales_script_type script_B script_C ... bonf 6.315e+94 1.745
[3 rows x 17 columns]
From the Post-hoc test results, the mean of script B is 4.333, mean of script A is 8.342 and the mean of script c is 0.638. The results suggests sales script A has a significantly higher conversion than sales script B and C. In a similar manner, sales script B has a significantly higher conversion than script C in statistical terms.
The third objective seeks to assess whether difference in conversion is statistically significant and identify which sales script has a significantly higher conversion. The results based on available data suggests that on a given demo day, the number of conversion that sales script A will produce is on the average, significantly higher than the other script types hence should be prioritize to increase conversion.
The sales team should consider prioritizing script A as the best bet to attain a higher conversion rate.
The fourth objective of the analysis is to move a step further with focus on predicting how various factors influence probability of a client to sign up for the product. In order to achieve the goal of higher amount of conversion, as assessment is undertaken to model whether or not a client will sign-up. Knowing how these factors influence a client’s decision to purchase our service will enable us better target clients by managing such predictors well. For this, a review of the various variables available has been undertaken to identify potential predictors.
The business problem at hand can be translated into an analytical one that requires a classification solution. In this case, the aim is to classify clients into two groups of convert and non-convert. This is the outcome we seek to predict hence the response or dependent variable. The response variable has two categories or classes as indicated, conversion and non-conversion, hence a binary classification method is chosen.
Logistic regression will be used for assessing how various factors influence the probability of a client to convert.
Based on the available limited data and understanding of the business operations, the potential predictors identified in the data for the analysis are the following
Type of sales script
Company type
The relationship between the type of sales script and conversion has been our main focus for earlier analysis and yet it is still important to understand how they influence the actual decision to convert and not the total number of conversion. It can be presume, as it is the case to hypothesize, that there is a logical understanding of type of sales script being a potential influential determinant.
Company type as a potential predictor is based on the assumption that the characteristics of the company will determine its decision to finally purchase our product. Given that, these features are not disclosed, the difference in labeling of company type is taken to be directly related to difference in company characteristics.
Modeling requires data preprocessing and the type of preprocessing undertaken is dependent on the data hence highlighted for transparency and reproducibility of analysis.
First of all, the company type variable has two categories being 0 and 1, and several missing data. It is possible the missing data is an instance of data that was not provided due to some privacy concerns. Moreover, as much as 4,916 missing data is identified which constitutes a sizable amount to eliminate from the analysis and very likely to reduce the possibility of capturing as much variations as the reality is on the ground. Supported by this, is the fact that, such a large amount of missing data is enough to constitute a category in its own right for analysis and insight. Based on these available information and context, the decision was made to replace all instances of missing data for company type with ‘Others’ as another company type. This led to reclassification of company type into company group A, B and Others corresponding to 0, 1 and missing data respectively. The act of reclassifying all missing data into one group could in itself undermine the performance of the model given that companies without similar characteristics could end up in same group and not providing the needed discrimination in predicting the outcome classes. Given that the focus is not on model performance but interpretability and insights, taking extra steps for model performance is of least importance
### make dummy variables for categorical variables with more than 2 classes
= pd.get_dummies(df_dplyselect, prefix_sep='_', drop_first=True)
logdata_dummy
logdata_dummy
is_signed ... company_group_company_B
0 0 ... 1
1 0 ... 0
2 0 ... 0
3 0 ... 0
4 0 ... 0
... ... ... ...
9995 0 ... 1
9996 1 ... 1
9997 1 ... 0
9998 0 ... 1
9999 1 ... 0
[10000 rows x 7 columns]
## Select columns for dependent and independent varible
= logdata_dummy['is_signed']
y
= logdata_dummy.drop(columns=['conversion_not_convert', 'is_signed', 'request_to_1streach_timelength_minutes_'])
X ## Split data into training and validation dataset
= train_test_split(X, y, test_size=0.4, random_state=1)
train_X, valid_X, train_y, valid_y ## Fit logistic regression on training dataset
= pg.logistic_regression(train_X, train_y)
reg_result
reg_result
names coef ... CI[2.5%] CI[97.5%]
0 Intercept -0.026989 ... -0.107403 0.053425
1 sales_script_type_script_B -0.049030 ... -0.157495 0.059436
2 sales_script_type_script_C 0.243650 ... -0.223056 0.710357
3 company_group_company_A 0.045234 ... -0.087648 0.178115
4 company_group_company_B 0.103170 ... -0.013828 0.220169
[5 rows x 7 columns]
The logistic regression for predicting the probability that a client will sign-up produced a non-significant p-value for all cases. Script A was used as reference point for script B and C. Thus, with a co-efficient of -0.049 for script B, it is suggested that using script B results in a decrease in the log odds of signing a client by 0.049 in comparison to script A. This is however statistically insignificant. Script C has a co-efficient of 0.24 which suggests that when a sales manager uses script C, the log odds of signing a client increases by 0.24 compared to script A which is statistically insignificant.
With regards to prioritizing companies to reach out to in order to increase conversion, when sales managers take a session with company A or company B, the log odds of landing a conversion increases compared to when they target companies categorized as ‘Others’ in this analysis. The result is however statistically insignificant.
From the analysis undertaken, it is recommended that the sales team consider not prioritizing demo sessions with company type ‘Others’. This should however not be treated as a de facto discrimination favouring company A and B to increase conversion given that the results are statistically insignificant.
Some potential predictors that could be included in the model are available in the dataset but the data format or how it is organized is not suitable for further transformation and inclusion in the model. This is highlighted with the suggestion of how to organize such data to lay foundation for a cross-functional and collaborative teamwork with other departments that generate data to be analyze.
source* - marketing source of the lead. This variable is very important in order to gain insight into the effectiveness of marketing strategy in producing conversion. Currently, the variable has 28 unique categories which requires that they are classified into logical groups of similarity, for instance organic source, referral, direct or indirect source, in order to be used as a categorical predictor. Given that actual names of the source have been hashed, such exercise is currently not possible hence the exclusion of this potential predictor. Thus, it is recommended that the actual names of the marketing source be provided to consider the possibility of using them as potential predictors.
sales_group_name* The type of sales team is likely to be a potential predictor influencing conversion. This variable has 22 categories which needs to be grouped in order to be used a categorical predictor. To do this, there is the need to provide further description about the various groups in order to classify them into logical groups for predicting conversion.
This analysis aims to demonstrate how to provide data-driven evidence to support decision making by the product team with the aim of increasing conversion.
A major tool that the sales team use to aid conversion is sales script of which there are different variant. Thus, the main focus of the analysis is to assess the role of sales script as a driver of conversion among others. To achieve this, the business problem was conceptualized and translated into several hypothesis.
Among others, the analysis sought to test the hypothesis that there is statistically significant relationship between type of sales script and conversion and that there is statistically significant difference in the total conversion made using different script. Further, the analysis sought to assess how sales script and company type influences the likelihood of a company signing up during and after a demo session.
It is recommended that sales script A be prioritized when deciding on the type of sales script to use for a demo. This is informed by the fact that sales script A had higher average statistically significant conversion compared to the other scripts. It should be noted that even though the multiple logistic regression suggested that the odds of signing up slightly increases in favour of script C, this result was not statistically significant. Moreover, other variables such as company type may have contributed as a confounding factor
The analysis has underscored that while it is possible to derive insights from the available data, there is the need to include other potential predictors in the data collected. This is informed by the fact that, the model developed does not adequately explain variation in conversion.