Research on E-commerce User Churn Prediction
Based on Logistic Regression
Qiu Yanfang Li Chen
School of Information Management School of Information Management
Beijing Information Science and Technology University Beijing Information Science and Technology University
Beijing, China Beijing, China
366043744@qq.com lichen@bistu.edu.com
Abstract—With the development and popularization of vector and recurrent neural network method to conduct the
Internet technology, e-commerce platform has provided research. However, they failed to make the model in a stable
satisfying products for customers and cultivated customer level [2]. Sun et al chose the SVM model when establishing a
loyalty. Nevertheless, the loss of user is still a popular issue in bank credit card user churn prediction model. [3]
business field and academic field. Based on logistic regression
The prediction of user churn in e-commerce platform
model, this paper established an e-commerce user churn
prediction model through preliminary research on e-commerce is a classical dichotomy problem [4]. The prediction results
customer churn behavior. By using the factor analysis method, are the possibility of being retained or lost, rather than
the user's online duration, number of logins, attentions, and classifying user behavior directly. As a common statistical
other user behavior factors were analyzed which concludes the analysis method used for classification, logistic regression
factor affecting the loss of users. Finally, the empirical study can obtain probabilistic prediction results that is applicable
proved that the proposed EBURM model can predict user to predict the user churn behavior of e-commerce platform.
churn behavior in a high confidence level.
II. ESTABLISHMENT OF EBURM MODEL BASED ON
Keywords- e-commerce; Logistic regression model; User LOGISTIC REGRESSION
behavior; User retention rates
A. Logistic regression
I. INTRODUCTION The prediction of user churn in e-commerce is an
According to the research data released by CNNIC, as obvious two classification problems. Logistic regression is a
of December 2016, the extent of China's e-commerce users commonly used statistical analysis method which can be
reached 467 million. With the rapid growth of e-commerce used for classification, prediction results can be obtained
users, how to predict the possibility of user churn in advance and probability, which belongs to a kind of probability type
has become an urgent problem for e-commerce platform. nonlinear regression. [5] Let the conditional probability
The factors affecting user retention of e-commerce includes P ( z ) = p be based on the probability of an observation
user's attention to shop, browsing rate of recommended relative to an event, then the logistic regression model can
information, demand for the sharing function, number of be expressed as:
online times per day, and length of the daily online, those
are important factors affecting e-commerce platform user ez 1
p( z ) = = (1)
churn, which can influence the prediction accuracy to a 1 + e 1 + e− z
z
large extent. Since the result of the prediction is yes or no between
The problem of e-commerce customer churn prediction the two possibilities, the range is [0,1], so we can estimate
has its own particularity. The e-commerce platform can't the probability that the variable P = 1 is based on its value.
accurately judge whether the user is really lost, which Maximum likelihood estimation, also called maximum
increases the level of difficulty and complexity of prediction likelihood estimation, the basic idea of this method is: when
greatly. Currently, the algorithms applied to user churn the model group n were randomly selected from the total
prediction include decision tree, artificial neural network, sample observations, the most reasonable parameter
Logistic regression model, K-Means algorithm, naive Bayes estimates from the model should make the probability of
and so on. In 2011, Yucheng Zhang proposed Markov extracting the n group sample observation value maximum.
model to predict user churn whose disadvantages were of This is an iterative algorithm, which takes an estimate value
low accuracy, low prediction coverage and high storage as the initial values of the parameters, according to the
complexity, etc. [1] In 2016, Yang Tao Liu chose embedded algorithm to determine the direction and change of
978-1-5090-6414-4/17/$31.00 ©2017 IEEE 87
parameters can increase the log likelihood value, estimation β1 =
number of times a user has successfully paid A shop
(3)
of the initial function, test the residuals and re estimated by number of users paid
the update function improved, until the log likelihood value ×
number of shops purchased by user
×α
is no longer significant change so far [6]. Because the number of users concerned
solution is more complex, this is no longer the case, and the The loss of users of the following characteristics, the
application of the SPSS software is usually calculated using concern of fewer shops, orders are also less; or basically no
the SPSS software. Finally, the prediction model of logistic orders. However, the more users retain, the more orders will
regression can be established by substituting the obtained be placed. That retained user is loyalty to some shops in e-
parameters (1). commerce, so users focus on shops can be replaced by
1
B. EBURM model building α= , [0,1] and σ in the range, is the dispersion
Here we need to build a model, because there are two 1+e −σ
kinds of things that are active and churn users of e- between the shops and the number of orders, so according to
the standard deviation formula, where
commerce, define yn as the category of e-commerce users N
1 represents xi shops under the
in sample data. When yn = 1 , it represents the user as σ =
N
¦ i =1
( xi − μ ) 2
active user, and yn = 0 represents the user as the churn singular, i for users the number of shops to pay. Therefore,
the meaning of α can be defined as: when users pay more
user. The retention rate R (retention rate) is defined as a
attention to the shops, the greater the dispersion of orders,
real number between the range [0,1] and is used to indicate
the user is more likely to be retained users, α closer to 1.
the possibility of loss of the user yn of the e-commerce • Recommended CTR
platform. The greater the value, the greater the likelihood In this article, the user's attention to the
that the user will remain on the e-commerce platform [6]. Set recommendation information by β 2 that the e-commerce
x = ( x1 , x2 , x3 , , xn ) as the dependent variable of the platform is now a user to recommend a variety of
information, and recommended information is generally the
user's yn behavior index, the logistic regression model can
user's personalized needs, if a user clicks the recommended
be used to calculate the retention rate of e-commerce users. number of times the more information E-commerce
The formula of R is: platform to understand the user, you will get the user's
1 degree of love, so the higher the recommended rate of
R = P( yn = 1| x) = − ( β 0 + β1 x1 + β 2 x2 ++ β n xn )
(2) attention, the more the more likely to retain the user
1+ e retention.
Based on the above model is trained using the sample times people views recommendations
β = (4)
data, using the maximum likelihood method or the use of 2
user views
SPSS software can obtain the estimation value of each • Share rate
parameter of the model, thus getting the final e-commerce
user's retention situation EBURM (Electronic Business User Where β 3 said the sharing rate, users share an e-
Retention Model). commerce platform each time, indicating that the e-
commerce platform products or activities have been the
C. Extraction of Characteristic Factors user's favorite, share to the third-party platform, indicating
In this paper, we analyze the user behavior of e- that users of our e-commerce platform promotion, the
commerce, combine and transform the original features of e- possibility of retention is higher, the formula is as follows:
commerce users through reasonable logical induction, and c lic k to s h a re b tn tim e s (5)
β =
extract the following characteristics as the factors of the 3
u s e rs p e r c lic k b tn tim e s
model and take the variables the value gives a specific • Number of daily
formula. Here, the daily number is expressed by beta _4, one of
• the user's interest rate for e-shops the factors that users may be wasting is the number of days
The user's interest rate for the store is expressed in this on which the platform is used, and if the number of times
paper by β1 which refers to the degree of attention that the used is low, the long-term retention rate is low. Therefore, it
user pays attention to the e-commerce store, which can be is also an important characteristic factor to analyze the
measured by the number of times the user clicks into the user's daily login. The formula is as follows:
lo g tim e s (6)
shops of interest, from the user's payment to the particular β = 4
a ll u s e r s tim e s
shop concerned Analysis, that is, the user clicks into the
• User churn time
store and successfully ordered the more orders to pay more,
the more the number of comments that the user concerned β5 is used to indicate the length of the drain. The
about the e-commerce concerns the higher the rate of shops. standard of determining whether a user is really missing is
88
the login frequency of the user in this e-commerce platform B. Parameter estimation and explicit test
within 3 months, if a user is registered for use a week after In this paper, the binary logistic regression analysis
the frequency of re use of a linear downward trend, and in module in SPSS software is used to train the model, and the
three months after basically no landing, the platform shows parameters of the model are estimated and tested. Set the
that this user drain. According to data from an e-commerce dependent variable and covariate, the classification standard
channel shown in Figure 1 below, it will take at least three value is set to 0.5, the other settings are the default value,
months for a user to have a significant loss in one channel. the calculation results as shown in Table ĉ below.
And from the data shown on the map, a week time you can
see the user's loss situation, the user churn is 7 days a week TABLE I. THE FACTOR COEFFICIENTS IN THE EQUATION
cycle.
B S.E. Wald df Sig. Exp(B)
ȕ1 40.095 4.892 40.262 1 0.000 70.971
ȕ2 15.125 2.337 12.236 1 0.002 20.112
ȕ3 3.142 0.969 26.107 1 0.006 8.861
ȕ4 3.425 1.847 27.136 1 0.000 8.326
ȕ5 6.21 0.137 4.326 1 0.001 0.129
ȕ0 -0.141 0.347 9.763 1 0.001 0.310
From Table ĉ, we can see the user's attention to the
shop, the recommended information attention rate, the
number of hits, and the length of the online sig. Values are
less than the critical value of 0.05, and users share the sig
Figure 1. Electricity Business Platform 3 Months User Data Volume
value of more than 0.05, Has nothing to do with the model,
so the following models no longer use the user to share the
• Constant
rate of click on the independent variables. On the whole, the
The data selected in this paper is extracted from an e-
model is feasible and four indicators are available, so the
commerce platform of 6000 data, which retained the amount
model changes to 4 variables for the final variable. The B
of data 3211, and the amount of data lost 2789. The formula
value is the coefficient of each factor, and the parameter
for the constant β 0 is thus given as follows: estimate of the model can be obtained by Table ĉ. The
β 0 = lo g ( p ) = l n
27 89 / 60 00
= − 0.14 09 (7) following is the formula (8):
1 − 27 89 / 60 0 0 1
R= − ( −0.141+ 42.095 x1 + 7.125 x2 +11.425 x4 + 6.21 x5 )
(8)
III. EMPIRICAL RESEARCH 1+ e
From the estimation of the parameters of the factors are
A. data collection positive, indicating the degree of attention, recommendation
Through the sample data, the model can be trained and information, as well as the number of logins and duration
the maximum likelihood method can be used to obtain the and retention rate is positively related, the greater the value,
estimated value of each parameter of the model, so as to get the higher the retention rate, the less likely the user is lost.
the EBURM of the final e-commerce user. The concrete Because of its sig are within the critical value, indicating
process is as follows: that the significant significance of the variable, so the use of
• The Classification Process the model is very high.
Step 1 quantifies the behavior of the test user to form
C. Result analysis
the model's independent variable value.
Step 2 uses the EBURM model to calculate the The prediction of the retention rate of e-business users
probability that the user is a real user, that is, the user's is that a classifier is created to classify the categories of
retention rate R . users belonging to an e-commerce user. The performance of
Step 3 determines the e-commerce user category a classifier is evaluated, the performance evaluation metrics
according to the set classification standard value (threshold usually have the following:
value). • Accuracy: The ratio of correct sample size to total
• Data Processing sample size ǂǂ
The data used in this study is the real user data of an e- • Precision: Predict the ratio of the correct number of
commerce, which collects 3 month users' data of one samples to the total number of samples in the state
channel of the platform, and extracts the information of • the full rate: Predict the correct sample number to
6000 user data. And in 3 months of data statistics, identified the actual sample ratio
3211 active retained users and 2789 lost users.
89
• rate of omission: the ratio of the sample size of the accuracy rate is 95.22%, which indicates that the accuracy
prediction error to the total sample number of the model is very high in the prediction of the retention
In the test set to extract 10 user data, including five and loss of the user's behavior. The accuracy of the whole
retained users and five lost users, according to the formula model is 93.6%, and the accuracy of the model's prediction
forecast, the result is lost users have two predictions are of user churn Is the highest. Indicating that the model is
wrong, respectively, have to buy, but the online length And reasonably available.
the number of online are relatively small, that some of the
occasional users is also a loss of the other one is no
purchase behavior, but online length, browse the
recommended information values are relatively low,
indicating that this may be accidental click , But the
platform does not really need the user, this is the loss of the
user.
TABLE II. CALCULATION RESULTS OF RETENTION RATE R
XVHU FDWHJRU\ [ [ [ [ 5 SUHGLFWLRQ
5HWDLQHG
5HWDLQHG
5HWDLQHG
5HWDLQHG Figure 2. Comparison of the actual data and the predicted data obtained
5HWDLQHG from the model
FKXUQ
FKXUQ
FKXUQ After comparing the predicted model data with the
FKXUQ original data, figure 2 is obtained, after obtaining the
FKXUQ predicted data, the predicted data is basically close to the
The data of the user data of a channel of an e- practice data, and the error is about 7%. And from the
commerce platform was selected, and the behavior different AUC value can be seen, EBURM model logistic
information of 6000 users was selected after the data regression to establish the correct rate of the value is
processing. The standard value of the classification relatively high based on the data, so we can conclude that
evaluation was set to 0.5, that is, the retention rate R was 0.5 the accuracy of predicting the loss behavior of EBURM user
or more Indicating that the user to retain the user, on the model in electronic commerce is relatively high. So that the
contrary, when R is less than 0.5 for the loss of users. EBURM model can predict the availability of relatively
According to this setting, according to the model statistics high.
out of Table ċ, Table Č
TABLE III. EBURM MODEL PREDICTION RESULTS
IV. SUMMARY
The EBURM model is evaluated by the AUC test
Predict Predict Predictive method. The results show that the EBURM model is
retention users churn users accuracy consistent with actual expectations for active and churn
Actual 3084 127 96.03% users. Based on the different influencing factors of the user
retention users retention rate, the EBURM model provide a personalized
Actual 257 2532 90.78%
churn users operational recommendation strategy. Comparing to the
Total Accuracy - - 93.6% method of user type predication, this model can predict user
behavior more accurately to reduces user churn. Through
the construction of the EBURM model to predict e-
TABLE IV. EVALUATION INDICATORS commerce user churn behavior, it helps e-commerce
platform to formulate operational strategy more precisely,
˄T˅
˅ ˄P˅
˅ ˄R˅
˅ ˄M˅
˅ provide users with personalized recommendations, increase
93.6% 96.03% 95.22% 6.4% user activity, retain users, and improve the economic effects
of e-commerce platform.
a. ˄Accuracy -T˗ Precision -P˗ Check the rate -R˗ missed rate-M˅
ACKNOWLEDGMENT
From the results predicted in Table ċ, the test set has a This paper is supported by the National Natural
total of 3211 retained users, 3084 judged to retain the user, Science Foundation of China (No. 61272513), based on
the predicted accuracy rate of 93.6%, a total of 2789 users Multi-Agent, the study of collaborative supply chain data
were lost, was found to have lost 2532 users, the prediction integration under emergency status.
90
V. REFERENCE
[1] Zhang Yucheng, xu big grain, Wang Xiaojuan. Active diction of the new algorithm [J]. Journal of xi 'an university
user behavior based on weighted markov chain prediction of electronic science and technology, 2016, (4): 62-56 + 51.
model [J]. Computer engineering and design, 2011, (10): [5] Musa A B. Comparative study on classification
3334-3337 + 3418. performance between support vector machine and logistic
[2] Liu Yangtao, south slope, Yang Xinfeng. Based on regression [J]. International Journal of Machine Learning
embedded vector and circulation of the neural network user and Cybernetics,2013,4(1);13-24.
behavior prediction method [J]. Journal of modern [6] Chang Zhenhai, Liu Wei. Logistic regression model and
electronic technology, 2016 (23): 165-169. its application [J]. Journal of Yanbian University
[3] Li Shi bo, Sun Bao hong, Wilcox R T. Cross-selling (NATURAL SCIENCE EDITION), 2012, (01): 28-32.
sequentially ordered products: An application to consumer [7] Gupta A, Kumar guru P. Credibility ranking of tweets
banking [J]. Journal of Marketing Research, during high impact events [c] //Proceedings of the 1st
2005,42(2):233-239. Workshop on Privacy and Security in Online Social Media.
[4] Tang Xing Quan Yi ning, Song Jianfeng, Michael Dunn New York: ACM,2012.
e, Zhu Hai, MiaoQi widely. Weibo forward personalized pre
91