Big Data in Telecommunication Oper
Big Data in Telecommunication Oper
2017
DOI: 10.1007/s41650-017-0010-1
c Posts & Telecom Press and Springer Singapore 2017
Research paper
Abstract: In the age of information explosion, big data has brought challenges but also great opportunities
that support a wide range of applications for people in all walks of life. Faced with the continuous and intense
competition from OTT service providers, traditional telecommunications service providers have been forced
to undergo enterprise transformation. Fortunately, these providers have natural and unique advantages in
terms of both data sources and data scale, all of which give them a competitive advantage. Multiple foreign
mainstream telecom operators have already applied big data for their own growth, from internal business to
external applications. Armed with big data, domestic telecom companies are also innovating business models.
This paper will introduce three aspects of big data in the telecommunications industry. First, the unique
characteristics and advantages of communications industry big data are discussed. Second, the development
of the big data platform architecture is introduced in detail, which incorporates five crucial sub-systems.
We highlight the data collection and data processing systems. Finally, three internal or external application
areas based on big data analysis are discussed, namely basic business, network construction, and intelligent
tracing. Our work sheds light on how to deal with big data for telecommunications enterprise development.
Keywords: telecommunication operator, enterprise transformation, big data, platform architecture, practi-
cal applications
-----------------------------------------------------------------------------------------------------
Citation: Z. Wang, G. F. Wei, Y. L. Zhan, et al. Big data in telecommunication operators: data, platform
and practices [J]. Journal of communications and information networks, 2017, 2(3): 78-91.
-----------------------------------------------------------------------------------------------------
carry out precision marketing, which has yielded vantages of communications industry big data com-
profitable results. In 2014, Alibaba launched the pared to other industries. Section 3 introduces the
“DMP”, which enabled businesses to implement dif- framework of the industry’s big data platforms in
ferent marketing strategies for different people based detail, from collection systems to storage systems to
on the analysis of user information obtained through application systems. Section 4 details three internal
this product. Applications of big data in other fields and external applications based on big data. Finally,
include tracking movie box office receipts[6] , health- this paper ends with a summary and directions for
care system[7,8] , customer surveys[9] , and user char- future research.
acteristics analysis[10] . All these big data applica-
tions are gradually transforming the way we live,
2 Data sources and advantages
work, study, etc.
Faced with continuous and intense competition In this section, the major sources of wireless big data
from OTT service providers, traditional telecommu- and the advantages of operators are introduced.
nications service providers must undergo enterprise
transformation. Fortunately, these operators have 2.1 Data sources
access to rich data sources and huge datasets, which
other industries do not have. Large numbers of cus- As providers of basic network services, the goal of
tomers will generate loads of behavioral data every telecom operators is to provide an information chan-
second of the day, including calling, messaging, net- nel between people and equipment, and between dif-
working, and other kinds of information. Even when ferent types of equipment[14] . Operators themselves
the customer is inactive, location-based data will are the producers of big data. Data generated in a
be generated. Moreover, combined with registration communications network is the main source of Inter-
and business information, customer billing data can net big data.
be obtained. Communications data is mainly derived from the
Consequently, the vast amounts of data that op- following three sources[15] .
erators have can potentially outpace the ability • Data in IT system: user attributes, business
of existing CDR-based processing to improve our consumption information, and terminal information
daily lives[11] . Telecommunications data can be data collected from CRM, billing systems, and ter-
used to optimize operations and drive operational minal self-registration platforms, respectively. Basic
business intelligence to realize immediate business user profiles and characteristics can be described in
opportunities[12] . Multiple foreign mainstream tele- accordance to these data.
com operators have already applied big data for their • Data in access network and core network: mo-
own development. Orange Business Services for in- bile signaling, DPI, M2M data, etc. These data accu-
stance, used big data to enhance the accuracy of mulate in wired/wireless networks whenever clients
their churn detection. Spain’s Telefonica Dynamic use voice, SMS, or networking services. The underly-
Insights obtained reliable predictions of user behav- ing structure of the data is complex, hence targeted
ior by packaging and analyzing data. In 2014, Ver- analysis and processes are needed for different types
izon built data centers in California to implement of data to achieve scenario-based descriptions of user
precision marketing[13] . Domestic operators are also locations and preferences.
innovating their business models by exploring the use • Data in operators own Internet applications:
of big data. online business hall data, palm business office data,
This paper provides detailed discussions of three wing payment data, etc. All data, including user ac-
aspects of big data in the telecommunications indus- cess modes, addresses, times, business preferences,
try. Section 2 discusses the sources and unique ad- investment and consumption habits are completely
80 Journal of Communications and Information Networks
stored in the background of the application, which break down the isolated data and develop a real sense
can be obtained directly. of the big data cloud, under the premise that user
In terms of “Volume”, hundreds of millions of privacy is guaranteed.
users’ behavioral data are already in the petabyte or • Continuous and real-time data. Compared to
even terabyte range. In terms of “Variety”, commu- Internet services providers, telecom operators can
nications data covers all businesses, customers, and obtain position tracking data through the cellular
channels, as well as Internet data, human attributes, network protocol even when users only power-on
position trajectories, and terminal information. In their devices and have no data connection (Wi-
terms of “Velocity”, the quality of communication Fi/3G/4G). This mechanism guarantees real-time
services should meet the real-time requirements of and continuous data collection, which will be more
various applications. powerful in real-time applications such as issuing
early traffic warnings.
2.2 Advantages
3 Big data platform
In China, three operators have the largest number of
users compared to all other industries, i.e., approxi- In this section, we will introduce the telecommuni-
mately 1.3 billion mobile users and 300 million fixed cations operator’s internal big data platform in de-
broadband users[16] . The massive number of users tail. The overall design of the big data platform is
combined with their own industry policies provide based on the principles of data concentration, de-
the following advantages to operators. gree of openness, and cloud computing. It aims
• Authentic user information. Owing to the ex- to provide secure access, storage, sharing, analy-
isting real name system, non-real name users have sis, applications, and management. It helps con-
limited services and are required by law to register. struct an enterprise-level and future-oriented data
This not only ensures the authenticity of user infor- center. Moreover, the platform will create an open
mation, but also guarantees that the data has one- and shared public data environment. The above at-
on-one correspondence with a real person. tributes can guarantee the application implementa-
• Comprehensive and intact information. Unlike tion in all internal departments.
Internet companies that can only interact with users
3.1 Overall framework
through their own App business, operators can ac-
cess all behavioral data on users in the network all As is shown in Fig. 1 the platform mainly consists of
the time, such as when and where they used the 5 parts: data collection system, data storage sys-
service, terminal type, website accessed, products tem, data processing system, open mobile system
searched, hot topic interests, etc. With enough stor- and management system.
age and computing power, we can efficiently and The overall framework has a distinct hierarchy and
completely uncover all these behaviors. Moreover, arrangement. The selected technologies and compo-
with the availability of authentic user information, nents are mature and stable. On one hand, it can
the complete and accurate descriptions of user pro- satisfy the data processing requirements in the cur-
files and features can be obtained. rent data environment. On the other hand, all the
• Identifiable and relatable data. User identifi- included technologies are supposed to be in line with
cations in the operator system include the mobile the future direction of big data.
phone number, ID card number, terminal ID, cook-
3.2 Data collection system
ies, and many other types of information. These data
can be related to financial, Internet, hotel, trans- The data collection system is the basic part of the
portation usage, and other business-related data to platform. It provides a variety of data access tools
Big data in telecommunication operators: data, platform and practices 81
and aggregates the critical structured and unstruc- Synchronization technology based on relational
tured system data from all enterprise management database: Both dblink and OGG are synchroniza-
departments, front-end and back-end. By combin- tion technologies for Oracle databases. OGG is a
ing data from the offline acquisition and real-time comprehensive software package for real-time data
acquisition phase, the system can break down the integration and replication in heterogeneous IT en-
isolated information and aggregate all original data vironments. The product enables high availability
into the unified platform. solutions, real-time data integration, transactional
change data capture; and data replication, transfor-
3.2.1 Collection interfaces mation, and verification between operational and an-
There are many different systems in the communi- alytical enterprise systems.
cations enterprise. Thus, interfaces are needed to Applied scenarios: Dblink is mainly used for data
connect the collection system to other kinds of sys- synchronization between Oracle databases. It is of-
tems. Some data are stored in files while others are ten used in full-scale synchronization. OGG uses the
real-time data. There are two kinds of interfaces. database file synchronization mode. Because of its
The data collection interface collects data from var- high efficiency and small influence on the source sys-
ious interior source systems. The service interface tem, it is currently used in production systems and
manages data sharing and transfer among different other time-sensitive applications, such as the syn-
intermediate systems. Tab. 1 introduces the different chronization of attributes tables, orders, and lists.
data interfaces. Interaction technology between HDFS and tables
based on Sqoop: Sqoop Apache (SQL-to-Hadoop)
3.2.2 Collection technologies was designed to help the RDBMS and Hadoop
In this part, we introduce three common collection achieve efficient big data exchange. With the help
technologies. of Sqoop, users can transfer relational database data
82 Journal of Communications and Information Networks
data collection national platform of internet log mobile network DPI, mobile network AAA data file
fixed network DPI, source IP, AD subscriber ID, times-
data collection DPI platform of network operation tamp, request URL, user agent, referrer URL, destina- file
tion IP, cookie user port, destination port, etc.
fixed network AAA data, including WLAN authentica-
data collection DPI platform of network operation quasi real-time
tion and broadband user authentication
data collection OIDD platform OIDD system signaling data file, real-time
UDB, ISMP, business pilot, WLAN hotspot management
data collection ODMS platform, TSM platform and other value-added business file
data
mobile network billing details (calling and called), SMS
data collection billing system file
billing, flow billing
data collection billing system fixed network billing details (calling and called) file
data service ability product and application all types of original list of external business file, real-time
platform
data service provincial IT system provincial roaming data issued file
to related systems in Hadoop, such as HBase and ciently collect, aggregate, and move massive amounts
Hive. Sqoop can also extract data from the Hadoop of log data from different sources and store them in
system and then export it to the relational database. a centralized data storage system. It is a lightweight
Applied scenarios: The development of businesses and simple gadget which can easily adapt to various
and applications, especially the impact of big data, collection methods and balance loads.
has led to the exponential growth of enterprise data. Applied scenarios: Flume technology is mainly
Data formats are becoming increasingly diverse, such used for the log collection of each system. The devel-
as text, video, Web crawler data, and many other opment of cloud application systems, distributed ar-
structured and unstructured data. The traditional chitectures, and increasing node numbers make daily
dblink and OGG synchronization technologies have operations and maintenance processes increasingly
failed to meet the demands of the industry. Hence, difficult, such as dispersion, storage pressure, non-
the Hadoop open source framework for data pro- standardized log formats, non-unified query chan-
cessing was introduced. Because of the use of nels, and non-automatic push of abnormal infor-
HDFS file storage mode, Sqoop is a good solution to mation. These problems spurred us to build a log
the synchronization problem between the relational database. The business applications cover the track
database and distributed database file system. Cur- analysis of operation and maintenance personnel, op-
rently, the data stored on the Hadoop platform in- erations staff, business processes in the business hall,
cludes all user information, subsidies, sales, orders, and user Web page access. For example, a clerk re-
DPI, signaling, and other structured or unstructured ports on the part of the business that is inefficient
data. These data are collected by Sqoop components and provides a specific order number. Then, oper-
and will be able to meet the subsequent processing ations and maintenance personnel, according to the
requirements of big data SQL engines such as Im- analysis of customer tracks, can identify the time-
pala, Spark, and Hive. consuming link, customer waiting time, and pure sys-
Incremental document collection technology based tem operation time. Based on the above steps, we
on Flume: Flume NG is a distributed, reliable, and can determine the real reasons for the inefficiency
available system provided by Cloudera. It can effi- and provide recommendations for the optimization
Big data in telecommunication operators: data, platform and practices 83
business type frequency capacity/TB increment/TB processing memory total storage/TB duration/month
customer, account
day 1.00 0.90 2.00 31.00 1
and user information
inventory data integration day/month 1.20 1.20 2.40 37.20 1
mobile network DPI day 1.90 1.90 3.80 68.40 36
fixed network DPI, ITV day 5.00 5.00 10.00 155.00 1
wing payment month 0.01 0.01 — 0.24 24
port A signaling day 0.60 0.50 2.00 37.20 2
OSS data day 0.20 0.10 0.60 74.40 12
income, bill month 3.00 0.80 2.60 75.00 25
statements day/month 1.70 0.20 3.40 40.80 24
group data month 0.30 0.01 — 7.20 24
account — 14.91 10.62 26.8 526.44 —
and management of the IT system. and processing. Moreover, the ODS and EDW hard-
One company’s current collection system is shown ware platform basically use minicomputers or inte-
in Tab. 2. grated machines, which lead to hard management.
Fortunately, open source technologies can integrate
3.3 Data processing system both structured data (e.g., BSS, OSS, MSS) and un-
structured data (e.g., mobile DPI and fixed-network
The data processing system is the core of the plat-
DPI). After the construction of the offline analysis
form, providing deep mining and analysis services.
platform, we can observe the daily critical quota.
Using the distributed storage and parallel comput-
The specific steps are as follows:
ing framework combined with many kinds of comput-
ing engines, this system can accomplish fast and dis-
1 check the external table data according to certain rules,
tributed computing for structured, semi-structured,
such as volatility and consistency;
and unstructured information resources.
2 check and insert the data into internal tables in interface
layer, and do time stamp and partition;
3.3.1 Processing architecture
3 store the mild summary and detail data generated by
In order to achieve efficient collaboration in data pro- the model calculation in HDFS format;
cessing and meet the requirements of different appli-
4 process based on business logic.
cations, we divided the system into a real-time mod-
ule and an offline module as shown in Fig. 2.
Real-time scenario: Real-time data, including 3.3.2 Processing level
mobile broadband/product development, terminal Tab. 3 shows the data processing level of one provin-
sales, package development, 4G flow, and gross in- cial telecommunication company.
come are all displayed by instrument panels, progress
bars, trend charts, regional hotspot maps, and other 3.4 Other systems
forms. The development status and progress are self-
explanatory. Personnel can make timely adjustments This section introduces the other three systems,
to marketing decisions by utilizing the screen display namely the data storage system, open mobile plat-
and rolling update. forms, and management systems.
Off-line scenario: Business development and appli- Functioning as the support of data analysis and
cation complexities apply loads of pressure to storage sharing, data storage systems can store and query
84 Journal of Communications and Information Networks
batch data
structured, semi-structured, and unstructured data. extempore query for cross-domain data.
In order to achieve efficient data transfer, there are The open mobile platform supports both internal
four layers in the storage system. Interface layer: data applications and external business. First, it is a
this layer aims at peripheral data sources and is re- platform for foreign businesses using the multitenant
sponsible for data collection and preprocessing. It mode. Second, the operator is the platform operator
can manage external data sources, interface types, as well as one of the tenants. The platform needs
format requirements, scheduling methods, and su- to assign users and permissions to tenants, and pro-
pervision of data acquisition and exchange. Integra- vide user-level independent storage space, well allo-
tion layer: this layer integrates the isolated business cated computing resources, secure data protection,
model to establish a set of theme-oriented enterprise etc. The multitenant mode needs to make full use of
data models. Intermediary layer: this layer refines the data analysis capacity and help tenants apply for
the integration layer information for the purpose of resources. It can also perform the intelligent man-
application. It can reduce the degree of coupling agement of tenant resources by recycling those with
between models through the fragmented way of pro- high idle rates and expanding limited resources.
cessing and storage, which supports fast and agile The management system has two parts: data man-
data processing and assembly. Summary layer: this agement and security management. The data man-
layer can provide data analysis, data mining, and agement module is responsible for process scheduling
Big data in telecommunication operators: data, platform and practices 85
and monitoring, generation of the main data and in- registration, billing, terminal type, etc. to handle ab-
dex database, and data resource management. The normal values, outliers, and missing values.
security management module is responsible for user Then we generate derivative variables by combin-
rights, data access, access control, data desensitiza- ing business rules. Cluster the ARPU and flow into
tion, data encryption, watermarks, and other system three categories and generate ARPU-rank field (1, 2,
management functions. 3) and flow-rank field (1, 2, 3) respectively. Calculate
other derived fields including the overflow consump-
tion, ARPU and terminal price matching degree.
4 Practice and applications
This section introduces big data analysis-based prac- 4.1.2 Model and algorithm
tices from three perspectives. The first one is their First, filter the valid input variables. The number
application to normal business. Then, it shows their of input variables follows a short and refined prin-
effects on network optimization. Finally, we will in- ciple. Too many input variables are likely to cause
troduce a business in which the telecommunications problems, such as interference and over-fitting, which
operator collaborates with the government. can lead to a decline in the stability of the model.
There are two methods to select variables: choosing
4.1 3G/4G upgrading by business analysis and choosing accordance with
the correlation coefficient. When the correlation co-
4G has become a key business for telecom operators
efficient between two variables is equal to or greater
since the release of TDD/FDD LTE licenses. It has a
than 0.6, this indicates a moderate or above linear re-
strong influence on future user profiles. By now, the
lationship between the two variables. Here, we only
terminal-SIM matching rate is relatively low. The
need to keep one variable.
number of matching users for the Anhui province is
3 800 000, which accounts for 37.1% of the total as of In order to ensure the universality of the model,
July 2016. Thus, using big data to enhance 4G ter- we need to divide the data into the training set and
minal sales is an effective way to validate the present test set. The model is constructed on the training
study. The first step is using data mining to identify set. The hit rate, coverage, and applicability are ver-
potential users. The ARPU can be considered fol- ified on the test set. The availability is guaranteed
lowed by target marketing. This way, both the user by the cross validation.
scale and value can be enhanced. By comparing the indicators obtained by the au-
tomatic classifier node in the SPSS Modeler, we find
4.1.1 Dataset that decision tree algorithm has the best overall per-
formance among the different algorithms shown in
The sample consists of 4 600 000 customers of a
Tab. 4.
provincial company as of April 2016. They used
neither 4G terminals nor LTE flow. Over the next Table 4 Performance of different classifiers
three months, the number of 4G terminal upgrades performance
algorithm
was 334 000 (i.e., the number of positive samples).
precision recall
Because of the large difference between positive and
decision tree 70% 80%
negative samples, we performed some balance mea-
neural network 71% 50%
surements. Meanwhile, 70% of the sample was des-
logistic regression 69% 55%
ignated as the training set and the remainder as the
test set.
In order to ensure the purity of the data, we Fig. 3 shows the key factors chosen by the feature
need to check the data on user information, self- selection module of SPSS.
86 Journal of Communications and Information Networks
14.8
15 14.0
coverage and report these conclusions to the wireless
10.9 network optimization center or construction center.
10
7.3
4.2.1 Model and algorithm
5
Resolving the frequent cutting-downs in core busi-
ness districts will enhance the user experience and
0
previous current cluster 1 cluster 2 cluster 3 cluster 4 balance the LTE network load. We can identify the
samples
cutting-down station by implementing range deter-
Figure 4 Practical effect of proposed model mination, data integration, and thermodynamic di-
agram analysis.
From August to September 2016, we divided 1. Automatic classification of base station cover-
1 500 000 users into four clusters and performed tar- age area based on grid holography. Based on holog-
Big data in telecommunication operators: data, platform and practices 87
continuous monitoring. This section explores the ap- 3. Key location mining algorithm. In location
plication of the intelligent tracing system based on mining, we chose DBDCAN algorithm which based
the mining of spatial and temporal mobile data. on density instead of normal K-means algorithm.
We aim to build a support environment and in- This is because 1) the number of K-means is diffi-
teractive interface for the intelligent footprint sys- cult to determine in advance; 2) K-means polygons
tem. We will also establish a spatial-temporal anal- form round clustering shapes easily and is not suit-
ysis standard for big data and industry. Moreover, able for squares, rivers, or places with other shapes;
will promote multidimensional interconnections in 3) K-means is very sensitive to noise data and the
order to achieve efficient organization, orderly man- trace data is not clear enough. On the other hand,
agement, reasonable use, and high value when ana- DBSCAN is adaptable and can avoid the above men-
lyzing spatial-temporal big data. tioned problems.
4. Spatial-temporal trajectory real-time road
4.3.1 Model and algorithm matching algorithm. The base station coverage ra-
dius is about a few hundred meters and it is often
1. Real time data acquisition technology based on far away from the road. Information is updated fre-
mobile signaling. Based on real-time monitoring and quently, generally in less than 1 min. Faced with
using specialized signaling acquisition software and high frequency and high error rates, a map-matching
hardware, operators can filter and analyze specific algorithm is essential to map the base station loca-
signaling processes and obtain information about tion to road-level positions accurately and in real-
base stations and signaling. This technology can lo- time. Thus, we based the algorithm on the com-
cate signals from small cells to large regions, leading monly used probability graph model to conduct road
to personalized services in road monitoring applica- network matching. Because of the characteristics of
tions. The data recorded include the user IDs, time signaling, some optimization is needed to ensure the
stamps, positions, and other location information. accuracy and efficiency of the algorithm. A road
It updates every 5 min to ensure the accuracy and test can provide powerful support for this algorithm.
continuity of user location information. First, install the technical analytic device in the car.
2. Moving sequence detection. There are many Then, record the vehicle trajectory and collect the
uncertainties and disturbances in the signaling time corresponding handover sequence. Finally, using the
sequence. For example, a user’s signal may suddenly marked data, such as time, latitude, and longitude
move far away from the trace, which we call “flying- to adjust the parameters.
points”. Another case is shown in Fig. 6, where the
user does not move at all but handovers occur fre- 4.3.2 Results and discussions
quently. The reason is the overlapping region, as Based on the above models, we can construct the
represented by the red areas. Hence, we need to con- intelligent traffic analysis platform in cities, which
duct preprocessing to filter out the abnormal data will provide dynamic crowd analysis, real-time data
and obtain the real moving sequence by real-time on traffic conditions, traffic behavior analysis, urban
flow computing technology. planning support, etc. Some specific applications are
described below.
C2 C7
C1 C4 The first is traffic demand analysis and road plan-
C5 C9 ning. Based on the analysis of 24/7 hours crowd
C3 C2
C8 movement, we established the planning OD(Origin
real trace Destination) matrix. Then, use tracking modeling
C6
inaccurate trace
we analyzed the main trajectory. According to the
Figure 6 Signaling sequence established traffic grid and the movement coordi-
Big data in telecommunication operators: data, platform and practices 89
nates, we can obtain the real road load demand, by a spatial-temporal real-time road matching algo-
which is based on the crowd flow. The OD trajectory rithm. Testing of the area proved that the results
at different time periods can generate a full time OD of our algorithm accurately represented the actual
trajectory diagram, as shown in Tabs. 5 and 6. conditions. Fig. 7 shows that traffic jams usually
occur at the time people go to work or when they
Table 5 OD matrix of 8:00 am∼9:00 am
go home, at 8:00 and 18:00, respectively. We also
D observed that the rush hour on weekdays starts ap-
O
J01 J02 J03 J04 S01 S02 S03 S04 S05 SUM proximately 2 h earlier than on weekends. Moreover,
J01 50 3 11 5 19 15 19 38 36 196 the overall condition of the road is slightly better on
J02 5 38 69 1 14 5 19 21 67 239
weekends.
J03 8 53 82 3 13 10 8 17 59 253
J04 20 1 1 0 9 22 11 44 8 116 63 weekdays
weekends
S01 3 8 22 1 53 6 7 21 76 197 62
S02 7 4 6 23 7 45 6 45 19 162 61
speed/km·h−1
S03 7 7 8 2 14 6 0 17 34 95 60
S04 36 18 29 32 36 42 7 140 74 414 59
vice providers, the ways in which to exploit big data OIDD Open Information Dynamic Data
to achieve enterprise transformation is an important ODMS Operation Data Management System
topic. This paper analyzed three aspects of big data: AAA Authentication, Authorization and Accounting
the big data characteristics of the communications ISMP Integrated Services Management Platform
industry, big data platform architectures, and big UDB User Database