KEMBAR78
Taming Big Data Analytics | PDF | Analytics | Predictive Analytics
0% found this document useful (0 votes)
144 views289 pages

Taming Big Data Analytics

The document provides an overview of data analytics, including its types, processes, and applications in various industries. It emphasizes the importance of aligning data and analytics strategies to enhance business performance and outlines the steps involved in data preparation and analysis. Additionally, it discusses the role of big data analytics in uncovering insights that drive informed business decisions and competitive advantages.

Uploaded by

Meha Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views289 pages

Taming Big Data Analytics

The document provides an overview of data analytics, including its types, processes, and applications in various industries. It emphasizes the importance of aligning data and analytics strategies to enhance business performance and outlines the steps involved in data preparation and analysis. Additionally, it discusses the role of big data analytics in uncovering insights that drive informed business decisions and competitive advantages.

Uploaded by

Meha Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 289

Data Analytics

Big Data Analytics


Logical Architectures for Big Data Analytics
Advanced Analytics
Data Analytics and its type
Predictive Analytics
Descriptive Analytics
Prescriptive Analytics
Diagnostic Analytics
Data Science Process
Data Preparation
Tools to Mine Big Data Analytics
Top Data Analytics Programming Languages
R programming language
Python
Scala
Apache Spark
SQL
Apache Hive
Analytical modeling is both science and art
Algorithm
Business Analytics
Edge Analytics
Inductive Reasoning
Supply Chain Analytics
Statistical Analysis
Analytic Database
Real-Time Analytics
Data Analytics Visualization Tools
Differences between Data Analytics, AI, Machine & Deep Learning
Artificial Intelligence
Artificial Neural Network
Machine Learning
Deep Learning
Data Lakes vs. Data Warehouses
Data Lake
Data Warehouse
Advanced Analytics techniques fuel data-driven organization
Must-have features for Big Data Analytics Tools
Data-driven storytelling opens analytics to all
Use Cases of Big Data Analytics in Real World
Key Skills That Data Scientists Need
Data analytics and career opportunities
Data Analytics
Data analytics (DA) is the process of examining data sets in order to find trends
and draw conclusions about the information they contain. Increasingly data
analytics is used with the aid of specialized systems and software. Data analytics
technologies and techniques are widely used in commercial industries to enable
organizations to make more-informed business decisions. It is also used
scientists and researchers to verify or disprove scientific models, theories and
hypotheses.

As a term, data analytics predominantly refers to an assortment of applications,


from basic business intelligence (BI), reporting and online analytical processing
(OLAP) to various forms of advanced analytics. In that sense, it's similar in
nature to business analytics, another umbrella term for approaches to analyzing
data. The difference is that the latter is oriented to business uses, while data
analytics has a broader focus. The expansive view of the term isn't universal,
though: In some cases, people use data analytics specifically to mean advanced
analytics, treating BI as a separate category.

Data analytics initiatives can help businesses increase revenues, improve


operational efficiency, optimize marketing campaigns and customer service
efforts. It can also be used to respond quickly to emerging market trends and
gain a competitive edge over rivals. The ultimate goal of data analytics,
however, is boosting business performance. Depending on the particular
application, the data that's analyzed can consist of either historical records or
new information that have been processed for real-time analytics. In addition, it
can come from a mix of internal systems and external data sources.

Types of data analytics applications


At a high level, data analytics methodologies include exploratory data analysis
(EDA), and confirmatory data analysis (CDA). EDA aims to find patterns and
relationships in data, while CDA applies statistical techniques to determine
whether hypotheses about a data set are true or false. EDA is often compared to
detective work, while CDA is akin to the work of a judge or jury during a court
trial a distinction first drawn by statistician John W. Tukey in 1977.

Data analytics can also be separated into quantitative data analysis and
qualitative data analysis. The former involves the analysis of numerical data with
quantifiable variables. These variables can be compared or measured
statistically. The qualitative approach is more interpretive it focuses on
understanding the content of non-numerical data like text, images, audio and
video, common phrases, themes and points of view.

At the application level, BI and reporting provide business executives and


corporate workers with actionable information about key performance
indicators, business operations, customers and more. In the past, data queries and
reports typically were created for end users by BI developers who worked in IT.
Now, more organizations will use self-service BI tools that let executives,
business analysts and operational workers run their own ad hoc queries and build
reports themselves.

An advanced type of data analytics includes data mining, which involves sorting
through large data sets to identify trends, patterns and relationships. Another
type is called predictive analytics, which seeks to predict customer behavior,
equipment failures and other future events. Machine learning can also be used
for data analytics, using automated algorithms to churn through data sets more
quickly than data scientists can do via conventional analytical modeling. Big
data analytics applies data mining, predictive analytics and machine learning
tools. Text mining provides a means of analyzing documents, emails and other
text-based content.

Data analytics initiatives support a wide variety of business uses. For example,
banks and credit card companies analyze withdrawal and spending patterns to
prevent fraud and identity theft. E-commerce companies and marketing services
providers will use clickstream analysis to identify website visitors who are likely
to buy a particular product or service based on navigation and page-viewing
patterns. Healthcare organizations mine patient data to evaluate the effectiveness
of treatments for cancer and other diseases. Mobile network operators also
examine customer data to forecast churn. This allows mobile companies to take
steps to prevent defections to business rivals. To boost customer relationship
management efforts, other companies can also engage in CRM analytics to
segment customers for marketing campaigns and equip call center workers with
up-to-date information about callers.

Inside the data analytics process


Data analytics applications involve more than just analyzing data. Particularly on
advanced analytics projects. Much of the required work takes place upfront, in
collecting, integrating and preparing data and then developing, testing and
revising analytical models to ensure that they produce accurate results. In
addition to data scientists and other data analysts, analytics teams often
include data engineers, whose job is to help get data sets ready for analysis.

The analytics process starts with data collection. Data scientists identify the
information they need for a particular analytics application, and then work on
their own or with data engineers and IT staff to assemble it for use. Data from
different source systems may need to be combined via data integration routines,
transformed into a common format and loaded into an analytics system, such as
a Hadoop cluster, NoSQL database or data warehouse.

In other cases, the collection process may consist of pulling a relevant subset out
of a stream of data that flows into, for example, Hadoop. This data is then moved
to a separate partition in the system so it can be analyzed without affecting the
overall data set.

Once the data that's needed is in place, the next step is to find and fix data
quality problems that could affect the accuracy of analytics applications. That
includes running data profiling and data cleansing tasks to ensure the
information in a data set is consistent and that errors and duplicate entries are
eliminated. Additional data preparation work is then done to manipulate and
organize the data for the planned analytics use. Data governance policies are
then applied to ensure that the data follows corporate standards and is being used
properly.

From here, a data scientist builds an analytical model, using predictive


modeling tools or other analytics software using languages such as Python,
Scala, R and SQL. The model is initially run against a partial data set to test its
accuracy. Typically, it's then revised and tested again. This process is known as
"training" the model until it functions as intended. Finally, the model is run in
production mode against the full data set, something that can be done once to
address a specific information need or on an ongoing basis as the data is
updated.

In some cases, analytics applications can be set to automatically trigger business


actions. For example, stock trades by a financial services firm. Otherwise, the
last step in the data analytics process is communicating the results generated by
analytical models to business executives and other end users. Charts and other
infographics can be designed to make findings easier to understand. Data
visualizations often are incorporated into BI dashboard applications that display
data on a single screen and can be updated in real-time as new information
becomes available.

Data analytics vs. data science


As automation grows, data scientists will focus more on business needs, strategic
oversight and deep learning. Data analysts who work in business intelligence
will focus more on model creation and other routine tasks. In general, data
scientists concentrate efforts on producing broad insights, while data analysts
focus on answering specific questions. In terms of technical skills, future data
scientists will need to focus more on the machine learning operations process,
also called MLOps.

Speaking at our Information Builders‘ Summit, IDC Group vice president, Dan
Vesset estimated that knowledge workers spend less than 20% of their time on
data analysis. The rest of their time is taken up with finding, preparing and
managing data, “An organisation plagued by the lack of relevant data,
technology and processes, employing 1000 knowledge workers, wastes over
$5.7 million annually searching for, but not finding information,” warned Vesset.

Vesset’s comments underline the fact that data must be business-ready before it
can generate value through advanced analytics, predictive analytics, IoT, or
artificial intelligence (AI).

As we’ve seen from numerous enterprise case studies, co-ordination of data and
analytics strategies and resources is the key to generating return on analytics
investments.

Building the case for aligning data and analytics strategies


As data sources become more abundant, it’s important for organisations
to develop a clear data strategy, which lays out how data will be acquired, stored,
cleansed, managed, secured, used and analysed, and the business impact of each
stage in the data lifecycle.

Equally, organisations need a clear analytics strategy which clarifies the desired
business outcomes.

Analytics strategy often follows four clear stages: starting with descriptive
analytics; moving to diagnostic analytics; advancing to predictive analytics and
ultimately to prescriptive analytics.
These two strategies must be aligned because the type of analytics required by
the organisation will have a direct impact on data management aspects such as
storage and latency requirements. For example, operational analytics and
decision support will place a different load on the infrastructure to customer
portal analytics, which must be able to scale to meet sudden spikes in demand.

If operational analytics and IoT are central to your analytics strategy, then
integration of new data formats and real-time streaming and integration will
need to be covered in your data strategy.

Similarly, if your organisation’s analytics strategy is to deliver insights directly


to customers, then data quality will be a critical factor in your data strategy.

When the analytics workload is considered, the impact on the data strategy
becomes clear. While a data lake project will serve your data scientists and back
office analysts, your customers and supply chain managers may be left in the
dark.

Putting business outcomes first

Over the past four decades, we have seen the majority of enterprise efforts
devoted to back-office analytics and data science in order to deliver data-based
insights to management teams.

However, the most effective analytics strategy is to deliver insights to the people
who can use them to generate the biggest business benefits.

We typically observe faster time to value where the analytics strategy focuses on
delivering insights directly to operational workers to support their decision-
making; or to add value to the services provided to partners and customers.

How to align data and analytics strategies One proven approach is to look
at business use cases for each stage in the analytics strategy. This might include
descriptive management scorecards and dashboards; diagnostic back-office
analytics and data science; operational analytics and decision support; M2M and
IoT; AI; or portal analytics created to enhance the customer experience.

Identify all the goals and policies that must be included in your strategies. Create
a framework to avoid gaps in data management so that the right data will be
captured, harmonised and stored to allow it to be used effectively within the
analytics strategy.

Look at how your organisation enables access to and integration of diverse data
sources. Consider how it uses software, batch or real-time processing and data
streams from all internal systems.

By looking at goals and policies, the organisation can accommodate any changes
to support a strong combined data and analytics strategy.

Focus on data quality

Once you have defined your data and analytics strategies, it’s critical to address
data quality. Mastering data ensures that your people can trust the analytic
insights derived from it. Taking this first step will greatly simplify your
organisation’s subsequent analytics initiatives.

As data is the fuel of the analytics engine, performance will depend on data
refinement.

The reality for many data professionals is that they struggle to gain organisation-
wide support for a data strategy. Business managers are more inclined to invest
in tangibles, such as dashboards Identifying the financial benefits of investing in
a data quality programme, or a master data management initiative is a challenge,
unless something has previously gone wrong which has convinced the
management team that valuable analytics outputs are directly tied to quality data
inputs.

To gain their support for a data strategy consider involving line of business
managers by asking them what the overall goals and outputs are for their
analytics initiatives. An understanding the desired outputs of data will then guide
the design of the data infrastructure.

Pulling together

Often we see data management, analytics and business intelligence being


handled by different teams, using different approaches, within the same
organisation. This can create a disconnection between what the business wants to
achieve from data assets and what is possible. Data and analytics strategies need
to be aligned so that there is a clear link between the way the organisation
manages its data and how it gains business insights.
Include people from different departments who possess a cross
section of skills: business, finance, marketing, customer service, IT,
business intelligence, data science and statistics. Understand how
these colleagues interact and what is important to them in terms of
data outputs.
Take into account how data interconnects with your organisation’s
daily business processes. This will help answer questions about the
required data sources, connections, latency and inputs to your
analytics strategy. Ensuring that they work together connects data
to business value.
Finally, consider the technology components that are required. This
entails looking at different platforms that deliver the required data
access, data integration, data cleansing, storage and latency, to
support your required business outcomes.

Measuring the benefits

The following organisations aligned their data and analytics strategies to deliver
clear business outcomes:

Food for the Poor used high quality data and analytics to reach its
fund raising target more quickly: reducing the time taken to raise
$10 million from six months to six days, so that it could more
quickly help people in dire need.
Lipari Foods integrated IoT, logistics and geo location data,
enabling it to analyse supply chain operations so that it uses
warehouse space more efficiently, allowing it to run an agile
operation with a small team of people.
St Luke’s University Health Network mastered its data as part of its
strategy to target specific households to make them aware of
specialised medications, reaching 98 per cent uptake in one of its
campaigns focused on thirty households. “Rather than getting
mired in lengthy data integration and master data management
(MDM) processes without any short-term benefits, stakeholders
decided to focus on time-to-value by letting business priorities
drive program deliverables,” explains Dan Foltz, program manager
for the EDW and analytics implementation at St. Luke’s. “We
simultaneously proceeded with data integration, data governance,
and BI development to achieve our business objectives as part of a
continuous flow. The business had new BI assets to meet their
needs in a timely fashion, while the MDM initiative improved those
assets and enabled progressively better analysis,” he adds. This
approach allowed the St. Luke’s team to deliver value throughout
the implementation.

These are just a few examples of organisations having a cohesive data strategy
and analytics strategy which has enabled them to generate better value from
diverse and complex data sets.

Gaining better value from data

While analytics initiatives often begin with one or two clear business cases, it’s
important to ensure that the overall data analytics strategy is bigger than any
single initiative. Organisations that focus on individual projects may find that
they have overlooked key data infrastructure requirements once they try to scale.
As Grace Auh, Business Intelligence and Decision Support manager at Markham
Stouffville Hospital, observed during Information Builders’ Summit, “Are you
connecting the dots? Or are you just collecting them?”

Capturing data in silos to serve tactical requirements diminishes the visibility


and value that it can deliver to the whole organisation. The ultimate path to
creating value is to align your data and analytic strategies to each other and most
importantly to the overall strategy and execution of your organisation.
Big Data Analytics
Big data analytics is the often complex process of examining big data to uncover
information such as hidden patterns, correlations, market trends and customer
preferences that can help organizations make informed business decisions.

On a broad scale, data analytics technologies and techniques provide a means to


analyze data sets and take away new information which can help organizations
make informed business decisions. Business intelligence (BI) queries answer
basic questions about business operations and performance.

Big data analytics is a form of advanced analytics, which involve complex


applications with elements such as predictive models, statistical algorithms and
what-if analysis powered by analytics systems.

The importance of big data analytics


Big data analytics through specialized systems and software can lead to positive
business-related outcomes:

New revenue opportunities


More effective marketing
Better customer service
Improved operational efficiency
Competitive advantages over rivals

Big data analytics applications allow data analysts, data scientists, predictive
modelers, statisticians and other analytics professionals to analyze growing
volumes of structured transaction data, plus other forms of data that are often left
untapped by conventional BI and analytics programs. This includes a mix
of semi-structured and unstructured data. For example, internet data, web
server logs, social media content, text from customer emails and survey
responses, mobile phone records, and machine data captured
by sensors connected to the internet of things (IoT).
Big data analytics is a form of advanced analytics, which has marked differences
compared to traditional BI.
How big data analytics works
In some cases, Hadoop clusters and NoSQL systems are used primarily as
landing pads and staging areas for data. This is before it gets loaded into a data
warehouse or analytical database for analysis usually in a summarized form that
is more conducive to relational structures.

More frequently, however, big data analytics users are adopting the concept of a
Hadoop data lake that serves as the primary repository for incoming streams
of raw data. In such architectures, data can be analyzed directly in a Hadoop
cluster or run through a processing engine like Spark. As in data warehousing,
sound data management is a crucial first step in the big data analytics process.
Data being stored in the HDFS must be organized, configured and partitioned
properly to get good performance out of both extract, transform and load (ETL)
integration jobs and analytical queries.

Once the data is ready, it can be analyzed with the software commonly used
for advanced analytics processes. That includes tools for:

data mining, which sift through data sets in search of patterns and
relationships;
predictive analytics, which build models to forecast customer behavior
and other future developments;
machine learning, which taps algorithms to analyze large data sets; and
deep learning, a more advanced offshoot of machine learning.
Text mining and statistical analysis software can also play a role in the big data
analytics process, as can mainstream business intelligence software and data
visualization tools. For both ETL and analytics applications, queries can be
written in MapReduce, with programming languages such as R, Python, Scala,
and SQL. These are the standard languages for relational databases that are
supported via SQL-on-Hadoop technologies.

Big data analytics uses and challenges


Big data analytics applications often include data from both internal systems and
external sources, such as weather data or demographic data on consumers
compiled by third-party information services providers. In addition, streaming
analytics applications are becoming common in big data environments as users
look to perform real-time analytics on data fed into Hadoop systems through
stream processing engines, such as Spark, Flink and Storm.

Early big data systems were mostly deployed on premises, particularly in large
organizations that collected, organized and analyzed massive amounts of data.
But cloud platform vendors, such as Amazon Web Services (AWS)
and Microsoft, have made it easier to set up and manage Hadoop clusters in the
cloud. The same goes for Hadoop suppliers such as Cloudera-Hortonworks,
which supports the distribution of the big data framework on the AWS
and Microsoft Azure clouds. Users can now spin up clusters in the cloud, run
them for as long as they need and then take them offline with usage-based
pricing that doesn't require ongoing software licenses.

Big data has become increasingly beneficial in supply chain analytics. Big
supply chain analytics utilizes big data and quantitative methods to enhance
decision making processes across the supply chain. Specifically, big supply
chain analytics expands datasets for increased analysis that goes beyond the
traditional internal data found on enterprise resource planning (ERP) and supply
chain management (SCM) systems. Also, big supply chain analytics implements
highly effective statistical methods on new and existing data sources. The
insights gathered facilitate better informed and more effective decisions that
benefit and improve the supply chain.

Potential pitfalls of big data analytics initiatives include a lack of internal


analytics skills and the high cost of hiring experienced data scientists and data
engineers to fill the gaps.
Big data analytics involves analyzing structured and unstructured data.
Emergence and growth of big data analytics
The term big data was first used to refer to increasing data volumes in the mid-
1990s. In 2001, Doug Laney, then an analyst at consultancy Meta Group Inc.,
expanded the notion of big data. This encompassed increases in the variety of
data being generated by organizations and the velocity at which that data was
being created and updated. Those three factors volume, velocity and variety
became known as the 3Vs of big data, a concept Gartner popularized after
acquiring Meta Group and hiring Laney in 2005.

Separately, the Hadoop distributed processing framework was launched as


an Apache open source project in 2006. This planted the seeds for a clustered
platform built on top of commodity hardware and geared to run big data
applications. By 2011, big data analytics began to take a firm hold in
organizations and the public eye, along with Hadoop and various related big data
technologies that had sprug up around it.

Initially, as the Hadoop ecosystem took shape and started to mature, big data
applications were primarily the province of large internet and e-
commerce companies such as Yahoo, Google and Facebook, as well as analytics
and marketing services providers. In the ensuing years, though, big data
analytics has increasingly been embraced by retailers, financial services firms,
insurers, healthcare organizations, manufacturers, energy companies and other
enterprises.
Logical Architectures for Big Data Analytics
If you check the reference architectures for big data analytics proposed
by Forrester and Gartner, modern analytics need a plurality of systems: one or
several Hadoop clusters, in-memory processing systems, streaming tools,
NoSQL databases, analytical appliances and operational data stores, among
others.
This is not surprising, since different data processing tasks need different tools.
For instance: real-time queries have different requirements than batch jobs, and
the optimal way to execute queries for reporting is very different from the way to
execute a machine learning process. Therefore, all these on-going big data
analytics initiatives are actually building logical architectures, where data is
distributed across several systems.

The Architecture of an Enterprise Big Data Analytics Platform


This will not change anytime soon. As Gartner’s Ted Friedmann said in a recent
tweet, ‘the world is getting more distributed and it is never going back the other
way ’. The ‘all the data in the same place ’ mantra of the big ‘data warehouse’
projects of the 90’s and 00’s never happened: even in those simpler times, fully
replicating all relevant data for a large company in a single system proved
unfeasible. The analytics projects of today will not succeed in such task in a
much more complex world of big data and cloud.
That is why the aforementioned reference architectures for big data analytics
include a ‘unifying’ component to act as the interface between the consuming
applications and the different systems. This component should provide: data
combination capabilities, a single entry point to apply security and data
governance policies, and should isolate applications from the changes in the
underlying infrastructure (which, in the case of big data analytics, is constantly
evolving).
Advanced Analytics
Advanced analytics is a broad category of inquiry that can be used to help drive
changes and improvements in business practices.

While the traditional analytical tools that comprise basic business intelligence
(BI) examine historical data, tools for advanced analytics focus on forecasting
future events and behaviors, enabling businesses to conduct what-if analyses to
predict the effects of potential changes in business strategies.

Predictive analytics, data mining, big data analytics and machine learning are
just some of the analytical categories that fall under the heading of advanced
analytics. These technologies are widely used in industries including marketing,
healthcare, risk management and economics.

Uses of advanced analytics


Advanced data analytics is being used across industries to predict future events.
Marketing teams use it to predict the likelihood that certain web users will click
on a link; healthcare providers use prescriptive analytics to identify patients who
might benefit from a specific treatment; and cellular network providers use
diagnostic analytics to predict potential network failures, enabling them to
do preventative maintenance.

Advanced analytics practices are becoming more widespread as enterprises


continue to create new data at a rapid rate. Now that many organizations have
access to large stores of data, or big data, they can apply predictive analytics
techniques to understand their operations at a deeper level.

Advanced analytics techniques


The advanced analytics process involves mathematical approaches to
interpreting data. Classical statistical methods, as well as newer, more machine-
driven techniques, such as deep learning, are used to identify patterns,
correlations and groupings in data sets. Based on these, users can make a
prediction about future behavior, whether it is which group of web users is most
likely to engage with an online ad or profit growth over the next quarter.

In many cases, these complex predictive and prescriptive analyses require a


highly skilled data scientist. These professionals have extensive training in
mathematics; computer coding languages, like Python and the R language; and
experience in a particular line of business.
Advanced analytics has become more common during the era of big data.
Predictive analytics models and, in particular, machine learning models require
large amounts of training to identify patterns and correlations before they can
make a prediction. The growing amount of data managed by enterprises today
opens the door to these advanced analytics techniques.

Advanced analytics tools


There are a variety of advanced analytics tools to choose from that offer different
advantages based on the use case. They generally break down into two
categories: open source and proprietary.

Open source tools have become a go-to option for many data scientists doing
machine learning and prescriptive analytics. They include programming
languages, as well as computing environments, including Hadoop and Spark.
Users typically say they like open source advanced analytics tools because they
are generally inexpensive to operate, offer strong functionality and are backed by
a user community that continually innovates the tools.

On the proprietary side, vendors including Microsoft, IBM and the SAS Institute
all offer advanced analytics tools. Most require a deep technical background and
understanding of mathematical techniques.

In recent years, however, a crop of self-service analytics tools has matured to


make functionality more accessible to business users. Tableau, in particular, has
become a common tool. While its functionality is more limited than deeper
technical tools, it does enable users to conduct cluster analyses and other
advanced analyses.
Data Analytics and its type
Analytics is the discovery and communication of meaningful patterns in data.
Especially, valuable in areas rich with recorded information, analytics relies on
the simultaneous application of statistics, computer programming, and operation
research to qualify performance. Analytics often favors data visualization to
communicate insight.
Firms may commonly apply analytics to business data, to describe, predict, and
improve business performance. Especially, areas within include predictive
analytics, enterprise decision management, etc. Since analytics can require
extensive computation (because of big data), the algorithms and software used to
analytics harness the most current methods in computer science.
In a nutshell, analytics is the scientific process of transforming data into insight
for making better decisions. The goal of Data Analytics is to get actionable
insights resulting in smarter decisions and better business outcomes.
It is critical to design and built a data warehouse or Business Intelligence(BI)
architecture that provides a flexible, multi-faceted analytical ecosystem,
optimized for efficient ingestion and analysis of large and diverse data sets.
There are four type of data analytics:

1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics

Predictive Analytics
Predictive analytics is a form of advanced analytics that uses both new and
historical data to forecast activity, behavior and trends. It involves
applying statistical analysis techniques, analytical queries and
automated machine learning algorithms to data sets to create predictive
models that place a numerical value or score on the likelihood of a particular
event happening.

Predictive analytics software applications use variables that can be measured and
analyzed to predict the likely behavior of individuals, machinery or other
entities. Predictive analytics can be used for a variety of use cases. For example,
an insurance company is likely to take into account potential driving safety
variables, such as age, gender, location, type of vehicle and driving record, when
pricing and issuing auto insurance policies.

Multiple variables are combined into a predictive model capable of assessing


future probabilities with an acceptable level of reliability. The software relies
heavily on advanced algorithms and methodologies, such as logistic regression
models, time series analysis and decision trees.

Predictive analytics has grown alongside the emergence of big data systems. As
enterprises have amassed larger and broader pools of data in Hadoop clusters
and other big data platforms, they have created increased data mining
opportunities to gain predictive insights. Heightened development and
commercialization of machine learning tools by IT vendors have also helped
expand predictive analytics capabilities.

Marketing, financial services and insurance companies have been notable


adopters of predictive analytics, as have large search engine and online services
providers. Predictive analytics is also commonly used in industries such as
healthcare, retail and manufacturing.

Business applications for predictive analytics include targeting online


advertisements, analyzing customer behavior to determine buying patterns,
flagging potentially fraudulent financial transactions, identifying patients at risk
of developing particular medical conditions and detecting impending parts
failures in industrial equipment before they occur.

The predictive analytics process and techniques


Predictive analytics requires a high level of expertise with statistical methods
and the ability to build predictive data models. As a result, it's typically in the
domain of data scientists, statisticians and other skilled data analysts. They're
supported by data engineers, who help to gather relevant data and prepare it for
analysis, and by software developers and business analysts, who help with data
visualization, dashboards and reports.
Data scientists use predictive models to look for correlations between different
data elements in website clickstream data, patient health records and other types
of data sets. Once the data collection has occurred, a statistical model is
formulated, trained and modified as needed to produce accurate results. The
model is then run against the selected data to generate predictions. Full data sets
are analyzed in some applications, but in others, analytics teams use data
sampling to streamline the process. The data modeling is validated or revised as
additional information becomes available.

The predictive analytics process begins by understanding the business and


preparing the data. A statistical model is then created, evaluated and deployed to
handle the data and derive predictions.
The predictive analytics process isn't always linear, and correlations often
present themselves where data scientists aren't looking. For that reason, some
enterprises are filling data scientist positions by hiring people who have
academic backgrounds in physics and other hard science disciplines. In keeping
with the scientific method, these workers are comfortable going where the data
leads them. Even if companies follow the more conventional path of hiring data
scientists trained in math, statistics and computer science, having an open mind
about data exploration is a key attribute for effective predictive analytics.

Once predictive modeling produces actionable results, the analytics team can
share them with business executives, usually with the aid of dashboards and
reports that present the information and highlight future business opportunities
based on the findings. Functional models can also be built into operational
applications and data products to provide real-time analytics capabilities, such as
a recommendation engine on an online retail website that points customers to
particular products based on their browsing activity and purchase choices.

Beyond data modeling, other techniques used by data scientists and experts
engaging in predictive analytics may include:
text analytics software to mine text-based content, such as Microsoft
Word documents, email and social media posts;
classification models that organize data into preset categories to make it
easier to find and retrieve; and
deep neural networking, which can emulate human learning and
automate predictive analytics.

Applications of predictive analytics


Online marketing is one area in which predictive analytics has had a significant
business impact. Retailers, marketing services providers and other organizations
use predictive analytics tools to identify trends in the browsing history of a
website visitor to personalize advertisements. Retailers also use customer
analytics to drive more informed decisions about what types of products the
retailer should stock.

Predictive maintenance is also emerging as a valuable application for


manufacturers looking to monitor a piece of equipment for signs that it may be
about to break down. As the internet of things (IoT) develops, manufacturers are
attaching sensors to machinery on the factory floor and to mechatronic products,
such as automobiles. Data from the sensors is used to forecast when maintenance
and repair work should be done in order to prevent problems.

IoT also enables similar predictive analytics uses for monitoring oil and gas
pipelines, drilling rigs, windmill farms and various other industrial
IoT installations. Localized weather forecasts for farmers based partly on data
collected from sensor-equipped weather data stations installed in farm fields is
another IoT-driven predictive modeling application.

Analytics tools
A wide range of tools is used in predictive modeling and analytics. IBM,
Microsoft, SAS Institute and many other software vendors offer predictive
analytics tools and related technologies supporting machine learning and deep
learning applications.

In addition, open source software plays a big role in the predictive analytics
market. The open source R analytics language is commonly used in predictive
analytics applications, as are the Python and Scala programming languages.
Several open source predictive analytics and machine learning platforms are also
available, including a library of algorithms built into the Spark processing
engine.

Analytics teams can use the base open source editions of R and other analytics
languages or pay for the commercial versions offered by vendors such as
Microsoft. The commercial tools can be expensive, but they come with technical
support from the vendor, while users of pure open source releases must
troubleshoot on their own or seek help through open source community support
sites.

Predictive Analytics Primer


No one has the ability to capture and analyze data from the future. However,
there is a way to predict the future using data from the past. It’s called predictive
analytics, and organizations do it every day.
Has your company, for example, developed a customer lifetime value (CLTV)
measure? That’s using predictive analytics to determine how much a customer
will buy from the company over time. Do you have a “next best offer” or
product recommendation capability? That’s an analytical prediction of the
product or service that your customer is most likely to buy next. Have you made
a forecast of next quarter’s sales? Used digital marketing models to determine
what ad to place on what publisher’s site? All of these are forms of predictive
analytics.
Predictive analytics are gaining in popularity, but what do you—a manager, not
an analyst—really need to know in order to interpret results and make better
decisions? How do your data scientists do what they do? By understanding a
few basics, you will feel more comfortable working with and communicating
with others in your organization about the results and recommendations from
predictive analytics. The quantitative analysis isn’t magic—but it is normally
done with a lot of past data, a little statistical wizardry, and some important
assumptions. Let’s talk about each of these.
The Data: Lack of good data is the most common barrier to organizations
seeking to employ predictive analytics. To make predictions about what
customers will buy in the future, for example, you need to have good data on
who they are buying (which may require a loyalty program, or at least a lot of
analysis of their credit cards), what they have bought in the past, the attributes of
those products (attribute-based predictions are often more accurate than the
“people who buy this also buy this” type of model), and perhaps some
demographic attributes of the customer (age, gender, residential location,
socioeconomic status, etc.). If you have multiple channels or customer
touchpoints, you need to make sure that they capture data on customer purchases
in the same way your previous channels did.
All in all, it’s a fairly tough job to create a single customer data warehouse with
unique customer IDs on everyone, and all past purchases customers have made
through all channels. If you’ve already done that, you’ve got an incredible asset
for predictive customer analytics.
The Statistics: Regression analysis in its various forms is the primary tool that
organizations use for predictive analytics. It works like this in general: An
analyst hypothesizes that a set of independent variables (say, gender, income,
visits to a website) are statistically correlated with the purchase of a product for a
sample of customers. The analyst performs a regression analysis to see just how
correlated each variable is; this usually requires some iteration to find the right
combination of variables and the best model. Let’s say that the analyst succeeds
and finds that each variable in the model is important in explaining the product
purchase, and together the variables explain a lot of variation in the product’s
sales. Using that regression equation, the analyst can then use the regression
coefficients the degree to which each variable affects the purchase behavior to
create a score predicting the likelihood of the purchase.
You have created a predictive model for other customers who weren’t in the
sample. All you have to do is compute their score, and offer the product to them
if their score exceeds a certain level. It’s quite likely that the high scoring
customers will want to buy the product assuming the analyst did the statistical
work well and that the data were of good quality.
The Assumptions: That brings us to the other key factor in any predictive
model—the assumptions that underlie it. Every model has them, and it’s
important to know what they are and monitor whether they are still true. The big
assumption in predictive analytics is that the future will continue to be like the
past. As Charles Duhigg describes in his book The Power of Habit , people
establish strong patterns of behavior that they usually keep up over time.
Sometimes, however, they change those behaviors, and the models that were
used to predict them may no longer be valid.
What makes assumptions invalid? The most common reason is time. If your
model was created several years ago, it may no longer accurately predict current
behavior. The greater the elapsed time, the more likely customer behavior has
changed. Some Netflix predictive models, for example, that were created on
early Internet users had to be retired because later Internet users were
substantially different. The pioneers were more technically-focused and
relatively young; later users were essentially everyone.
Another reason a predictive model’s assumptions may no longer be valid is if the
analyst didn’t include a key variable in the model, and that variable has changed
substantially over time. The great—and scary—example here is the financial
crisis of 2008-9, caused largely by invalid models predicting how likely
mortgage customers were to repay their loans. The models didn’t include the
possibility that housing prices might stop rising, and even that they might fall.
When they did start falling, it turned out that the models became poor predictors
of mortgage repayment. In essence, the fact that housing prices would always
rise was a hidden assumption in the models.
Since faulty or obsolete assumptions can clearly bring down whole banks and
even (nearly!) whole economies, it’s pretty important that they be carefully
examined. Managers should always ask analysts what the key assumptions are,
and what would have to happen for them to no longer be valid. And both
managers and analysts should continually monitor the world to see if key factors
involved in assumptions might have changed over time.
With these fundamentals in mind, here are a few good questions to ask your
analysts:

Can you tell me something about the source of data you used in your
analysis?
Are you sure the sample data are representative of the population?
Are there any outliers in your data distribution? How did they affect
the results?
What assumptions are behind your analysis?
Are there any conditions that would make your assumptions invalid?

Even with those cautions, it’s still pretty amazing that we can use analytics to
predict the future. All we have to do is gather the right data, do the right type of
statistical model, and be careful of our assumptions. Analytical predictions may
be harder to generate than those by the late-night television soothsayer Carnac
the Magnificent, but they are usually considerably more accurate.
Big data analytics projects raise stakes for predictive models

One of the keys to success in big data analytics projects is building strong ties
between data analysts and business units. But there are also technical and skills
issues that can boost or waylay efforts to create effective analytical models for
running predictive analytics and data mining applications against sets of big
data.

A fundamental question is how much data to incorporate into predictive models.


The last few years have seen an explosion in the availability of big data
technologies, such as Hadoop and NoSQL databases, offering relatively
inexpensive data storage. Companies are now collecting information from more
sources and hanging on to scraps of data that in the past they would have
considered superfluous. The promise of being able to analyze all that data has
increased its perceived value as a corporate asset. The more data, the better
seemingly.

But analytics teams need to weigh the benefits of using the full assortment of
data at their disposal. That might be necessary for some applications -- for
example, fraud detection, which depends on identifying outliers in a data set that
point toward fraudulent activity, or uplift modeling efforts that aim to segment
potential customers so marketing programs can be targeted at people who might
be positively influenced by them. In other cases, predictive modeling in big data
environments can be done effectively and more quickly with smaller data sets
through the use of data sampling techniques.

Tess Nesbitt, director of analytics at DataSong, a marketing analytics services


company in San Francisco, said statistical theorems show that, after a certain
point, feeding more data into an analytical model doesn't provide more accurate
results. She also said sampling analyzing representative portions of the available
information can help speed development time on models, enabling them to be
deployed more quickly.

Predictive models benefit from surplus data


Still, there's an argument to be made for retaining all the data an organization can
collect. DataSong helps businesses optimize their online ad campaigns by doing
predictive analytics on what sites would be best to advertise on and what types
of ads to run on different sites; for sales attribution purposes, it also analyzes
customer clickstream data to determine which ads induce people to buy
products. To fuel its analytics applications, the company ingests massive
amounts of Web data into a Hadoop cluster.
Much of that data doesn't necessarily get fed directly into model development,
but it's available for use if needed -- and even if it isn't, Nesbitt said having all
the information can be useful. For example, a large data set gives modelers a
greater number of records held out of the development process to use in testing a
model and tweaking it for improved accuracy. "The more data you have for
testing and validating your models, it's only a good thing," she said.

Data quality is another issue that needs to be taken into account in building
models for big data analytics applications, said Michael Berry, analytics director
at travel website operator TripAdvisor LLC's TripAdvisor for Business division
in Newton, Mass. "There's a hope that because data is big now, you don't have to
worry about it being accurate," Berry said during a session at the 2013 Predictive
Analytics World conference in Boston. "You just press the button, and you'll
learn something. But that may not stand up to reality."

Staffing also gets a spot on the list of predictive modeling and big data analytics
challenges. Skilled data scientists are in short supply, particularly ones with a
combination of big data and predictive analytics experience. That can make it
difficult to find qualified data analysts and modelers to lead big data analytics
projects.

Analytics skills shortage requires hiring flexibility


Mark Pitts, vice president of enterprise informatics, data and analytics at
Highmark Inc., said it's uncommon for data analysts to come out of college with
all the skills that the Pittsburgh-based medical insurer and healthcare services
provider wants them to have. Pitts looks for people who understand the technical
aspects of managing data, have quantitative analysis skills and know how to use
predictive analytics software; it also helps if they understand business concepts.
But the full package is hard to find. "All of those things are very rare in
combination," he said. "You need that right personality and aptitude, and we can
build the rest."

Along those lines, a computer engineer on Pitts' staff had a master's degree in
business administration but didn't really know anything about statistical analysis.
Highmark paid for the engineer to go back to school to get a master's degree in
statistics as well. Pitts said he identified the worker for continuing education
support not only because the engineer had some of the necessary qualifications
but also because he had a personality trait that Pitts is particularly interested in:
curiosity.

At DataSong, Nesbitt typically looks for someone with a Ph.D. in statistics and
experience using the R programming language, which the company uses to build
its predictive models with R-based software from Revolution Analytics. "To
work on our team, where we're building models all the time and we're knee-deep
in data, you have to have technical skills," she said.

Ultimately, though, those skills must be put to use to pull business value out of
an organization's big data vaults. "The key to remain focused on is that this isn't
really a technical problem -- it's a business problem," said Tony Rathburn, a
senior consultant and training director at The Modeling Agency, an analytics
consultancy in Pittsburgh. "That's the real issue for the analyst: setting up the
problem in a way that actually provides value to a business unit. That point
hasn't changed, regardless of the amount of data."

Faster modeling techniques in predictive analytics pay off

At Enova International Inc., a Chicago-based online financial services firm, the


company has been investing heavily in Ph.D. level data scientists. But that
approach to building an analytics team raises a question: How you do you adapt
academic predictive modeling techniques to business processes?

Joe DeCosmo, Enova's chief analytics officer, said a member of his team
recently told him that when the analyst first started working at the company, he
had to get over his academic instincts to detail every theory-based aspect of
the predictive models he builds in order to focus more on the business impact the
models can have.

"They have to realize they don't have to build the perfect model," DeCosmo said.
"It's about building something that's better than what we're doing currently."

This issue is heating up as more businesses look for workers with data science
skills. Often, the people who have the skills organizations need, which include
statistical analysis, machine learning, and R and Python programming, come
from academic backgrounds. But businesses don't have the kind of time that
Ph.D. programs give students to build analytical models. In the real world,
models need to be built and deployed quickly to help drive timely business
strategies and decisions.

Focus on perfection doesn't pay


About 20% of the people on the analytics team at Enova have doctorates.
DeCosmo said most of the analysts come around to a more business-focused
way of doing things once they see how the end-product of their work can
improve a specific business process. For example, Enova recently applied
predictive modeling techniques to identify suitable recipients for a direct mail
marketing campaign, to better target the mailing. That helped improved response
rates by about 25%, according to DeCosmo. The model may not have been
perfect, he added, but the kind of rapid improvement it led to helps data
scientists understand and appreciate the value of their work.

"At our scale, if we can get a model into production that's 10% better, that adds
material impact to our business," DeCosmo said.

There's always a tradeoff between time and predictive power when developing
analytical models. Spending more time on development to make a model better
could allow a data scientist to discover new correlations that boost the strength
of its predictions. But DeCosmo said he sees more business value in speedy
development.

"We're very focused on driving down the time [it takes to develop models]," he
said. "There's no such thing as a perfect model, so don't waste your time trying to
build one. We'd rather get that model out into production."

Simplicity drives predictive modeling speed


For Tom Sturgeon, director of business analytics at Schneider Electric's U.S.
operations in Andover, Mass., the top priority is empowering business
analysts to do some straightforward reporting themselves and free up his team of
data scientists to focus on more strategic analysis work.
Schneider Electric is an energy management company that sells products and
services aimed at making energy distribution and usage by corporate clients
more efficient. In the past, for every new report or analysis a business unit
wanted, Sturgeon and his team would have to pull data out of a complex
architecture of ERP, CRM and business intelligence systems, all of which were
themselves pulling data from back-end data stores. Sturgeon described these
systems as middlemen because they hold a lot of useful data, but on their own
don't make data easily accessible. His team had to manually pull data out, an
action which itself has less value than the actual analysis.

But since 2013, they've been using a "data blending" tool from Alteryx Inc. to
bring all the data into an analytics sandbox that business analysts can access with
Tableau's data discovery and visualization software. Sturgeon said that allows
the business analysts to skip the "middleman" reporting systems and build their
own reports, while his team does deeper analyses.

"We take the data and bring it together," he said. "Then we say, 'Here's the
sandbox, here are some tools, what questions do you want to ask?'"

Even when doing more data science work, though, the focus is on simplicity.
The analytics team is still working to develop its predictive capabilities, so for
now it's starting small. For example, it recently looked to see if there was a
correlation between macroeconomic data published by the Federal Reserve and
Schneider Electric's sales. The goal was to improve sales forecasting and set
more reasonable goals for the company's salespeople. The analysts could have
brought in additional economic data from outside sources to try to strengthen the
correlation, but they instead prioritized a basic approach.

"We aren't looking to build the best predictive model," Sturgeon said. "We're
starting simple and trying to gain traction."

Predictive modeling isn't BI as usual


In looking to unleash effective and speedy predictive modeling techniques in an
organization, bringing a standard business intelligence mindset to the process
won't cut it, said Mike Lampa, managing partner at consultancy Archipelago
Information Strategies.
Speaking at the TDWI Executive Summit in Las Vegas, Lampa said workers
involved in predictive analytics projects need to have much more freedom than
traditional BI teams, which typically spend a lot of time initially gathering
project requirements. That would be a waste of time in a predictive project, he
added. Meaningful correlations are often found in unexpected data sets and may
lead to recommendations that business managers weren't necessarily looking for.

Setting project requirements at the outset could slow down the analytics process
and limit the insights that get generated, Lampa cautioned, adding that data
scientists have to be able to go where the data takes them. "You can't create
effective models when you're always tied down to predetermined specifications,"
he said.

Business focus is key when applying predictive analytics models

At the oil and gas drilling company Halliburton, traditional BI is still important,
but there is a growing emphasis on predictive analytics models. One company
official said this trend is going to be the key to differentiating the Houston-based
firm from its competitors and making it more successful.

"You can do as much business intelligence as you want but it's not going to help
you win against your competitors in the long run," said Satyam Priyadarshy,
chief data scientist at Halliburton, in a presentation at the Predictive Analytics
World conference in Boston. He added that predictive modeling is going to be a
"game changer."

But simply doing predictive analytics modeling isn't enough. For Priyadarshy
and other conference presenters, predictive initiatives are only successful when
they are business-oriented and narrowly tailored to address specific problems.

Predictive modeling is a stat-heavy, technically intensive exercise. But when


implementing a predictive modeling program within a company, it's important to
not get bogged down in these areas in order to push projects to deliver true
business value.

For Priyadarshy, this approach means breaking down some of the data silos that
inevitably spring up. During the process of exploring and drilling a new gas or
oil well, tremendous volumes of data are generated. But they come from several
different departments. For example, data from seismic surveys of sites have
traditionally not been shared with the drilling operations teams, Priyadarshy said.
But there's an obvious need for the crews manning the drills to know what kind
of material they're likely to hit at certain depths.

Priyadarshy said he and his team are working on a homegrown data platform
that would make this data more accessible. The platform is a combination
of Hadoop, SQL, and in-memory database tools. It also includes a data
virtualization tool that allows different teams to access data wherever it is stored.
Doing so allows drilling teams to build predictive analytics models based on data
coming off of drilling sensors and from seismic surveys. These models allow the
drilling teams to predict in real time how fast they should run the drill bit and
how much pressure to apply.

Having such knowledge separates predictive modeling from traditional BI,


Priyadarshy said. Rather than producing a static BI report that retrospectively
explains certain events during the drilling process, the predictive models allow
teams to make adjustments in real time and address specific problems.
"With predictive models, you want to build actionable things rather than just
dashboards," Priyadarshy said.

Keep predictive modeling projects business-focused


Predictive modeling is most effective when it's used to tackle known business
problems, rather than looking to predict correlations that don't necessarily have
specific business value.

"You want to be clear about what types of problems you're trying to solve," said
Alfred Essa, vice president of analytics at McGraw-Hill Education in Columbus,
Ohio, during a presentation at Predictive Analytics World. "This helps you ask
deeper questions."

McGraw-Hill works with clients -- primarily local school districts and colleges -
- to look at their data to predict student performance. McGraw-Hill and the
schools have been able to reliably predict how students are likely to perform in
classes, including which students could fail or drop out, Essa said. But simply
giving this information to schools isn't necessarily helpful. He talks to clients to
make sure they have a plan for how they intend to use the information. Just
telling students they're likely to fail and they need to work harder might actually
backfire, causing them to give up. Schools need to develop curriculums to help
failing students before they do anything with the predictions, he said.

For Essa, the answer to this kind of question often comes during exploratory data
analysis. This early stage of modeling typically involves just looking at the data,
graphing various elements and trying to get a feel for what's in the data. This
stage can help modelers see variables that may point to trends, Essa said. In the
case of predicting student failure, they may be able to see factors that lead
students to fail, enabling schools to address these worries. This action goes
beyond just a predictive model.

"Before you start to do modeling, it's really helpful to pose questions and
interactively get answers back," Essa said.

Simplify outputs of predictive analytics models


There's always statistics underpinning any predictive model, which are useful to
the modelers. But for the lines of business that interact with the results of
predictive models, these stats are nothing but distraction.
Instead, predictions need to be clear and concise, said Patrick Surry, chief data
scientist at airfare prediction mobile app Hopper, based in Cambridge, Mass. He
talked about how one of Hopper's competitors gives customers purchasing
recommendations as a confidence interval. The problem is that few people
understand what the site means when it says, for example, it's 70% confident a
given price is the lowest that can be expected. Similarly, when Hopper was
testing its service it used the word "forecast" to talk about changes customers
should expect in prices. Surry said this just made people think Hopper was
talking about the weather.

"When you watch people try to interact with predictions, there are things you
don't even think about," he said. "As soon as you put the word 'confidence' in
there you've lost 90% of the audience."

Today, the Hopper app simply tells users to buy now because prices are as low as
they're likely to get or to wait because a better deal is likely to pop up. There are
some complicated predictive models running behind the scenes analyzing things
like historic price data, prices for given days of the week and month, destinations
and past sales. But Surry said customers don't need to know all these
calculations; they just need to know if they should buy an airline ticket or wait.

Predictive analytics tools point to better business actions

From recommending additional purchases based on the items that customers


place in online shopping carts to pinpointing hospital patients who have a greater
risk of readmission, the use of predictive analytics tools and techniques is
enabling organizations to tap their collections of data to predict future business
outcomes -- if the process is managed properly.

Predictive analytics has become an increasingly hot topic in analytics circles as


more people realize that predictive modeling of customer behavior and business
scenarios is "the big way to get big value out of data," said Mike Gualtieri, an
analyst at Forrester Research Inc. As a result, predictive analytics deployments
are gaining momentum, according to Gualtieri, who said that he has seen an
increase in adoption levels from about 20% in 2012 to "the mid- to high-30%
range" now.

That's still relatively low -- which creates even bigger potential business benefits
for organizations that have invested in predictive analytics software. If a
company's competitors aren't doing predictive analytics, it has "a great
opportunity to get ahead," Gualtieri said.
Predictive analytics projects can also provide those benefits across various
industries, said Eric King, president and founder of The Modeling Agency LLC,
an analytics consulting and training services firm based in Pittsburgh. "Everyone
is overwhelmed with data and starving for information," King noted.

But that doesn't mean it's just a matter of rolling out the technology and
letting analytics teams play around with data. When predictive analytics is done
well, the business benefits can be substantial -- but there are "some mainly
strategic pitfalls" to watch out for, King said. "Many companies are doing
analytics to do analytics, and they aren't pursuing analytics that are measurable,
purposeful, accountable and understandable by leadership."

Data scientists don't know it all


One common mistake is putting too much emphasis on the role of data scientists.
"Businesses think the data scientists have to understand the business," Gualtieri
said. With that in mind, they end up looking for experienced data analysts who
have all the required technical skills and also understand their business practices,
a combination that he warned can be nearly impossible to find. "That's why they
say, 'A data scientist is a unicorn.' But it doesn't have to work that way."

Instead, he recommended, business managers should be the ones who walk


through customer experience management operations or other business processes
and identify the kinds of behaviors and trends they'd like to predict, "then go to
the data scientists and ask if they can predict them."

King agreed that organizations often give data scientists too much responsibility
and leeway in analytics applications.
"They're really not analytics leaders in a lot of cases," he said, adding that data
scientists often aren't very effective at interviewing people from the business side
about their needs or defining analytics project plans. Echoing Gualtieri, King
said a variety of other people, from the business and IT, should also play roles in
predictive analytics initiatives. "When you have the right balance with your
team, you'll end up with a purposeful and thriving analytics process that will
produce results."

Plan ahead on predictive analytics


Companies looking to take advantage of predictive analytics tools also shouldn't
just jump into projects without a plan.
"You can't approach predictive analytics like you do a lot of other IT projects,"
King said. It's important, he advised, to think strategically about an
implementation upfront, plotting out a formal process that starts with a
comprehensive assessment of analytics needs and internal resources and skills.
"That's where we're seeing not only a greater adoption of predictive analytics,
but far greater results," he said.

In addition, companies need to understand the data they have at their disposal
and make it easily accessible for analysis, which is "no small task," according to
Gualtieri. Without an effective data management strategy, analytics efforts can
grind to a halt: "Data scientists consistently report that a large percentage of their
time is spent in the data preparation stage," he said. "If they can't effectively get
that data together or it takes too much time, opportunity is wasted."

Another mistake that some companies make is turning to inexperienced workers


to get the job done, said Karl Rexer, president of consultancy Rexer Analytics in
Winchester, Mass.

"Predictive analytics requires knowledge of statistics, sample sizes, regression


and other sorts of analytics tools and techniques that isn't commonly found
inside the current staffs that businesses have," he said. If hiring experienced
workers isn't an option, he suggested outsourcing initial pilot programs to
external experts who can help produce some early successes while also working
to transfer the needed skills to existing staffers.

Once those skills are in place and projects are under way, Rexer said a key to
getting good results from predictive analytics techniques is focusing on one
business initiative at a time -- for example, customer retention or getting online
shoppers to add more items to their carts. In some cases, companies think "they
can take all the data, throw it in [predictive models] and magically insights are
going to come out," he said. "Predictive analytics can be very helpful, but it's not
magic. You need to be tightly focused."

Descriptive Analytics
Descriptive modeling is a mathematical process that describes real- world events
and the relationships between factors responsible for them. The process is used
by consumer-driven organizations to help them target their marketing and
advertising efforts.

In descriptive modeling, customer groups are clustered according to


demographics, purchasing behavior, expressed interests and other descriptive
factors. Statistics can identify where the customer groups share similarities and
where they differ. The most active customers get special attention because they
offer the greatest ROI (return on investment).

The main aspects of descriptive modeling include:

Customer segmentation: Partitions a customer base into groups with


various impacts on marketing and service.
Value-based segmentation: Identifies and quantifies the value of a
customer to the organization.
Behavior-based segmentation: Analyzes customer product usage and
purchasing patterns.
Needs-based segmentation: Identifies ways to capitalize on motives that
drive customer behavior.

Descriptive modeling can help an organization to understand its customers,


but predictive modeling is necessary to facilitate the desired outcomes. Both
descriptive and predictive modeling constitute key elements of data
mining and Web mining.

Mastering descriptive data analysis yields better predictions

Big data and business analytics may not be considered ubiquitous quite yet, but
they are getting there.

Total big data revenues over the past several years have grown exponentially and
will approach $50 billion by 2017, according to business technology consultancy
Wikibon. Forbes cited a 2015 Capgemini global study predicting a 56%
increase in big data investments over three years. And Computer Science Corp.
estimates data production overall by 2020 will be 44 times what it was in 2009.

The analytics needed to work with all of that data is growing just as fast. But
analytics comes in many flavors, with the descriptive and predictive varieties
being the biggest and most useful. Yet, of the two, descriptive is embraced far
more by businesses than predictive.

Today, 90% of organizations use some form of descriptive analytics, which


includes methods of mining historical data as well as real-time streams to extract
useful facts to explain the data. The functions employed by descriptive data
analysis include social analytics, production and distribution metrics, and
correlations between operational results and changes in process.

Peering into the crystal ball


Predictive analytics involves processing big data to forecast future
outcomes beyond the simple trending produced by business intelligence. It
allows the enterprise to game out complex what-if scenarios, create accurate
models for future performance, identify correlations that aren't intuitively
obvious and perform more thorough root-cause analysis. With these capabilities,
an organization can forecast customer/client behaviors, predict logistical failures,
anticipate changes in purchasing patterns and make more accurate
credit/procurement decisions.

Descriptive data analysis is considered fairly easy since it can be implemented


with the standard aggregate functions built into most databases and knowledge
of basic high school math. In contrast, predictive analytics calls for a strong
command of statistics, college-level math (linear regression and so on) and often
specialized software. Most organizations have the in-house resources to do
descriptive analytics, while predictive analytics requires recruiting
specialists and very often the purchase of new systems.

Yet the gap between descriptive and predictive ostensibly isn't as great as it
seems. The resulting data from both methodologies is gathered for the sole
purpose of answering questions, albeit different questions. For descriptive data,
it's "What has happened?" and for predictive data, "What might happen next?"

An organization with a solid grasp of descriptive analytics is well on its way to


embracing predictive. The reason is simple: Predictions are generally granular,
focusing on the behavior or performance of one unit or individual among many
—for example, "What will this person buy?" "Is this customer a credit risk?"
Descriptive data analysis creates the rules and conditions that lead to accurate
predictive analytics. You can't have the latter without the former. With well-
constructed descriptive models, predictive models are much, much easier.
Descriptive models take large amounts of data and employ well-crafted rules of
classification for that data to organize many units or individuals into useful
subgroupings. Descriptive models condense that data into factors that identify
certain characteristics of the people or processes being assessed that predictive
models require.
Descriptive analytics is generally context-free -- no correlation to other data is
usually involved -- while predictive analytics is all context. The accurate
assessment of what a customer might want or need is greatly enhanced by
knowing what the customer is doing . Customers with similar profiles and
shopping patterns may vary wildly if the context of their shopping changes.
Teenagers, for instance, buy different clothes in different quantities if they're
about to leave home for college. When context is factored in via predictive
analytics, the result is highly refined forecasting.

Decisions times two


Descriptive data analysis, therefore, is the first step on a journey that leads to
predictive modeling power, an enterprise-transforming combination. The
outcome of combining these two methodologies rests with the ability to
create decision models.

A decision model incorporates all the information necessary to generate an


actionable decision from the output of the descriptive model and the predictions,
and past decisions, it generates. That increases the accuracy and efficiency of
decision making by enabling optimization the ability to fine-tune processes and
institutional behaviors based on the successful implementation of analytics.

This decision model leads to the next level prescriptive analytics a methodology
for choosing effective courses of action from available options, and a very
different beast from descriptive and predictive analytics. The point is that no
branch of analytics exists in isolation; each methodology feeds into the next and
adds a new layer of sophistication and functionality to the process.

The ultimate goal is to view analytics not simply as implementing a new process
or tool, but more as important steps in an enterprise's evolution on the way
to perpetual growth and change.

Prescriptive Analytics
Prescriptive analytics is the area of business analytics (BA) dedicated to finding
the best course of action for a given situation.

Prescriptive analytics is related to both descriptive and predictive analytics.


While descriptive analytics aims to provide insight into what has happened and
predictive analytics helps model and forecast what might happen, prescriptive
analytics seeks to determine the best solution or outcome among various choices,
given the known parameters.

Prescriptive analytics can also suggest decision options for how to take
advantage of a future opportunity or mitigate a future risk, and illustrate the
implications of each decision option. In practice, prescriptive analytics can
continually and automatically process new data to improve the accuracy of
predictions and provide better decision options.

A process-intensive task, the prescriptive approach analyzes potential decisions,


the interactions between decisions, the influences that bear upon these decisions
and the bearing all of the above has on an outcome to ultimately prescribe an
optimal course of action in real time. Prescriptive analytics is not failproof,
however, but is subject to the same distortions that can upend descriptive and
predictive analytics, including data limitations and unaccounted-for external
forces. The effectiveness of predictive analytics also depends on how well the
decision model captures the impact of the decisions being analyzed.

Advancements in the speed of computing and the development of complex


mathematical algorithms applied to the data sets have made prescriptive analysis
possible. Specific techniques used in prescriptive analytics include optimization,
simulation, game theory and decision-analysis methods.

Prescriptive models take analytics one step beyond

In the beginning, there was descriptive analytics data-parsing methodologies that


cleverly analyzed large amounts of data about customers, products, financials or
most anything else and yielded insightful new categories for those
items. Predictive analytics then followed as an even more dazzling practice that
could fine-tune our understanding of "what comes next" with great accuracy and
granularity so we could maximize our time and investment in planning for the
outcome we want.

Next in line is prescriptive analytics : the science of outcomes. It's less intuitive
and much harder to embrace, yet it feeds the enterprise the kind of news we don't
necessarily want to hear. Descriptive and predictive results simply provide better
data for making decisions always a good thing and an important refinement of
what is already happening. But prescriptive results take it a step further: They tell
us what to do. That makes prescriptive at least as important as its siblings in
moving the enterprise forward.
Prescriptive models don't just inform those involved in the decision-making
process, they are the decision-making process. They articulate the best outcome,
which can create friction among those who aren't comfortable relinquishing their
decision-making responsibilities to a machine.

Playing by the (changing) rules


Prescriptive models also require careful framing, or rules, to produce outcomes
according to the best interests of the business. When prescriptive analytics is
applied, the process itself needs to include as much information as possible
about the enterprise by creating a framework for interpreting the prescriptive
results. That framework is built on business rules.

Business rules defining the enterprise's operations serve to gauge the impact of
prescriptive recommendations on operations, efficiency and the bottom
line. Projected outcomes are brought in line with institutional priorities, values
and goals. The rules are based on established policy, best practices, internal and
external constraints, local and global objectives, and so on. They determine to
what degree prescriptive recommendations and anticipated outcomes truly work.

Constructing those rules may be an exhaustive, time-consuming and meticulous


undertaking, requiring participation from all areas of an organization. Yet, the
toughest work is still to come.

The rules must be dynamic; organic; and, to some degree, fluid. The entire point
of an analytics-based institutional culture is acquiescence to the objective reality
of real-world data. A corporate self-image based on that data will necessarily
evolve. It follows that the business rules driving prescriptive analytics must also
evolve. Therefore, the prescriptive process and the successful outcomes it
delivers will feed back into the rules and steadily refine them.
An electronics manufacturer in southern Indiana put this idea to work in
selecting its optimum long-term customer contracts. Though its headquarters are
in the U.S., most of its actual manufacturing facilities are located on other
continents. Capacity to manufacture and deliver in those other countries is
governed by a number of risk factors involving fluctuating availability of raw
materials, economic conditions affecting logistics and employee turnover. So the
business rules applied to the company's contract evaluation process are critical to
the accuracy of analysis and must be adjusted frequently.

Data inside and out


Another daunting challenge is hybridization of inputs into the prescriptive
process. Descriptive and predictive processes use data that's carefully
preformatted and well-thought-out. Prescriptive processes must model diverse
facts, features and events from inside and outside the enterprise. That's
called environmental data , and it can be messy because it's composed of
unstructured and multi-sourced data that potentially includes everything from
internet posts and video to free-form text based on speeches and white papers.

Codifying and classifying this diverse amount of data is cumbersome and


expensive; perhaps the most off-putting of all are the prescriptive analytics
components. Building processes to capture and format this kind of data can be
viewed as a serious impediment to implementation. Yet, the task is essential. It
can mean the difference between adequate and perfect modeling solutions.

The healthcare industry has been a leader in modeling prescriptive solutions with
the environment. The service provider's need for efficiency is greater than ever
because of the massive changes in healthcare economics in recent years.
Capacity planning is a key factor in optimizing logistics and resources for
service delivery. The models incorporate vast amounts of environmental data,
including highly granular demographics, trends in health by region, and
economic conditions at both national and regional levels. By using these models,
many healthcare providers are adjusting near-term and long-term investment
plans for optimal service delivery.

Prescriptive analytics closes the big data loop. It's a natural endpoint for the
descriptive and predictive processes that precede it. Whatever the hype and
hoopla surrounding prescriptive models, its success depends on a combination of
mathematical innovation, mastery of data and old-fashioned hard work.

Diagnostic Analytics
Diagnostic Analytics: In this analysis, we generally use historical data over
other data to answer any question or for the solution of any problem. We try to
find any dependency and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into
a problem, and they also keep detailed information about there disposal
otherwise data collection may turn out individual for every problem and it will
be very time-consuming.
Common techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations

Data Discovery
“We are drowning in information but starved for knowledge” according to
best selling author, John Naisbitt. Today’s businesses can collect piles of
information on everything from customer buying patterns and feedback to
supplier lead times and marketing efforts. Yet it is nearly impossible to
draw value and truth from the massive amount of data your business
collects without a data discovery system in place.
Data discovery is a term related to business intelligence technology. It is the
process of collecting data from your various databases and silos, and
consolidating it into a single source that can be easily and instantly evaluated.
Once your raw data is converted, you can follow your train of thought by drilling
down into the data with just few clicks. Once a trend is identified, the software
empowers you to unearth the contributing factors.
For instance, BI enables you to explore the data by region, different employees,
product type, and more. In a matter of seconds, you have access to actionable
insights to make rapid, fact-based decisions in response to your discoveries.
Without BI, discovering a trend is usually a case of coincidence.
With data discovery, the user searches for specific items or patterns in a data
set. Visual tools make the process fun, easy-to-use, swift, and intuitive.
Visualization of data now goes beyond traditional static reports. BI
visualizations have expanded to include geographical maps, pivot-tables, heat
maps, and more, giving you the ability to create high-fidelity presentations of
your discoveries.
Discover trends you did not know where there
With data discovery, executives are often shocked to discover trends they didn’t
know were there. Michael Smith of the Johnston Corporation had this to say
after implementing Phocas:
"Five minutes into the demo, I had found items that didn't have the margin I was
expecting, customers that didn't have the profitability I was expecting and
vendors that weren't performing the way I expected. I realised that we were onto
something that would be very impactful to our business."
These discoveries allow companies to discover unfavourable trends before they
become a problem and take action to avoid losses.
Data Mining
Data mining is the process of sorting through large data sets to identify patterns
and establish relationships to solve problems through data analysis. Data mining
tools allow enterprises to predict future trends.

In data mining, association rules are created by analyzing data for frequent
if/then patterns, then using the support and confidence criteria to locate the most
important relationships within the data. Support is how frequently the items
appear in the database, while confidence is the number of times if/then
statements are accurate.
Other data mining parameters include Sequence or Path
Analysis, Classification, Clustering and Forecasting. Sequence or Path Analysis
parameters look for patterns where one event leads to another later event. A
Sequence is an ordered list of sets of items, and it is a common type of data
structure found in many databases. A Classification parameter looks for new
patterns, and might result in a change in the way the data is organized.
Classification algorithms predict variables based on other factors within the
database.

Clustering parameters find and visually document groups of facts that were
previously unknown. Clustering groups a set of objects and aggregates them
based on how similar they are to each other.

There are different ways a user can implement the cluster, which differentiate
between each clustering model. Fostering parameters within data mining can
discover patterns in data that can lead to reasonable predictions about the future,
also known as predictive analysis.

Data mining tools and techniques


Data mining techniques are used in many research areas, including
mathematics, cybernetics, genetics and marketing. While data mining techniques
are a means to drive efficiencies and predict customer behavior, if used correctly,
a business can set itself apart from its competition through the use of predictive
analysis.
Web mining, a type of data mining used in customer relationship management,
integrates information gathered by traditional data mining methods and
techniques over the web. Web mining aims to understand customer behavior and
to evaluate how effective a particular website is.
Other data mining techniques include network approaches based on multitask
learning for classifying patterns, ensuring parallel and scalable execution of data
mining algorithms, the mining of large databases, the handling of relational and
complex data types, and machine learning. Machine learning is a type of data
mining tool that designs specific algorithms from which to learn and predict.

Benefits of data mining


In general, the benefits of data mining come from the ability to uncover hidden
patterns and relationships in data that can be used to make predictions
that impact businesses.

Specific data mining benefits vary depending on the goal and the industry. Sales
and marketing departments can mine customer data to improve lead conversion
rates or to create one-to-one marketing campaigns. Data mining information on
historical sales patterns and customer behaviors can be used to build prediction
models for future sales, new products and services.

Companies in the financial industry use data mining tools to build risk models
and detect fraud. The manufacturing industry uses data mining tools to improve
product safety, identify quality issues, manage the supply chain and improve
operations.

Text Mining
Text mining (text analytics) is the process of exploring and analyzing large
amounts of unstructured text data aided by software that can identify concepts,
patterns, topics, keywords and other attributes in the data. It's also known as text
analytics, although some people draw a distinction between the two terms; in
that view, text analytics refers to the application that uses text mining techniques
to sort through data sets.

Text mining has become more practical for data scientists and other users due to
the development of big data platforms and deep learning algorithms that
can analyze massive sets of unstructured data.
Mining and analyzing text helps organizations find potentially valuable business
insights in corporate documents, customer emails, call center logs, verbatim
survey comments, social network posts, medical records and other sources of
text-based data. Increasingly, text mining capabilities are also being incorporated
into AI chatbots and virtual agents that companies deploy to provide automated
responses to customers as part of their marketing, sales and customer service
operations.

How text mining works


Text mining is similar in nature to data mining, but with a focus on text instead
of more structured forms of data. However, one of the first steps in the text
mining process is to organize and structure the data in some fashion so it can be
subjected to both qualitative and quantitative analysis.

Doing so typically involves the use of natural language processing (NLP)


technology, which applies computational linguistics principles to parse and
interpret data sets.

The upfront work includes categorizing, clustering and tagging text;


summarizing data sets; creating taxonomies; and extracting information about
things like word frequencies and relationships between data entities. Analytical
models are then run to generate findings that can help drive business strategies
and operational actions.

In the past, NLP algorithms were primarily based on statistical or rules-based


models that provided direction on what to look for in data sets. In the mid-2010s,
though, deep learning models that work in a less supervised way emerged as an
alternative approach for text analysis and other advanced analytics applications
involving large data sets. Deep learning uses neural networks to analyze data
using an iterative method that's more flexible and intuitive than what
conventional machine learning supports.

As a result, text mining tools are now better equipped to uncover underlying
similarities and associations in text data, even if data scientists don't have a good
understanding of what they're likely to find at the start of a project. For example,
an unsupervised model could organize data from text documents or emails into a
group of topics without any guidance from an analyst.
Applications of text mining
Sentiment analysis is a widely used text mining application that can track
customer sentiment about a company. Also known as opinion mining, sentiment
analysis mines text from online reviews, social networks, emails, call center
interactions and other data sources to identify common threads that point to
positive or negative feelings on the part of customers. Such information can be
used to fix product issues, improve customer service and plan new marketing
campaigns, among other things.
Other common text mining uses include screening job candidates based on the
wording in their resumes, blocking spam emails, classifying website content,
flagging insurance claims that may be fraudulent, analyzing descriptions of
medical symptoms to aid in diagnoses, and examining corporate documents as
part of electronic discovery processes. Text mining software also offers
information retrieval capabilities akin to what search engines and enterprise
search platforms provide, but that's usually just an element of higher level text
mining applications, and not a use in and of itself.

Chatbots answer questions about products and handle basic customer service
tasks; they do so by using natural language understanding (NLU) technology, a
subcategory of NLP that helps the bots understand human speech and written
text so they can respond appropriately.

Natural language generation (NLG) is another related technology that mines


documents, images and other data, and then creates text on its own. For example,
NLG algorithms are used to write descriptions of neighborhoods for real estate
listings and explanations of key performance indicators tracked by business
intelligence systems.

Benefits of text mining


Using text mining and analytics to gain insight into customer sentiment can help
companies detect product and business problems and then address them before
they become big issues that affect sales. Mining the text in customer reviews and
communications can also identify desired new features to help strengthen
product offerings. In each case, the technology provides an opportunity to
improve the overall customer experience, which will hopefully result in
increased revenue and profits.

Text mining can also help predict customer churn, enabling companies to take
action to head off potential defections to business rivals as part of their
marketing and customer relationship management programs. Fraud detection,
risk management, online advertising and web content management are other
functions that can benefit from the use of text mining tools.

In healthcare, the technology may be able to help diagnose illnesses and medical
conditions in patients based on the symptoms they report.
Text mining challenges and issues
Text mining can be challenging because the data is often vague, inconsistent and
contradictory. Efforts to analyze it are further complicated by ambiguities that
result from differences in syntax and semantics, as well as the use of slang,
sarcasm, regional dialects and technical language specific to individual vertical
industries. As a result, text mining algorithms must be trained to parse such
ambiguities and inconsistencies when they categorize, tag and summarize sets of
text data.

In addition, the deep learning models used in many text mining applications
require large amounts of training data and processing power, which can make
them expensive to run. Inherent bias in data sets is another issue that can lead
deep learning tools to produce flawed results if data scientists don't recognize the
biases during the model development process.

There's also a lot of text mining software to choose from. Dozens of commercial
and open source technologies are available, including tools from major software
vendors, including IBM, Oracle, SAS, SAP and Tibco.

Web Mining
In customer relationship management (CRM), Web mining is the integration of
information gathered by traditional data mining methodologies and techniques
with information gathered over the World Wide Web. (Mining means extracting
something useful or valuable from a baser substance, such as mining gold from
the earth.) Web mining is used to understand customer behavior, evaluate the
effectiveness of a particular Web site, and help quantify the success of a
marketing campaign.

Web mining allows you to look for patterns in data through content mining,
structure mining, and usage mining. Content mining is used to examine data
collected by search engines and Web spiders. Structure mining is used to
examine data related to the structure of a particular Web site and usage mining is
used to examine data related to a particular user's browser as well as data
gathered by forms the user may have submitted during Web transactions.

The information gathered through Web mining is evaluated (sometimes with the
aid of software graphing applications) by using traditional data
mining parameters such as clustering and classification, association, and
examination of sequential patterns.

In summary: Both descriptive analytics and diagnostic analytics look to the


past to explain what happened and why it happened. Predictive analytics and
prescriptive analytics use historical data to forecast what will happen in the
future and what actions you can take to affect those outcomes. Forward-thinking
organizations use a variety of analytics together to make smart decisions that
help your business.
Data Science Process

Data is the lifeblood of modern businesses. Increasingly, getting the most out of
our organization's data with accurate insight and understanding makes a real
difference to business success. As a result, the data scientist has become a
critical hire for companies of all sizes, whether the job is a specialized position
in IT or embedded in a business unit.
Nevertheless, it isn't always clear what we mean by the term data scientist .
A highly qualified data analyst? Someone with a scientific background who
happens to work with data?

Certainly, data scientists typically are experienced in statistics and scripting, and
they often have a technical background, rather than a scientific, liberal arts or
business one. But the critical element of data science -- which does make it a
science rather than just a business practice -- is the importance of process and
experiments.

You likely remember learning about the scientific method in high school.
Scientists come up with theories and hypotheses. They design experiments to
test those hypotheses and then either confirm, reject or, more often, refine the
theory.

Basic business intelligence and reporting typically doesn't follow this process.
Instead, BI and business analysts sift, sort, tabulate and visualize data in order to
support a business case. For example, they may show graphically that sales of
our company's products in the western region are falling and also that this region
includes younger customers compared to other areas. From there, they could
make the case that we need to change our product, marketing or sales strategy
for that region. In BI, the most persuasive data visualization often carries the
argument.

The data scientist takes a different approach. Let's continue to use this sales
example to show how the data science process works, in the following six steps.

The data science process includes these six steps.

Identify a hypothesis of value to the business


In our case, the data scientist can formulate a simple hypothesis based on
questions raised by the sales, marketing and product teams: We think younger
people are less likely to buy our products, so that's driving down sales in the
relatively youthful western region.

In addition, we could come up with a few related hypotheses, such as: It's not
simply that customers in the western region are younger, but also that younger
people typically earn less money and average income is lower there than it is in
other regions.

You can see already that the data scientist must be able to think through different
implications of related hypotheses in order to design the right data science
experiments. Just asking one direct question when analyzing data generally
proves less helpful than asking several. And to get the best results, data scientists
should work with business experts to tease out edge cases and counterexamples
that can help refine their hypotheses.

Gather and prepare the required data


With a hypothesis or a set of them in hand, it's time for the data scientist to get
the right data and prepare it for analysis.

The BI team commonly works with data from a data warehouse that has been
cleaned up, transformed and modeled to reflect business rules and how analysts
have looked at the data in the past. The data scientist, on the other hand,
generally wants to look at data in its raw state, before any rules are applied to it.
Also, data science applications often require more data than what is stored in a
warehouse.

In our example, the company's data warehouse likely includes various details
about customers but perhaps not how they paid for products: by credit card,
cash, online payment, etc. Or we may find that, because data warehouse models
can be cumbersome to modify, the putative system of record is a little out of date
and doesn't yet include newer forms of payment -- exactly the kinds that are
attractive to younger people.

So, the data scientist needs to work with the IT team to get access to the most
detailed data sources that are available and pull together the required data. This
may be business data sourced from ERP, CRM or other operational systems, but
it increasingly also includes web logs, streaming data from IoT devices and
many other types of data. The raw data usually will be extracted and loaded -- or
ingested, as the jargon has it -- into a data lake. For simplicity and convenience,
though, the data scientist most often only works against a sample data set at this
early stage.

And this isn't to say that data scientists do no data preparation work at all. For
sure, they typically don't apply business models or predefined business rules to
the raw data in the manner of a data warehouse developer. But they do spend a
lot of time profiling and cleansing data -- for example, deciding how to handle
missing or outlier values -- and transforming it into structures that are
appropriate for specific machine learning algorithms and statistical models.

Experiment with and tune analytical models


Designing the experiment is a critical step in the data science process. Indeed,
some would say it's more of an art than a science. Certainly, it helps if the data
scientist has a good understanding of the business and some insight into what
constitutes interesting variables to consider, in addition to a sense of which
algorithms are likely to give more useful results.

Today, there are numerous data science and machine learning tools that can try
different algorithms and approaches and select the best ones for analytics
applications, without much human intervention. You more or less point the tool
at the data, specify the variables you're interested in and leave it to run. Often
described as automated machine learning platforms, these systems are largely
marketed to business users who function as citizen data scientists, but they're just
as popular with skilled data scientists, who use them to investigate more models
than they could do manually.

Even the best model can be improved with some tuning and tweaking of
variables. Sometimes, the data scientist may even want to go back and shape the
data a little differently -- perhaps removing outliers that were left during the
initial data preparation stage. For example, I've seen many cases where the
original data was collected with default values that were convenient but wrong
and potentially misleading.

Select a model and run the data analysis


Once the data scientist has found the best algorithm running against the test data
set, it's time to run the analytics experiment against all of the data.

The results? Well, I can't tell you what they'll be. But, with an interesting
hypothesis, good data and a carefully built model, a data scientist should be able
to find something useful to the business. You may surprise yourself, even at this
stage, with an unexpected discovery. Most often, you'll either confirm or reject
your original hypothesis -- which, of course, is what you set out to do in the first
place.

Going back to our sales example, let's assume the model we decide to run proves
that, yes, younger people are less likely to buy our products -- but with some
important twists, which leads us to the next step.

Present and explain the results to business stakeholders


Remember that the whole point of our experiment was to test some ideas that we
can then take to marketing, sales and product design to give them new insights
into our customers.

What we have, however, is a mass of statistics from our model that the business
users may not understand. Perhaps in general, younger people are indeed less
likely to buy our product -- also, their average purchase is lower than those of
older customers. But some young people buy a lot, resulting in a high median
sale level.

To help business stakeholders understand such complications, data scientists


need another skill not an additional technical ability, but one of a set of soft skills
they should have. They must be able to explain the analytics work and tell the
story of the data science experiment and its results. Some businesses even have
data interpreters or analytics translators who specialize in this important task,
describing the implications of analytical models and their findings in business
terms. They and data scientists alike often use data storytelling techniques to
clarify the analytics results and proposed actions.

Prepare and deploy the model for production use


We now have our data in place, our model working and a good business
understanding of what we have discovered. In fact, the business teams have even
thought about how to make some offers on our website to appeal to those elusive
younger customers in the west. Now we need to take the data science work from
the lab and put it into production on operational data as the business is running.

This final step isn't always straightforward. First, updating the analytical model
with fresh data on an ongoing basis may require a different approach to data
loading. What we did manually as an experiment may not be efficient in
practice. Partly for this reason, another role has emerged in many businesses: the
data engineer, whose responsibilities include working closely with data scientists
to make models production-ready.

We should also recognize that, in our example, buying habits change over time,
perhaps with the economy or changes in taste. So, we have to keep the model up
to date and perhaps tune it again in the future. That may also be one of a data
engineer's tasks, although the data scientist must rework a model if it drifts too
much from its original accuracy.

Finally, the model that works best as an experiment may prove expensive to run
in practice. With data analysis increasingly done in the cloud, where we pay for
the use of computing and storage, we may find that some changes make the
model slightly less accurate but cheaper to run. A data engineer can also help
with that, but the trade-off between accuracy and cost can be a tricky choice.

The business side of data science


I've described a basic outline of the data science process. As you can see, there
are some elements that we can call engineering, or even an art. We also need to
bear in mind that in the commercial world, data science is a business. That is to
say, the purpose of our experiments and the success of the process will always be
most usefully focused on straightforward commercial realities.

At its best, data science involves widespread collaboration across business and
IT domains and adds new value to many different facets of an organization's
work.

Data Preparation
The specifics of the data preparation process vary by industry, organization and
need, but the framework remains largely the same.

• Gather data
The data preparation process begins with finding the right data. This can come
from an existing data catalog or can be added ad-hoc.
• Discover and assess data
After collecting the data, it is important to discover each dataset. This step is
about getting to know the data and understanding what has to be done before the
data becomes useful in a particular context.

• Cleanse and validate data


Cleaning up the data is traditionally the most time consuming part of the data
preparation process, but it’s crucial for removing faulty data and filling in gaps.
Important tasks here include:
Removing extraneous data and outliers.
Filling in missing values.
Conforming data to a standardized pattern.
Masking private or sensitive data entries.
Once data has been cleansed, it must be validated by testing for errors in the data
preparation process up to this point. Often times, an error in the system will
become apparent during this step and will need to be resolved before moving
forward.

• Transform and enrich data


Transforming data is the process of updating the format or value entries in order
to reach a well-defined outcome, or to make the data more easily understood by
a wider audience. Enriching data refers to adding and connecting data with
other related information to provide deeper insights.
• Store data
Once prepared, the data can be stored or channeled into a third party application
such as a business intelligence tool, clearing the way for processing and analysis
to take place.
Let’s understand some of the concepts in detail:
Raw Data
Raw data (sometimes called source data or atomic data) is data that has not been
processed for use. A distinction is sometimes made between data
and information to the effect that information is the end product of data
processing. Raw data that has undergone processing is sometimes referred to
as cooked data .

Although raw data has the potential to become "information," it requires


selective extraction, organization, and sometimes analysis and formatting for
presentation. For example, a point-of-sale terminal (POS terminal) in a busy
supermarket collects huge volumes of raw data each day, but that data doesn't
yield much information until it is processed. Once processed, the data may
indicate the particular items that each customer buys, when they buy them, and
at what price. Such information can be further subjected to predictive
technology analysis to help the owner plan future marketing campaigns.

As a result of processing, raw data sometimes ends up in a database, which


enables the data to become accessible for further processing and analysis in a
number of different ways.

Data Profiling
Data profiling is the process of examining, analyzing and reviewing data to
collect statistics surrounding the quality and hygiene of the dataset. Data quality
refers to the accuracy, consistency, validity and completeness of data. Data
profiling may also be known as data archeology, data assessment, data discovery
or data quality analysis.

The first step of data profiling is gathering one or multiple data sources and
its metadata for analysis. The data is then cleaned to unify structure, eliminate
duplications, identify interrelationships and find any anomalies. Once the data is
clean, different data profiling tools will return various statistics to describe the
dataset. This could include the mean, minimum/maximum value, frequency,
recurring patterns, dependencies or data quality risks.

For example, by examining the frequency distribution of different values for


each column in a table, an analyst could gain insight into the type and use of
each column. Cross-column analysis can be used to expose embedded value
dependencies and inter-table analysis allows the analyst to discover overlapping
value sets that represent foreign key relationships between entities.

Organizations can use data profiling at the beginning of a project to determine if


enough data has been gathered, if any data can be reused or if the project is
worth pursuing. The process of data profiling itself can be based on specific
business rules that will uncover how the dataset aligns with business standards
and goals.

Profiling tools evaluate the actual content, structure and quality of the data by
exploring relationships that exist between value collections both within and
across data sets. Vendors that offer software and tools that can automate the data
profiling process include Informatica, Oracle and SAS.

Types of data profiling

While all applications of data profiling involve organizing and collecting


information about a database, there are also three specific types of data profiling.

1. Structure discovery- This focuses on the formatting of the data, making


sure everything is uniform and consistent. It also uses basic statistical
analysis to return information about the validity of the data.
2. Content discovery- This process assesses the quality of individual pieces
of data. For example, ambiguous, incomplete and null values are
identified.
3. Relationship discovery- This detects connections, similarities,
differences and associations between data sources.

Benefits of data profiling

Data profiling returns a high-level overview of data that can result in the
following benefits:

Leads to higher quality, more credible data.


Helps with more accurate predictive analytics and decision making.
Makes better sense of the relationships between different datasets and
sources.
Keeps company information centralized and organized.
Eliminates errors associated with high costs, such as missing values or
outliers.
Highlights areas within a system that experience the most data quality
issues, such as data corruption or user input errors.
Produces insights surrounding risks, opportunities and trends.

Examples of data profiling applications

Data profiling can be implemented in a variety of use cases where data quality is
important. For example, projects that involve data warehousing or business
intelligence may require gathering data from multiple disparate systems or
databases for one report or analysis. Applying the data profiling process to these
projects can help identify potential issues and corrections that need to be made
in ETL processing before moving forward.

Additionally, data profiling is crucial in data conversion or data


migration initiatives that involve moving data from one system to another. Data
profiling can help identify data quality issues that may get lost in translation or
adaptions that must be made to the new system prior to migration.

Data Scrubbing
Data scrubbing, also called data cleansing, is the process of amending or
removing data in a database that is incorrect, incomplete, improperly formatted,
or duplicated. An organization in a data-intensive field like banking, insurance,
retailing, telecommunications, or transportation might use a data scrubbing tool
to systematically examine data for flaws by using rules, algorithms, and look-up
tables. Typically, a database scrubbing tool includes programs that are capable of
correcting a number of specific type of mistakes, such as adding missing zip
codes or finding duplicate records. Using a data scrubbing tool can save
a database administrator a significant amount of time and can be less costly than
fixing errors manually.

Extract, Load, Transform


Extract, Load, Transform (ELT) is a data integration process for transferring raw
data from a source server to a data system (such as a data warehouse or data
lake) on a target server and then preparing the information for downstream uses.

ELT is comprised of a data pipeline with three different operations being


performed on data:

The first step is to Extract the data. Extracting data is the process of identifying
and reading data from one or more source systems, which may be databases,
files, archives, ERP, CRM or any other viable source of useful data.

The second step for ELT, is to Load the extract data. Loading is the process
of adding the extracted data to the target database.

The third step is to Transform the data. Data transformation is the process of
converting data from its source format to the format required for analysis.
Transformation is typically based on rules that define how the data should be
converted for usage and analysis in the target data store. Although transforming
data can take many different forms, it frequently involves converting coded data
into usable data using code and lookup tables.
Examples of transformations include:

Replacing codes with values


Aggregating numerical sums
Applying mathematical functions
Converting data types
Modifying text strings
Combining data from different tables and databases

How ELT works


ELT is a variation of the Extract, Transform, Load (ETL), a data integration
process in which transformation takes place on an intermediate server before it is
loaded into the target. In contrast, ELT allows raw data to be loaded directly into
the target and transformed there.

With an ELT approach, a data extraction tool is used to obtain data from a source
or sources, and the extracted data is stored in a staging area or database. Any
required business rules and data integrity checks can be run on the data in the
staging area before it is loaded into the data warehouse. All data transformations
occur in the data warehouse after the data is loaded.

ELT vs. ETL


The differences between ELT and a traditional ETL process are more significant
than just switching the L and the T. The biggest determinant is how, when and
where the data transformations are performed.

With ETL, the raw data is not available in the data warehouse because it is
transformed before it is loaded. With ELT, the raw data is loaded into the data
warehouse (or data lake) and transformations occur on the stored data.

Staging areas are used for both ELT and ETL, but with ETL the staging areas are
built into the ETL tool being used. With ELT, the staging area is in a database
used for the data warehouse.

ELT is most useful for processing the large data sets required for business
intelligence (BI) and big data analytics. Non-relational and unstructured data is
more conducive for an ELT approach because the data is copied "as is" from the
source. Applying analytics to unstructured data typically uses a "schema on
read" approach as opposed to the traditional "schema on write" used
by relational databases.

Loading data without first transforming it can be problematic if you are moving
data from a non-relational source to a relational target because the data will have
to match a relational schema. This means it will be necessary to identify and
massage data to support the data types available in the target database.

Data type conversion may need to be performed as part of the load process if the
source and target data stores do not support all the same data types. Such
problems can also occur when moving data from one relational database
management system (DBMS) to another, such as say Oracle to Db2, because the
data types supported differ from DBMS to DBMS.

ETL should be considered as a preferred approach over ELT when there is a need
for extensive data cleansing before loading the data to the target system, when
there are numerous complex computations required on numeric data and when
all the source data comes from relational systems.

The following chart compares different facets of ETL or ELT:

ELT ETL
Order of Extract Extract
Processes Load Transform
Transform Load
Flexibility Because transformation is More upfront planning
not dependent on should be conducted to
extraction, ELT is more ensure that all relevant data is
flexible than ETL for being integrated.
adding more extracted data
in the future.
Administration More administration may Typically, a single tool is
be required as multiple used for all three stages
tools may need to be perhaps simplifying
adopted. administration effort.
Development With a more flexible ETL requires upfront design
Time approach, development planning, which can result in
time may expand less overhead and
depending upon development time because
requirements and approach. only relevant data is
processed.
End Users Data scientists and Users reading reports and
advanced analysts SQL coders
Complexity of Transformations are coded Transformations are coded in
Transformation in by programmers (e.g., the ETL tool by data
using Java) and must be integration professional
maintained like any other experienced with the tool.
program.
Hardware Typically, ELT tools do not It is common for ETL tools
Requirements require additional to require specific hardware
hardware, instead using with their own engines to
existing compute power for perform transformations.
transformations.
Skills ELT relies mostly on native ETL requires additional
DBMS functionality, so training and skills to learn the
existing skills can be used tool set that drives the
in most cases. extraction, transformation
and loading.
Maturity ELT is a relatively new ETL is a mature practice that
practice, and as such there has existed since the 1990s.
is less expertise and fewer There are many skilled
best practices available. technicians, best practices
exist, and there are many
useful ETL tools on the
market.
Data Stores Mostly Hadoop, perhaps Almost exclusively relational
NoSQL database. Rarely database.
relational database.
Use Cases Best for unstructured data Best for relational and
and nonrelational data. structured data. Better for
Ideal for data lakes. Can small to medium amounts of
work for homogeneous data.
relational data, too. Well-
suited for very large
amounts of data.
Benefits of ELT
One of the main attractions of ELT is the reduction in load times relative to the
ETL model. Taking advantage of the processing capability built into a data
warehousing infrastructure reduces the time that data spends in transit and is
usually more cost-effective. ELT can be more efficient by utilizing the computer
power of modern data storage systems.

When you use ELT, you move the entire data set as it exists in the source
systems to the target. This means that you have the raw data at your disposal in
the data warehouse, in contrast to the ETL approach where the raw data is
transformed before it is loaded to the data warehouse. This flexibility can
improve data analysis, enabling more analytics to be performed directly within
the data warehouse without having to reach out to the source systems for the
untransformed data.

Using the ELT can make sense when adopting a big data initiative for analytics.
Big data often relies on a large amount of data, as well as wide variety of data
that is more suitable for ELT.

Uses of ELT
ELT is often used in the following cases:

when the data is structured, but the source and target database are the
same type (i.e., Oracle source and target);
when the data is unstructured and massive, such as processing and
correlating data from log files and sensors'
when the data is relatively simple, but there are large amounts of it;
when there is a plan to use machine learning tools to process the data
instead of traditional SQL queries; and
schema on read.

ELT tools and software


Although ELT can be performed using separate tools for extracting, loading and
transforming the data, tools exist that integrate all ELT processes. When seeking
an ELT tool, users should look for the ability to read data from multiple sources,
specifically the sources that their organization uses and intends to use. Most
tools support a wide variety of source and target data stores and database
systems.

Users can look for tools that can perform both ETL and ELT, as it's likely to have
the need for both data integration techniques.

A data store can be useful for managing a target data mart, data warehouse
and/or data lake. For an ELT approach, NoSQL database management
systems and Hadoop are viable candidates, as are purpose-built data warehouse
appliances. In some cases, a traditional relational DBMS may be appropriate.
Tools to Mine Big Data Analytics

Before it deployed a Hadoop cluster five years ago, retailer Macy's Inc. had big
problems analyzing all of the sales and marketing data its systems were
generating. And the problems were only getting bigger as Macy's pushed
aggressively to increase its online business, further ratcheting up the data
volumes it was looking to explore.
The company's traditional data warehouse architecture had severe processing
limitations and couldn't handle unstructured information, such as text. Historical
data was also largely inaccessible, typically having been archived on tapes that
were shipped to off-site storage facilities. Data scientists and other analysts
"could only run so many queries at particular times of the day," said Seetha
Chakrapany, director of marketing analytics and customer relationship
management (CRM) systems at Macy's. "They were pretty much shackled. They
couldn't do their jobs."

The Hadoop system has alleviated the situation, providing a big data analytics
architecture that also supports basic business intelligence (BI) and reporting
processes. Going forward, the cluster "could truly be an enterprise data analytics
platform" for Macy's, Chakrapany said. Already, along with the analytics teams
using it, thousands of business users in marketing, merchandising, product
management and other departments are accessing hundreds of BI
dashboards that are fed to them by the system.

But there's a lot more to the Macy's big data environment than the Hadoop
cluster alone. At the front end, for example, Macy's has deployed a variety of
analytics tools to meet different application needs. For statistical analysis, the
Cincinnati-based retailer uses SAS and Microsoft's R Server, which is based on
the R open source statistical programming language.

Several other tools provide predictive analytics, data mining and machine
learning capabilities. That includes H2O, Salford Predictive Modeler, the
Apache Mahout open source machine learning platform and KXEN -- the latter
an analytics technology that SAP bought three years ago and has since folded
into its SAP BusinessObjects Predictive Analytics software. Also in the picture
at Macy's are Tableau Software's data visualization tools and AtScale's BI on
Hadoop technology.
A better way to analyze big data
All the different tools are key elements in making effective use of the big data
analytics architecture, Chakrapany said in a presentation and follow-up interview
at Hadoop Summit 2016 in San Jose, Calif. Automating the advanced analytics
process through statistical routines and machine learning is a must, he noted.

"We're constantly in a state of experimentation. And because of the volume of


data, there's just no humanly possible way to analyze it manually," Chakrapany
said. "So, we apply all the statistical algorithms to help us see what's happening
with the business." That includes analysis of customer, order, product and
marketing data, plus clickstream activity records captured from the Macys.com
website.

Similar scenarios are increasingly playing out at other organizations, too. As big
data platforms such as Hadoop, NoSQL databases and the Spark processing
engine become more widely adopted, the number of companies deploying
advanced analytics tools that can help them take advantage of the data flowing
into those systems is also on the rise.

In an ongoing survey on the use of BI and analytics software conducted, 26.7%


of some 7,000 respondents, as of November 2016, said their organizations had
installed predictive analytics tools. And, looking forward, predictive
analytics topped the list of technologies for planned investments. It was cited by
39.5% of the respondents, putting it above data visualization, self-service BI and
enterprise reporting all more mainstream BI technologies.

A TDWI survey conducted in the second half of 2015 also found increasing
plans to use predictive analytics software to bolster business operations. In that
case, 87% of 309 BI, analytics and data management professionals said their
organizations were already active users of the technology or expected to
implement it within three years. Other forms of advanced analytics, what-if
simulations and prescriptive analytics, for example are similarly in line for
increased usage, according to a report on the survey, which was published last
December (see "Predicting High Growth" chart).
Predictive analytics use is on the rise.

Algorithms find meaning in data sets


Machine learning tools and other types of artificial intelligence technologies
deep learning and cognitive computing among them are also getting increased
attention from technology users and vendors as analytics teams look to
automated algorithms to help them make sense of data sets that are getting larger
and larger.

Progressive Casualty Insurance Co. is another company that's already there. The
Mayfield Village, Ohio-based insurer uses a Hadoop cluster partly to power its
Snapshot program, which awards policy discounts to safe drivers based on
operational data collected from their vehicles through a device that plugs into the
on- board diagnostics port.

The cluster is based on the Hortonworks distribution of Hadoop, as is the one at


Macy's. About 60 compute nodes are dedicated to the Snapshot initiative, and
Progressive's big data analytics architecture includes tools such as SAS, R and
H2O, which the company's data scientists use in analyzing the driving data
processed in the Hadoop system.

The data scientists run predictive algorithms backed up by heavy-duty data


visualizations to help score customers participating in the program on their
driving safety. They also look for bad driving habits and possible mechanical
problems in vehicles, such as alternator issues signaled by abnormal voltage
fluctuations captured as part of the incoming data.

The predictive analytics and machine learning capabilities are "huge," said
Pawan Divakarla, Progressive's data and analytics business leader. "You have so
much data, and you have fancier and fancier models for analyzing it. You need
something to assist you, to see what works."

Going deeper on big data analytics


Yahoo was the first production user of Hadoop in 2006, when the
technology's co-creator, Doug Cutting, was working at the web search and
internet services company, and it claims to be the largest Hadoop user today.
Yahoo's big data analytics architecture includes more than 40,000 nodes running
300-plus applications across 40 clusters that mix Hadoop with its companion
Apache HBase database, the Apache Storm real-time processing engine and
other big data technologies. But the Sunnyvale, Calif., company's use of those
technologies continues to expand into new areas.

"Even after 10 years, we're still uncovering benefits," said Andy Feng, vice
president in charge of Yahoo's big data and machine learning architecture. Feng
estimated that, over the past three years, he has spent about 95% of his time at
work focusing on machine learning tools and applications. In the past, the
automated algorithms that could be built and run with existing machine learning
technologies "weren't capable of leveraging huge data sets on Hadoop clusters,"
Feng said. "The accuracy wasn't that good."

"We always did machine learning, but we did it in a constrained fashion, so the
results were limited," added Sumeet Singh, senior director of product
development for cloud and big data platforms at Yahoo. However, he and Feng
said things have changed for the better in recent years, and in a big way. "We've
seen an amazing resurgence in artificial intelligence and machine learning, and
one of the reasons is all the data," Singh noted.

For example, Yahoo is now running a machine learning algorithm that uses
a semantic analysis process to better match paid ads on search results pages to
the search terms entered by web users; it has led to a 9% increase in revenue per
search, according to Feng. Another machine learning application lets users of
Yahoo's Flickr online photo and video service organize images based on their
visual content instead of the date on which they were taken. The algorithm can
also flag photos as not suitable for viewing at work to help users avoid
potentially embarrassing situations in the office, Feng said.

These new applications were made possible partly through the addition of
graphics processing units to Hadoop cluster nodes, Feng said the GPUs do image
processing that conventional CPUs can't handle. Yahoo also added Spark to the
big data analytics architecture to take over some of the processing work.

In addition, it deployed MLlib, Spark's built-in library of machine learning


algorithms. However, those algorithms turned out to be too basic, Singh said.
That prompted the big data team to develop CaffeOnSpark, a library of deep
learning algorithms that Yahoo has made available as an open source technology
on the GitHub website.
Top Data Analytics Programming Languages
A programming language is a formal language comprising a set of instructions
that produce various kinds of output. These languages are used in computer
programs to implement algorithms and have multiple applications. There are
several programming languages for data science as well. Data scientists should
learn and master at least one language as it is an essential tool to realize various
data science functions.
Data science is a concept of bringing together statistics, data analysis and their
related strategies to understand and analyze real wonders with data. It engages
theories and techniques drawn from various fields within the wide regions of
statistics, mathematics, computer science, and information science.
Before becoming an expert in data science, learning a programming language is
a crucial requirement. Data scientists should weigh the pros and cons of the
different types of programming languages for data science before making a
decision.
Data science is an exciting field to work in, combining quantitative skills and
advanced statistical with real-world programming ability. Various potential
programming languages are aspiring data scientist should think about having
some expertise. Now, let’s take a look at top programming languages a data
scientist should master.
R programming language
The R programming language is an open source scripting language for predictive
analytics and data visualization.
The initial version of R was released in 1995 to allow academic statisticians and
others with sophisticated programming skills to perform complex data statistical
analysis and display the results in any of a multitude of visual graphics. The "R"
name is derived from the first letter of the names of its two developers, Ross
Ihaka and Robert Gentleman, who were associated with the University of
Auckland at the time.

The R programming language includes functions that support linear modeling,


non-linear modeling, classical statistics, classifications, clustering and more. It
has remained popular in academic settings due to its robust features and the fact
that it is free to download in source code form under the terms of the Free
Software Foundation's GNU general public license. It compiles and runs
on UNIX platforms and other systems including Linux, Windows and MacOS.

The appeal of the R language has gradually spread out of academia into business
settings, as many data analysts who trained on R in college prefer to continue
using it rather than pick up a new tool with which they are inexperienced.

The R software environment


The R language programming environment is built around a standard command-
line interface. Users leverage this to read data and load it to the workspace,
specify commands and receive results. Commands can be anything from simple
mathematical operators, including +, -, * and /, to more complicated functions
that perform linear regressions and other advanced calculations.

Users can also write their own functions. The environment allows users to
combine individual operations, such as joining separate data files into a single
document, pulling out a single variable and running a regression on the resulting
data set, into a single function that can be used over and over.

Looping functions are also popular in the R programming environment. These


functions allow users to repeatedly perform some action, such as pulling out
samples from a larger data set, as many times as the user wants to specify.

R language pros and cons


Many users of the R programming language like the fact that it is free to
download, offers sophisticated data analytics capabilities and has an active
community of users online where they can turn to for support.

Because it's been around for many years and has been popular throughout its
existence, the language is fairly mature. Users can download add-on packages
that enhance the basic functionality of the language. These packages enable
users to visualize data, connect to external databases, map data geographically
and perform advanced statistical functions. There is also a popular user
interface called RStudio, which simplifies coding in the R language.

The R language has been criticized for delivering slow analyses when applied to
large data sets. This is because the language utilizes single-threaded processing,
which means the basic open source version can only utilize one CPU at a time.
By comparison, modern big data analytics thrives on parallel data processing,
simultaneously leveraging dozens of CPUs across a cluster of servers to process
large data volumes quickly.
In addition to its single-threaded processing limitations, the R programming
environment is an in-memory application. All data objects are stored in a
machine's RAM during a given session. This can limit the amount of data R is
able to work on at one time.

R and big data


These limitations have mitigated the applicability of the R language in big data
applications. Instead of putting R to work in production, many enterprise users
leverage R as an exploratory and investigative tool. Data scientists will use R to
run complicated analyses on sample data and then, after identifying a meaningful
correlation or cluster in the data, put the finding into product through enterprise-
scale tools.

Several software vendors have added support for the R programming language
to their offerings, allowing R to gain a stronger footing in the modern big data
realm. Vendors including IBM, Microsoft, Oracle, SAS Institute, TIBCO and
Tableau, among others, include some level of integration between their analytics
software and the R language. There are also R packages for popular open source
big data platforms, including Hadoop and Spark.

R language well-suited to analytical data sampling and manipulations


What factors should organizations take into account when evaluating if
the R language is right for their analytics needs?

R is a gloriously specialized language. R's core strength is data sampling and


data manipulation. Suppose you want, for example, to take a random sample of
100 values from a set of data that is normally distributed with a mean of 65.342
and a standard deviation of 2.1. All you need is a single line:

rnorm(100,65.342,2.1)

And from that, R will generate the data you're looking for.
Now, for many people, that might sound unbelievably boring. But the power of
R analytics lies in the application of the language's abilities: It's a perfect tool for
numerical simulations. For example, I recently wanted to perform a Monte Carlo
simulation of a scoring system called the Net Promoter Score (NPS). Monte
Carlo simulations are a vital part of analytics; they allow you to model the
behavior of complex systems in order to be able to understand them. Used by
analytics professionals for many years, they involve random sampling of sets of
numbers thousands or even millions of times.

R excels at creating and running Monte Carlo simulations, and the NPS
simulation described above took a mere nine lines of code. I would love to tell
you that I'm a hero because I managed to do it in nine lines, but that really isn't
the case. The R programming language is simply exceptionally good at
generating huge sets of numbers and then manipulating them. It's also good for
prototyping big data manipulations.

How does R manage to be so good at these kinds of tasks? The answer is that it
has a whole raft of functions that are designed specifically for this kind of work.
Where do they come from? R is free and open source. If people want a function
and can't find it, they can write one and add it to the function "bank" that is R.
They have been doing that for about 15 years, which means that most of the
functions you will ever need are already there.

Finally, R is a very easy language to learn -- you can just download the language
and a front-end environment (such as RStudio, which I used to create the image
embedded here) and start typing.

So, if you have numerical manipulations you want to perform, particularly


simulations such as Monte Carlos, I really recommend taking a look to
see whether the R language fits your needs. If you don't need to manipulate
numbers in these kinds of ways, R is probably not for you.
Exploratory Data Analysis
The primary aim with exploratory analysis is to examine the data for
distribution, outliers and anomalies to direct specific testing of your hypothesis.
It also provides tools for hypothesis generation by visualizing and understanding
the data usually through graphical representation, a task that statisticians call
exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:

1. Generate questions about your data.


2. Search for answers by visualising, transforming, and modelling your
data.
3. Use what you learn to refine your questions and/or generate new
questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA
is a state of mind. During the initial phases of EDA you should feel free to
investigate every idea that occurs to you. Some of these ideas will pan out, and
some will be dead ends. As your exploration continues, you will home in on a
few particularly productive areas that you’ll eventually write up and
communicate to others.
EDA is an important part of any data analysis, even if the questions are handed
to you on a platter, because you always need to investigate the quality of your
data. Data cleaning is just one application of EDA: you ask questions about
whether your data meets your expectations or not. To do data cleaning, you’ll
need to deploy all the tools of EDA: visualisation, transformation, and
modelling.
Prerequisites
In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to
interactively ask questions, answer them with data, and then ask new questions.
library(tidyverse)
Questions
“There are no routine statistical questions, only questionable statistical routines.”
— Sir David Cox
“Far better an approximate answer to the right question, which is often vague,
than an exact answer to the wrong question, which can always be made
precise.” John Tukey
Your goal during EDA is to develop an understanding of your data. The easiest
way to do this is to use questions as tools to guide your investigation. When you
ask a question, the question focuses your attention on a specific part of your
dataset and helps you decide which graphs, models, or transformations to make.
EDA is fundamentally a creative process. And like most creative processes, the
key to asking quality questions is to generate a large quantity of questions. It is
difficult to ask revealing questions at the start of your analysis because you do
not know what insights are contained in your dataset. On the other hand, each
new question that you ask will expose you to a new aspect of your data and
increase your chance of making a discovery. You can quickly drill down into the
most interesting parts of your data—and develop a set of thought-provoking
questions—if you follow up each question with a new question based on what
you find.
There is no rule about which questions you should ask to guide your research.
However, two types of questions will always be useful for making discoveries
within your data. You can loosely word these questions as:

1. What type of variation occurs within my variables?


2. What type of covariation occurs between my variables?
The rest of this chapter will look at these two questions. I’ll explain what
variation and covariation are, and I’ll show you several ways to answer each
question. To make the discussion easier, let’s define some terms:

A variable is a quantity, quality, or property that you can measure.


A value is the state of a variable when you measure it. The value of
a variable may change from measurement to measurement.
An observation is a set of measurements made under similar
conditions (you usually make all of the measurements in an
observation at the same time and on the same object). An observation
will contain several values, each associated with a different variable.
I’ll sometimes refer to an observation as a data point.
Tabular data is a set of values, each associated with a variable and
an observation. Tabular data is tidy if each value is placed in its own
“cell”, each variable in its own column, and each observation in its
own row.
Variation
Variation is the tendency of the values of a variable to change from
measurement to measurement. You can see variation easily in real life; if you
measure any continuous variable twice, you will get two different results. This is
true even if you measure quantities that are constant, like the speed of light. Each
of your measurements will include a small amount of error that varies from
measurement to measurement. Categorical variables can also vary if you
measure across different subjects (e.g. the eye colors of different people), or
different times (e.g. the energy levels of an electron at different moments). Every
variable has its own pattern of variation, which can reveal interesting
information. The best way to understand that pattern is to visualise the
distribution of the variable’s values.

Visualising distributions
How you visualise the distribution of a variable will depend on whether the
variable is categorical or continuous. A variable is categorical if it can only take
one of a small set of values. In R, categorical variables are usually saved as
factors or character vectors. To examine the distribution of a categorical variable,
use a bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
The height of the bars displays how many observations occurred with each x
value. You can compute these values manually with dplyr::count():
diamonds %>%
count(cut)
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
A variable is continuous if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables. To examine
the distribution of a continuous variable, use a histogram:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
You can compute this by hand by
combining dplyr::count() and ggplot2::cut_width():
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # … with 5 more rows
A histogram divides the x-axis into equally spaced bins and then uses the height
of a bar to display the number of observations that fall in each bin. In the graph
above, the tallest bar shows that almost 30,000 observations have a carat value
between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the binwidth
argument, which is measured in the units of the x variable. You should always
explore a variety of binwidths when working with histograms, as different
binwidths can reveal different patterns. For example, here is how the graph
above looks when we zoom into just the diamonds with a size of less than three
carats and choose a smaller binwidth.
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)

If you wish to overlay multiple histograms in the same plot, I recommend


using geom_freqpoly() instead of geom_histogram() . geom_freqpoly()
performs the same calculation as geom_histogram() , but instead of displaying
the counts with bars, uses lines instead. It’s much easier to understand
overlapping lines than bars.
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
There are a few challenges with this type of plot, which we will come back to
in visualising a categorical and a continuous variable.
Now that you can visualise variation, what should you look for in your plots?
And what type of follow-up questions should you ask? I’ve put together a list
below of the most useful types of information that you will find in your graphs,
along with some follow-up questions for each type of information. The key to
asking good follow-up questions will be to rely on your curiosity (What do you
want to learn more about?) as well as your skepticism (How could this be
misleading?).

Typical values
In both bar charts and histograms, tall bars show the common values of a
variable, and shorter bars show less-common values. Places that do not have bars
reveal values that were not seen in your data. To turn this information into useful
questions, look for anything unexpected:

Which values are the most common? Why?


Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
As an example, the histogram below suggests several interesting questions:
Why are there more diamonds at whole carats and common fractions
of carats?
Why are there more diamonds slightly to the right of each peak than
there are slightly to the left of each peak?
Why are there no diamonds bigger than 3 carats?
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)

Clusters of similar values suggest that subgroups exist in your data. To


understand the subgroups, ask:

How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each
other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
The histogram below shows the length (in minutes) of 272 eruptions of the Old
Faithful Geyser in Yellowstone National Park. Eruption times appear to be
clustered into two groups: there are short eruptions (of around 2 minutes) and
long eruptions (4-5 minutes), but little in between.
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)

Many of the questions above will prompt you to explore a relationship between
variables, for example, to see if the values of one variable can explain the
behavior of another variable. We’ll get to that shortly.

Unusual values
Outliers are observations that are unusual; data points that don’t seem to fit the
pattern. Sometimes outliers are data entry errors; other times outliers suggest
important new science. When you have a lot of data, outliers are sometimes
difficult to see in a histogram. For example, take the distribution of the y
variable from the diamonds dataset. The only evidence of outliers is the
unusually wide limits on the x-axis.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
There are so many observations in the common bins that the rare bins are so
short that you can’t see them (although maybe if you stare intently at 0 you’ll
spot something). To make it easy to see the unusual values, we need to zoom to
small values of the y-axis with coord_cartesian() :
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
(coord_cartesian() also has an xlim() argument for when you need to zoom into
the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly
differently: they throw away the data outside the limits.)
This allows us to see that there are three unusual values: 0, ~30, and ~60. We
pluck them out with dplyr:
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
#> # A tibble: 9 x 4
#> price x y z
#> <int> <dbl> <dbl> <dbl>
#> 1 5139 0 0 0
#> 2 6381 0 0 0
#> 3 12800 0 0 0
#> 4 15686 0 0 0
#> 5 18034 0 0 0
#> 6 2130 0 0 0
#> 7 2130 0 0 0
#> 8 2075 5.15 31.8 5.12
#> 9 12210 8.09 58.9 8.06
The y variable measures one of the three dimensions of these diamonds, in mm.
We know that diamonds can’t have a width of 0mm, so these values must be
incorrect. We might also suspect that measurements of 32mm and 59mm are
implausible: those diamonds are over an inch long, but don’t cost hundreds of
thousands of dollars!
It’s good practice to repeat your analysis with and without the outliers. If they
have minimal effect on the results, and you can’t figure out why they’re there,
it’s reasonable to replace them with missing values, and move on. However, if
they have a substantial effect on your results, you shouldn’t drop them without
justification. You’ll need to figure out what caused them (e.g. a data entry error)
and disclose that you removed them in your write-up.

Exercises

1. Explore the distribution of each of the x , y , and z variables


in diamonds . What do you learn? Think about a diamond and how
you might decide which dimension is the length, width, and depth.
2. Explore the distribution of price . Do you discover anything unusual
or surprising? (Hint: Carefully think about the binwidth and make
sure you try a wide range of values.)
3. How many diamonds are 0.99 carat? How many are 1 carat? What do
you think is the cause of the difference?
4. Compare and contrast coord_cartesian() vs xlim() or ylim() when
zooming in on a histogram. What happens if you leave binwidth
unset? What happens if you try and zoom so only half a bar shows?
Missing values
If you’ve encountered unusual values in your dataset, and simply want to move
on to the rest of your analysis, you have two options.

1. Drop the entire row with the strange values:


diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
I don’t recommend this option because just because one measurement is
invalid, doesn’t mean all the measurements are. Additionally, if you have
low quality data, by time that you’ve applied this approach to every
variable you might find that you don’t have any data left!

2. Instead, I recommend replacing the unusual values with missing


values. The easiest way to do this is to use mutate() to replace the
variable with a modified copy. You can use the ifelse() function to
replace unusual values with NA :
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
ifelse() has three arguments. The first argument test should be a logical vector.
The result will contain the value of the second argument, yes , when test
is TRUE , and the value of the third argument, no , when it is false.
Alternatively to ifelse, use dplyr::case_when(). case_when() is particularly
useful inside mutate when you want to create a new variable that relies on a
complex combination of existing variables.
Like R, ggplot2 subscribes to the philosophy that missing values should never
silently go missing. It’s not obvious where you should plot missing values, so
ggplot2 doesn’t include them in the plot, but it does warn that they’ve been
removed:
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
#> Warning: Removed 9 rows containing missing values (geom_point).
To suppress that warning, set na.rm = TRUE :
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
Other times you want to understand what makes observations with missing
values different to observations with recorded values. For example,
in nycflights13::flights, missing values in the dep_time variable indicate that the
flight was cancelled. So you might want to compare the scheduled departure
times for cancelled and non-cancelled times. You can do this by making a new
variable with is.na().
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
However this plot isn’t great because there are many more non-cancelled flights
than cancelled flights. In the next section we’ll explore some techniques for
improving this comparison.

Exercises

1. What happens to missing values in a histogram? What happens to


missing values in a bar chart? Why is there a difference?
2. What does na.rm = TRUE do in mean() and sum()?
Covariation
If variation describes the behavior within a variable, covariation describes the
behavior between variables. Covariation is the tendency for the values of two
or more variables to vary together in a related way. The best way to spot
covariation is to visualise the relationship between two or more variables. How
you do that should again depend on the type of variables involved.

A categorical and continuous variable


It’s common to want to explore the distribution of a continuous variable broken
down by a categorical variable, as in the previous frequency polygon. The
default appearance of geom_freqpoly() is not that useful for that sort of
comparison because the height is given by the count. That means if one of the
groups is much smaller than the others, it’s hard to see the differences in shape.
For example, let’s explore how the price of a diamond varies with its quality:
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

It’s hard to see the difference in distribution because the overall counts differ so
much:
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
To make the comparison easier we need to swap what is displayed on the y-axis.
Instead of displaying count, we’ll display density , which is the count
standardised so that the area under each frequency polygon is one.
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
There’s something rather surprising about this plot - it appears that fair diamonds
(the lowest quality) have the highest average price! But maybe that’s because
frequency polygons are a little hard to interpret - there’s a lot going on in this
plot.
Another alternative to display the distribution of a continuous variable broken
down by a categorical variable is the boxplot. A boxplot is a type of visual
shorthand for a distribution of values that is popular among statisticians. Each
boxplot consists of:

A box that stretches from the 25th percentile of the distribution to the
75th percentile, a distance known as the interquartile range (IQR). In
the middle of the box is a line that displays the median, i.e. 50th
percentile, of the distribution. These three lines give you a sense of
the spread of the distribution and whether or not the distribution is
symmetric about the median or skewed to one side.
Visual points that display observations that fall more than 1.5 times
the IQR from either edge of the box. These outlying points are
unusual so are plotted individually.
A line (or whisker) that extends from each end of the box and goes to
the
farthest non-outlier point in the distribution.
Let’s take a look at the distribution of price by cut using geom_boxplot() :
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()

We see much less information about the distribution, but the boxplots are much
more compact so we can more easily compare them (and fit more on one plot). It
supports the counterintuitive finding that better quality diamonds are cheaper on
average! In the exercises, you’ll be challenged to figure out why.
cut is an ordered factor: fair is worse than good, which is worse than very good
and so on. Many categorical variables don’t have such an intrinsic order, so you
might want to reorder them to make a more informative display. One way to do
that is with the reorder() function.
For example, take the class variable in the mpg dataset. You might be interested
to know how highway mileage varies across classes:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()

To make the trend easier to see, we can reorder class based on the median value
of hwy :
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y =
hwy))
If you have long variable names, geom_boxplot() will work better if you flip it
90°. You can do that with coord_flip() .
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y =
hwy)) +
coord_flip()
Exercises

1. Use what you’ve learned to improve the visualisation of the departure


times of cancelled vs. non-cancelled flights.
2. What variable in the diamonds dataset is most important for
predicting the price of a diamond? How is that variable correlated
with cut? Why does the combination of those two relationships lead
to lower quality diamonds being more expensive?
3. Install the ggstance package, and create a horizontal boxplot. How
does this compare to using coord_flip() ?
4. One problem with boxplots is that they were developed in an era of
much smaller datasets and tend to display a prohibitively large
number of “outlying values”. One approach to remedy this problem is
the letter value plot. Install the lvplot package, and try
using geom_lv() to display the distribution of price vs cut. What do
you learn? How do you interpret the plots?
5. Compare and contrast geom_violin() with a
facetted geom_histogram() , or a coloured geom_freqpoly() . What
are the pros and cons of each method?
6. If you have a small dataset, it’s sometimes useful to use geom_jitter()
to see the relationship between a continuous and categorical variable.
The ggbeeswarm package provides a number of methods similar
to geom_jitter() . List them and briefly describe what each one does.

Two categorical variables


To visualise the covariation between categorical variables, you’ll need to count
the number of observations for each combination. One way to do that is to rely
on the built-in geom_count() :
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))

The size of each circle in the plot displays how many observations occurred at
each combination of values. Covariation will appear as a strong correlation
between specific x values and specific y values.
Another approach is to compute the count with dplyr:
diamonds %>%
count(color, cut)
#> # A tibble: 35 x 3
#> color cut n
#> <ord> <ord> <int>
#> 1 D Fair 163
#> 2 D Good 662
#> 3 D Very Good 1513
#> 4 D Premium 1603
#> 5 D Ideal 2834
#> 6 E Fair 224
#> # … with 29 more rows
Then visualise with geom_tile() and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

If the categorical variables are unordered, you might want to use the seriation
package to simultaneously reorder the rows and columns in order to more clearly
reveal interesting patterns. For larger plots, you might want to try the d3heatmap
or heatmaply packages, which create interactive plots.

Exercises

1. How could you rescale the count dataset above to more clearly show
the distribution of cut within colour, or colour within cut?
2. Use geom_tile() together with dplyr to explore how average flight
delays vary by destination and month of year. What makes the plot
difficult to read? How could you improve it?
3. Why is it slightly better to use aes(x = color, y = cut) rather
than aes(x = cut, y = color) in the example above?

Two continuous variables


You’ve already seen one great way to visualise the covariation between two
continuous variables: draw a scatterplot with geom_point() . You can see
covariation as a pattern in the points. For example, you can see an exponential
relationship between the carat size and price of a diamond.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))

Scatterplots become less useful as the size of your dataset grows, because points
begin to overplot, and pile up into areas of uniform black (as above). You’ve
already seen one way to fix the problem: using the alpha aesthetic to add
transparency.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
But using transparency can be challenging for very large datasets. Another
solution is to use bin. Previously you used geom_histogram()
and geom_freqpoly() to bin in one dimension. Now you’ll learn how to
use geom_bin2d() and geom_hex() to bin in two dimensions.
geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and
then use a fill color to display how many points fall into each bin. geom_bin2d()
creates rectangular bins. geom_hex() creates hexagonal bins. You will need to
install the hexbin package to use geom_hex() .
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
Another option is to bin one continuous variable so it acts like a categorical
variable. Then you can use one of the techniques for visualising the combination
of a categorical and a continuous variable that you learned about. For example,
you could bin carat and then for each group, display a boxplot:
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
cut_width(x, width) , as used above, divides x into bins of width width . By
default, boxplots look roughly the same (apart from number of outliers)
regardless of how many observations there are, so it’s difficult to tell that each
boxplot summarises a different number of points. One way to show that is to
make the width of the boxplot proportional to the number of points
with varwidth = TRUE .
Another approach is to display approximately the same number of points in each
bin. That’s the job of cut_number() :
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
Exercises

1. Instead of summarising the conditional distribution with a boxplot,


you could use a frequency polygon. What do you need to consider
when using cut_width() vs cut_number() ? How does that impact a
visualisation of the 2d distribution of carat and price ?
2. Visualise the distribution of carat, partitioned by price.
3. How does the price distribution of very large diamonds compare to
small diamonds? Is it as you expect, or does it surprise you?
4. Combine two of the techniques you’ve learned to visualise the
combined distribution of cut, carat, and price.
5. Two dimensional plots reveal outliers that are not visible in one
dimensional plots. For example, some points in the plot below have
an unusual combination of x and y values, which makes the points
outliers even though their x and y values appear normal when
examined separately.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
Patterns and models
Patterns in your data provide clues about relationships. If a systematic
relationship exists between two variables it will appear as a pattern in the data. If
you spot a pattern, ask yourself:

Could this pattern be due to coincidence (i.e. random chance)?


How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of
the data?
A scatterplot of Old Faithful eruption lengths versus the wait time between
eruptions shows a pattern: longer wait times are associated with longer
eruptions. The scatterplot also displays the two clusters that we noticed above.
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
Patterns provide one of the most useful tools for data scientists because they
reveal covariation. If you think of variation as a phenomenon that creates
uncertainty, covariation is a phenomenon that reduces it. If two variables covary,
you can use the values of one variable to make better predictions about the
values of the second. If the covariation is due to a causal relationship (a special
case), then you can use the value of one variable to control the value of the
second.
Models are a tool for extracting patterns out of data. For example, consider the
diamonds data. It’s hard to understand the relationship between cut and price,
because cut and carat, and carat and price are tightly related. It’s possible to use
a model to remove the very strong relationship between price and carat so we
can explore the subtleties that remain. The following code fits a model that
predicts price from carat and then computes the residuals (the difference
between the predicted value and the actual value). The residuals give us a view
of the price of the diamond, once the effect of carat has been removed.
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))

Once you’ve removed the strong relationship between carat and price, you can
see what you expect in the relationship between cut and price: relative to their
size, better quality diamonds are more expensive.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
You’ll learn how models, and the modelr package, work in the final part of the
book, model. We’re saving modelling for later because understanding what
models are and how they work is easiest once you have tools of data wrangling
and programming in hand.
ggplot2 calls
As we move on from these introductory chapters, we’ll transition to a more
concise expression of ggplot2 code. So far we’ve been very explicit, which is
helpful when you are learning:
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
Typically, the first one or two arguments to a function are so important that you
should know them by heart. The first two arguments to ggplot() are data
and mapping , and the first two arguments to aes() are x and y . In the
remainder of the book, we won’t supply those names. That saves typing, and, by
reducing the amount of boilerplate, makes it easier to see what’s different
between plots. That’s a really important programming concern that we’ll come
back in functions.
Rewriting the previous plot more concisely yields:
ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
Sometimes we’ll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from %>% to +. I wish this transition wasn’t necessary
but unfortunately ggplot2 was created before the pipe was discovered.
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()

Predictive Analysis in R Programming


Predictive analysis in R Language is a branch of analysis which uses statistics
operations to analyze historical facts to make predict future events. It is a
common term used in data mining and machine learning. Methods like time
series analysis, non-linear least square, etc. are used in predictive analysis. Using
predictive analytics can help many businesses as it finds out the relationship
between the data collected and based on the relationship, the pattern is predicted.
Thus, allowing businesses to create predictive intelligence.
We’ll discuss the process, need and applications of predictive analysis with
example codes.

Process of Predictive Analysis

Predictive analysis consists of 7 processes as follows:

Define project: Defining the project, scope, objectives and result.


Data collection: Data is collected through data mining providing a
complete view of customer interactions.
Data Analysis: It is the process of cleaning, inspecting, transforming
and modelling the data.
Statistics: This process enables validating the assumptions and testing
the statistical models.
Modelling: Predictive models are generated using statistics and the
most optimized model is used for the deployment.
Deployment: The predictive model is deployed to automate the
production of everyday decision-making results.
Model monitoring: Keep monitoring the model to review
performance which ensures expected results.

Need of Predictive Analysis


Understanding customer behavior: Predictive analysis uses data
mining feature which extracts attributes and behavior of customers. It
also finds out the interests of the customers so that business can learn
to represent those products which can increase the probability or
likelihood of buying.
Gain competition in the market: With predictive analysis, businesses
or companies can make their way to grow fast and stand out as a
competition to other businesses by finding out their weakness and
strengths.
Learn new opportunities to increase revenue: Companies can create
new offers or discounts based on the pattern of the customers providing
an increase in revenue.
Find areas of weakening: Using these methods, companies can gain
back their lost customers by finding out the past actions taken by the
company which customers didn’t like.

Applications of Predictive Analysis

Health care: Predictive analysis can be used to determine the history


of patient and thus, determining the risks.
Financial modelling: Financial modelling is another aspect where
predictive analysis plays a major role in finding out the trending stocks
helping the business in decision making process.
Customer Relationship Management: Predictive analysis helps
firms in creating marketing campaigns and customer services based on
the analysis produced by the predictive algorithms.
Risk Analysis: While forecasting the campaigns, predictive analysis
can show an estimation of profit and helps in evaluating the risks too.

Example:
Let us take an example of time analysis series which is a method of predictive
analysis in R programming:
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# output to be created as png file
png(file ="predictiveAnalysis.png")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
# plotting the graph
plot(mts, xlab ="Weekly Data of sales",
ylab ="Total Revenue",
main ="Sales vs Revenue",
col.main ="darkgreen")
# saving the file
dev.off()
Output:

Forecasting Data:
Now, forecasting sales and revenue based on historical data.
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843,
471497, 936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# library required for forecasting
library(forecast)
# output to be created as png file
png(file ="forecastSalesRevenue.png")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
# forecasting model using arima model
fit <- auto.arima(mts)
# Next 5 forecasted values
forecast(fit, 5)
# plotting the graph with next
# 5 weekly forecasted values
plot(forecast(fit, 5), xlab ="Weekly Data of Sales",
ylab ="Total Revenue",
main ="Sales vs Revenue", col.main ="darkgreen")
# saving the file
dev.off()
Output:
Performing Hierarchical Cluster Analysis using R
Cluster analysis or clustering is a technique to find subgroups of data points
within a data set. The data points belonging to the same subgroup have similar
features or properties. Clustering is an unsupervised machine learning approach
and has a wide variety of applications such as market research, pattern
recognition, recommendation systems, and so on. The most common algorithms
used for clustering are K-means clustering and Hierarchical cluster analysis. In
this article, we will learn about hierarchical cluster analysis and its
implementation in R programming.
Hierarchical cluster analysis (also known as hierarchical clustering) is a
clustering technique where clusters have a hierarchy or a predetermined order.
Hierarchical clustering can be represented by a tree-like structure called
a Dendrogram . There are two types of hierarchical clustering:

Agglomerative hierarchical clustering : This is a bottom-up approach


where each data point starts in its own cluster and as one moves up the
hierarchy, similar pairs of clusters are merged.
Divisive hierarchical clustering : This is a top-down approach where
all data points start in one cluster and as one moves down the
hierarchy, clusters are split recursively.
To measure the similarity or dissimilarity between a pair of data points, we use
distance measures (Euclidean distance, Manhattan distance, etc.). However, to
find the dissimilarity between two clusters of observations, we use
agglomeration methods. The most common agglomeration methods are:

Complete linkage clustering : It computes all pairwise dissimilarities


between the observations in two clusters, and considers the longest
(maximum) distance between two points as the distance between two
clusters.
Single linkage clustering : It computes all pairwise dissimilarities
between the observations in two clusters, and considers the shortest
(minimum) distance as the distance between two clusters.
Average linkage clustering : It computes all pairwise dissimilarities
between the observations in two clusters, and considers the average
distance as the distance between two clusters.

Performing Hierarchical Cluster Analysis using R


For computing hierarchical clustering in R, the commonly used functions are as
follows:

hclust in the stats package and agnes in the cluster package for
agglomerative hierarchical clustering.
diana in the cluster package for divisive hierarchical clustering.

We will use the Iris flower data set from the datasets package in our
implementation. We will use sepal width, sepal length, petal width, and petal
length column as our data points. First, we load and normalize the data. Then the
dissimilarity values are computed with dist function and these values are fed to
clustering functions for performing hierarchical clustering.

# Load required packages


library(datasets) # contains iris dataset
library(cluster) # clustering algorithms
library(factoextra) # visualization
library(purrr) # to use map_dbl() function
# Load and preprocess the dataset
df <- iris[, 1:4]
df <- na.omit(df)
df <- scale(df)
# Dissimilarity matrix
d <- dist(df, method = "euclidean")

Agglomerative hierarchical clustering implementation


The dissimilarity matrix obtained is fed to hclust . The method parameter
of hclust specifies the agglomeration method to be used (i.e. complete, average,
single). We can then plot the dendrogram.
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)
Output:

Observe that in the above dendrogram, a leaf corresponds to one observation and
as we move up the tree, similar observations are fused at a higher height. The
height of the dendrogram determines the clusters. In order to identify the
clusters, we can cut the dendrogram with cutree . Then visualize the result in a
scatter plot using fviz_cluster function from the factoextra package.
# Cut tree into 3 groups
sub_grps <- cutree(hc1, k = 3)
# Visualize the result in a scatter plot
fviz_cluster(list(data = df, cluster = sub_grps))
Output:

We can also provide a border to the dendrogram around the 3 clusters as shown
below.
# Plot the obtained dendogram with
# rectangle borders for k clusters
plot(hc1, cex = 0.6, hang = -1)
rect.hclust(hc1, k = 3, border = 2:4)
Output:
Alternatively, we can use the agnes function to perform the hierarchical
clustering. Unlike hclust , the agnes function gives the agglomerative
coefficient, which measures the amount of clustering structure found (values
closer to 1 suggest strong clustering structure).
# agglomeration methods to assess
m <- c("average", "single", "complete")
names(m) <- c("average", "single", "complete")
# function to compute hierarchical
# clustering coefficient
ac <- function(x) {
agnes(df, method = x)$ac
}
map_dbl(m, ac)
Output:
average single complete
0.9035705 0.8023794 0.9438858
Complete linkage gives a stronger clustering structure. So, we use this
agglomeration method to perform hierarchical clustering with agnes function as
shown below.
# Hierarchical clustering
hc2 <- agnes(df, method = "complete")
# Plot the obtained dendogram
pltree(hc2, cex = 0.6, hang = -1,
main = "Dendrogram of agnes")
Output:

Divisive clustering implementation


The function diana which works similar to agnes allows us to perform divisive
hierarchical clustering. However, there is no method to provide.
# Compute divisive hierarchical clustering
hc3 <- diana(df)
# Divise coefficient
hc3$dc
# Plot obtained dendrogram
pltree(hc3, cex = 0.6, hang = -1,
main = "Dendrogram of diana")
Output:
[1] 0.9397208
Python
Python is an interpreted, object-oriented programming language similar
to PERL, that has gained popularity because of its clear syntax and readability.
Python is said to be relatively easy to learn and portable, meaning its statements
can be interpreted in a number of operating systems, including UNIX-based
systems, Mac OS, MS-DOS, OS/2, and various versions of Microsoft Windows
98. Python was created by Guido van Rossum, a former resident of the
Netherlands, whose favorite comedy group at the time was Monty Python's
Flying Circus. The source code is freely available and open for modification and
reuse. Python has a significant number of users.

A notable feature of Python is its indenting of source statements to make the


code easier to read. Python offers dynamic data type, ready-made class, and
interfaces to many system calls and libraries. It can be extended, using
the C or C++ language.

Python can be used as the script in Microsoft's Active Server Page (ASP)
technology. The scoreboard system for the Melbourne (Australia) Cricket
Ground is written in Python. Z Object Publishing Environment, a popular
Web application server, is also written in the Python language.

Python is everywhere!
With the widespread use of Python across major industry verticals, Python has
become a hot topic of discussion in the town. Python has been acknowledged as
the fastest-growing programming language , as per Stack Overflow Trends.
According to Stack Overflow Developers’ Survey 2019 , Python is the second
“most loved ” language with 73% of the developers choosing it above other
languages prevailing in the market.

Python is a general-purpose and open-source programming language used by big


names such as Reddit, Instagram, and Venmo says a press release .
Why choose Python for Big Data?
Python and Big Data is the new combination invading the market space NOW .
Python is in great demand among Big Data companies . In this blog, we will
discuss the major benefits of using Python and why Python for big data has
become a preferred choice among businesses these days.
Simple Coding
Python programming involves fewer lines of codes as compared to other
languages available for programming. It is able to execute programs in the least
lines of code. Moreover, Python automatically offers assistance to identify and
associate data types.

“ Python is a truly wonderful language. When somebody comes up with a good


idea it takes about 1 minute and five lines to program something that almost
does what you want.” — Jack Jansen

Python programming follows an indentation based nesting structure . The


language can process lengthy tasks within a short span of time. As there is no
limitation to data processing, you can compute data in commodity machines,
laptop, cloud, and desktop.

Earlier, Python was considered to be a slower language in comparison to some of


its counterparts like Java and Scala but the scenario has changed now.

The advent of the Anaconda platform has offered a great speed to the language.
This is why Python for big data has become one of the most popular options in
the industry. You can also hire Python Developer who can implement these
Python benefits in your business.
Open-Source
Developed with the help of a community-based model, Python is an open-source
programming language . Being an open-source language, Python supports
multiple platforms. Also, it can be run in various environments such as Windows
and Linux .

“My favorite language for maintainability is Python. It has simple, clean


syntax, object encapsulation, good library support, and optional named
parameters ”, said Bram Cohen.
Library Support
Python programming offers the use of multiple libraries. This makes it a famous
programming language in fields like scientific computing. As Big Data involves
a lot of data analysis and scientific computing, Python and Big Data serve as
great companions.

Python offers a number of well-tested analytics libraries. These libraries consist


of packages such as,

Numerical computing
Data analysis
Statistical analysis
Visualization
Machine learning

Python’s Compatibility with Hadoop


Both Python and Hadoop are open-source big data platforms. This is the reason
why Python is more compatible with Hadoop than other programming
languages. You can incorporate these Python features in your business. To do
this, you need to hire Python developers from a reputed Python development
company.

What are the benefits of using the Pydoop Package?


1. Access to HDFS API

The Pydoop package( Python and Hadoop) provides you access to the HDFS
API for Hadoop which allows you to write Hadoop MapReduce programs and
applications.
How is the HDFS API beneficial for you? So, here you go. The HDFS API lets
you read and write information easily on files, directories, and global file system
properties without facing any hurdles.

2. Offers MapReduce API

Pydoop offers MapReduce API for solving complex problems with minimal
programming efforts. This API can be used to implement advanced data science
concepts like ‘Counters’ and ‘Record Readers’ which makes Python
programming the best choice for Big Data.

Also, Read — “Is Python for Financial App Development the Right fit?”

Speed
Python is considered to be one of the most popular languages for software
development because of its high speed and performance. As it accelerates the
code well, Python is an apt choice for big data .

Python programming supports prototyping ideas which help in making the code
run fast. Moreover, while doing so, Python also sustains the transparency
between the code and the process.

Python programming contributes to making code readable and transparent thus


rendering great assistance in the maintenance of the code.

Scope
Python allows users in simplifying data operations . As Python is an object-
oriented language, it supports advanced data structures. Some of the data
structures that Python manages include lists, sets, tuples, dictionaries and many
more.

Besides this, Python helps in supporting scientific computing operations such as


matrix operations, data frames, etc. These incredible features of Python help to
enhance the scope of the language thus enabling it to speed up data operations.
This is what makes Python and Big Data a deadly combination.

Data Processing Support


Python has an inbuilt feature of supporting data processing. You can use this
feature to support data processing for unstructured and unconventional data. This
is the reason why big data companies prefer to choose Python as it is considered
to be one of the most important requirements in big data. So, hire offshore
Python programmers and avail the advantages of using Python in your business.

Final words
These were some of the benefits of using Python. So by now, you would have
got a clear idea of why Python for big data is considered the best fit. Python is a
simple and open-source language possessing high speed and robust Library
support.

“Big data is at the foundation of all the megatrends that are happening.” –
Chris Lynch

With the use of big data technology spreading across the globe, meeting the
requirements of this industry is surely a daunting task. But, with its incredible
benefits, Python has become a suitable choice for Big Data . You can also
leverage Python in your business for availing its advantages.

Exploratory Data Analysis in Python


EDA is a phenomenon under data analysis used for gaining a better
understanding of data aspects like:

main features of data


variables and relationships that hold between them
identifying which variables are important for our problem

We shall look at various exploratory data analysis methods like:

Descriptive Statistics, which is a way of giving a brief overview of the


dataset we are dealing with, including some measures and features of
the sample
Grouping data [Basic grouping with group by ]
ANOVA, Analysis Of Variance, which is a computational method to
divide variations in an observations set into different components.
Correlation and correlation methods

The dataset we’ll be using is chile voting dataset, which you can import in
python as:
import pandas as pd
Df = pd.read_csv("https://vincentarelbundock.github.io/
Rdatasets/csv/car/Chile.csv")
Descriptive Statistics
Descriptive statistics is a helpful way to understand characteristics of your data
and to get a quick summary of it. Pandas in python provide an interesting
method describe() . The describe function applies basic statistical computations
on the dataset like extreme values, count of data points standard deviation etc.
Any missing value or NaN value is automatically skipped. describe() function
gives a good picture of distribution of data.
DF.describe()
Here’s the output you’ll get on running above code:

Another useful method if value_counts() which can get count of each category in
a categorical attributed series of values. For an instance suppose you are dealing
with a dataset of customers who are divided as youth, medium and old categories
under column name age and your dataframe is “DF”. You can run this statement
to know how many people fall in respective categories. In our data set example
education column can be used

DF["education"].value_counts()
The output of the above code will be:

One more useful tool is boxplot which you can use through matplotlib module.
Boxplot is a pictorial representation of distribution of data which shows extreme
values, median and quartiles. We can easily figure out outliers by using boxplots.
Now consider the dataset we’ve been dealing with again and lets draw a boxplot
on attribute population
import pandas as pd
import matplotlib.pyplot as plt
DF = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/
data/master/airline-safety/airline-safety.csv")
y = list(DF.population)
plt.boxplot(y)
plt.show()

The output plot would look like this with spotting out outliers:
Grouping data
Group by is an interesting measure available in pandas which can help us figure
out effect of different categorical attributes on other data variables. Let’s see an
example on the same dataset where we want to figure out affect of people’s age
and education on the voting dataset.
DF.groupby(['education', 'vote']).mean()
The output would be somewhat like this:
If this group by output table is less understandable further analysts use pivot
tables and heat maps for visualization on them.
ANOVA
ANOVA stands for Analysis of Variance. It is performed to figure out the
relation between the different group of categorical data.
Under ANOVA we have two measures as result:
– F-testscore : which shows the variaton of groups mean over variation
– p-value: it shows the importance of the result

This can be performed using python module scipy method name f_oneway()
Syntax:
import scipy.stats as st
st.f_oneway(sample1, sample2, ..)

These samples are sample measurements for each group.


As a conclusion, we can say that there is a strong correlation between other
variables and a categorical variable if the ANOVA test gives us a large F-test
value and a small p-value.
Correlation and Correlation computation
Correlation is a simple relationship between two variables in a context such that
one variable affects the other. Correlation is different from act of causing . One
way to calculate correlation among variables is to find Pearson correlation. Here
we find two parameters namely, Pearson coefficient and p-value. We can say
there is a strong correlation between two variables when Pearson correlation
coefficient is close to either 1 or -1 and the p-value is less than 0.0001.
Scipy module also provides a method to perform pearson correlation analysis,
syntax:
import scipy.stats as st
st.pearsonr(sample1, sample2)
Loading Libraries:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import trim_mean

Loading Data:
data = pd.read_csv("state.csv")
# Check the type of data
print ("Type : ", type(data), "\n\n")
# Printing Top 10 Records
print ("Head -- \n", data.head(10))
# Printing last 10 Records
print ("\n\n Tail -- \n", data.tail(10))
Output :

Type : class 'pandas.core.frame.DataFrame'

Head --
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
5 Colorado 5029196 2.8 CO
6 Connecticut 3574097 2.4 CT
7 Delaware 897934 5.8 DE
8 Florida 18801310 5.8 FL
9 Georgia 9687653 5.7 GA

Tail --
State Population Murder.Rate Abbreviation
40 South Dakota 814180 2.3 SD
41 Tennessee 6346105 5.7 TN
42 Texas 25145561 4.4 TX
43 Utah 2763885 2.3 UT
44 Vermont 625741 1.6 VT
45 Virginia 8001024 4.1 VA
46 Washington 6724540 2.5 WA
47 West Virginia 1852994 4.0 WV
48 Wisconsin 5686986 2.9 WI
49 Wyoming 563626 2.7 WY
Code #1 : Adding Column to the dataframe
# Adding a new column with derived data
data['PopulationInMillions'] = data['Population']/1000000
# Changed data
print (data.head(5))
Output :
State Population Murder.Rate Abbreviation PopulationInMillions
0 Alabama 4779736 5.7 AL 4.779736
1 Alaska 710231 5.6 AK 0.710231
2 Arizona 6392017 4.7 AZ 6.392017
3 Arkansas 2915918 5.6 AR 2.915918
4 California 37253956 4.4 CA 37.253956
Code #2 : Data Description
data.describe()
Output :

Code #3 : Data Info


data.info()
Output :

RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
State 50 non-null object
Population 50 non-null int64
Murder.Rate 50 non-null float64
Abbreviation 50 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 1.6+ KB
Code #4 : Renaming a column heading
# Rename column heading as it
# has '.' in it which will create
# problems when dealing functions
data.rename(columns ={'Murder.Rate': 'MurderRate'}, inplace = True)
# Lets check the column headings
list(data)
Output :

['State', 'Population', 'MurderRate', 'Abbreviation']


Code #5 : Calculating Mean
Population_mean = data.Population.mean()
print ("Population Mean : ", Population_mean)
MurderRate_mean = data.MurderRate.mean()
print ("\nMurderRate Mean : ", MurderRate_mean)
Output:
Population Mean : 6162876.3

MurderRate Mean : 4.066


Code #6 : Trimmed mean
# Mean after discarding top and
# bottom 10 % values eliminating outliers
population_TM = trim_mean(data.Population, 0.1)
print ("Population trimmed mean: ", population_TM)
murder_TM = trim_mean(data.MurderRate, 0.1)
print ("\nMurderRate trimmed mean: ", murder_TM)
Output :
Population trimmed mean: 4783697.125

MurderRate trimmed mean: 3.9450000000000003


Code #7 : Weighted Mean
# here murder rate is weighed as per
# the state population
murderRate_WM = np.average(data.MurderRate, weights = data.Population)
print ("Weighted MurderRate Mean: ", murderRate_WM)
Output :
Weighted MurderRate Mean: 4.445833981123393
Code #8 : Median
Population_median = data.Population.median()
print ("Population median : ", Population_median)
MurderRate_median = data.MurderRate.median()
print ("\nMurderRate median : ", MurderRate_median)
Output :
Population median : 4436369.5

MurderRate median : 4.0


We have discussed some basic techniques to analyze the data, now let’s see the
visual techniques.
Let’s see the basic techniques –
# Loading Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import trim_mean
# Loading Data
data = pd.read_csv("state.csv")
# Check the type of data
print ("Type : ", type(data), "\n\n")
# Printing Top 10 Records
print ("Head -- \n", data.head(10))
# Printing last 10 Records
print ("\n\n Tail -- \n", data.tail(10))
# Adding a new column with derived data
data['PopulationInMillions'] = data['Population']/1000000
# Changed data
print (data.head(5))
# Rename column heading as it
# has '.' in it which will create
# problems when dealing functions
data.rename(columns ={'Murder.Rate': 'MurderRate'},
inplace = True)
# Lets check the column headings
list(data)
Output :
Type : class 'pandas.core.frame.DataFrame'

Head --
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
5 Colorado 5029196 2.8 CO
6 Connecticut 3574097 2.4 CT
7 Delaware 897934 5.8 DE
8 Florida 18801310 5.8 FL
9 Georgia 9687653 5.7 GA

Tail --
State Population Murder.Rate Abbreviation
40 South Dakota 814180 2.3 SD
41 Tennessee 6346105 5.7 TN
42 Texas 25145561 4.4 TX
43 Utah 2763885 2.3 UT
44 Vermont 625741 1.6 VT
45 Virginia 8001024 4.1 VA
46 Washington 6724540 2.5 WA
47 West Virginia 1852994 4.0 WV
48 Wisconsin 5686986 2.9 WI
49 Wyoming 563626 2.7 WY

State Population Murder.Rate Abbreviation PopulationInMillions


0 Alabama 4779736 5.7 AL 4.779736
1 Alaska 710231 5.6 AK 0.710231
2 Arizona 6392017 4.7 AZ 6.392017
3 Arkansas 2915918 5.6 AR 2.915918
4 California 37253956 4.4 CA 37.253956

['State', 'Population', 'MurderRate', 'Abbreviation']


Visualizing Population per Million
# Plot Population In Millions
fig, ax1 = plt.subplots()
fig.set_size_inches(15, 9)

ax1 = sns.barplot(x ="State", y ="Population",


data = data.sort_values('MurderRate'),
palette ="Set2")
ax1.set(xlabel ='States', ylabel ='Population In Millions')
ax1.set_title('Population in Millions by State', size = 20)
plt.xticks(rotation =-90)
Output:

(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,


17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
a list of 50 Text xticklabel objects)
Visualizing Murder Rate per Lakh
# Plot Murder Rate per 1, 00, 000
fig, ax2 = plt.subplots()
fig.set_size_inches(15, 9)

ax2 = sns.barplot(
x ="State", y ="MurderRate",
data = data.sort_values('MurderRate', ascending = 1),
palette ="husl")
ax2.set(xlabel ='States', ylabel ='Murder Rate per 100000')
ax2.set_title('Murder Rate by State', size = 20)
plt.xticks(rotation =-90)
Output :
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
a list of 50 Text xticklabel objects)

Although Louisiana is ranked 17 by population (about 4.53M), it has the highest


Murder rate of 10.3 per 1M people.
Code #1 : Standard Deviation
Population_std = data.Population.std()
print ("Population std : ", Population_std)
MurderRate_std = data.MurderRate.std()
print ("\nMurderRate std : ", MurderRate_std)
Output :
Population std : 6848235.347401142

MurderRate std : 1.915736124302923


Code #2 : Variance
Population_var = data.Population.var()
print ("Population var : ", Population_var)
MurderRate_var = data.MurderRate.var()
print ("\nMurderRate var : ", MurderRate_var)
Output :
Population var : 46898327373394.445

MurderRate var : 3.670044897959184


Code #3 : Inter Quartile Range
# Inter Quartile Range of Population
population_IQR = data.Population.describe()['75 %'] -
data.Population.describe()['25 %']
print ("Population IQR : ", population_IRQ)
# Inter Quartile Range of Murder Rate
MurderRate_IQR = data.MurderRate.describe()['75 %'] -
data.MurderRate.describe()['25 %']
print ("\nMurderRate IQR : ", MurderRate_IQR)
Output :
Population IQR : 4847308.0

MurderRate IQR : 3.124999999999999


Code #4 : Median Absolute Deviation (MAD)
Population_mad = data.Population.mad()
print ("Population mad : ", Population_mad)
MurderRate_mad = data.MurderRate.mad()
print ("\nMurderRate mad : ", MurderRate_mad)
Output :
Population mad : 4450933.356000001
MurderRate mad : 1.5526400000000005
Data analysis and Visualization with Python
Python is a great language for doing data analysis, primarily because of the
fantastic ecosystem of data-centric Python packages. Pandas is one of those
packages, and makes importing and analyzing data much easier. In this article, I
have used Pandas to analyze data on Country Data.csv file from UN public Data
Sets of a popular ‘statweb.stanford.edu’ website.
Installation
Easiest way to install pandas is to use pip:
#pip install pandas
Creating A DataFrame in Pandas
Creation of dataframe is done by passing multiple Series into the DataFrame
class using pd.Series method. Here, it is passed in the two Series objects, s1 as
the first row, and s2 as the second row.
Example:
# assigning two series to s1 and s2
s1 = pd.Series([1,2])
s2 = pd.Series(["Ashish", "Sid"])
# framing series objects into data
df = pd.DataFrame([s1,s2])
# show the data frame
df
# data framing in another way
# taking index and column values
dframe = pd.DataFrame([[1,2],["Ashish", "Sid"]],
index=["r1", "r2"],
columns=["c1", "c2"])
dframe
# framing in another way
# dict-like container
dframe = pd.DataFrame({
"c1": [1, "Ashish"],
"c2": [2, "Sid"]})
dframe
Output:

Importing Data with Pandas


The first step is to read the data. The data is stored as a comma-separated values,
or csv, file, where each row is separated by a new line, and each column by a
comma (,). In order to be able to work with the data in Python, it is needed to
read the csv file into a Pandas DataFrame. A DataFrame is a way to represent
and work with tabular data. Tabular data has rows and columns, just like this csv
file.
Example:
# Import the pandas library, renamed as pd
import pandas as pd
# Read IND_data.csv into a DataFrame, assigned to df
df = pd.read_csv("IND_data.csv")
# Prints the first 5 rows of a DataFrame as default
df.head()
# Prints no. of rows and columns of a DataFrame
df.shape

Output:

29,10
Indexing DataFrames with Pandas
Indexing can be possible using the pandas.DataFrame.iloc method. The iloc
method allows to retrieve as many as rows and columns by position.

Examples:
# prints first 5 rows and every column which replicates df.head()
df.iloc[0:5,:]
# prints entire rows and columns
df.iloc[:,:]
# prints from 5th rows and first 5 columns
df.iloc[5:,:5]
Indexing Using Labels in Pandas
Indexing can be worked with labels using the pandas.DataFrame.loc method,
which allows to index using labels instead of positions.

Examples:

# prints first five rows including 5th index and every columns of df
df.loc[0:5,:]
# prints from 5th rows onwards and entire columns
df = df.loc[5:,:]

The above doesn’t actually look much different from df.iloc[0:5,:]. This is
because while row labels can take on any values, our row labels match the
positions exactly. But column labels can make things much easier when working
with data. Example:
# Prints the first 5 rows of Time period
# value
df.loc[:5,"Time period"]

DataFrame Math with Pandas


Computation of data frames can be done by using Statistical Functions of pandas
tools.
Examples:
# computes various summary statistics, excluding NaN values
df.describe()
# for computing correlations
df.corr()
# computes numerical data ranks
df.rank()

Pandas Plotting
Plots in these examples are made using standard convention for referencing the
matplotlib API which provides the basics in pandas to easily create decent
looking plots.
Examples:
# import the required module
import matplotlib.pyplot as plt
# plot a histogram
df['Observation Value'].hist(bins=10)
# shows presence of a lot of outliers/extreme values
df.boxplot(column='Observation Value', by = 'Time period')
# plotting points as a scatter plot
x = df["Observation Value"]
y = df["Time period"]
plt.scatter(x, y, label= "stars", color= "m",
marker= "*", s=30)
# x-axis label
plt.xlabel('Observation Value')
# frequency label
plt.ylabel('Time period')
# function to show the plot
plt.show()
Storing DataFrame in CSV Format :
Pandas provide to.csv('filename', index = "False|True") function to write
DataFrame into a CSV file. Here filename is the name of the CSV file that you
want to create and index tells that index (if Default) of DataFrame should be
overwritten or not. If we set index = False then the index is not overwritten. By
Default value of index is TRUE then index is overwritten.
Example :

import pandas as pd
# assigning three series to s1, s2, s3
s1 = pd.Series([0, 4, 8])
s2 = pd.Series([1, 5, 9])
s3 = pd.Series([2, 6, 10])
# taking index and column values
dframe = pd.DataFrame([s1, s2, s3])
# assign column name
dframe.columns =['Geeks', 'For', 'Geeks']
# write data to csv file
dframe.to_csv('geeksforgeeks.csv', index = False)
dframe.to_csv('geeksforgeeks1.csv', index = True)

Output :

geeksforgeeks1.csv

geeksforgeeks2.csv
Handling Missing Data
The Data Analysis Phase also comprises of the ability to handle the missing data
from our dataset, and not so surprisingly Pandas live up to that expectation as
well. This is where dropna and/or fillna methods comes into the play. While
dealing with the missing data, you as a Data Analyst are either supposed to drop
the column containing the NaN values (dropna method) or fill in the missing
data with mean or mode of the whole column entry (fillna method), this decision
is of great significance and depends upon the data and the affect would create in
our results.

Drop the missing Data :


Consider this is the DataFrame generated by below code :

import pandas as pd
# Create a DataFrame
dframe = pd.DataFrame({'Geeks': [23, 24, 22],
'For': [10, 12, np.nan],
'geeks': [0, np.nan, np.nan]},
columns =['Geeks', 'For', 'geeks'])
# This will remove all the
# rows with NAN values
# If axis is not defined then
# it is along rows i.e. axis = 0
dframe.dropna(inplace = True)
print(dframe)
# if axis is equal to 1
dframe.dropna(axis = 1, inplace = True)
print(dframe)
Output :

axis=0

axis=1

Fill the missing values :


Now, to replace any NaN value with mean or mode of the
data, fillna is used, which could replace all the NaN values from a
particular column or even in whole DataFrame as per the
requirement.
import numpy as np
import pandas as pd
# Create a DataFrame
dframe = pd.DataFrame({'Geeks': [23, 24, 22],
'For': [10, 12, np.nan],
'geeks': [0, np.nan, np.nan]},
columns = ['Geeks', 'For', 'geeks'])
# Use fillna of complete Dataframe
# value function will be applied on every column
dframe.fillna(value = dframe.mean(), inplace = True)
print(dframe)
# filling value of one column
dframe['For'].fillna(value = dframe['For'].mean(),
inplace = True)
print(dframe)

Output :
Groupby Method (Aggregation) :
The groupby method allows us to group together the data based off any row or
column, thus we can further apply the aggregate functions to analyze our data.
Group series using mapper (dict or key function, apply given function to group,
return result as series) or by a series of columns.
Consider this is the DataFrame generated by below code :

import pandas as pd
import numpy as np
# create DataFrame
dframe = pd.DataFrame({'Geeks': [23, 24, 22, 22, 23, 24],
'For': [10, 12, 13, 14, 15, 16],
'geeks': [122, 142, 112, 122, 114, 112]},
columns = ['Geeks', 'For', 'geeks'])
# Apply groupby and aggregate function
# max to find max value of column
# &quot;For&quot; and column &quot;geeks&quot; for every
# different value of column &quot;Geeks&quot;.
print(dframe.groupby(['Geeks']).max())

Output :

Analysis of test data using K-Means Clustering in Python


This demonstrates an illustration of K-means clustering on a sample random data
using open-cv library.
Pre-requisites: Numpy, OpenCV, matplot-lib
Let’s first visualize test data with Multiple Features using matplot-lib tool.
# importing required tools
import numpy as np
from matplotlib import pyplot as plt
# creating two test data
X = np.random.randint(10,35,(25,2))
Y = np.random.randint(55,70,(25,2))
Z = np.vstack((X,Y))
Z = Z.reshape((50,2))
# convert to np.float32
Z = np.float32(Z)
plt.xlabel('Test Data')
plt.ylabel('Z samples')
plt.hist(Z,256,[0,256])
plt.show()

Here ‘Z’ is an array of size 100, and values ranging from 0 to 255. Now,
reshaped ‘z’ to a column vector. It will be more useful when more than one
features are present. Then change the data to np.float32 type.
Output:

Now, apply the k-Means clustering algorithm to the same example as in the
above test data and see its behavior.
Steps Involved:
1) First we need to set a test data.
2) Define criteria and apply kmeans().
3) Now separate the data.
4) Finally Plot the data.

import numpy as np
import cv2
from matplotlib import pyplot as plt
X = np.random.randint(10,45,(25,2))
Y = np.random.randint(55,70,(25,2))
Z = np.vstack((X,Y))
# convert to np.float32
Z = np.float32(Z)
# define criteria and apply kmeans()
criteria = (cv2.TERM_CRITERIA_EPS +
cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
ret,label,center =
cv2.kmeans(Z,2,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
# Now separate the data
A = Z[label.ravel()==0]
B = Z[label.ravel()==1]
# Plot the data
plt.scatter(A[:,0],A[:,1])
plt.scatter(B[:,0],B[:,1],c = 'r')
plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's')
plt.xlabel('Test Data'),plt.ylabel('Z samples')
plt.show()

Output:

This example is meant to illustrate where k-means will produce intuitively


possible clusters.
Applications :
1) Identifying Cancerous Data.
2) Prediction of Students’ Academic Performance.
3) Drug Activity Prediction.

Scala
Scala (Scalable Language) is a software programming language that mixes
object-oriented methods with functional programming capabilities that support a
more concise style of programming than other general-purpose languages like
Java, reducing the amount of code developers have to write. Another benefit of
the combined object-functional approach is that features that work well in small
programs tend to scale up efficiently when run in larger environments.

First released publicly in 2004, Scala also incorporates some imperative,


statement-oriented programming capabilities. In addition, it supports static
typing, in which computations are formed as statements that change program
state at compile time, an approach that can provide improved runtime
efficiencies. It is typically implemented on a Java virtual machine (JVM), which
opens up the language for mixed use with Java objects, classes and methods, as
well as JVM runtime optimizations.

Scala also includes its own interpreter, which can be used to execute instructions
directly, without previous compiling. Another key feature in Scala is a "parallel
collections" library designed to help developers address parallel programming
problems. Pattern matching is among the application areas in which such parallel
capabilities have proved to be especially useful.

Scala was originally written by Martin Odersky, a professor at the Ecole


Polytechnique Federale de Lausanne, in Switzerland. His previous work
included creation of the Funnel language, which shared some traits with Scala
but didn't employ JVMs as an execution engine. Odersky began work on Scala in
2001 and continues to play a lead role in its development; he also co-founded
Scala development tools maker Typesafe Inc. in 2011 and is the San Francisco
company's chairman and chief architect.

Updates to the Java language have added functional programming traits


somewhat akin to Scala's. One prominent Scala user, LinkedIn Corp., indicated
in early 2015 that it planned to reduce its reliance on the language and focus
more on Java 8 and other languages. But Scala continues to be one of the major
tools for building software infrastructure at a number of other high-profile
companies, including Twitter Inc. and local-search app developer Foursquare
Labs Inc.

Apache Spark, an open source data processing engine for batch processing,
machine learning, data streaming and other types of analytics applications, is
very significant example of Scala usage. Spark is written in Scala, and the
language is central to its support for distributed data sets that are handled as
collective software objects to help boost resiliency. However, Spark applications
can be programmed in Java and the Python language in addition to Scala.

Scala Lightweight functional programming for Java

Languages based in Java often involve verbose syntax and domain-specific


languages for testing, parsing and numerical compute processes. These things
can be the bane of developers, because the piles of repetitive code require
developers to spend extra time combing through it to find errors.

As a general-purpose programming language, Scala can help alleviate these


issues by combining both object-oriented and functional styles. To mitigate
syntax complexities, Scala also fuses imperative programming with functional
programming and can advantageously use its access to a huge ecosystem of Java
libraries.

This article examines Scala's Java versatility and interoperability, the Scala
tooling and runtime features that help ensure reliable performance, and some of
the challenges developers should watch out for when they use this language.

Scala attracted wide attention from developers in 2015 due to its effectiveness
with general-purpose cluster computing. Today, it's found in many Java virtual
machine (JVM) systems, where developers use Scala to eliminate the need for
redundant type information. Because programmers don't have to specify a type,
they also don't have to repeat it.
Scala shares a common runtime platform with Java, so it can execute Java code.
Using the JVM and JavaScript runtimes, developers can build high-performance
systems with easy access to the rest of the Java library ecosystem. Because the
JVM is deeply embedded in enterprise code, Scala offers a concise shortcut that
guarantees diverse functionality and granular control.
Developers can also rely on Scala to more effectively express general
programming patterns. By reducing the number of lines, programmers can write
type-safe code in an immutable manner, making it easy to apply concurrency and
to synchronize processing.

The power of objects


In pure object-oriented programming (OOP) environments, every value is an
object. As a result, types and behaviors of objects are described by classes,
subclasses and traits to designate inheritance. These concepts enable
programmers to eliminate redundant code and extend the use of existing classes.

Scala treats functions like first-class objects. Programmers can compose with
relatively guaranteed type safety. Scala's lightweight syntax is perfect for
defining anonymous functions and nesting. Scala's pattern-matching ability also
makes it possible to incorporate functions within class definitions.

Java developers can quickly become productive in Scala if they have an existing
knowledge of OOP, and they can achieve greater flexibility because they can
define data types that have either functional or OOP-based attributes.

Challenges of working with Scala


Some of the difficulties associated with Scala include complex build tools, a lack
of support for advanced integrated development environment language features
and project publishing issues. Other criticisms aim at Scala's generally limited
tooling and difficulties working with complex language features in the codebase.

Managing dependency versions can also be a challenge in Scala. It's not unusual
for a language to cause headaches for developers when it comes to dependency
management, but that challenge is particularly prevalent in Scala due to the sheer
number of Scala versions and upgrades. New Scala releases often mark a
significant shift that requires massive developer retraining and codebase
migrations.

Developers new to Scala should seek out the support of experienced contributors
to help minimize the learning curve. While Scala still exists in a relatively
fragmented, tribal ecosystem, it's hard to say where Scala is heading in terms of
adoption. However, with the right support, Scala functional programming can be
a major asset.
Python vs Scala
Python is a high level, interpreted and general purpose dynamic programming
language that focuses on code readability. Python requires less typing, provides
new libraries, fast prototyping, and several other new features.
Scala is a high level language.it is a purely object-oriented programming
language. The source code of the Scala is designed in such a way that its
compiler can interpret the Java classes.
Below are some major differences between Python and Scala:

PYTHON SCALA

Python is a dynamically typed


language. Scala is a statically typed language.

We don’t need to specify objects in We need to specify the type of


Python because it is a dynamically variables and objects in Scala because
typed Object Oriented Programming Scala is statically typed Object
language. Oriented Programming language.

Scala is less difficult to learn than


Python is easy to learn and use. Python.

An extra work is created for the No extra work is created in Scala and
interpreter at the runtime. thus it is 10 times faster than Python.

This is not the case in Scala that is why


while dealing with large data process,
The data types are decided by it Scala should be considered instead of
during runtime. Python

Scala also has good community


Python’s Community is huge support. But still, it is lesser than
compared to Scala. Python.

Scala has reactive cores and a list of


Python supports heavyweight asynchronous libraries and hence Scala
process forking and doesn’t support is a better choice for implementing
proper multithreading. concurrency.

Its methodologies are much


complex in Python as it is dynamic Testing is much better in scala because
programming language. it is a statically typed language.

It is popular because of its English- For scalable and concurrent systems,


like syntax. Scala play much bigger.

Scala is less difficult to learn than


Python is easy for the developers to Python and it is difficult to write code
write code in it. in Scala.

There is an interface in Python to It is basically a compiled language and


many OS system calls and libraries. all source codes are compiled before
It has many interpreters execution

Python language is highly prone to


bugs whenever there is any change
to the existing code. No such problem is seen in Scala.

Python has libraries for Machine


learning and proper data science
tools and Natural Language
Processing (NLP). Where as Scala has no such tools.

Python can be used for small-scale Scala can be used for large-scale
projects. projects.

It doesn’t provide scalable feature


support. It provides scalable feature support.
Apache Spark with Scala – Resilient Distributed Dataset
Data is growing even faster than processing speeds. To perform computations on
such large data is often achieved by using distributed systems. A distributed
system consists of clusters (nodes/networked computers) that run processes in
parallel and communicate with each other if needed.
Apache Spark is a unified analytics engine for large-scale data processing. It
provides high-level APIs in Java, Scala, Python, and R, and an optimized engine
that supports general execution graphs. This rich set of functionalities and
libraries supported higher-level tools like Spark SQL for SQL and structured
data processing, MLlib for machine learning, GraphX for graph processing, and
Structured Streaming for incremental computation and stream processing. In this
article, we will be learning Apache spark (version 2.x) using Scala.
Some basic concepts :

1. RDD(Resilient Distributed Dataset) – It is an immutable distributed


collection of objects. In the case of RDD, the dataset is the main part
and It is divided into logical partitions.
2. SparkSession – The entry point to programming Spark with the
Dataset and DataFrame API.

We will be using Scala IDE only for demonstration purposes. A dedicated spark
compiler is required to run the below code. Follow the link to run the below
code.
Let’s create our first data frame in spark.
Scala
// Importing SparkSession
import org.apache.spark.sql.SparkSession
// Creating SparkSession object
val sparkSession = SparkSession.builder()
.appName("My First Spark Application")
.master("local").getOrCreate()
// Loading sparkContext
val sparkContext = sparkSession.sparkContext
// Creating an RDD
val intArray = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
// parallelize method creates partitions, which additionally
// takes integer argument to specifies the number of partitions.
// Here we are using 3 partitions.
val intRDD = sparkContext.parallelize(intArray, 3)
// Printing number of partitions
println(s"Number of partitons in intRDD : ${intRDD.partitions.size}")
// Printing first element of RDD
println(s"First element in intRDD : ${intRDD.first}")
// Creating string from RDD
// take(n) function is used to fetch n elements from
// RDD and returns an Array.
// Then we will convert the Array to string using
// mkString function in scala.
val strFromRDD = intRDD.take(intRDD.count.toInt).mkString(", ")
println(s"String from intRDD : ${strFromRDD}")
// Printing contents of RDD
// collect function is used to retrieve all the data in an RDD.
println("Printing intRDD: ")
intRDD.collect().foreach(println)

Output:

Number of partitons in intRDD : 3


First element in intRDD : 1
String from intRDD : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Printing intRDD:
1
2
3
4
5
6
7
8
9
10
Scala is a programming language that is an extension of Java as it was originally
built on the Java Virtual Machine (JVM). So it can easily integrate with Java.
However, the real reason that Scala is so useful for Data Science is that it can be
used along with Apache Spark to manage large amounts of data. So when it
comes to big data, Scala is the go-to language. Many of the data science
frameworks that are created on top of Hadoop actually use Scala or Java or are
written in these languages. However, one downside of Scala is that it is difficult
to learn and there are not as many online community support groups as it is a
niche language.
Apache Spark
According to Databrick’s definition “Apache Spark is a lightning-fast unified
analytics engine for big data and machine learning. It was originally developed
at UC Berkeley in 2009.”
Databricks is one of the major contributors to Spark includes yahoo! Intel etc.
Apache spark is one of the largest open-source projects for data processing. It is
a fast and in-memory data processing engine.
History of spark :
Spark started in 2009 in UC Berkeley R&D Lab which is known as AMPLab
now. Then in 2010 spark became open source under a BSD license. After that
spark transferred to ASF (Apache Software Foundation) in June 2013. Spark
researchers previously working on Hadoop map-reduce. In UC Berkeley R&D
Lab they observed that was inefficient for iterative and interactive computing
jobs. In Spark to support in-memory storage and efficient fault recovery that
Spark was designed to be fast for interactive queries and iterative algorithms. In
the below-given diagram, we are going to describe the history of Spark. Let’s
have a look.
Features of Spark :

Apache spark can use to perform batch processing.


Apache spark can also use to perform stream processing. For stream
processing, we were using Apache Storm / S4.
It can be used for interactive processing. Previously we were using
Apache Impala or Apache Tez for interactive processing.
Spark is also useful to perform graph processing. Neo4j / Apache
Graph was using for graph processing.
Spark can process the data in real-time and batch mode.

So, we can say that Spark is a powerful open-source engine for data processing.
Components of Apache Spark
Spark is a cluster computing system. It is faster as compared to other cluster
computing systems (such as Hadoop). It provides high-level APIs in Python,
Scala, and Java. Parallel jobs are easy to write in Spark. In this article, we will
discuss the different components of Apache Spark.
Spark processes a huge amount of datasets and it is the foremost active Apache
project of the current time. Spark is written in Scala and
provides API in Python, Scala, Java, and R. The most vital feature of Apache
Spark is its in-memory cluster computing that extends the speed of the data
process. Spark is an additional general and quicker processing platform. It helps
us to run programs relatively quicker than Hadoop (i.e.) a hundred times quicker
in memory and ten times quicker even on the disk. The main features of spark
are:

1. Multiple Language Support: Apache Spark supports multiple


languages; it provides API’s written in Scala, Java, Python or R. It
permits users to write down applications in several languages.
2. Quick Speed: The most vital feature of Apache Spark is its processing
speed. It permits the application to run on a Hadoop cluster, up to one
hundred times quicker in memory, and ten times quicker on disk.
3. Runs Everywhere: Spark will run on multiple platforms while not
moving the processing speed. It will run on Hadoop, Kubernetes,
Mesos, Standalone, and even within the Cloud.
4. General Purpose: It is powered by plethora libraries for machine
learning (i.e.) MLlib, DataFrames, and SQL at the side of Spark
Streaming and GraphX. It is allowed to use a mix of those libraries
which are coherently associated with the application. The feature of
mix streaming, SQL, and complicated analytics, within the same
application, makes Spark a general framework.
5. Advanced Analytics: Apache Spark also supports “Map” and
“Reduce” that has been mentioned earlier. However, at the side of
MapReduce, it supports Streaming data, SQL queries, Graph
algorithms, and Machine learning. Thus, Apache Spark may be used to
perform advanced analytics.

Components of Spark:
The above figure illustrates all the spark components. Let’s understand each of
the components in detail:

1. Spark Core: All the functionalities being provided by Apache Spark


are built on the highest of the Spark Core. It delivers speed by
providing in-memory computation capability. Spark Core is the
foundation of parallel and distributed processing of giant dataset. It is
the main backbone of the essential I/O functionalities and significant in
programming and observing the role of the spark cluster. It holds all the
components related to scheduling, distributing and monitoring jobs on
a cluster, Task dispatching, Fault recovery. The functionalities of this
component are:
1. It contains the basic functionality of spark. (Task scheduling,
memory management, fault recovery, interacting with storage
systems).
2. Home to API that defines RDDs.
2. Spark SQL Structured data: The Spark SQL component is built
above the spark core and used to provide the structured processing on
the data. It provides standard access to a range of data sources. It
includes Hive, JSON, and JDBC. It supports querying data either via
SQL or via the hive language. This also works to access structured and
semi-structured information. It also provides powerful, interactive,
analytical application across both streaming and historical data. Spark
SQL could be a new module in the spark that integrates the relative
process with the spark with programming API. The main functionality
of this module is:
1. It is a Spark package for working with structured data.
2. It Supports many sources of data including hive tablets,
parquet, json.
3. It allows the developers to intermix SQK with programmatic
data manipulation supported by RDDs in python, scala and
java.
3. Spark Streaming: Spark streaming permits ascendible, high-
throughput, fault-tolerant stream process of live knowledge streams.
Spark can access data from a source like a flume, TCP socket. It will
operate different algorithms in which it receives the data in a file
system, database and live dashboard. Spark uses Micro-batching for
real-time streaming. Micro-batching is a technique that permits a
method or a task to treat a stream as a sequence of little batches of
information. Hence spark streaming groups the live data into small
batches. It delivers it to the batch system for processing. The
functionality of this module is:
1. Enables processing of live streams of data like log files
generated by production web services.
2. The API’s defined in this module are quite similar to spark
core RDD API’s.
4. Mlib Machine Learning: MLlib in spark is a scalable Machine
learning library that contains various machine learning algorithms. The
motive behind MLlib creation is to make the implementation of
machine learning simple. It contains machine learning libraries and the
implementation of various algorithms. For
example, clustering, regression, classification and collaborative
filtering.
5. GraphX graph processing: It is an API for graphs and graph parallel
execution. There is network analytics in which we store the data.
Clustering, classification, traversal, searching, and pathfinding is also
possible in the graph. It generally optimizes how we can represent
vertex and edges in a graph. GraphX also optimizes how we can
represent vertex and edges when they are primitive data types. To
support graph computation, it supports fundamental operations like
subgraph, joins vertices, and aggregate messages as well as an
optimized variant of the Pregel API.

Uses of Apache Spark: The main applications of the spark framework are:

1. The data generated by systems aren’t consistent enough to mix for


analysis. To fetch consistent information from systems we will use
processes like extract, transform and load and it reduces time and cost
since they are very efficiently implemented in spark.
2. It is tough to handle the time generated data like log files. Spark is
capable enough to work well with streams of information and reuse
operations.
3. As spark is capable of storing information in memory and might run
continual queries quickly, it makes it straightforward to figure out the
machine learning algorithms that can be used for a particular kind of
data.

Introduction to PySpark Distributed Computing with Apache Spark


Datasets are becoming huge. Infact, data is growing faster than processing
speeds. Therefore, algorithms involving large data and high amount of
computation are often run on a distributed computing system. A distributed
computing system involves nodes (networked computers) that run processes in
parallel and communicate (if, necessary).
MapReduce – The programming model that is used for Distributed computing
is known as MapReduce. The MapReduce model involves two stages, Map and
Reduce.

1. Map – The mapper processes each line of the input data (it is in the
form of a file), and produces key – value pairs.

Input data → Mapper → list([key, value])

2. Reduce – The reducer processes the list of key – value pairs (after the
Mapper’s function). It outputs a new set of key – value pairs.

list([key, value]) → Reducer → list([key, list(values)])


Spark – Spark (open source Big-Data processing engine by Apache) is a cluster
computing system. It is faster as compared to other cluster computing systems
(such as, Hadoop). It provides high level APIs in Python, Scala, and Java.
Parallel jobs are easy to write in Spark. We will cover PySpark (Python +
Apache Spark), because this will make the learning curve flatter. To install Spark
on a linux system, follow this. To run Spark in a multi – cluster system,
follow this. We will see how to create RDDs (fundamental data structure of
Spark).
RDDs (Resilient Distributed Datasets) – RDDs are immutable collection of
objects. Since we are using PySpark, these objects can be of multiple types.
These will become more clear further.
SparkContext – For creating a standalone application in Spark, we first define
a SparkContext –

from pyspark import SparkConf, SparkContext


conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)
RDD transformations – Now, a SparkContext object is created. Now, we will
create RDDs and see some transformations on them.

# create an RDD called lines from ‘file_name.txt’


lines = sc.textFile("file_name.txt", 2)
# print lines.collect() prints the whole RDD
print lines.collect()

One major advantage of using Spark is that it does not load the dataset into
memory, lines is a pointer to the ‘file_name.txt’ file.
A simple PySpark app to count the degree of each vertex for a given graph

from pyspark import SparkConf, SparkContext


conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)
def conv(line):
line = line.split()
return (int(line[0]), [int(line[1])])
def numNeighbours(x, y):
return len(x) + len(y)
lines = sc.textFile('graph.txt')
edges = lines.map(lambda line: conv(line))
Adj_list = edges.reduceByKey(lambda x, y: numNeighbours(x, y))
print Adj_list.collect()
Understanding the above code –

1. Our text file is in the following format – (each line represents an edge
of a directed graph)
1 2
1 3
2 3
3 4
. .
. .
. .PySpark
2. Large Datasets may contain millions of nodes, and edges.
3. First few lines set up the SparkContext. We create an RDD lines from
it.
4. Then, we transform the lines RDD to edges RDD.The function conv
a?cts on each line and key value pairs of the form (1, 2), (1, 3), (2, 3),
(3, 4), … are stored in the edges RDD.
5. After this the reduceByKey aggregates all the key – pairs
corresponding to a particular key and numNeighbours function is
used for generating each vertex’s degree in a separate RDD Adj_list ,
which has the form (1, 2), (2, 1), (3, 1), …

Running the code –

1. The above code can be run by the following commands –

$ cd /home/arik/Downloads/spark-1.6.0/
$ ./bin/spark-submit degree.py

2. You can use your Spark installation path in the first line.

Pyspark Linear regression using Apache MLlib


Problem Statement: Build a predictive Model for the shipping company, to find
an estimate of how many Crew members a ship requires.
The dataset contains 159 instances with 9 features.
The Description of dataset is as below:

Let’s make the Linear Regression Model, predicting Crew members


Attached dataset: cruise_ship_info
import pyspark
from pyspark.sql import SparkSession
#SparkSession is now the entry point of Spark
#SparkSession can also be construed as gateway to spark libraries
#create instance of spark class
spark=SparkSession.builder.appName('housing_price_model').getOrCreate()
#create spark dataframe of input csv file
df=spark.read.csv('D:\python coding\pyspark_tutorial\Linear regression\
cruise_ship_info.csv'
,inferSchema=True,header=True)
df.show(10)
Output :
+-----------+-----------+---+------------------+----------+------+------+-----------------
+----+
| Ship_name|Cruise_line|Age| Tonnage|passengers|length|cabins|passenger_density|crew
+-----------+-----------+---+------------------+----------+------+------+-----------------
+----+
| Journey| Azamara| 6|30.276999999999997| 6.94| 5.94| 3.55| 42.64|3.55|
| Quest| Azamara| 6|30.276999999999997| 6.94| 5.94| 3.55| 42.64|3.55|
|Celebration| Carnival| 26| 47.262| 14.86| 7.22| 7.43| 31.8| 6.7|
| Conquest| Carnival| 11| 110.0| 29.74| 9.53|
14.88| 36.99|19.1|
| Destiny| Carnival| 17| 101.353| 26.42| 8.92|
13.21| 38.36|10.0|
| Ecstasy| Carnival| 22| 70.367| 20.52| 8.55| 10.2| 34.29| 9.2|
| Elation| Carnival| 15| 70.367| 20.52| 8.55| 10.2| 34.29| 9.2|
| Fantasy| Carnival| 23| 70.367| 20.56| 8.55| 10.22| 34.23| 9.2|
|Fascination| Carnival| 19| 70.367| 20.52| 8.55| 10.2| 34.29|
9.2|
| Freedom| Carnival| 6|110.23899999999999| 37.0| 9.51|
14.87| 29.79|11.5|
+-----------+-----------+---+------------------+----------+------+------+-----------------
+----+
#prints structure of dataframe along with datatype
df.printSchema()
Output :

#In our predictive model, below are the columns


df.columns
Output :

#columns identified as features are as below:


#['Cruise_line','Age','Tonnage','passengers','length','cabins','passenger_density']
#to work on the features, spark MLlib expects every value to be in numeric form
#feature 'Cruise_line is string datatype
#using StringIndexer, string type will be typecast to numeric datatype
#import library strinindexer for typecasting
from pyspark.ml.feature import StringIndexer
indexer=StringIndexer(inputCol='Cruise_line',outputCol='cruise_cat')
indexed=indexer.fit(df).transform(df)
#above code will convert string to numeric feature and create a new dataframe
#new dataframe contains a new feature 'cruise_cat' and can be used further
#feature cruise_cat is now vectorized and can be used to fed to model
for item in indexed.head(5):
print(item)
print('\n')

Output :
Row(Ship_name='Journey', Cruise_line='Azamara', Age=6,
Tonnage=30.276999999999997, passengers=6.94, length=5.94,
cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0)

Row(Ship_name='Quest', Cruise_line='Azamara', Age=6,


Tonnage=30.276999999999997, passengers=6.94, length=5.94,
cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0)

Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26,


Tonnage=47.262, passengers=14.86, length=7.22,
cabins=7.43, passenger_density=31.8, crew=6.7, cruise_cat=1.0)

Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11,


Tonnage=110.0, passengers=29.74, length=9.53,
cabins=14.88, passenger_density=36.99, crew=19.1, cruise_cat=1.0)

Row(Ship_name='Destiny', Cruise_line='Carnival', Age=17,


Tonnage=101.353, passengers=26.42, length=8.92,
cabins=13.21, passenger_density=38.36, crew=10.0, cruise_cat=1.0)
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
#creating vectors from features
#Apache MLlib takes input if vector form
assembler=VectorAssembler(inputCols=['Age',
'Tonnage',
'passengers',
'length',
'cabins',
'passenger_density',
'cruise_cat'],outputCol='features')
output=assembler.transform(indexed)
output.select('features','crew').show(5)
#output as below

Output :

#final data consist of features and label which is crew.


final_data=output.select('features','crew')
#splitting data into train and test
train_data,test_data=final_data.randomSplit([0.7,0.3])
train_data.describe().show()
Output :

test_data.describe().show()
Output :
#import LinearRegression library
from pyspark.ml.regression import LinearRegression
#creating an object of class LinearRegression
#object takes features and label as input arguments
ship_lr=LinearRegression(featuresCol='features',labelCol='crew')
#pass train_data to train model
trained_ship_model=ship_lr.fit(train_data)
#evaluating model trained for Rsquared error
ship_results=trained_ship_model.evaluate(train_data)
print('Rsquared Error :',ship_results.r2)
#R2 value shows accuracy of model is 92%
#model accuracy is very good and can be use for predictive analysis
Output :

#testing Model on unlabeled data


#create unlabeled data from test_data
#testing model on unlabeled data
unlabeled_data=test_data.select('features')
unlabeled_data.show(5)
Output :
predictions=trained_ship_model.transform(unlabeled_data)
predictions.show()
#below are the results of output from test data

Output :

Pyspark Linear regression with Advanced Feature Dataset using Apache


MLlib
Ames Housing Data: The Ames Housing dataset was compiled by Dean De
Cock for use in data science education and expanded version of the often-cited
Boston Housing dataset. The dataset provided has 80 features and 1459
instances.
Dataset description is as below:
For demo few columns are displayed but there are a lot more columns are there
in the dataset.
Examples:
Input Attached Dataset: Ames_housing_dataset
Code :
# SparkSession is now the entry point of Spark
# SparkSession can also be construed as gateway to spark libraries
import pyspark
from pyspark.sql import SparkSession
# create instance of spark class
spark =
SparkSession.builder.appName('ames_housing_price_model').getOrCreate()
df_train = spark.read.csv(r'D:\python coding\pyspark_tutorial\Linear regression'
'\housing price multiple features'
'\house-prices-advanced-regression-techniques'
'\train.csv', inferSchema = True, header = True)

Code :
# identifying the columns having less meaningful data on the basis of datatypes
l_int =[]
for item in df_train.dtypes:
if item[1]=='int':
l_int.append(item[0])
print(l_int)
l_str =[]
for item in df_train.dtypes:
if item[1]=='string':
l_str.append(item[0])
print(l_str)

Output

Integer Datatypes:
['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea',
'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
'MiscVal', 'MoSold', 'YrSold', 'SalePrice']

String Datatypes:
['MSZoning', 'LotFrontage', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2',
'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
'Exterior2nd',
'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
'HeatingQC',
'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
'PoolQC', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition']
Code :
# identifying integer column records having less meaningful data
# identifying integer column records having less meaningful data
for i in df_train.columns:
if i in l_int:
a ='df_train'+'.'+i
ct_total = df_train.select(i).count( )
ct_zeros = df_train.filter((col(i)== 0)).count()
per_zeros =(ct_zeros / ct_total)*100
print('total count / zeros count '
+i+' '+str(ct_total)+' / '+str(ct_zeros)+' / '+str(per_zeros))

Output of zeros percentage:


total count/zeros count/zeros_percent OpenPorchSF 1460 / 656 /
44.93150684931507
total count/zeros count/zeros_percent EnclosedPorch 1460 / 1252 /
85.75342465753425
total count/zeros count/zeros_percent 3SsnPorch 1460 / 1436 /
98.35616438356163
total count/zeros count/zeros_percent ScreenPorch 1460 / 1344 /
92.05479452054794
total count/zeros count/zeros_percent PoolArea 1460 / 1453 /
99.52054794520548
total count/zeros count/zeros_percent PoolQC 1460 / 1453 / 99.52054794520548
total count/zeros count/zeros_percent Fence 1460 / 1453 / 99.52054794520548
total count/zeros count/zeros_percent MiscFeature 1460 / 1453 /
99.52054794520548
total count/zeros count/zeros_percent MiscVal 1460 / 1408 /
96.43835616438356
total count/zeros count/zeros_percent MoSold 1460 / 0 / 0.0
total count/zeros count/zeros_percent YrSold 1460 / 0 / 0.0
Code :
# above calculation gives us an insight about the useful features
# now drop the columns having zeros or NA % more than 75 %
df_new = df_train.drop(*['BsmtFinSF2', 'LowQualFinSF', 'BsmtHalfBath',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch' ,
'PoolArea', 'PoolQC', 'Fence', 'MiscFeature',
'MiscVal', 'Alley'])
df_new = df_new.drop(*['Id'])
# now we have the clean data to work

Code :
# converting string to numeric feature
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
feat_list =['MSZoning', 'LotFrontage', 'Street', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle',
'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2',
'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
'PavedDrive', 'SaleType', 'SaleCondition']
print('indexed list created')
# there are multiple features to work
# using pipeline we can convert multiple features to indexers
indexers = [StringIndexer(inputCol = column, outputCol =
column+"_index").fit(df_new) for column in feat_list]
type(indexers)
# Combines a given list of columns into a single vector column.
# input_cols: Columns to be assembled.
# returns Dataframe with assembled column.
pipeline = Pipeline(stages = indexers)
df_feat = pipeline.fit(df_new).transform(df_new)
df_feat.columns
# using above code we have converted list of features into indexes
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
# we will convert below columns into features to work with
assembler = VectorAssembler(inputCols =['MSSubClass', 'LotArea',
'OverallQual',
'OverallCond', 'YearBuilt', 'YearRemodAdd',
'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'GrLivArea',
'BsmtFullBath', 'FullBath', 'HalfBath',
'GarageArea', 'MoSold', 'YrSold',
'MSZoning_index', 'LotFrontage_index',
'Street_index', 'LotShape_index',
'LandContour_index', 'Utilities_index',
'LotConfig_index', 'LandSlope_index',
'Neighborhood_index', 'Condition1_index',
'Condition2_index', 'BldgType_index',
'HouseStyle_index', 'RoofStyle_index',
'RoofMatl_index', 'Exterior1st_index',
'Exterior2nd_index', 'MasVnrType_index',
'MasVnrArea_index', 'ExterQual_index',
'ExterCond_index', 'Foundation_index',
'BsmtQual_index', 'BsmtCond_index',
'BsmtExposure_index', 'BsmtFinType1_index',
'BsmtFinType2_index', 'Heating_index',
'HeatingQC_index', 'CentralAir_index',
'Electrical_index', 'KitchenQual_index',
'Functional_index', 'FireplaceQu_index',
'GarageType_index', 'GarageYrBlt_index',
'GarageFinish_index', 'GarageQual_index',
'GarageCond_index', 'PavedDrive_index',
'SaleType_index', 'SaleCondition_index'],
outputCol ='features')
output = assembler.transform(df_feat)
final_data = output.select('features', 'SalePrice')
# splitting data for test and validation
train_data, test_data = final_data.randomSplit([0.7, 0.3])
Code :
train_data.describe().show()

test_data.describe().show()
Code :
from pyspark.ml.regression import LinearRegression
house_lr = LinearRegression(featuresCol ='features', labelCol ='SalePrice')
trained_house_model = house_lr.fit(train_data)
house_results = trained_house_model.evaluate(train_data)
print('Rsquared Error :', house_results.r2)
# Rsquared Error : 0.8279155904297449
# model accuracy is 82 % with train data
# evaluate model on test_data
test_results = trained_house_model.evaluate(test_data)
print('Rsquared error :', test_results.r2)
# Rsquared error : 0.8431420382408793
# result is quiet better with 84 % accuracy
# create unlabelled data from test_data
# test_data.show()
unlabeled_data = test_data.select('features')
unlabeled_data.show()
Code :
predictions = trained_house_model.transform(unlabeled_data)
predictions.show()
SQL
SQL (Structured Query Language) is a standardized programming language
that's used to manage relational databases and perform various operations on the
data in them. Initially created in the 1970s, SQL is regularly used not only by
database administrators, but also by developers writing data integration scripts
and data analysts looking to set up and run analytical queries.

The uses of SQL include modifying database table and index structures; adding,
updating and deleting rows of data; and retrieving subsets of information from
within a database for transaction processing and analytics applications. Queries
and other SQL operations take the form of commands written as statements --
commonly used SQL statements include select, add, insert, update, delete, create,
alter and truncate.

SQL became the de facto standard programming language for relational


databases after they emerged in the late 1970s and early 1980s. Also known as
SQL databases, relational systems comprise a set of tables containing data in
rows and columns. Each column in a table corresponds to a category of data --
for example, customer name or address -- while each row contains a data value
for the intersecting column.

SQL standard and proprietary extensions


An official SQL standard was adopted by the American National Standards
Institute (ANSI) in 1986 and then by the International Organization for
Standardization, known as ISO, in 1987. More than a half-dozen joint updates to
the standard have been released by the two standards development bodies since
then; as of this writing, the most recent version is SQL:2011, approved that year.

Both proprietary and open source relational database management systems built
around SQL are available for use by organizations. They include Microsoft SQL
Server, Oracle Database, IBM DB2, SAP HANA, SAP Adaptive
Server, MySQL (now owned by Oracle) and PostgreSQL. However, many of
these database products support SQL with proprietary extensions to the standard
language for procedural programming and other functions. For example,
Microsoft offers a set of extensions called Transact-SQL (T-SQL), while Oracle's
extended version of the standard is PL/SQL. As a result, the different variants of
SQL offered by vendors aren't fully compatible with one another.

SQL commands and syntax


SQL commands are divided into several different types, among them data
manipulation language (DML) and data definition language (DDL) statements,
transaction controls and security measures. The DML vocabulary is used to
retrieve and manipulate data, while DDL statements are for defining and
modifying database structures. The transaction controls help manage transaction
processing, ensuring that transactions are either completed or rolled back if
errors or problems occur. The security statements are used to control database
access as well as to create user roles and permissions.

SQL syntax is the coding format used in writing statements. Figure 1 shows an
example of a DDL statement written in Microsoft's T-SQL to modify a database
table in SQL Server 2016:

An example of T-SQL code in SQL Server 2016. This is the code for the ALTER
TABLE WITH (ONLINE = ON | OFF) option.
SQL-on-Hadoop tools
SQL-on-Hadoop query engines are a newer offshoot of SQL that enable
organizations with big data architectures built around Hadoop systems to take
advantage of it instead of having to use more complex and less familiar
languages -- in particular, the MapReduce programming environment for
developing batch processing applications.

More than a dozen SQL-on-Hadoop tools have become available through


Hadoop distribution providers and other vendors; many of them are open source
software or commercial versions of such technologies. In addition, the Apache
Spark processing engine, which is often used in conjunction with Hadoop,
includes a Spark SQL module that similarly supports SQL-based programming.

In general, SQL-on-Hadoop is still an emerging technology, and most of the


available tools don't support all of the functionality offered in relational
implementations of SQL. But they're becoming a regular component of Hadoop
deployments as companies look to get developers and data analysts with SQL
skills involved in programming big data applications.

SQL-on-Hadoop
SQL-on-Hadoop is a class of analytical application tools that combine
established SQL-style querying with newer Hadoop data framework elements.

By supporting familiar SQL queries, SQL-on-Hadoop lets a wider group of


enterprise developers and business analysts work with Hadoop on commodity
computing clusters. Because SQL was originally developed for relational
databases, it has to be modified for the Hadoop 1 model, which uses the Hadoop
Distributed File System and Map-Reduce or the Hadoop 2 model, which can
work without either HDFS or Map-Reduce.

The different means for executing SQL in Hadoop environments can be divided
into (1) connectors that translate SQL into a MapReduce format; (2) "push
down" systems that forgo batch-oriented MapReduce and execute SQL within
Hadoop clusters; and (3) systems that apportion SQL work between MapReduce-
HDFS clusters or raw HDFS clusters, depending on the workload.

One of the earliest efforts to combine SQL and Hadoop resulted in the Hive data
warehouse, which featured HiveQL software for translating SQL-like queries
into MapReduce jobs. Other tools that help support SQL-on-Hadoop include
BigSQL, Drill, Hadapt, Hawq, H-SQL, Impala, JethroData, Polybase, Presto,
Shark (Hive on Spark), Spark, Splice Machine, Stinger, and Tez (Hive on Tez).

Selecting the right SQL-on-Hadoop engine to access big data

In the world of Hadoop and NoSQL, the spotlight is now on SQL-on-


Hadoop engines. Today, many different engines are available, making it hard for
organizations to choose. This article presents some important requirements to
consider when selecting one of these engines.

With SQL-on-Hadoop technologies, it's possible to access big data stored in


Hadoop by using the familiar SQL language. Users can plug in almost any
reporting or analytical tool to analyze and study the data. Before SQL-on-
Hadoop, accessing big data was restricted to the happy few. You had to have in-
depth knowledge of technical application programming interfaces, such as the
ones for the Hadoop Distributed File System, MapReduce or HBase, to work
with the data. Now, thanks to SQL-on-Hadoop, everyone can use his favorite
tool. For an organization, that opens up big data to a much larger audience,
which can increase the return on its big data investment.

The first SQL-on-Hadoop engine was Apache Hive, but during the last 12
months, many new ones have been released. These include CitusDB, Cloudera
Impala, Concurrent Lingual, Hadapt, InfiniDB, JethroData, MammothDB,
Apache Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba
and Splice Machine.

In addition to these implementations, all the data virtualization servers should be


included because they also offer SQL access to Hadoop data. In fact, they are
designed to access all kinds of data sources, including Hadoop, and they allow
different data sources to be integrated. Examples of data virtualization servers
are Cirro Data Hub, Cisco/Composite Information Server, Denodo Platform,
Informatica Data Services, Red Hat JBoss Data Virtualization and Stone Bond
Enterprise Enabler Virtuoso.

And, of course, there are a few SQL database management systems that support
polyglot persistence. This means that they can store data in their own native SQL
database or in Hadoop; by doing so, they also offer SQL access to Hadoop data.
Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft
PolyBase, Actian ParAccel and Teradata Aster Database (via SQL-H).

SQL equality on Hadoop?


In other words, organizations can choose from a wide range of SQL-on-Hadoop
engines. But which one should be selected? Or are they so alike that it doesn't
matter which one is picked?

The answer is that it does matter, because not all of these technologies are
created equal. On the outside, they all look the same, but internally they are very
different. For example, CitusDB knows where all the data is stored and uses that
knowledge to access the data as efficiently as possible. JethroData stores indexes
to get direct access to data, and Splice Machine offers a transactional SQL
interface.

Selecting the right SQL-on-Hadoop technology requires a detailed study. To get


started, you should evaluate the following requirements before selecting one of
the available engines.

SQL dialect. The richer the SQL dialect supported, the wider the range of
applications that can benefit from it. In addition, the richer the dialect, the more
query processing can be pushed to Hadoop and the less the applications and
reporting tools have to do.

Joins. Executing joins on big tables fast and efficiently is not always easy,
especially if the SQL-on-Hadoop engine has no idea where the data is stored. An
inefficient style of join processing can lead to massive amounts of I/O and can
cause colossal data transport between nodes. Both can result in really poor
performance.

Non-traditional data. Initially, SQL was designed to process highly structured


data: Each record in a table has the same set of columns, and each column holds
one atomic value. Not all big data in Hadoop has this traditional
structure. Hadoop files may contain nested data, variable data (with hierarchical
structures), schema-less data and self-describing data. A SQL-on-Hadoop engine
must be able to translate all these forms of data to flat relational data, and must
be able to optimize queries on these forms of data as well.

Storage format. Hadoop supports some "standard" storage formats of the data,
such as Parquet, Avro and ORCFile. The more SQL-on-Hadoop technologies use
such formats, the more tools and other SQL-on-Hadoop engines can read that
same data. This drastically minimizes the need to replicate data. Thus, it's
important to verify whether a proprietary storage format is used.

User-defined functions. To use SQL to execute complex analytical functions,


such as Gaussian discriminative analysis and market basket analysis, it's
important that they're supported by SQL or can be developed. Such functions are
called user-defined functions (UDFs). It's also important that the SQL-on-
Hadoop engine can distribute the execution of UDFs over as many nodes and
disks as possible.

Multi-user workloads. It must be possible to set parameters that determine how


the engine should divide its resources among different queries and different
types of queries. For example, queries from different applications may have
different processing priorities; long-running queries should get less priority than
simple queries being processed concurrently; and unplanned and resource-
intensive queries may have to be cancelled or temporarily interrupted if they use
too many resources. SQL-on-Hadoop engines require smart and advanced
workload managers.

Data federation. Not all data is stored in Hadoop. Most enterprise data is still
stored in other data sources, such as SQL databases. A SQL-on-Hadoop engine
must support distributed joins on data stored in all kinds of data sources. In other
words, it must support data federation.

It would not surprise me if every organization that uses Hadoop eventually


deploys a SQL-on-Hadoop engine (or maybe even a few). As organizations
compare and evaluate the available technologies, assessing the engine's
capabilities for the requirements listed in this article is a great starting point.

Apache Hive
Apache Hive is an open source data warehouse system for querying and
analyzing large data sets that are principally stored in Hadoop files. It is
commonly a part of compatible tools deployed as part of the software ecosystem
based on the Hadoop framework for handling large data sets in a distributed
computing environment.

Like Hadoop, Hive has roots in batch processing techniques. It was originated in
2007 by developers at Facebook who sought to provide SQL access to Hadoop
data for analytics users. Like Hadoop, Hive was developed to address the need to
handle petabytes of data accumulating via web activity. Release 1.0 became
available in February 2015.

How Apache Hive works


Initially, Hadoop processing relied solely on the MapReduce framework, and
this required users to understand advanced styles of Java programming in order
to successfully query data. The motivation behind Apache Hive was to simplify
query development, and to, in turn, open up Hadoop unstructured data to a wider
group of users in organizations.
Hive has three main functions: data summarization, query and analysis. It
supports queries expressed in a language called HiveQL, or HQL, a declarative
SQL-like language that, in its first incarnation, automatically translated SQL-
style queries into MapReduce jobs executed on the Hadoop platform. In
addition, HiveQL supported custom MapReduce scripts to plug into queries.

When SQL queries are submitted via Hive, they are initially received by a driver
component that creates session handles, forwards requests to a compiler via Java
Database Connectivity/Open Database Connectivity interfaces, which
subsequently forwards jobs for execution. Hive enables
data serialization/deserialization and increases flexibility in schema design by
including a system catalog called Hive-Metastore.

Apache Hive brings SQL capabilities to Hadoop analytics. A driver component


creates session handles and links to a compiler, and work is forwarded for
execution. Live Long and Process daemons can separately handle I/O, caching
and query fragment execution, boosting performance.
How Hive has evolved
Like Hadoop, Hive has evolved to encompass more than just MapReduce.
Inclusion of the YARN resource manager in Hadoop 2.0 helped developers'
ability to expand use of Hive, as it did other Hadoop ecosystem components.
Over time, HiveQL has gained support for the Apache Spark SQL engine as well
as the Hive engine, and both HiveQL and the Hive Engine have added support
for distributed process execution via Apache Tez and Spark.

Early Hive file support comprised text files (also called flat files), SequenceFiles
(flat files consisting of binary key/value pairs) and Record Columnar Files
(RCFiles), which store columns of a table in a columnar database way). Hive
columnar storage support has come to include Optimized Row Columnar (ORC)
files and Parquet files.

Hive execution and interactivity were a topic of attention nearly from its
inception. That is because query performance lagged that of more familiar SQL
engines. In 2013, to boost performance, Apache Hive committers began work on
the Stinger project, which brought Apache Tez and directed acyclic graph
processing to the warehouse system.

Also accompanying Stinger were new approaches that improved performance by


adding a cost-based optimizer, in-memory hash joins, a vector query engine and
other enhancements. Query performance reaching 100,000 queries per hour and
analytics processing of 100 million rows per second, per node have been
reported for recent versions of Hive.

Additions accompanying releases 2.3 in 2017 and release 3.0 in 2018 furthered
Apache Hive's development. Among highlights were support for Live Long and
Process (LLAP) functionality that allows prefetching and caching of columnar
data and support for atomicity, consistency, isolation and durability (ACID)
operations including INSERT, UPDATE and DELETE. Work also began on
materialized views and automatic query rewriting capabilities familiar to
traditional data warehouse users.

Hive supporters and alternatives


Committers to the Apache Hive community project have included individuals
from Cloudera, Hortonworks, Facebook, Intel, LinkedIn, Databricks and others.
Hive is supported in Hadoop distributions. As with the Hbase NoSQL database,
it is very commonly implemented as part of Hadoop distributed data processing
applications. Hive is available by download from the Apache Foundation, as
well as from Hadoop distribution providers Cloudera, MapR and Hortonworks,
and as a part of AWS Elastic MapReduce. The latter implementation supports
analysis of data sets residing in Simple Storage Service object storage.

Apache Hive was among the very first attempts to bring SQL querying
capabilities to the Hadoop ecosystem. Among a host of other SQL-on-
Hadoop alternatives that have arisen are BigSQL, Drill, Hadapt, Impala and
Presto. Also, Apache Pig has emerged as an alternative language to HiveQL for
Hadoop-oriented data warehousing.

Architecture and Working of Hive


The major components of Hive and its interaction with the Hadoop is
demonstrated in the figure below and all the components are described further:

User Interface (UI)


As the name describes User interface provide an interface between user
and hive. It enables user to submit queries and other operations to the
system. Hive web UI, Hive command line, and Hive HD Insight (In
windows server) are supported by the user interface.
Driver
Queries of the user after the interface are received by the driver within
the Hive. Concept of session handles is implemented by driver.
Execution and Fetching of APIs modelled on JDBC/ODBC interfaces
is provided by the user.
Compiler
Queries are parses, semantic analysis on the different query blocks and
query expression is done by the compiler. Execution plan with the help
of the table in the database and partition metadata observed from the
metastore are generated by the compiler eventually.
Metastore
All the structured data or information of the different tables and
partition in the warehouse containing attributes and attributes level
information are stored in the metastore. Sequences or de-sequences
necessary to read and write data and the corresponding HDFS files
where the data is stored. Hive selects corresponding database servers to
stock the schema or Metadata of databases, tables, attributes in a table,
data types of databases, and HDFS mapping.

Execution Engine
Execution of the execution plan made by the compiler is performed in
the execution engine. The plan is a DAG of stages. The dependencies
within the various stages of the plan is managed by execution engine as
well as it executes these stages on the suitable system components.
Diagram – Architecture of Hive that is built on the top of Hadoop
In the above diagram along with architecture, job execution flow in Hive with
Hadoop is demonstrated step by step .

Step-1: Execute Query


Interface of the Hive such as Command Line or Web user interface
delivers query to the driver to execute. In this, UI calls the execute
interface to the driver such as ODBC or JDBC.
Step-2: Get Plan
Driver designs a session handle for the query and transfer the query to
the compiler to make execution plan. In other words, driver interacts
with the compiler.
Step-3: Get Metadata
In this, the compiler transfers the metadata request to any database and
the compiler gets the necessary metadata from the metastore.
Step-4: Send Metadata
Metastore transfers metadata as an acknowledgement to the compiler.
Step-5: Send Plan
Compiler communicating with driver with the execution plan made by
the compiler to execute the query.
Step-6: Execute Plan
Execute plan is sent to the execution engine by the driver.
Execute Job
Job Done
Dfs operation (Metadata Operation)
Step-7: Fetch Results
Fetching results from the driver to the user interface (UI).
Step-8: Send Results
Result is transferred to the execution engine from the driver. Sending
results to Execution engine. When the result is retrieved from data
nodes to the execution engine, it returns the result to the driver and to
user interface (UI).

Difference between Apache Hive and Apache Spark SQL


1. Apache Hive :
Apache Hive is a data warehouse device constructed on the pinnacle of Apache
Hadoop that enables convenient records summarization, ad-hoc queries, and the
evaluation of massive datasets saved in a number of databases and file structures
that combine with Hadoop, together with the MapR Data Platform with MapR
XD and MapR Database. Hive gives an easy way to practice structure to massive
quantities of unstructured facts and then operate batch SQL-like queries on that
data.
2. Apache Spark SQL :
Spark SQL brings native assist for SQL to Spark and streamlines the method of
querying records saved each in RDDs (Spark’s allotted datasets) and in exterior
sources. Spark SQL effortlessly blurs the traces between RDDs and relational
tables. Unifying these effective abstractions makes it convenient for developers
to intermix SQL instructions querying exterior information with complicated
analytics, all inside a single application.

Difference Between Apache Hive and Apache Spark SQL :

S.NO. APACHE HIVE APACHE SPARK SQL

It is an Open Source Data It is used in structured data


warehouse system, Processing system where
constructed on top of Apache it processes information using
1. Hadoop. SQL.

It contains large data sets and It computes heavy functions


stored in Hadoop files for followed by correct
analyzing and querying optimization techniques for
2. purposes. processing a task.

It was released in the year It first came into the picture in


3. 2012. 2014.
It can be implemented in
For its implementation, it various languages such as R,
4. mainly uses JAVA. Python and Scala.

Its latest version (2.3.2) is Its latest version (2.3.0) is


5. released in 2017. released in 2018.

Mainly RDMS is used as its It can be integrated with any


6. Database Model. No-SQL database.

It can support all OS provided,


JVM environment will be It supports various OS such as
7. there. Linux, Windows, etc.

Access methods for its


processing include JDBC, It can be accessed only by
8. ODBC and Thrift. ODBC and JDBC.

Apache Hive is a data warehouse and an ETL tool which provides an SQL-like
interface between the user and the Hadoop distributed file system (HDFS) which
integrates Hadoop. It is built on top of Hadoop. It is a software project that
provides data query and analysis. It facilitates reading, writing and handling
wide datasets that stored in distributed storage and queried by Structure Query
Language (SQL) syntax. It is not built for Online Transactional Processing
(OLTP) workloads. It is frequently used for data warehousing tasks like data
encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to
enhance scalability, extensibility, performance, fault-tolerance and loose-
coupling with its input formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers
standard SQL functionality for analytics. Traditional SQL queries are written in
the MapReduce Java API to execute SQL Application and SQL queries over
distributed data. Hive provides portability as most data warehousing applications
functions with SQL-based query languages like NoSQL.
Components of Hive:

1. HCatalog
It is a Hive component and is a table as well as a store management
layer for Hadoop. It enables user along with various data processing
tools like Pig and MapReduce which enables to read and write on the
grid easily.

2. WebHCat
It provides a service which can be utilized by the user to run Hadoop
MapReduce (or YARN), Pig, Hive tasks or function Hive metadata
operations with an HTTP interface.

Modes of Hive:
Hive is functioned in two major modes which are described below. These modes
are depended on the size of data nodes in Hadoop.

1. Local Mode
It is used, when the Hadoop is built under pseudo mode which have
only one data node, when the data size is smaller in term of restricted
to single local machine, and when processing will be faster on smaller
datasets existing in the local machine.
2. Map Reduce Mode
It is used, when Hadoop is built with multiple data nodes and data is
divided across various nodes, it will function on huge datasets and
query is executed parallelly, and to achieve enhanced performance in
processing large datasets.

Characteristics of Hive:

1. Databases and tables are built before loading the data.


2. Hive as data warehouse is built to manage and query only structured
data which is residing under tables.
3. At the time of handling structured data, MapReduce lacks optimization
and usability function such as UDFs whereas Hive framework have
optimization and usability.
4. Programming in Hadoop deals directly with the files. So, Hive can
partition the data with directory structures to improve performance on
certain queries.
5. Hive is compatible for the various file formats which are TEXTFILE,
SEQUENCEFILE, ORC, RCFILE, etc.
6. Hive uses derby database in single user metadata storage and it uses
MYSQL for multiple user Metadata or shared Metadata.

Features of Hive:

1. It provides indexes, including bitmap indexes to accelerate the queries.


Index type containing compaction and bitmap index as of 0.10.
2. Metadata storage in a RDBMS, reduces the time to function semantic
checks during query execution.
3. Built in user-defined functions (UDFs) to manipulation of strings,
dates, and other data-mining tools. Hive is reinforced to extend the
UDF set to deal with the use-cases not reinforced by predefined
functions.
4. DEFLATE, BWT, snappy, etc are the algorithms to operation on
compressed data which is stored in Hadoop Ecosystem.
5. It stores schemas in a database and processes the data into the Hadoop
File Distributed File System (HDFS).
6. It is built for the Online Analytical Processing (OLAP).
7. It delivers various types od querying language which are frequently
known as Hive Query Language (HVL or HiveQL).
Analytical modeling is both science and art

Advanced analytics won't produce an ounce of business insight without models,


the statistical and machine learning algorithms that tease patterns and
relationships from data and express them as mathematical equations. The
algorithms tend to be immensely complex, mathematicians and statisticians
(think data scientists) are needed to create them and then tweak the models to
better fit changing business needs and conditions.
But analytical modeling is not a wholly quantitative, left-brain endeavor. It's a
science, certainly, but it's an art, too.

The art of modeling involves selecting the right data sets, algorithms and
variables and the right techniques to format data for a particular business
problem. But there's more to it than model-building mechanics. No model will
do any good if the business doesn't understand its results. Communicating the
results to executives so they understand what the model discovered and how
it can benefit the business is critical but challenging, it's the "last mile" in the
whole analytical modeling process and often the most treacherous. Without that
understanding, though, business managers might be loath to use the analytical
findings to make critical business decisions.

An analytical model estimates or classifies data values by essentially drawing a


line through data points. When applied to new data or records, a model can
predict outcomes based on historical patterns. But not all models are transparent,
and some are downright opaque. That's a problem for execs, who often don't
trust models until they see something positive result from decisions based on
modeling-generated insights for example, operating costs go down or revenues
go up. For analytics to work, modelers need to build models that reflect business
managers' perceptions of business realities and they need to make those
connections clear.

They should also be realistic about the likely fruits of their scientific and artistic
labors. Though some models make fresh observations about business data, most
don't; they extract relationships or patterns that people already know about but
might overlook or ignore otherwise.

For example, a crime model predicts that the number of criminal incidents will
increase in a particular neighborhood on a particular summer night. A grizzled
police sergeant might cavalierly dismiss the model's output, saying he was aware
that would happen because an auto race takes place that day at the local
speedway, which always spawns spillover crime in the adjacent neighborhood.
"Tell me something I don't already know," he grumbles. But that doesn't mean
the modeling work was for naught: In this case, the model reinforces the
policeman’s implicit knowledge, bringing it to the forefront of his consciousness,
so he can act on it.

Occasionally, models do uncover breakthrough insights. An example comes


from the credit card industry, which uses analytical models to detect and prevent
fraud. Several years ago, analysts using a combination of algorithms and data
sets uncovered a new racket: Perpetrators were using automatic number
generators to guess credit card numbers on e-commerce websites. The models
found that nearly identical credit card numbers were spitting out a huge number
of transactions and card declines, thus uncovering a new pattern. The companies
quickly figured out what was going on and implemented safeguards, avoiding
millions of dollars in fraudulent transactions.

Many people think that to excel at analytics their companies need only hire a
bunch of statisticians who understand the nuances of sophisticated algorithms
and give them high-powered tools to crunch data. But that only gets you so far.
The art of analytical modeling is a skill that requires intimate knowledge of an
organization's processes and data as well as the ability to communicate with
business executives in business terms. Like fine furniture makers, analytics
professionals who master these skills can build high-quality models with lasting
value and reap huge rewards for their organizations in the process.

Algorithm
An algorithm (pronounced AL-go-rith-um) is a procedure or formula for solving
a problem, based on conducting a sequence of specified actions. A
computer program can be viewed as an elaborate algorithm. In mathematics and
computer science, an algorithm usually means a small procedure that solves a
recurrent problem.

Algorithms are widely used throughout all areas of IT (information technology).


A search engine algorithm, for example, takes search strings of keywords
and operators as input, searches its associated database for relevant
web pages, and returns results.

An encryption algorithm transforms data according to specified actions to


protect it. A secret key algorithm such as the U.S. Department of Defense's Data
Encryption Standard (DES), for example, uses the same key to encrypt and
decrypt data. As long as the algorithm is sufficiently sophisticated, no one
lacking the key can decrypt the data.

The word algorithm derives from the name of the mathematician, Mohammed
ibn-Musa al-Khwarizmi, who was part of the royal court in Baghdad and who
lived from about 780 to 850. Al-Khwarizmi's work is the likely source for the
word algebra as well.

Business Analytics
Business analytics (BA) is the iterative, methodical exploration of an
organization's data, with an emphasis on statistical analysis. Business analytics is
used by companies that are committed to making data-driven decisions. Data-
driven companies treat their data as a corporate asset and actively look for ways
to turn it into a competitive advantage. Successful business analytics depends
on data quality, skilled analysts who understand the technologies and the
business, and an organizational commitment to using data to gain insights that
inform business decisions.

Specific types of business analytics include:

Descriptive analytics, which tracks key performance indicators (KPIs) to


understand the present state of a business;
Predictive analytics, which analyzes trend data to assess the likelihood
of future outcomes; and
Prescriptive analytics, which uses past performance to generate
recommendations about how to handle similar situations in the future.
How business analytics works
Once the business goal of the analysis is determined, an analysis methodology is
selected and data is acquired to support the analysis. Data acquisition often
involves extraction from one or more business systems, cleansing and
integration into a single repository such as a data warehouse or data mart.

Initial analysis is typically performed against a smaller sample set of


data. Analytic tools range from spreadsheets with statistical functions to
complex data mining and predictive modeling applications. As patterns and
relationships in the data are uncovered, new questions are asked and the analytic
process iterates until the business goal is met.
Deployment of predictive models involves scoring data records -- typically in a
database -- and using the scores to optimize real-time decisions within
applications and business processes. BA also supports tactical decision-making
in response to unforeseen events. And, in many cases, the decision-making is
automated to support real-time responses.

Business analytics vs. business intelligence


While the terms business intelligence and business analytics are often used
interchangeably, there are some key differences:

Business analytics vs. data science

The more advanced areas of business analytics can start to resemble data
science, but there is also a distinction between these two terms. Even when
advanced statistical algorithms are applied to data sets, it doesn't necessarily
mean data science is involved. That's because true data science involves more
custom coding and exploring answers to open-ended questions.

Data scientists generally don't set out to solve a specific question, as most
business analysts do. Rather, they will explore data using advanced statistical
methods and allow the features in the data to guide their analysis. There are a
host of business analytics tools that can perform these kinds of functions
automatically, requiring few of the special skills involved in data science.

Business analytics applications

Business analytics tools come in several different varieties:

Data visualization tools


Business intelligence reporting software
Self-service analytics platforms
Statistical analysis tools
Big data platforms

Self-service has become a major trend among business analytics tools. Users
now demand software that is easy to use and doesn't require specialized training.
This has led to the rise of simple-to-use tools from companies such
as Tableau and Qlik, among others. These tools can be installed on a single
computer for small applications or in server environments for enterprise-wide
deployments. Once they are up and running, business analysts and others with
less specialized training can use them to generate reports, charts and web portals
that track specific metrics in data sets.

Edge Analytics
Edge analytics is an approach to data collection and analysis in which an
automated analytical computation is performed on data at a sensor, network
switch or other device instead of waiting for the data to be sent back to a
centralized data store.

Edge analytics has gained attention as the internet of things (IoT) model of
connected devices has become more prevalent. In many organizations, streaming
data from manufacturing machines, industrial equipment, pipelines and other
remote devices connected to the IoT creates a massive glut of operational data,
which can be difficult -- and expensive -- to manage. By running the data
through an analytics algorithm as it's created, at the edge of a corporate network,
companies can set parameters on what information is worth sending to a cloud or
on-premises data store for later use -- and what isn't.

Analyzing data as it's generated can also decrease latency in the decision-making
process on connected devices. For example, if sensor data from a manufacturing
system points to the likely failure of a specific part, business rules built into the
analytics algorithm interpreting the data at the network edge can automatically
shut down the machine and send an alert to plant managers so the part can be
replaced. That can save time compared to transmitting the data to a central
location for processing and analysis, potentially enabling organizations to reduce
or avoid unplanned equipment downtime.

Another primary benefit of edge analytics is scalability. Pushing analytics


algorithms to sensors and network devices alleviates the processing strain
on enterprise data management and analytics systems, even as the number of
connected devices being deployed by organizations -- and the amount of data
being generated and collected increases.

How is edge analytics used?


One of the most common use cases for edge analytics is monitoring edge
devices. This is particularly true for IoT devices. A data analytics platform might
be deployed for the purpose of monitoring a large collection of devices for the
purpose of making sure that the devices are functioning normally. If a problem
does occur, an edge analytics platform might be able to take corrective action
automatically. If automatic remediation isn't possible, then the platform might
instead provide the IT staff with actionable insights that will help them to fix the
problem.

Benefits of edge analytics


Edge analytics delivers several compelling benefits:

Near real-time analysis of data . Because analysis is performed near


the data -- often on board the device itself -- the data can be analyzed in
near real time. This would simply not be the case if the device had to
transmit the data to a back-end server in the cloud or in a remote data
center for processing.
Scalability . Edge analytics is by its very nature scalable. Because each
device analyzes its own data, the computational workload is distributed
across devices.
Possible reduction of costs . Significant costs are associated with
traditional big data analytics. Regardless of whether the data is
processed in a public cloud or in an organization's own data center, there
are costs tied to data storage, data processing and bandwidth
consumption. Some of the edge analytics platforms for IoT devices use
the IoT device's hardware to perform the data analytics, thereby
eliminating the need for back-end processing.
Improved security . If data is analyzed on board the device that created
it, then it's not necessary to transmit the full data set across the wire.
This can help improve security because the raw data never leaves the
device that created it.
Limitations of edge analytics
Like any other technology, edge analytics has its limits. Those limitations
include:

Not all hardware supports it . Simply put, not every IoT device has the
memory, CPU and storage hardware required to perform deep analytics
onboard the device.
You might have to develop your own edge analytics platform. Edge
analytics is still a relatively new technology. Although off-the-shelf
analytical platforms do exist, it's entirely possible that an organization
might have to develop its own edge analytics platform based on the
devices that it wants to analyze.

Applications of edge analytics


Edge analytics tend to be most useful in industrial environments that use many
IoT sensors. In such environments, edge analytics can deliver benefits such as:
Improved up time . If an edge analytics platform can monitor a sensor
array, it might be able to take corrective action when problems occur.
Even if the resolution isn't automated, simply alerting an operator to a
problem can help improve the overall up time.
Lower maintenance costs . By performing in-depth analysis of IoT
devices, it might be possible to gain deep insight into device health and
longevity. Depending on the environment, this might help the
organization to reduce its maintenance costs by performing maintenance
when it's necessary rather than blindly following a maintenance
schedule.
Predict failures . An in-depth analysis of IoT hardware might make it
possible to accurately predict hardware failures in advance. This can
enable organizations to take proactive steps to head off a failure.

Edge analytics vs. edge computing


Edge computing is based on the idea that data collection and data processing can
be performed near the location where the data is either being created or
consumed. Edge analytics uses these same devices and the data that they have
already produced. An analytics model performs a deeper analysis of the data
than what was initially performed. These analytics capabilities enable the
creation of actionable insights, often directly on the device.

Cloud analytics vs. edge analytics


Both cloud analytics and edge analytics are techniques for gathering relevant
data and then using that data to perform data analysis. The key difference
between the two is that cloud analytics requires raw data to be transmitted to the
cloud for analysis.
Although cloud analytics has its place, edge analytics has two main advantages.
First, edge analytics incurs far lower latency than cloud analytics because data is
analyzed on site -- often within the device itself, in real time, as the data is
created. The second advantage is that edge analytics doesn't require network
connectivity to the cloud. This means that edge analytics can be used in
bandwidth-constrained environments, or in locations where cloud connectivity
simply isn't available.

Inductive Reasoning
Inductive reasoning is a logical process in which multiple premises, all believed
true or found true most of the time, are combined to obtain a specific conclusion.

Inductive reasoning is often used in applications that involve prediction,


forecasting, or behavior. Here is an example:

Every tornado I have ever seen in the United States rotated


counterclockwise, and I have seen dozens of them.
We see a tornado in the distance, and we are in the United States.
I conclude that the tornado we see right now must be rotating
counterclockwise.

A meteorologist will tell you that in the United States (which lies in the northern
hemisphere), most tornadoes rotate counterclockwise, but not all of them do.
Therefore, the conclusion is probably true, but not necessarily true. Inductive
reasoning is, unlike deductive reasoning, not logically rigorous. Imperfection can
exist and inaccurate conclusions can occur, however rare; in deductive reasoning
the conclusions are mathematically certain.

Inductive reasoning is sometimes confused with mathematical induction, an


entirely different process. Mathematical induction is a form of deductive
reasoning, in which logical certainties are "daisy chained" to derive a general
conclusion about an infinite number of objects or situations.

Supply Chain Analytics


Supply chain analytics refers to the processes organizations use to gain insight
and extract value from the large amounts of data associated with the
procurement, processing and distribution of goods. Supply chain analytics is an
essential element of supply chain management (SCM).

The discipline of supply chain analytics has existed for over 100 years, but the
mathematical models, data infrastructure, and applications underpinning these
analytics have evolved significantly. Mathematical models have improved with
better statistical techniques, predictive modeling and machine learning. Data
infrastructure has changed with cloud infrastructure, complex event processing
(CEP) and the internet of things. Applications have grown to provide insight
across traditional application silos such as ERP, warehouse management,
logistics and enterprise asset management.

An important goal of choosing supply chain analytics software is to improve


forecasting and efficiency and be more responsive to customer needs. For
example, predictive analytics on point-of-sale terminal data stored in a demand
signal repository can help a business anticipate consumer demand, which in turn
can lead to cost-saving adjustments to inventory and faster delivery.

Achieving end-to-end supply chain analytics requires bringing information


together across the procurement of raw materials and extends through
production, distribution and aftermarket services. This depends on effective
integration between the many SCM and supply chain execution platforms that
make up a typical company's supply chain. The goal of such integration is supply
chain visibility: the ability to view data on goods at every step in the supply
chain.

Supply chain analytics software


Supply chain analytics software is generally available in two forms: embedded
in supply chain software, or in a separate, dedicated business intelligence and
analytics tool that has access to supply chain data. Most ERP vendors offer
supply chain analytics features, as do vendors of specialized SCM software.
Some IT consultancies develop software models that can be customized and
integrated into a company's business processes.

Some ERP and SCM vendors have begun applying CEP to their platforms for
real-time supply chain analytics. Most ERP and SCM vendors have one-to-one
integrations, but there is no standard. However, the Supply Chain Operations
Reference (SCOR) model provides standard metrics for comparing supply chain
performance to industry benchmarks.

Ideally, supply chain analytics software would be applied to the entire chain, but
in practice it is often focused on key operational subcomponents, such as
demand planning, manufacturing production, inventory management or
transportation management. For example, supply chain finance analytics can
help identify increased capital costs or opportunities to boost working
capital; procure-to-pay analytics can help identify the best suppliers and provide
early warning of budget overruns in certain expense categories; and
transportation analytics software can predict the impact of weather on shipments.

How supply chain analytics works


Supply chain analytics brings together data from across different applications,
infrastructure, third-party sources and emerging technologies such as IoT to
improve decision-making across the strategic, tactical and operational processes
that make up supply chain management. Supply chain analytics helps
synchronize supply chain planning and execution by improving real-time
visibility into these processes and their impact on customers and the bottom line.
Increased visibility can also increase flexibility in the supply chain network by
helping decision-makers to better evaluate tradeoffs between cost and customer
service.

The process of creating supply chain analytics typically starts with data scientists
who understand a particular aspect of the business, such as the factors that relate
to cash flow, inventory, waste and service levels. These experts look for potential
correlations between different data elements to build a predictive model that
optimizes the output of the supply chain. They test out variations until they have
a robust model.

Supply chain analytics models that reach a certain threshold of success are
deployed into production by data engineers with an eye toward scalability and
performance. Data scientists, data engineers and business users work together to
refine the way these data analytics are presented and operationalized in practice.
Supply chain models are improved over time by correlating the performance of
data analysis models in production with the business value they deliver.

Features of supply chain analytics


Supply chain analytics software usually includes most of the following features:

Data visualization. The ability to slice and dice data from different
angles to improve insight and understanding.
Stream processing. Deriving insight from multiple data streams
generated by, for example, the IoT, applications, weather reports and
third-party data.
Social media integration. Using sentiment data from social feeds to
improve demand planning.
Natural language processing. Extracting and organizing unstructured
data buried in documents, news sources and data feeds.
Location intelligence. Deriving insight from location data to
understand and optimize distribution.
Digital twin of the supply chain. Organizing data into a
comprehensive model of the supply chain that is shared across different
types of users to improve predictive and prescriptive analytics.
Graph databases. Organizing information into linked elements that
make it easier to find connections, identify patterns and improve
traceability of products, suppliers and facilities.
Types of supply chain analytics
A common lens used to delineate the main types of supply chain analytics is
based on Gartner's model of the four capabilities of analytics: descriptive,
diagnostic, predictive and prescriptive.

Descriptive supply chain analytics uses dashboards and reports to help


interpret what has happened. It often involves using a variety of
statistical methods to search through, summarize and organize
information about operations in the supply chain. This can be useful in
answering questions like, "How have inventory levels changed over the
last month?" or "What is the return on invested capital?"
Diagnostic supply chain analytics are used to figure out why something
happened or is not working as well as it should. For example, "Why are
shipments being delayed or lost?" or "Why is our company not
achieving the same number of inventory turns as a competitor?"
Predictive supply chain analytics helps to foresee what is likely to
happen in the future based on current data. For example, "How will new
trade regulations or a pandemic lockdown affect the availability and cost
of raw materials or goods?"
Prescriptive supply chain analytics helps prescribe or automate the best
course of action using optimization or embedded decision logic. This
can help improve decisions about when to launch a product, whether or
not to build a factory or the best shipment strategy for each retail
location.

Another way of breaking down types of supply chain analytics is by their form
and function. Advisory firm Supply Chain Insights, for example, breaks down
the types of supply chain analytics into the following functions:
Workflow
Decision support
Collaboration
Unstructured text mining
Structured data management

In this model, the different types of analytics feed into each other as part of an
end-to-end ongoing process for improving supply chain management.

For example, a company could use unstructured text mining to turn raw data
from contracts, social media feeds and news reports into structured data that is
relevant to the supply chain. This improved, more structured data could then
help automate and improve workflows, such as procure-to-pay processes. The
data in digitized workflows is much easier to capture than data from manual
workflows, thus increasing the data available for decision support systems.
Better decision support could in turn enhance collaboration across different
departments like procurement and warehouse management or between supply
chain partners.

Other technologies are emerging as ways to improve the predictive models


generated by supply chain analytics. For example, organizations are starting to
use process mining to analyze how they execute business processes. This type of
process analytics can be used to create a digital twin of the organization that can
help identify supply chain opportunities for automation across procurement,
production, logistics and finance. Augmented analytics can help business users
ask questions about the business in plain language, with responses delivered in
brief summaries. Graph analytics can shed light on the relationships between
entities in the supply chain, such as how changes in a tier 3 supplier might affect
tier 1 suppliers.

Supply chain analytics uses


Sales and operations planning uses supply chain analytics to match a
manufacturer's supply with demand by generating plans that align daily
operations with corporate strategy. Supply chain analytics is also used to do the
following:

improve risk management by identifying known risks and predicting future


risks based on patterns and trends throughout the supply chain;
increase planning accuracy by analyzing customer data to identify factors that
increase or decrease demand;

improve order management by consolidating data sources to assess inventory


levels, predict demand and identify fulfillment issues;

streamline procurement by organizing and analyzing spending across


departments to improve contract negotiations and identify opportunities for
discounts or alternative sources; and
increase working capital by improving models for determining the inventory
levels required to ensure service goals with minimal capital investment.

History of supply chain analytics


Supply chain analytics has its roots in the work of Frederick Taylor, whose 1911
publication, The Principles of Scientific Management, laid the groundwork for
the modern fields of industrial engineering and supply chain management.
Henry Ford adopted Taylor's techniques in the creation of the modern assembly
line and a supply chain that supported more efficient means of production.

The advent of mainframe computers gave rise to the data processing work done
by IBM researcher Hans Peter Luhn, who some credit for coining the
term business intelligence in his 1958 paper, "A Business Intelligence System."
His work helped build the foundation for the different types of data analytics
used in supply chain analytics.

In 1963, Bud Lalonde, a professor at Ohio State University, proposed that


physical distribution management should be combined with materials
management, procurement and manufacturing into what he called business
logistics. Around this time, management consultant Stafford Beer and others
began exploring new ideas like the viable systems model for organizing business
information into a structured hierarchy to improve business planning and
execution. By the early 1980s, the burgeoning field was known as supply chain
management.

As the internet became a force in the 1990s, people looked at how it could be
applied in supply chain management. A pioneer in this area was the British
technologist Kevin Ashton. As a young product manager tasked with solving the
problem of keeping a popular lipstick on store shelves, Ashton hit upon radio
frequency identification sensors as a way to automatically capture data about the
movement of products across the supply chain. Ashton, who would go on to co-
found the Massachusetts Institute of Technology's Auto-ID Center that perfected
RFID technology and sensors, coined the term internet of things to explain this
revolutionary new feature of supply chain management.

The 1990s also saw the development of CEP by researchers such as the team
headed by Stanford University's David Luckham and others. CEP's ability to
capture incoming data from real-time events helped supply chain managers
correlate low-level data related to factory operations, the physical movements of
products, and weather into events that could then be analyzed by supply chain
analytics tools. For example, data about production processes could be
abstracted to factory performance, which in turn could be abstracted into
business events related to things like inventory levels.

Another turning point in the field of supply chain analytics was the advent of
cloud computing, a new vehicle for delivering IT infrastructure, software and
platforms as service. By providing a foundation for orchestrating data across
multiple sources, the cloud has driven improvements in many types of analytics,
including supply chain analytics. The emergence of data lakes
like Hadoop allowed enterprises to capture data from different sources on a
common platform, further refining supply chain analytics by enabling companies
to correlate more types of data. Data lakes also made it easier to implement
advanced analytics that operated on a variety of structured and unstructured data
from different applications, event streams and the IoT.

In recent years, robotic process automation -- software that automates rote


computer tasks previously performed by humans -- has become a powerful tool
in improving business automation and the ability to integrate data into analytics.

In addition, the artificial intelligence technique known as deep learning is


increasingly being used to improve supply chain analytics. Deep learning
techniques are driving advances in machine vision (used to improve inventory
tracking), natural language understanding (used to automate contract
management), and improvements in routing models.

Future trends of supply chain analytics


Supply chain analytics will continue to evolve in tandem with the evolution of
analytics models, data structures and infrastructure, and the ability to integrate
data across application silos. In the long run, advanced analytics will lead to
more autonomous supply chains that can manage and respond to changes, much
like self-driving cars are starting to do today. In addition, improvements in IoT,
CEP and streaming architectures will enable enterprises to derive insight more
quickly from a larger variety of data sources. AI techniques will continue to
improve people's ability to generate more accurate and useful predictive insights
that can be embedded into workflows.

Other technologies expected to play a big role in supply chain analytics and
management include the following:

Blockchain. Blockchain infrastructure and technologies promise to improve


visibility and traceability across more layers of the supply chain. These same
building blocks could drive companies to use smart contracts to automate,
control and execute transactions.

G raph analytics. Predicted to power more than half of all enterprise


applications within a decade, graph analytics will help supply chain managers
better analyze the links of various entities in the supply chain.

Hyperautomation. The technologies underpinning hyperautomation will


accelerate supply chain automation by using process mining analytics to identify
automation candidates, generate the automations and manage these automated
processes.

Statistical Analysis
Statistical analysis is the collection and interpretation of data in order to uncover
patterns and trends. It is a component of data analytics. Statistical analysis can
be used in situations like gathering research interpretations, statistical modeling
or designing surveys and studies. It can also be useful for business intelligence
organizations that have to work with large data volumes.

In the context of business intelligence (BI), statistical analysis involves


collecting and scrutinizing every data sample in a set of items from which
samples can be drawn. A sample, in statistics, is a representative selection drawn
from a total population.

The goal of statistical analysis is to identify trends. A retail business, for


example, might use statistical analysis to find patterns in unstructured and semi-
structured customer data that can be used to create a more positive customer
experience and increase sales.
Steps of statistical analysis
Statistical analysis can be broken down into five discrete steps, as follows:

Describe the nature of the data to be analyzed.


Explore the relation of the data to the underlying population.
Create a model to summarize an understanding of how the data relates to
the underlying population.
Prove (or disprove) the validity of the model.
Employ predictive analytics to run scenarios that will help guide future
actions.
Statistical analysis software
Software for statistical analysis will typically allow users to do more complex
analyses by including additional tools for organization and interpretation of data
sets, as well as for the presentation of that data. IBM SPSS Statistics, RMP and
Stata are some examples of statistical analysis software. For example, IBM
SPSS Statistics covers much of the analytical process. From data preparation and
data management to analysis and reporting. The software includes a
customizable interface, and even though it may be hard form someone to use, it
is relatively easy for those experienced in how it works.

Analytic Database
An analytic database is a read-only system that stores historical data on business
metrics such as sales performance and inventory levels. Business analysts,
corporate executives and other workers can run queries and reports against an
analytic database. The information is updated on a regular basis to incorporate
recent transaction data from an organization’s operational systems.

An analytic database is specifically designed to support business intelligence


(BI) and analytic applications, typically as part of a data warehouse or data mart.
This differentiates it from an operational, transactional or OLTP database, which
is used for transaction processing i.e., order entry and other “run the business”
applications. Databases that do transaction processing can also be used to
support data warehouses and BI applications, but analytic database vendors
claim that their products offer performance and scalability advantages over
conventional relational database software.
There currently are five main types of analytic databases on the market:
Columnar databases , which organize data by columns instead of rows, thus
reducing the number of data elements that typically have to be read by the
database engine while processing queries.

Data warehouse appliances , which combine the database with hardware and
BI tools in an integrated platform that’s tuned for analytical workloads and
designed to be easy to install and operate.

In-memory databases , which load the source data into system memory in a
compressed, non-relational format in an attempt to streamline the work involved
in processing queries.
Massively parallel processing (MPP) databases , which spread data across a
cluster of servers, enabling the systems to share the query processing workload.

Online analytical processing (OLAP) databases , which store


multidimensional “cubes” of aggregated data for analyzing information based on
multiple data attributes.

Real-Time Analytics
Real-time analytics is the use of data and related resources for analysis as soon
as it enters the system. The adjective real-time refers to a level of computer
responsiveness that a user senses as immediate or nearly immediate. The term is
often associated with streaming data architectures and real-time operational
decisions that can be made automatically through robotic process
automation and policy enforcement.

Whereas historical data analysis uses a set of historical data for batch analysis,
real-time analytics instead visualizes and analyzes the data as it appears in the
computer system. These enables data scientists to use real-time analytics for
purposes such as:

Forming operational decisions and applying them to production


activities including business processes and transactions on an ongoing
basis.
Viewing dashboard displays in real time with constantly updated
transactional data sets.
Utilizing existing prescriptive and predictive analytics
Reporting historical and current data simultaneously.
Real-time analytics software has three basic components:

an aggregator that gathers data event streams (and perhaps batch files)
from a variety of data sources;
a broker that makes data available for consumption; and
an analytics engine that analyzes the data, correlates values and blends
streams together.

The system that receives and sends data streams and executes the application and
real-time analytics logic is called the stream processor.

How real-time analytics works


Real-time analytics often takes place at the edge of the network to ensure that
data analysis is done as close to the data's origin as possible. In addition to edge
computing, other technologies that support real-time analytics include:

Processing in memory -- a chip architecture in which the processor is


integrated into a memory chip to reduce latency.
In-database analytics -- a technology that allows data processing to be
conducted within the database by building analytic logic into the
database itself.
Data warehouse appliances -- a combination of hardware and software
products designed specifically for analytical processing. An appliance
allows the purchaser to deploy a high-performance data warehouse right
out of the box.
In-memory analytics -- an approach to querying data when it resides
in random access memory, as opposed to querying data that is stored on
physical disks.
Massively parallel programming -- the coordinated processing of a
program by multiple processors that work on different parts of the
program, with each processor using its own operating system and
memory.

In order for the real-time data to be useful, the real-time analytics applications
being used should have high availability and low response times. These
applications should also feasibly manage large amounts of data, up to terabytes.
This should all be done while returning answers to queries within seconds.
The term real-time also includes managing changing data sources -- something
that may arise as market and business factors change within a company. As a
result, the real-time analytics applications should be able to handle big data. The
adoption of real-time big data analytics can maximize business returns, reduce
operational costs and introduce an era where machines can interact over
the internet of things using real-time information to make decisions on their
own.

Different technologies exist that have been designed to meet these demands,
including the growing quantities and diversity of data. Some of these new
technologies are based on specialized appliances -- such as hardware and
software systems. Other technologies utilize a special processor and memory
chip combination, or a database with analytics capabilities embedded in its
design.

Benefits of real-time analytics


Real-time analytics enables businesses to react without delay, quickly detect and
respond to patterns in user behavior, take advantage of opportunities that could
otherwise be missed and prevent problems before they arise.

Businesses that utilize real-time analytics greatly reduce risk throughout their
company since the system uses data to predict outcomes and suggest alternatives
rather than relying on the collection of speculations based on past events or
recent scans -- as is the case with historical data analytics. Real-time analytics
provides insights into what is going on in the moment.

Other benefits of real-time analytics include:

Data visualization. Real-time data can be visualized and reflects


occurrences throughout the company as they occur, whereas historical
data can only be placed into a chart in order to communicate an overall
idea.
Improved competitiveness. Businesses that use real-time analytics can
identify trends and benchmarks faster than their competitors who are
still using historical data. Real-time analytics also allows businesses to
evaluate their partners' and competitors' performance reports
instantaneously.
Precise information. Real-time analytics focuses on instant analyses
that are consistently useful in the creation of focused outcomes, helping
ensure time is not wasted on the collection of useless data.
Lower costs. While real-time technologies can be expensive, their
multiple and constant benefits make them more profitable when used
long term. Furthermore, the technologies help avoid delays in using
resources or receiving information.
Faster results. The ability to instantly classify raw data allows queries
to more efficiently collect the appropriate data and sort through it
quickly. This, in turn, allows for faster and more efficient trend
prediction and decision making.
Challenges
One major challenge faced in real-time analytics is the vague definition of real
time and the inconsistent requirements that result from the various
interpretations of the term. As a result, businesses must invest a significant
amount of time and effort to collect specific and detailed requirements from all
stakeholders in order to agree on a specific definition of real time, what is
needed for it and what data sources should be used.

Once the company has unanimously decided on what real time means, it faces
the challenge of creating an architecture with the ability to process data at high
speeds. Unfortunately, data sources and applications can cause processing-speed
requirements to vary from milliseconds to minutes, making creation of a capable
architecture difficult. Furthermore, the architecture must also be capable of
handling quick changes in data volume and should be able to scale up as the data
grows.

The implementation of a real-time analytics system can also present a challenge


to a business's internal processes. The technical tasks required to set up real-time
analytics -- such as creation of the architecture -- often cause businesses to
ignore changes that should be made to internal processes. Enterprises should
view real-time analytics as a tool and starting point for improving internal
processes rather than as the ultimate goal of the business.

Finally, companies may find that their employees are resistant to the change
when implementing real-time analytics. Therefore, businesses should focus on
preparing their staff by providing appropriate training and fully communicating
the reasons for the change to real-time analytics.

Use cases for real-time analytics in customer experience management


In customer relations management and customer experience management, real-
time analytics can provide up-to-the-minute information about an enterprise's
customers and present it so that better and quicker business decisions can be
made -- perhaps even within the time span of a customer interaction.

Here are some examples of how enterprises are tapping into real-time analytics:

Fine-tuning features for customer-facing apps. Real-time analytics


adds a level of sophistication to software rollouts and supports data-
driven decisions for core feature management.
Managing location data. Real-time analytics can be used to determine
what data sets are relevant to a particular geographic location and signal
the appropriate updates.
Detecting anomalies and frauds. Real-time analytics can be used to
identify statistical outliers caused by security breaches, network outages
or machine failures.
Empowering advertising and marketing campaigns. Data gathered
from ad inventory, web visits, demographics and customer behavior can
be analyzed in real time to uncover insights that hopefully will improve
audience targeting, pricing strategies and conversion rates.
Examples
Examples of real-time analytics include:

Real-time credit scoring. Instant updates of individuals' credit scores


allow financial institutions to immediately decide whether or not to
extend the customer's credit.
Financial trading. Real-time big data analytics is being used to support
decision-making in financial trading. Institutions use financial databases,
satellite weather stations and social media to instantaneously inform
buying and selling decisions.
Targeting promotions. Businesses can use real-time analytics to
deliver promotions and incentives to customers while they are in the
store and surrounded by the merchandise to increase the chances of a
sale.
Healthcare services. Real-time analytics is used in wearable devices --
such as smartwatches -- and has already proven to save lives through the
ability to monitor statistics, such as heart rate, in real time.
Emergency and humanitarian services. By attaching real-time
analytical engines to edge devices -- such as drones -- incident
responders can combine powerful information, including traffic, weather
and geospatial data, to make better informed and more efficient
decisions that can improve their abilities to respond to emergencies and
other events.
Future
The future of pharmaceutical marketing and sales is being greatly impacted by
the use of real-time analytics. It is expected that more pharmaceutical companies
will begin using emerging technologies and implementing real-time analytics
instead of relying on traditional methods to gain deeper insights into customer
behavior and the market landscape. This has the potential to reduce costs
through accurate predictions while also increasing sales and profit by optimizing
marketing.

Higher education is also changing with the use of real-time analytics.


Organizations can start marketing to prospective students who are best fit for
their institution based on factors such as test scores, academic records and
financial standing. Real-time, predictive analytics can help educational
organizations gauge the probability of the student graduating and using their
degree for gainful employment as well as predict a class' debt load and earnings
after graduation.

Unfortunately, the consistently increasing amount of machines and technical


devices in the world and the expanding amount of information they capture
makes it harder and harder to gain valuable insights from the data. One solution
to this is the open source Elastic Stack; a collection of products that centralizes,
stores, analyzes and displays any desired log and machine data in real
time. Open source is believed to be the future of computer programs, especially
in data-driven fields like business intelligence.
Data Analytics Visualization Tools
Data Visualization is basically putting the analyzed data in the form of visuals i.e
- graphs, images. These visualizations make it easy for humans to understand the
analyzed trends through visuals.
Data Visualization is very important when it comes to analyzing big datasets.
When data scientists analyze complex datasets they also need to understand the
insights collected. Data Visualization will make it easier for them to understand
through graphs and charts.
Tableau
Tableau is often regarded as the grand master of data visualization software and
for good reason. Tableau has a very large customer base across many industries
due to its simplicity of use and ability to produce interactive visualizations far
beyond those provided by general BI solutions. It is particularly well suited to
handling the huge and very fast-changing datasets which are used in Big Data
operations, including artificial intelligence and machine learning applications,
thanks to integration with a large number of advanced database solutions
including Hadoop, Amazon AWS, My SQL, SAP and Teradata. Extensive
research and testing has gone into enabling Tableau to create graphics and
visualizations as efficiently as possible, and to make them easy for humans to
understand.
Qlikview
Qlik with their Qlikview tool is the other major player in this space and
Tableau’s biggest competitor. The vendor has over 40,000 customer accounts
across over 100 countries, and those that use it frequently cite its highly
customizable setup and wide feature range as a key advantage. This however can
mean that it takes more time to get to grips with and use it to its full potential. In
addition to its data visualization capabilities Qlikview offers powerful business
intelligence, analytics and enterprise reporting capabilities and I particularly like
the clean and clutter-free user interface. Qlikview is commonly used alongside
its sister package, Qliksense, which handles data exploration and discovery.
There is also a strong community and there are plenty of third-party resources
available online to help new users understand how to integrate it in their
projects.
FusionCharts
This is a very widely-used, JavaScript-based charting and visualization package
that has established itself as one of the leaders in the paid-for market. It can
produce 90 different chart types and integrates with a large number of platforms
and frameworks giving a great deal of flexibility. One feature that has helped
make FusionCharts very popular is that rather than having to start each new
visualization from scratch, users can pick from a range of “live” example
templates, simply plugging in their own data sources as needed.
Highcharts
Like FusionCharts this also requires a licence for commercial use, although it
can be used freely as a trial, non-commercial or for personal use. Its website
claims that it is used by 72 of the world’s 100 largest companies and it is often
chosen when a fast and flexible solution must be rolled out, with a minimum
need for specialist data visualization training before it can be put to work. A key
to its success has been its focus on cross-browser support, meaning anyone can
view and run its interactive visualizations, which is not always true with newer
platforms.
Datawrapper
Datawrapper is increasingly becoming a popular choice, particularly among
media organizations which frequently use it to create charts and present
statistics. It has a simple, clear interface that makes it very easy to upload csv
data and create straightforward charts, and also maps, that can quickly be
embedded into reports.
Plotly
Plotly enables more complex and sophisticated visualizations, thanks to its
integration with analytics-oriented programming languages such as Python, R
and Matlab. It is built on top of the open source d3.js visualization libraries for
JavaScript, but this commercial package (with a free non-commercial licence
available) adds layers of user-friendliness and support as well as inbuilt support
for APIs such as Salesforce.
Sisense
Sisense provides a full stack analytics platform but its visualization capabilities
provide a simple-to-use drag and drop interface which allow charts and more
complex graphics, as well as interactive visualizations, to be created with a
minimum of hassle. It enables multiple sources of data to be gathered into one
easily accessed repositories where it can be queried through dashboards
instantaneously, even across Big Data-sized sets. Dashboards can then be shared
across organizations ensuring even non technically-minded staff can find the
answers they need to their problems.
Differences between Data Analytics, AI, Machine & Deep Learning
Data Analytics – Data Science, Artificial Intelligence (AI), Machine Learning
(ML), and Deep Learning (DL) are closely interconnected. The Venn-diagram
shown below visualizes overlapping AI-related terminology.

We will explore in detail on each one of the following terms one by one:

Artificial Intelligence
Artificial intelligence, or AI for short, has been around since the mid 1950s. It’s
not necessarily new. But it became super popular recently because of the
advancements in processing capabilities. Back in the 1900s, there just wasn’t the
necessary computing power to realize AI. Today, we have some of the fastest
computers the world has ever seen. And the algorithm implementations have
improved so much that we can run them on commodity hardware, even your
laptop or smartphone that you’re using to read this right now. And given the
seemingly endless possibilities of AI, everybody wants a piece of it.

But what exactly is artificial intelligence? Artificial intelligence is the ability that
can be imparted to computers which enables these machines to understand data,
learn from the data, and make decisions based on patterns hidden in the data, or
inferences that could otherwise be very difficult (to almost impossible) for
humans to make manually. AI also enables machines to adjust their “knowledge”
based on new inputs that were not part of the data used for training these
machines.

Another way of defining AI is that it’s a collection of mathematical algorithms


that make computers understand relationships between different types and pieces
of data such that this knowledge of connections could be utilized to come to
conclusions or make decisions that could be accurate to a very high degree.

But there’s one thing you need to make sure, that you have enough data for AI to
learn from. If you have a very small data lake that you’re using to train your AI
model, the accuracy of the prediction or decision could be low. So more the data,
better is the training of the AI model, and more accurate will be the outcome.
Depending on the size of your training data, you can choose various algorithms
for your model. This is where machine learning and deep learning start to show
up.

In the early days of AI, neural networks were all the rage. There were multiple
groups of people across the globe working on bettering their neural networks.
But as I mentioned earlier in the post, the limitations of the computing hardware
kind of hindered the advancement of AI. But from the late 1980s all the way up
to the 2010s, machine learning it was. Every major tech company was investing
heavily in machine learning. Companies such as Google, Amazon, IBM,
Facebook, etc. were virtually dragging AI and ML PhD. people straight from
universities. But these days, even machine learning has taken a back seat. It’s all
about deep learning now. There’s definitely been an evolution of AI in the last
few decades, and it’s getting better with every passing year. You can visualize
this evolution from the image below.

Artificial Neural Network


In information technology (IT), an artificial neural network (ANN) is a system
of hardware and/or software patterned after the operation of neurons in the
human brain. ANNs also called, simply, neural networks are a variety of deep
learning technology, which also falls under the umbrella of artificial intelligence,
or AI.

Commercial applications of these technologies generally focus on solving


complex signal processing or pattern recognition problems. Examples of
significant commercial applications since 2000 include handwriting recognition
for check processing, speech-to-text transcription, oil-exploration data analysis,
weather prediction and facial recognition.

How artificial neural networks work


An ANN usually involves a large number of processors operating in parallel and
arranged in tiers. The first tier receives the raw input information analogous to
optic nerves in human visual processing. Each successive tier receives the output
from the tier preceding it, rather than from the raw input in the same way
neurons further from the optic nerve receive signals from those closer to it. The
last tier produces the output of the system.

Each processing node has its own small sphere of knowledge, including what it
has seen and any rules it was originally programmed with or developed for itself.
The tiers are highly interconnected, which means each node in tier n will be
connected to many nodes in tier n-1 its inputs and in tier n+1, which provides
input data for those nodes. There may be one or multiple nodes in the output
layer, from which the answer it produces can be read.

Artificial neural networks are notable for being adaptive, which means they
modify themselves as they learn from initial training and subsequent runs
provide more information about the world. The most basic learning model is
centered on weighting the input streams, which is how each node weights the
importance of input data from each of its predecessors. Inputs that contribute to
getting right answers are weighted higher.

How neural networks learn


Typically, an ANN is initially trained or fed large amounts of data. Training
consists of providing input and telling the network what the output should be.
For example, to build a network that identifies the faces of actors, the initial
training might be a series of pictures, including actors, non-actors, masks,
statuary and animal faces. Each input is accompanied by the matching
identification, such as actors' names, "not actor" or "not human" information.
Providing the answers allows the model to adjust its internal weightings to learn
how to do its job better.

For example, if nodes David, Dianne and Dakota tell node Ernie the current
input image is a picture of Brad Pitt, but node Durango says it is Betty White,
and the training program confirms it is Pitt, Ernie will decrease the weight it
assigns to Durango's input and increase the weight it gives to that of David,
Dianne and Dakota.

In defining the rules and making determinations that is, the decision of each
node on what to send to the next tier based on inputs from the previous tier
neural networks use several principles. These include gradient-based
training, fuzzy logic, genetic algorithms and Bayesian methods. They may be
given some basic rules about object relationships in the space being modeled.

For example, a facial recognition system might be instructed, "Eyebrows are


found above eyes," or, "Moustaches are below a nose. Moustaches are above
and/or beside a mouth." Preloading rules can make training faster and make the
model more powerful sooner. But it also builds in assumptions about the nature
of the problem space, which may prove to be either irrelevant and unhelpful or
incorrect and counterproductive, making the decision about what, if any, rules to
build in very important.

Further, the assumptions people make when training algorithms causes neural
networks to amplify cultural biases. Biased data sets are an ongoing challenge in
training systems that find answers on their own by recognizing patterns in data.
If the data feeding the algorithm isn't neutral and almost no data is, the machine
propagates bias.

Types of neural networks


Neural networks are sometimes described in terms of their depth, including how
many layers they have between input and output, or the model's so-called hidden
layers. This is why the term neural network is used almost synonymously with
deep learning. They can also be described by the number of hidden nodes the
model has or in terms of how many inputs and outputs each node has. Variations
on the classic neural network design allow various forms of forward and
backward propagation of information among tiers.

Specific types of artificial neural networks include:

Feed-forward neural networks


Recurrent neural networks
Convolutional neural networks
Deconvolutional neural networks
Modular neural networks

Feed-forward neural networks are one of the simplest variants of neural


networks. They pass information in one direction, through various input nodes,
until it makes it to the output node. The network may or may not have hidden
node layers, making their functioning more interpretable. It is prepared to
process large amounts of noise. This type of ANN computational model is used
in technologies such as facial recognition and computer vision.

Recurrent neural networks (RNN) are more complex. They save the output of
processing nodes and feed the result back into the model. This is how the model
is said to learn to predict the outcome of a layer. Each node in the RNN model
acts as a memory cell, continuing the computation and implementation of
operations. This neural network starts with the same front propagation as a feed-
forward network, but then goes on to remember all processed information in
order to reuse it in the future. If the network's prediction is incorrect, then the
system self-learns and continues working towards the correct prediction during
backpropagation. This type of ANN is frequently used in text-to-speech
conversions.

Convolutional neural networks (CNN) are one of the most popular models
used today. This neural network computational model uses a variation of
multilayer perceptrons and contains one or more convolutional layers that can be
either entirely connected or pooled. These convolutional layers create feature
maps that record a region of image which is ultimately broken into rectangles
and sent out for nonlinear processing. The CNN model is particularly popular in
the realm of image recognition; it has been used in many of the most advanced
applications of AI, including facial recognition, text digitization and natural
language processing. Other uses include paraphrase detection, signal processing
and image classification.
Deconvolutional neural networks utilize a reversed CNN model process. They
aim to find lost features or signals that may have originally been considered
unimportant to the CNN system's task. This network model can be used in image
synthesis and analysis.

Modular neural networks contain multiple neural networks working


separately from one another. The networks do not communicate or interfere with
each other's activities during the computation process. Consequently, complex or
big computational processes can be performed more efficiently.

Advantages of artificial neural networks


Advantages of artificial neural networks include:

Parallel processing abilities mean the network can perform more than
one job at a time.
Information is stored on an entire network, not just a database.
The ability to learn and model nonlinear, complex relationships helps
model the real life relationships between input and output.
Fault tolerance means the corruption of one or more cells of the ANN
will not stop the generation of output.
Gradual corruption means the network will slowly degrade over time,
instead of a problem destroying the network instantly.
The ability to produce output with incomplete knowledge with the loss
of performance being based on how important the missing information
is.
No restrictions are placed on the input variables, such as how they
should be distributed.
Machine learning means the ANN can learn from events and make
decisions based on the observations.
The ability to learn hidden relationships in the data without commanding
any fixed relationship means an ANN can better model highly Volatile
data and non-constant variance.
The ability to generalize and infer unseen relationships on unseen data
means ANNs can predict the output of unseen data.

Disadvantages of artificial neural networks


The disadvantages of ANNs include:

The lack of rules for determining the proper network structure means the
appropriate artificial neural network architecture can only be found
through trial and error and experience.
The requirement of processors with parallel processing abilities makes
neural networks hardware dependent.
The network works with numerical information, therefor all problems
must be translated into numerical values before they can be presented to
the ANN.
The lack of explanation behind probing solutions is one of the biggest
disadvantages in ANNs. The inability to explain the why or how behind
the solution generates a lack of trust in the network.

Applications of artificial neural networks


Image recognition was one of the first areas to which neural networks were
successfully applied, but the technology uses have expanded to many more
areas, including:

Chatbots
Natural language processing, translation and language generation
Stock market prediction
Delivery driver route planning and optimization
Drug discovery and development

These are just a few specific areas to which neural networks are being applied
today. Prime uses involve any process that operates according to strict rules or
patterns and has large amounts of data. If the data involved is too large for a
human to make sense of in a reasonable amount of time, the process is likely a
prime candidate for automation through artificial neural networks.

History of neural networks


The history of artificial neural networks goes back to the early days of
computing. In 1943, mathematicians Warren McCulloch and Walter Pitts built a
circuitry system intended to approximate the functioning of the human brain that
ran simple algorithms.

In 1957, Cornell University researcher Frank Rosenblatt developed the


perceptron, an algorithm designed to perform advanced pattern recognition,
ultimately building toward the ability for machines to recognize objects in
images. But the perceptron failed to deliver on its promise, and during the 1960s,
artificial neural network research fell off.
In 1969, MIT researchers Marvin Minsky and Seymour Papert published the
book Perceptrons , which spelled out several issues with neural networks,
including the fact that computers of the day were too limited in their computing
power to process the data needed for neural networks to operate as intended.
Many feel this book led to a prolonged "AI winter" in which research into neural
networks stopped.

It wasn't until around 2010 that research picked up again. The big data trend,
where companies amass vast troves of data, and parallel computing gave data
scientists the training data and computing resources needed to run complex
artificial neural networks. In 2012, a neural network was able to beat human
performance at an image recognition task as part of the ImageNet competition.
Since then, interest in artificial neural networks as has soared and the technology
continues to improve.

Machine Learning
Machine learning (ML) is a type of artificial intelligence (AI) that allows
software applications to become more accurate at predicting outcomes without
being explicitly programmed to do so. Machine learning algorithms use
historical data as input to predict new output values.

Recommendation engines are a common use case for machine learning. Other
popular uses include fraud detection, spam filtering, malware threat
detection, business process automation (BPA) and predictive maintenance.

Types of machine learning


Classical machine learning is often categorized by how an algorithm learns to
become more accurate in its predictions. There are four basic
approaches: supervised learning, unsupervised learning, semi-supervised
learning and reinforcement learning. The type of algorithm a data scientist
chooses to use depends on what type of data they want to predict.

Supervised learning. In this type of machine learning, data


scientists supply algorithms with labeled training data and define the
variables they want the algorithm to assess for correlations. Both the
input and the output of the algorithm is specified.
Unsupervised learning. This type of machine learning involves
algorithms that train on unlabeled data. The algorithm scans through
data sets looking for any meaningful connection. Both the data
algorithms train on and the predictions or recommendations they output
are predetermined.
Semi-supervised learning. This approach to machine learning involves
a mix of the two preceding types. Data scientists may feed an algorithm
mostly labeled training data, but the model is free to explore the data on
its own and develop its own understanding of the data set.
Reinforcement learning. Reinforcement learning is typically used to
teach a machine to complete a multi-step process for which there are
clearly defined rules. Data scientists program an algorithm to complete a
task and give it positive or negative cues as it works out how to
complete a task. But for the most part, the algorithm decides on its own
what steps to take along the way.
How supervised machine learning works
Supervised machine learning requires the data scientist to train the algorithm
with both labeled inputs and desired outputs. Supervised learning algorithms are
good for the following tasks:

Binary classification. Dividing data into two categories.


Multi-class classification. Choosing between more than two types of
answers.
Regression modeling . Predicting continuous values.
Ensembling . Combining the predictions of multiple machine learning
models to produce an accurate prediction.
How unsupervised machine learning works
Unsupervised machine learning algorithms do not require data to be labeled.
They sift through unlabeled data to look for patterns that can be used to group
data points into subsets. Most types of deep learning, including neural networks,
are unsupervised algorithms. Unsupervised learning algorithms are good for the
following tasks:

Clustering . Splitting the data set into groups based on similarity.


Anomaly detection. Identifying unusual data points in a data set.
Association mining. Identifying sets of items in a data set that
frequently occur together.
Dimensionality Reduction. Reducing the number of variables in a data
set.
How semi-supervised learning works
Semi-supervised learning works by data scientists feeding a small amount
of labeled training data to an algorithm. From this, the algorithm learns the
dimensions of the data set, which it can then apply to new, unlabeled data. The
performance of algorithms typically improves when they train on labeled data
sets. But labeling data can be time-consuming and expensive. Semi-supervised
learning strikes a middle ground between the performance of supervised learning
and the efficiency of unsupervised learning. Some areas where semi-supervised
learning is used include:

Machine translation . Teaching algorithms to translate language based


on less than a full dictionary of words.
Fraud detection . Identifying cases of fraud when you only have a few
positive examples.
Labeling data . Algorithms trained on small data sets can learn to apply
data labels to larger sets automatically.
How reinforcement learning works
Reinforcement learning works by programming an algorithm with a distinct goal
and a prescribed set of rules for accomplishing that goal. Data scientists also
program the algorithm to seek positive rewards -- which it receives when it
performs an action that is beneficial toward the ultimate goal -- and avoid
punishments -- which it receives when it performs an action that gets it farther
away from its ultimate goal. Reinforcement learning is often used in areas like:

Robotics . Robots can learn to perform tasks in the physical world using
this technique.
Video gameplay . Reinforcement learning has been used to teach bots to
play a number of video games.
Resource management . Given finite resources and a defined goal,
reinforcement learning can help enterprises plan how to allocate
resources.
Uses of machine learning
Today, machine learning is used in a wide range of applications. Perhaps one of
the most well-known examples of machine learning in action is
the recommendation engine that powers Facebook's News Feed.

Facebook uses machine learning to personalize how each member's feed is


delivered. If a member frequently stops to read a particular group's posts, the
recommendation engine will start to show more of that group's activity earlier in
the feed.

Behind the scenes, the engine is attempting to reinforce known patterns in the
member's online behavior. Should the member change patterns and fail to read
posts from that group in the coming weeks, the News Feed will adjust
accordingly.

In addition to recommendation engines, other uses for machine learning include


the following:
Customer relationship management -- CRM software can use machine learning
models to analyze email and prompt sales team members to respond to the most
important messages first. More advanced systems can even recommend
potentially effective responses.

Business intelligence -- BI and analytics vendors use machine learning in their


software to identify potentially important data points, patterns of data points and
anomalies.

Human resource information systems -- HRIS systems can use machine learning
models to filter through applications and identify the best candidates for an open
position.

Self-driving cars -- Machine learning algorithms can even make it possible for a
semi-autonomous car to recognize a partially visible object and alert the driver.

Virtual assistants -- Smart assistants typically combine supervised and


unsupervised machine learning models to interpret natural speech and supply
context.

Advantages and disadvantages


Machine learning has seen powerful use cases ranging from predicting customer
behavior constituting the operating system for self-driving cars. But just because
some industries have seen benefits doesn't mean machine learning is without its
downsides.

When it comes to advantages, machine learning can help enterprises understand


their customers at a deeper level. By collecting customer data and correlating it
with behaviors over time, machine learning algorithms can learn associations
and help teams tailor product development and marketing initiatives to customer
demand.

Some internet companies use machine learning as a primary driver in their


business models. Uber, for example, uses algorithms to match drivers with
riders. Google uses machine learning to surface the right advertisements in
searches.

But machine learning comes with disadvantages. First and foremost, it can be
expensive. Machine learning projects are typically driven by data scientists, who
command high salaries. These projects also require software infrastructure that
can be high-cost.

There is also the problem of machine learning bias. Algorithms that trained on
data sets that exclude certain populations or contain errors can lead to inaccurate
models of the world that, at best, fail and, at worst, are discriminatory. When an
enterprise bases core business processes on biased models, it can run into
regulatory and reputational harm.

Choosing the right machine learning model


The process of choosing the right machine learning model to solve a problem
can be time-consuming if not approached strategically.

Step 1: Align the problem with potential data inputs that should be considered
for the solution. This step requires help from data scientists and experts who
have a deep understanding of the problem.

Step 2: Collect data, format it and label the data if necessary. This step is
typically led by data scientists, with help from data wranglers.

Step 3: Chose which algorithm(s) to use and test to see how well they perform.
This step is usually carried out by data scientists.

Step 4: Continue to fine-tune outputs until they reach an acceptable level of


accuracy. This step is usually carried out by data scientists with feedback from
experts who have a deep understanding of the problem.

Importance of human-interpretable machine learning


Explaining how a specific ML model works can be challenging when the model
is complex. There are some vertical industries where data scientists have to use
simple machine learning models because it's important for the business to
explain how each and every decision was made. This is especially true in
industries with heavy compliance burdens like banking and insurance.

Complex models can accurate predictions, but explaining to a layperson how an


output was determined can be difficult.

The future of machine learning


While machine learning algorithms have been around for decades, they've
attained new popularity as artificial intelligence (AI) has grown in prominence.
Deep learning models, in particular, power today's most advanced AI
applications.

Machine learning platforms are among enterprise technology's most competitive


realms, with most major vendors, including Amazon, Google, Microsoft, IBM
and others, racing to sign customers up for platform services that cover the
spectrum of machine learning activities, including data collection, data
preparation, data classification, model building, training and application
deployment.

As machine learning continues to increase in importance to business operations


and AI becomes ever more practical in enterprise settings, the machine learning
platform wars will only intensify.

Continued research into deep learning and AI is increasingly focused on


developing more general applications. Today's AI models require extensive
training in order to produce an algorithm that is highly optimized to perform one
task. But some researchers are exploring ways to make models more flexible and
are seeking techniques that allow a machine to apply context learned from one
task to future, different tasks.

History of machine learning


1642 - Blaise Pascal invents a mechanical machine that can add, subtract,
multiply and divide.

1679 - Gottfried Wilhelm Leibniz devises the system of binary code.

1834 - Charles Babbage conceives the idea for a general all-purpose device that
could be programmed with punched cards.
1842 - Ada Lovelace describes a sequence of operations for solving
mathematical problems using Charles Babbage's theoretical punch- card
machine and becomes the first programmer.

1847 - George Boole creates Boolean logic, a form of algebra in which all values
can be reduced to the binary values of true or false.

1936 - English logician and cryptanalyst Alan Turing proposes


a universal machine that could decipher and execute a set of instructions. His
published proof is considered the basis of computer science.
1952 - Arthur Samuel creates a program to help an IBM computer get better at
checkers the more it plays.

1959 - MADALINE becomes the first artificial neural network applied to a real-
world problem: removing echoes from phone lines.

1985 - Terry Sejnowski and Charles Rosenberg's artificial neural network taught
itself how to correctly pronounce 20,000 words in one week.

1997 - IBM's Deep Blue beat chess grandmaster Garry Kasparov.

1999 - A CAD prototype intelligent workstation reviewed 22,000 mammograms


and detected cancer 52% more accurately than radiologists did.

2006 - Computer scientist Geoffrey Hinton invents the term deep learning to
describe neural net research.

2012 - An unsupervised neural network created by Google learned to recognize


cats in YouTube videos with 74.8% accuracy.

2014 - A chatbot passes the Turing Test by convincing 33% of human judges
that it was a Ukrainian teen named Eugene Goostman.

2014 - Google's AlphaGo defeats the human champion in Go, the most difficult
board game in the world.
2016 - LipNet, DeepMind's artificial-intelligence system, identifies lip-read
words in video with an accuracy of 93.4%.

2019 - Amazon controls 70% of the market share for virtual assistants in the
U.S.
Types of Machine Learning Algorithms

Model development is not one-size-fits-all affair -- there are different types of


machine learning algorithms for different business goals and data sets. For
example, the relatively straightforward linear regression algorithm is easier to
train and implement than other machine learning algorithms, but it may fail to
add value to a model requiring complex predictions.

The nine machine learning algorithms that follow are among the most popular
and commonly used to train enterprise models. The models each support
different goals, range in user friendliness and use one or more of the following
machine learning approaches: supervised learning, unsupervised learning, semi-
supervised learning or reinforcement learning.

Supervised machine learning algorithms


Supervised learning models require data scientists to provide the algorithm with
data sets for input and parameters for output, as well as feedback on accuracy
during the training process. They are task-based, and test on labeled data sets.

Linear regression

The most popular type of machine learning algorithm is arguably linear


regression. Linear regression algorithms map simple correlations between two
variables in a set of data. A set of inputs and their corresponding outputs are
examined and quantified to show a relationship, including how a change in one
variable affects the other. Linear regressions are plotted via a line on a graph.

Linear regression's popularity is due to its simplicity: The algorithm is easily


explainable, relatively transparent and requires little to no parameter tuning.
Linear regression is frequently used in sales forecasting and risk assessment for
enterprises that seek to make long-term business decisions.

Linear regression is best for when "you are looking at predicting your value or
predicting a class," said Shekhar Vemuri, CTO of technology service company
Clairvoyant, based in Chandler, Ariz.

Support vector machines


Support vector machines, or SVM, is a machine learning algorithm that
separates data into classes. During model training, SVM finds a line that
separates data in a given set into specific classes and maximizes the margins of
each class. After learning these classification lines, the model can then apply
them to future data.

This algorithm works best for training data that can clearly be separated by a
line, also referred to as a hyperplane. Nonlinear data can be programmed into a
facet of SVM called nonlinear SVMs. But, with training data that's hyper-
complex -- faces, personality traits, genomes and genetic material -- the class
systems become smaller and harder to identify and require a bit more human
assistance.

SVMs are used heavily in the financial sector, as they offer high accuracy on
both current and future data sets. The algorithms can be used to compare relative
financial performance, value and investment gains virtually.

Companies with nonlinear data and different kinds of data sets often use SVM,
Vemuri said.

Decision tree

A decision tree algorithm takes data and graphs it out in branches to show the
possible outcomes of a variety of decisions. Decision trees classify response
variables and predict response variables based on past decisions.

Decision trees are a visual method of mapping out decisions. Their results are
easy to explain and can be accessible to citizen data scientists. A decision tree
algorithm maps out various decisions and their likely impact on an end result
and can even be used with incomplete data sets.

Decision trees, due to their long-tail visuals, work best for small data sets, low-
stakes decisions and concrete variables. Because of this, common decision tree
use cases involve augmenting option pricing -- from mortgage lenders
classifying borrowers to product management teams quantifying the shift in
market that would occur if they changed a major ingredient.

Decision trees remain popular because they can outline multiple outcomes and
tests without requiring data scientists to deploy multiple algorithms, said Jeff
Fried, director of product management for InterSystems, a software company
based in Cambridge, Mass.

Unsupervised machine learning algorithms


Unsupervised machine learning algorithms are not trained by data scientists.
Instead, they use deep learning to identify patterns in data by combing through
sets of unlabeled training data and observing correlations. Unsupervised learning
models receive no information about what to look for in the data or which data
features to examine.

Apriori

The Apriori algorithm, based on the Apriori principle, is most commonly used in
market basket analysis to mine item sets and generate association rules. The
algorithms check for a correlation between two items in a data set to determine if
there's a positive or negative correlation between them.

The Apriori algorithm is primed for sales teams that seek to notice which
products customers are more likely to buy in combination with other products. If
a high percentage of customers who purchase bread also purchase butter, the
algorithm can conclude that purchase of A (bread) will often lead to purchase of
B (butter). This can be cross-referenced in data sets, data points and purchase
ratios.

Apriori algorithms can also determine that purchase of A (bread) is only 10%
likely to lead to the purchase of C (corn). Marketing teams can use this
information to inform things like product placement strategies. Besides sales
functions, Apriori algorithms are favored by e-commerce giants, like Amazon
and Alibaba, but are also used to understand searcher intent by sites like Bing
and Google to predict searches by correlating associated words.

K-means clustering

The K-means algorithm is an iterative method of sorting data points into groups
based on similar characteristics. For example, a K-means cluster algorithm
would sort web results for the word civic into groups relating to Honda Civic
and civic as in municipal or civil.

K-means clustering has a reputation for accurate, streamlined groupings


processed in a relatively short period of time, compared to other algorithms. K-
means clustering is popular among search engines to produce relevant
information and enterprises looking to group user behaviors by connotative
meaning, or IT performance monitoring.
Semi-supervised machine learning algorithms
Semi-supervised learning teaches an algorithm through a mix of labeled and
unlabeled data. This algorithm learns certain information through a set of
labelled categories, suggestions and examples. Semi-supervised algorithms then
create their own labels by exploring the data set or virtual world on their own,
following a rough outline or some data scientist feedback.

Generative Adversarial Networks

GANs are deep generative models that have gained popularity. GANs have the
ability to imitate data in order to model and predict. They work by essentially
pitting two models against each other in a competition to develop the best
solution to a problem. One neural network, a generator, creates new data while
another, the discriminator, works to improve on the generator's data. After many
iterations of this, data sets become more and more lifelike and realistic. Popular
media uses GANs to do things like face creation and audio manipulation. GANs
are also impactful for creating large data sets using limited training points,
optimizing models and improving manufacturing processes.

Self-trained Naïve Bayes classifier

Self-trained algorithms are all examples of semi-supervised learning. Developers


can add to these models a Naïve Bayes classifier, which allows self-trained
algorithms to perform classification tasks simply and easily. When developing a
self-trained model, researchers train the algorithm to recognize object classes on
a labeled training set. Then the researchers have the model classify unlabeled
data. Once that cycle is finished, researchers upload the correct self-categorized
labels to the training data and retrain. Self-trained models are popular in natural
language processing (NLP) and among organizations with limited labeled data
sets.

Reinforcement learning
Reinforcement learning algorithms are based on a system of rewards and
punishments learned through trial and error. The model is given a goal and seeks
maximum reward for getting closer to that goal based on limited information and
learns from its previous actions. Reinforcement learning algorithms can be
model-free -- creating interpretations of data through constant trial and error -- or
model-based -- adhering more closely to a set of predefined steps with minimal
trial and error.
Q-learning

Q-learning algorithms are model-free, which means they seek to find the best
method of achieving a defined goal by seeking the maximum reward by trying
the maximum amount of actions. Q-learning is often paired with deep learning
models in research projects, including Google's DeepMind. Q-learning further
breaks down into various algorithms, including deep deterministic policy
gradient (DDPG) or hindsight experience replay (HER).

Model-based value estimation


Unlike model-free approaches like Q-learning, model-based algorithms have a
limited depth of freedom to create potential states and actions and are
statistically more efficient. Such algorithms, like the popular MBVE, are fitted
with a specific data set and base action using supervised learning. Designers of
MBVE note that "model-based methods can quickly arrive at near-optimal
control with learned models under fairly restricted dynamics classes." Model-
based methods are designed for specific use cases.

Automated Machine Learning tools pave the way to AI

Automated machine learning is one of the trendiest and most popular areas of
enterprise AI software right now. With vendors offering everything from
individual automated machine learning tools to cloud-based, full-service
programs, autoML is quickly helping enterprises streamline business process and
dive into AI.

In light of the rise of autoML, analysts and experts are encouraging enterprises
to evaluate their specific needs alongside the intended purpose of the tools -- to
augment data scientists' work -- instead of trying to use autoML without a larger
AI framework.

Whether your enterprise has a flourishing data science team, citizen data science
team or relies heavily on outsourcing data science work, autoML can provide
value if you choose tools and use cases wisely.

AutoML and data scientists


Enterprises are applying automated machine learning in a diverse range of use
cases, from developing retail insights to training robots. Whatever the
environment or the business process being automated, experts said the real
promise of autoML is the ability to collaborate with data scientists.

"Make sure that you're using [autoML] for the right intended purpose, which is
automate the grunt work that a data scientist typically has to do," said Shekhar
Vemuri, CTO of technology service company Clairvoyant, based in Chandler,
Ariz.

AutoML tools are being used to augment and speed up the modeling process,
because data scientists spend most of their time on data engineering and data
washing, said Evan Schnidman, CEO of natural language processing company
Prattle, based in St. Louis.
"The first ranges of tools are all about how [to] streamline the data ingestion,
data washing process. The next ranges of the tools are how [to] then streamline
model development and model deployment. And then the third ranges are how
[to] streamline model testing and validation," he said.

Still, experts warned autoML users not to expect automated machine learning
tools to replace data scientists.

AutoML and augmented analytics do not fully replace expert data scientists, said
Carlie Idoine, senior director and analyst of data science and business analytics
at Gartner.

"This is an extension of data science and machine learning capability, not a


replacement," she said. "We can automate some of the capabilities, but it's still a
good idea to have experts involved in processes that may be evaluating or
validating the models."

Intention equals value


If an enterprise intends to automate or augment a part of the data science
process, it has a chance to succeed. If it intends to replace data science teams or
expects results overnight, autoML technology will disappoint. Choosing the tool
or program will depend heavily on the intention, goal and project for which the
enterprise is solving.

"A key realization should be that we're using autoML to essentially gain scale
and try out more things than we could do manually or hand code," Vemuri said.

Schnidman echoed the sentiment, calling autoML a support tool for data
scientists. Businesses that have a mature data science team are poised to get the
most net value, because the automated tools are an extension of data scientists'
capabilities.

"AutoML works for those who say, We've done this manually and taken it as far
as we can go. So, we want to use these augmented tools to do feature
engineering, maybe take out some bias we have and see what it finds that we
didn't consider,'" Idoine said.

If enterprises intend for autoML to replace their data science team, or be their
only point of AI development, the tools will give limited advantages. AutoML is
only one step of many in an overall AI strategy -- especially in enterprises that
are heavily regulated and those affected by recent data protection laws.

"Regulated industries and verticals have all these other legal concerns that they
need to keep in mind and stay on top of. Make sure that you're able to ensure that
your tool of choice is able to integrate into your overall AI workflow," Vemuri
said.

Limitations of tools
The biggest limitation of automated machine learning tools today is they work
best on known types of problems using algorithms like regression and
classification. Because autoML has to be programmed to follow steps, some
algorithms and forms of AI are not compatible with automation.
"Some of the newer types of algorithms like deep neural nets aren't really well
suited for autoML; that type of analysis is much more sophisticated, and it's not
something that can be easily explained," Idoine said.

AutoML is also wrapped up in the problem of black box algorithms and testing.
If a process can't be easily outlined -- even if the automated machine learning
tool can complete it -- the process will be hard to explain. Black box
functionality comes with a whole host of its own issues, including bias and
struggles with incomplete data sets.

"We don't want to encourage black boxes for people that aren't experts in this
type of work," Idoine said.

Getting to machine learning in production takes focus


Data scientists that build AI models and data engineers that deploy machine
learning in production work in two different realms. This makes it hard to
efficiently bring a new predictive model into production.

But some enterprises are finding ways to work around this problem. At the Flink
Forward conference in San Francisco, engineers at Comcast and Capital One
described how they are using Apache Flink to help bridge this gap to speed the
deployment of new AI algorithms.

Version everything
The tools used by data scientists and engineers can differ in subtle ways. That
leads to problems replicating good AI models in production.

Comcast is experimenting with versioning all the artifacts that go into


developing and deploying AI models. This includes machine learning data
models, model features and code running machine learning predictions. All the
components are stored in GitHub, which makes it possible to tie together models
developed by data scientists and code deployed by engineers.

"This ensures that what we put into production and feature engineering are
consistent," said Dave Torok, senior enterprise architect at Comcast.
At the moment, this process is not fully automated. However, the goal is to move
toward full lifecycle automation for Comcast's machine learning development
pipeline.

Bridging the language gap


Data scientists tend to like to use languages like Python, while production
systems run Java. To bridge this gap, Comcast has been building a set of Jython
components for its data scientists.
Jython is an implementation designed to enable data scientists to run Python
apps natively on Java infrastructure. It was first released in 1997 and has grown
in popularity among enterprises launching machine initiatives because Python is
commonly used by data scientists to build machine learning models. One
limitation of this approach is that it can't take advantage of many of the features
running on Flink. Jython compiles Python code to run as native Java code.

However, Java developers are required to implement bindings to take advantage


of new Java methods introduced with tools like Flink.
"At some point, we want to look at doing more generation of Flink-native
features," Torok said. "But on the other hand, it gives us flexibility of
deployment."

Capital One ran into similar problems trying to connect Python for its data
scientists and Java for its production environment to create better fraud detection
algorithms. They did some work to build up a Jython library that acts as an
adaptor.

"This lets us implement each feature as accessible in Python," said Jeff


Sharpe, senior software engineer at Capital One.
These applications run within Flink as if they were Java code. One of the
benefits of this approach is that the features can run in parallel, which is not
normally possible in Jython.

Need for fallback mechanisms


Comcast's machine learning models make predictions by correlating multiple
features. However, the data for some of these features is not always available at
runtime, so fallback mechanisms must be implemented.

For example, Comcast has developed a set of predictive models to prioritize


repair truck rolls based on a variety of features, including the number of prior
calls in the last month, a measurement of degraded internet speeds and the
behavior of consumer equipment. But some of this data may not be available to
predict the severity of a customer problem in a timely manner, which can cause a
time-out, triggering the use of a less accurate model that runs with the available
data.

The initial models are created based on an assessment of historical data.


However, Comcast's AI infrastructure enables engineers to feed information
about the performance of machine learning in production back into the model
training process to improve performance over time. The key lies in correlating
predictions of the models with factors like a technician's observations.

Historical data still a challenge


Capital One is using Flink and microservices to make historical and recent data
easier to use to both develop and deploy better fraud detection models.

Andrew Gao, software engineer at Capital One, said the bank's previous
algorithms did not have access to all of a customer's activities. On the production
side, these models needed to be able to return an answer in a reasonable amount
of time.

"We want to catch fraud, but not create a poor customer experience," Gao said.

The initial project started off as one monolithic Flink application. However,
Capital One ran into problems merging data from historical data sources and
current streaming data, so they broke this up into several smaller microservices
that helped address the problem.
This points to one of the current limitations of using stream processing for
building AI apps. Stephan Ewen, chief technology officer at Data Artisans and
lead developer of Flink, said that the development of Flink tooling has
traditionally focused on AI and machine learning in production.

"Engineers can do model training logic using Flink, but we have not pushed for
that. This is coming up more and more," he said.

Deep Learning
Deep learning is a type of machine learning (ML) and artificial intelligence (AI)
that imitates the way humans gain certain types of knowledge. Deep learning is
an important element of data science, which includes statistics and predictive
modeling. It is extremely beneficial to data scientists who are tasked with
collecting, analyzing and interpreting large amounts of data; deep learning
makes this process faster and easier.

At its simplest, deep learning can be thought of as a way to automate predictive


analytics. While traditional machine learning algorithms are linear, deep
learning algorithms are stacked in a hierarchy of increasing complexity and
abstraction.
To understand deep learning, imagine a toddler whose first word is dog . The
toddler learns what a dog is -- and is not -- by pointing to objects and saying the
word dog . The parent says, "Yes, that is a dog," or, "No, that is not a dog." As
the toddler continues to point to objects, he becomes more aware of the features
that all dogs possess. What the toddler does, without knowing it, is clarify a
complex abstraction -- the concept of dog -- by building a hierarchy in which
each level of abstraction is created with knowledge that was gained from the
preceding layer of the hierarchy.
How deep learning works
Computer programs that use deep learning go through much the same process as
the toddler learning to identify the dog. Each algorithm in the hierarchy applies a
nonlinear transformation to its input and uses what it learns to create a statistical
model as output. Iterations continue until the output has reached an acceptable
level of accuracy. The number of processing layers through which data must pass
is what inspired the label deep .

In traditional machine learning, the learning process is supervised, and the


programmer has to be extremely specific when telling the computer what types
of things it should be looking for to decide if an image contains a dog or does
not contain a dog. This is a laborious process called feature extraction, and the
computer's success rate depends entirely upon the programmer's ability to
accurately define a feature set for "dog." The advantage of deep learning is the
program builds the feature set by itself without supervision. Unsupervised
learning is not only faster, but it is usually more accurate.

Initially, the computer program might be provided with training data -- a set of
images for which a human has labeled each image "dog" or "not dog" with meta
tags. The program uses the information it receives from the training data to
create a feature set for "dog" and build a predictive model. In this case, the
model the computer first creates might predict that anything in an image that has
four legs and a tail should be labeled "dog." Of course, the program is not aware
of the labels "four legs" or "tail." It will simply look for patterns of pixels in the
digital data. With each iteration, the predictive model becomes more complex
and more accurate.

Unlike the toddler, who will take weeks or even months to understand the
concept of "dog," a computer program that uses deep learning algorithms can be
shown a training set and sort through millions of images, accurately identifying
which images have dogs in them within a few minutes.

To achieve an acceptable level of accuracy, deep learning programs require


access to immense amounts of training data and processing power, neither of
which were easily available to programmers until the era of big data and cloud
computing. Because deep learning programming can create complex statistical
models directly from its own iterative output, it is able to create accurate
predictive models from large quantities of unlabeled, unstructured data. This is
important as the internet of things (IoT) continues to become more pervasive,
because most of the data humans and machines create is unstructured and is not
labeled.

What are deep learning neural networks?


A type of advanced machine learning algorithm, known as artificial neural
networks, underpins most deep learning models. As a result, deep learning may
sometimes be referred to as deep neural learning or deep neural networking.

Neural networks come in several different forms, including recurrent neural


networks, convolutional neural networks, artificial neural networks and
feedforward neural networks -- and each has benefits for specific use cases.
However, they all function in somewhat similar ways, by feeding data in and
letting the model figure out for itself whether it has made the right interpretation
or decision about a given data element.

Neural networks involve a trial-and-error process, so they need massive amounts


of data on which to train. It's no coincidence neural networks became popular
only after most enterprises embraced big data analytics and accumulated large
stores of data. Because the model's first few iterations involve somewhat-
educated guesses on the contents of an image or parts of speech, the data used
during the training stage must be labeled so the model can see if its guess was
accurate. This means, though many enterprises that use big data have large
amounts of data, unstructured data is less helpful. Unstructured data can only be
analyzed by a deep learning model once it has been trained and reaches an
acceptable level of accuracy, but deep learning models can't train on unstructured
data.

Deep learning methods


Various different methods can be used to create strong deep learning models.
These techniques include learning rate decay, transfer learning, training from
scratch and dropout.
Learning rate decay. The learning rate is a hyperparameter -- a factor that
defines the system or sets conditions for its operation prior to the learning
process -- that controls how much change the model experiences in response to
the estimated error every time the model weights are altered. Learning rates that
are too high may result in unstable training processes or the learning of a
suboptimal set of weights. Learning rates that are too small may produce a
lengthy training process that has the potential to get stuck.
The learning rate decay method -- also called learning rate annealing or adaptive
learning rates -- is the process of adapting the learning rate to increase
performance and reduce training time. The easiest and most common adaptations
of learning rate during training include techniques to reduce the learning rate
over time.

Transfer learning. This process involves perfecting a previously trained model;


it requires an interface to the internals of a preexisting network. First, users feed
the existing network new data containing previously unknown classifications.
Once adjustments are made to the network, new tasks can be performed with
more specific categorizing abilities. This method has the advantage of requiring
much less data than others, thus reducing computation time to minutes or hours.

Training from scratch. This method requires a developer to collect a large


labeled data set and configure a network architecture that can learn the features
and model. This technique is especially useful for new applications, as well as
applications with a large number of output categories. However, overall, it is a
less common approach, as it requires inordinate amounts of data, causing
training to take days or weeks.

Dropout. This method attempts to solve the problem of overfitting in networks


with large amounts of parameters by randomly dropping units and their
connections from the neural network during training. It has been proven that the
dropout method can improve the performance of neural networks on supervised
learning tasks in areas such as speech recognition, document classification and
computational biology.

Examples of deep learning applications


Because deep learning models process information in ways similar to the human
brain, they can be applied to many tasks people do. Deep learning is currently
used in most common image recognition tools, natural language
processing and speech recognition software. These tools are starting to appear in
applications as diverse as self-driving cars and language translation services.

What is deep learning used for?


Use cases today for deep learning include all types of big data
analytics applications, especially those focused on natural language processing,
language translation, medical diagnosis, stock market trading signals, network
security and image recognition.
Specific fields in which deep learning is currently being used include the
following:

Customer experience. Deep learning models are already being used for
chatbots. And, as it continues to mature, deep learning is expected to be
implemented in various businesses to improve the customer experiences
and increase customer satisfaction.
Text generation. Machines are being taught the grammar and style of a
piece of text and are then using this model to automatically create a
completely new text matching the proper spelling, grammar and style of
the original text.
Aerospace and military. Deep learning is being used to detect objects
from satellites that identify areas of interest, as well as safe or unsafe
zones for troops.
Industrial automation. Deep learning is improving worker safety in
environments like factories and warehouses by providing services that
automatically detect when a worker or object is getting too close to a
machine.
Adding color. Color can be added to black and white photos and videos
using deep learning models. In the past, this was an extremely time-
consuming, manual process.
Medical research. Cancer researchers have started implementing deep
learning into their practice as a way to automatically detect cancer cells.
Computer vision. Deep learning has greatly enhanced computer vision,
providing computers with extreme accuracy for object detection and
image classification, restoration and segmentation.
Limitations and challenges
The biggest limitation of deep learning models is they learn through
observations. This means they only know what was in the data on which they
trained. If a user has a small amount of data or it comes from one specific source
that is not necessarily representative of the broader functional area, the models
will not learn in a way that is generalizable.
The issue of biases is also a major problem for deep learning models. If a model
trains on data that contains biases, the model will reproduce those biases in its
predictions. This has been a vexing problem for deep learning programmers,
because models learn to differentiate based on subtle variations in data elements.
Often, the factors it determines are important are not made explicitly clear to the
programmer. This means, for example, a facial recognition model might make
determinations about people's characteristics based on things like race or gender
without the programmer being aware.

The learning rate can also become a major challenge to deep learning models. If
the rate is too high, then the model will converge too quickly, producing a less-
than-optimal solution. If the rate is too low, then the process may get stuck, and
it will be even harder to reach a solution.

The hardware requirements for deep learning models can also create limitations.
Multicore high-performing graphics processing units (GPUs) and other similar
processing units are required to ensure improved efficiency and decreased time
consumption. However, these units are expensive and use large amounts of
energy. Other hardware requirements include random access memory (RAM)
and a hard drive or RAM-based solid-state drive (SSD).

Other limitations and challenges include the following:

Deep learning requires large amounts of data. Furthermore, the more


powerful and accurate models will need more parameters, which, in
turn, requires more data.
Once trained, deep learning models become inflexible and cannot handle
multitasking. They can deliver efficient and accurate solutions, but only
to one specific problem. Even solving a similar problem would require
retraining the system.
Any application that requires reasoning -- such as programming or
applying the scientific method -- long-term planning and algorithmic-
like data manipulation is completely beyond what current deep learning
techniques can do, even with large data.
Deep learning vs. machine learning
Deep learning is a subset of machine learning that differentiates itself through
the way it solves problems. Machine learning requires a domain expert to
identify most applied features. On the other hand, deep learning learns features
incrementally, thus eliminating the need for domain expertise. This makes deep
learning algorithms take much longer to train than machine learning algorithms,
which only need a few seconds to a few hours. However, the reverse is true
during testing. Deep learning algorithms take much less time to run tests than
machine learning algorithms, whose test time increases along with the size of the
data.

Furthermore, machine learning does not require the same costly, high-end
machines and high-performing GPUs that deep learning does.

In the end, many data scientists choose traditional machine learning over deep
learning due to its superior interpretability, or the ability to make sense of the
solutions. Machine learning algorithms are also preferred when the data is small.
Instances where deep learning becomes preferable include situations where there
is a large amount of data, a lack of domain understanding for feature
introspection or complex problems, such as speech recognition and natural
language processing.

History
Deep learning can trace its roots back to 1943 when Warren McCulloch and
Walter Pitts created a computational model for neural networks using
mathematics and algorithms. However, it was not until the mid-2000s that the
term deep learning started to appear. It gained popularity following the
publication of a paper by Geoffrey Hinton and Ruslan Salakhutdinov that
showed how a neural network with many layers could be trained one layer at a
time.

In 2012, Google made a huge impression on deep learning when its algorithm
revealed the ability to recognize cats. Two years later, in 2014, Google bought
DeepMind, an artificial intelligence startup from the U.K. Two years after that,
in 2016, Google DeepMind's algorithm, AlphaGo, mastered the complicated
board game Go, beating professional player Lee Sedol at a tournament in Seoul.

Recently, deep learning models have generated the majority of advances in the
field of artificial intelligence. Deep reinforcement learning has emerged as a way
to integrate AI with complex applications, such as robotics, video games and
self-driving cars. The primary difference between deep learning and
reinforcement learning is, while deep learning learns from a training set and then
applies what is learned to a new data set, deep reinforcement learning learns
dynamically by adjusting actions using continuous feedback in order to optimize
the reward.

A reinforcement learning agent has the ability to provide fast and strong control
of generative adversarial networks (GANs). The Adversarial Threshold Neural
Computer (ATNC) combines deep reinforcement learning with GANs in order to
design small organic molecules with a specific, desired set of pharmacological
properties.

GANs are also being used to generate artificial training data for machine
learning tasks, which can be used in situations with imbalanced data sets or
when data contains sensitive information.

Here is a very simple illustration of how a deep learning program works. This
video by the LuLu Art Group shows the output of a deep learning program after
its initial training with raw motion capture data. This is what the program
predicts the abstract concept of "dance" looks like.

With each iteration, the program's predictive model became more complex and
more accurate.

Deep learning powers a motion-tracking revolution


A surge in the development of artificial-intelligence technology is driving a new
wave of open-source tools for analysing animal behaviour and posture.
As a postdoc, physiologist Valentina Di Santo spent a lot of time scrutinizing
high-resolution films of fish.

Di Santo was investigating the motions involved when fish such as skates swim.
She filmed individual fish in a tank and manually annotated their body parts
frame by frame, an effort that required about a month of full-time work for 72
seconds of footage. Using an open-source application called DLTdv, developed
in the computer language MATLAB, she then extracted the coordinates of body
parts — the key information needed for her research. That analysis showed,
among other things, that when little skates (Leucoraja erinacea ) need to swim
faster, they create an arch on their fin margin to stiffen its edge1.

But as the focus of Di Santo’s research shifted from individual animals to


schools of fish, it was clear a new approach would be required. “It would take
me forever to analyse [those data] with the same detail,” says Di Santo, who is
now at Stockholm University. So, she turned to DeepLabCut instead.

DeepLabCut is an open-source software package developed by Mackenzie


Mathis, a neuroscientist at Harvard University in Cambridge, Massachusetts, and
her colleagues, which allows users to train a computational model called a neural
network to track animal postures in videos. The publicly available version didn’t
have an easy way to track multiple animals over time, but Mathis’ team agreed to
run an updated version using the fish data, which Di Santo annotated using a
graphical user interface (GUI). The preliminary output looks promising, Di
Santo says, although she is waiting to see how the tool performs on the full data
set. But without DeepLabCut, she says, the study “would not be possible”.

NatureTech
Researchers have long been interested in tracking animal motion, Mathis says,
because motion is “a very good read-out of intention within the brain”. But
conventionally, that has involved spending hours recording behaviours by hand.
The previous generation of animal-tracking tools mainly determined centre of
mass and sometimes orientation, and the few tools that captured finer details
were highly specialized for specific animals or subject to other constraints, says
Talmo Pereira, a neuroscientist at Princeton University in New Jersey.

Over the past several years, deep learning — an artificial-intelligence method


that uses neural networks to recognize subtle patterns in data — has empowered
a new crop of tools. Open-source packages such as DeepLabCut, LEAP
Estimates Animal Pose (LEAP) and DeepFly3D use deep learning to determine
coordinates of animal body parts in videos. Complementary tools perform tasks
such as identifying specific animals. These packages have aided research on
everything from the study of motion in hunting cheetahs to collective zebrafish
behaviour.

Each tool has limitations; some require specific experimental set-ups, or don’t
work well when animals always crowd together. But methods will improve
alongside advances in image capture and machine learning, says Sandeep Robert
Datta, a neuroscientist at Harvard Medical School in Boston, Massachusetts.
“What you’re looking at now is just the very beginning of what is certain to be a
long-term transformation in the way neuroscientists study behaviour,” he says.

Strike a pose
DeepLabCut is based on software used to analyse human poses. Mathis’ team
adapted its underlying neural network to work for other animals with relatively
few training data. Between 50 and 200 manually annotated frames are generally
sufficient for standard lab studies, although the amount needed depends on
factors such as data quality and the consistency of the people doing the labelling,
Mathis says. In addition to annotating body parts with a GUI, users can issue
commands through a Jupyter Notebook, a computational document popular with
data scientists. Scientists have used DeepLabCut to study both lab and wild
animals, including mice, spiders, octopuses and cheetahs. Neuroscientist Wujie
Zhang at the University of California, Berkeley, and his colleague used it to
estimate the behavioural activity of Egyptian fruit bats (Rousettus aegyptiacus )
in the lab2.

The deep-learning-based posture tracking package LEAP, developed by Pereira


and his colleagues requires 50–100 annotated frames for lab animals, says
Pereira. More training data would be needed for wildlife footage, although his
team has not yet conducted enough experiments to determine how much. The
researchers plan to release another package called Social LEAP (SLEAP) this
year to better handle footage of multiple, closely interacting animals.

Jake Graving, a behavioural scientist at the Max Planck Institute of Animal


Behavior in Konstanz, Germany, and his colleagues compared the performance
of a re-implementation of the DeepLabCut algorithm and LEAP on videos of
Grevy’s zebras (Equus grevyi )3. They report that LEAP processed images about
10% faster, but the DeepLabCut algorithm was about three times as accurate.

Graving’s team has developed an alternative tool called DeepPoseKit, which it


has used to study behaviours of desert locusts (Schistocerca gregaria ), such as
hitting and kicking. The researchers report that DeepPoseKit combines the
accuracy of DeepLabCut with a batch-processing speed that surpasses LEAP.
For instance, tracking one zebra in 1 hour of footage filmed at 60 frames per
second takes about 3.6 minutes with DeepPoseKit, 6.4 minutes with LEAP and
7.1 minutes with his team’s implementation of the DeepLabCut algorithm,
Graving says.

DeepPoseKit offers “very good innovations”, Pereira says. Mathis disputes the
validity of the performance comparisons, but Graving says that “our results offer
the most objective and fair comparison we could provide”. Mathis’ team
reported an accelerated version of DeepLabCut that can run on a mobile phone
in an article posted in September on the arXiv preprint repository4.

Biologists who want to test multiple software solutions can try Animal Part
Tracker, developed by Kristin Branson, a computer scientist at the Howard
Hughes Medical Institute’s Janelia Research Campus in Ashburn, Virginia, and
her colleagues. Users can select any of several posture-tracking algorithms,
including modified versions of those used in DeepLabCut and LEAP, as well as
another algorithm from Branson’s lab. DeepPoseKit also offers the option to use
alternative algorithms, as will SLEAP.

Other tools are designed for more specialized experimental set-ups. DeepFly3D,
for instance, tracks 3D postures of single tethered lab animals, such as mice with
implanted electrodes or fruit flies walking on a tiny ball that acts as a treadmill.
Pavan Ramdya, a neuroengineer at the Swiss Federal Institute of Technology in
Lausanne (EPFL), and his colleagues, who developed the software, are using
DeepFly3D to help identify which neurons in fruit flies are active when they
perform specific actions.

And DeepBehavior, developed by neuroscientist Ahmet Arac at the University


of California, Los Angeles, and his colleagues, allows users to track 3D
movement trajectories and calculate parameters such as velocities and joint
angles in mice and humans. Arac’s team is using this package to assess the
recovery of people who have had a stroke and to study the links between brain-
network activity and behaviour in mice.

Making sense of movement


Scientists who want to study multiple animals often need to track which animal
is which. To address this challenge, Gonzalo de Polavieja, a neuroscientist at
Champalimaud Research, the research arm of the private Champalimaud
Foundation in Lisbon, and his colleagues developed idtracker.ai, a neural-
network-based tool that identifies individual animals without manually annotated
training data. The software can handle videos of up to about 100 fish and 80
flies, and its output can be fed into DeepLabCut or LEAP, de Polavieja says. His
team has used idtracker.ai to probe, among other things, how zebrafish decide
where to move in a group5. However, the tool is intended only for lab videos
rather than wildlife footage and requires animals to separate from one another, at
least briefly.

Other software packages can help biologists to make sense of animals’ motions.
For instance, researchers might want to translate posture coordinates into
behaviours such as grooming, Mathis says. If scientists know which behaviour
they’re interested in, they can use the Janelia Automatic Animal Behavior
Annotator (JAABA), a supervised machine-learning tool developed by
Branson’s team, to annotate examples and automatically identify more instances
in videos.

An alternative approach is unsupervised machine learning, which does not


require behaviours to be defined beforehand. This strategy might suit researchers
who want to capture the full repertoire of an animal’s actions, says Gordon
Berman, a theoretical biophysicist at Emory University in Atlanta, Georgia. His
team developed the MATLAB tool MotionMapper to identify often repeated
movements. Motion Sequencing (MoSeq), a Python-based tool from Datta’s
team, finds actions such as walking, turning or rearing.

By mixing and matching these tools, researchers can extract new meaning from
animal imagery. “It gives you the full kit of being able to do whatever you
want,” Pereira says.
Data Lakes vs. Data Warehouses
Understand the differences between the two most popular options for storing big
data.
When it comes to storing big data, the two most popular options are data lakes
and data warehouses. Data warehouses are used for analyzing archived
structured data, while data lakes are used to store big data of all structures.
Data Lake
A data lake is a storage repository that holds a vast amount of raw data in its
native format until it is needed. While a hierarchical data warehouse stores data
in files or folders, a data lake uses a flat architecture to store data. Each data
element in a lake is assigned a unique identifier and tagged with a set of
extended metadata tags. When a business question arises, the data lake can be
queried for relevant data, and that smaller set of data can then be analyzed to
help answer the question.

The term data lake is often associated with Hadoop-oriented object storage. In
such a scenario, an organization's data is first loaded into the Hadoop platform,
and then business analytics and data mining tools are applied to the data where it
resides on Hadoop's cluster nodes of commodity computers.

Like big data, the term data lake is sometimes disparaged as being simply a
marketing label for a product that supports Hadoop. Increasingly, however, the
term is being used to describe any large data pool in which the schema and data
requirements are not defined until the data is queried.

The term describes a data storage strategy, not a specific technology, although it
is frequently used in conjunction with a specific technology (Hadoop). The same
can be said of the term data warehouse , which despite often referring to a
specific technology (relational database), actually describes a broad data
management strategy.
Data lake vs. data warehouse
Data lakes and data warehouses are two different strategies for storing big data.
The most important distinction between them is that in a data warehouse, the
schema for the data is preset; that is, there is a plan for the data upon its entry
into the database. In a data lake, this is not necessarily the case. A data lake can
house both structured and unstructured data and does not have a predetermined
schema. A data warehouse handles primarily structured data and has a
predetermined schema for the data it houses.

To put it more simply, think of the concept of a warehouse versus the concept of
a lake. A lake is liquid, shifting, amorphous, largely unstructured and is fed from
rivers, streams, and other unfiltered sources of water. A warehouse, on the other
hand, is a man-made structure, with shelves and aisles and designated places for
the things inside of it. Warehouses store curated goods from specific sources.
Warehouses are prestructured, lakes are not.

This core conceptual difference manifests in several ways, including:


Technology typically used to host data -- A data warehouse is usually
a relational database housed on an enterprise mainframe server or the cloud,
whereas a data lake is usually housed in a Hadoop environment or similar big
data repository.

Source of the data -- The data stored in a warehouse is extracted from


various online transaction processing applications to support business analytics
queries and data marts for specific internal business groups, such as sales or
inventory teams. Data lakes typically receive both relational and non-relational
data from IoT devices, social media, mobile apps and corporate applications.

Users -- Data warehouses are useful when there is a massive amount of data
from operational systems that need to be readily available for analysis. Data
lakes are more useful when an organization needs a large repository of data, but
does not have a purpose for all of it and can afford to apply a schema to it upon
access.

Because the data in a lake is often uncurated and can originate from sources
outside of the company's operational systems, lakes are not a good fit for the
average business analytics user. Instead, data lakes are better suited for use by
data scientists, because it takes a level of skill to be able to sort through the large
body of uncurated data and readily extract meaning from it.
Data quality -- In a data warehouse, the highly curated data is generally trusted
as the central version of true because it contains already processed data. The data
in a data lake is less reliable because it could be arriving from any source in any
state. It may be curated, and it may not be, depending on the source.
Processing -- The schema for data warehouses is on-write, meaning it is pre-set
for when the data is entered into the warehouse. The schema for a data lake is
on-read, meaning it doesn't exist until the data has been accessed and someone
chooses to use it for something.

Performance/cost -- Data warehouses are usually more expensive for large data
volumes, but the trade-off is faster query results, reliability and higher
performance. Data lakes are designed with low cost in mind, but query results
are improving as the concept and surrounding technologies mature.

Agility -- Data lakes are highly agile; they can be configured and reconfigured
as needed. Data warehouses are less so.

Security -- Data warehouses are generally more secure than data lakes because
warehouses as a concept have existed for longer and therefore, security
methods have had the opportunity to mature.

A side-by-side comparison of data lakes and data warehouses.

Because of their differences, and the fact that data lakes are a newer and still-
evolving concept, organizations might choose to use both a data warehouse and a
data lake in a hybrid deployment. This may be to accommodate the addition of
new data sources, or to create an archive repository to deal with data roll-off
from the main data warehouse. Frequently data lakes are an addition to, or
evolution of, an organization's current data management structure instead of a
replacement.

Data lake architecture


The physical architecture of a data lake may vary, as data lake is a strategy that
can be applied to multiple technologies. For example, the physical architecture
of a data lake using Hadoop might differ from that of data lake using Amazon
Simple Storage Service (Amazon S3).

However, there are three main principles that distinguish a data lake from other
big data storage methods and make up the basic architecture of a data lake. They
are:

No data is turned away. All data is loaded in from various source


systems and retained.
Data is stored in an untransformed or nearly untransformed state, as it
was received from the source.
Data is transformed and fit into a schema based on analysis
requirements.

Although data is largely unstructured and not geared toward answering any
specific question, it should still be organized in some manner so that doing this
in the future is possible. Whatever technology ends up being used to deploy an
organization's data lake, a few features should be included to ensure that the data
lake is functional and healthy and that the large repository of unstructured data
doesn't go to waste. These include:

A taxonomy of data classifications, which can include data type, content,


usage scenarios and groups of possible users.
A file hierarchy with naming conventions.
Data profiling tools to provide insight for classifying data objects and
addressing data quality issues.
Standardized data access process to keep track of what members of an
organization are accessing data.
A searchable data catalog.
Data protections including data masking, data encryption and automated
monitoring to generate alerts when data is accessed by unauthorized
parties.
Data awareness among employees, which includes an understanding of
proper data management and data governance, training on how to
navigate the data lake, and an understanding of strong data quality and
proper data usage.

Benefits of a data lake


The data lake offers several benefits, including:
The ability of developers and data scientists to easily configure a given
data model, application, or query on the fly. The data lake is highly
agile.
Data lakes are theoretically more accessible. Because there is no
inherent structure, any user can technically access the data in the data
lake, even though the prevalence of large amounts of unstructured data
might inhibit less skilled users.
The data lake supports users of varying levels of investment; users who
want to return to the source to retrieve more information, those who seek
to answer entirely new questions with the data and those who simply
require a daily report. Access is possible for each of these user types.
Data lakes are cheap to implement because most technologies used to
manage them are open source (i.e., Hadoop) and can be installed on low-
cost hardware.
Labor-intensive schema development and data cleanup are deferred
until after an organization has identified a clear business need for the
data.
Agility allows for a variety of different analytics methods to interpret
data, including big data analytics, real-time analytics, machine
learning and SQL queries.
Scalable because of a lack of structure.

Criticism
Despite the benefits of having a cheap, unstructured repository of data at an
organization's disposal, several legitimate criticisms have been levied against the
strategy.
One of the biggest potential follies of the data lake is that it might turn into a
data swamp, or data graveyard. If an organization practices poor data governance
and management, it may lose track of the data that exists in the lake, even as
more pours in. The result is a wasted body of potentially valuable data rotting
away unseen at the "bottom" of the data lake, so to speak, rendering it
deteriorated, unmanaged and inaccessible.

Data lakes, while providing theoretical accessibility to anyone in an


organization, may not be as accessible in practical use, because business analysts
may have a difficult time readily parsing unstructured data from a variety of
sources. This practical accessibility challenge may also contribute to the lack of
proper data maintenance and result in the development of a data graveyard. It's
important to maximize investment in a data lake and reduce the risk of failed
deployment.

Another problem with the term data lake itself is that it is used in many contexts
in public discourse. Although it makes most sense to use it to describe a strategy
of data management, it has also commonly been used to describe specific
technologies and as a result, has a level of arbitrariness to it. This challenge may
cease to be once the term matures and finds a more concrete meaning in the
public discourse.

Vendors
Although a data lake isn't a specific technology, there are several technologies
that enable them. Some vendors that offer those technologies are:

Apache offers the open-source ecosystem Hadoop, one of the most


common data lake services.
Amazon offers Amazon S3 with virtually unlimited scalability.
Google offers Google Cloud Storage and a collection of services to pair
with it for management.
Oracle offers the Oracle Big Data Cloud and a variety of PaaS services
to help manage it.
Microsoft offers the Azure Data Lake as a scalable data storage and
Azure Data Lake Analytics as a parallel analytics service. This is an
example of when the term data lake is used to refer to a specific
technology instead of a strategy.
HVR offers a scalable solution for organizations that need to move large
volumes of data and update it in real time.
Podium offers a solution with an easy-to-implement and use suite of
management features.
Snowflake offers a solution that specializes in processing diverse
datasets, including structured and semi-structured datasets such as
JSON, XML and Parquet.
Zaloni offers a solution that comes with Mica, a self-service data prep
tool and data catalog. Zaloni has been branded as the data lake company.

Hadoop Data Lake


A Hadoop data lake is a data management platform comprising one or
more Hadoop clusters. It is used principally to process and store nonrelational
data, such as log files, internet clickstream records, sensor data, JSON objects,
images and social media posts.

Such systems can also hold transactional data pulled from relational databases,
but they're designed to support analytics applications, not to handle transaction
processing. As public cloud platforms have become common sites for data
storage, many people build Hadoop data lakes in the cloud.

Hadoop data lake architecture


While the data lake concept can be applied more broadly to include other types
of systems, it most frequently involves storing data in the Hadoop Distributed
File System (HDFS) across a set of clustered compute nodes based on
commodity server hardware. The reliance on HDFS has, over time, been
supplemented with data stores using object storage technology, but non-HDFS
Hadoop ecosystem components typically are part of the enterprise data lake
implementation.

With the use of commodity hardware and Hadoop's standing as an open source
technology, proponents claim that Hadoop data lakes provide a less expensive
repository for analytics data than traditional data warehouses. In addition, their
ability to hold a diverse mix of structured, unstructured and semistructured data
can make them a more suitable platform for big data management and analytics
applications than data warehouses based on relational software.

However, a Hadoop enterprise data lake can be used to complement an


enterprise data warehouse (EDW) rather than to supplant it entirely. A Hadoop
cluster can offload some data processing work from an EDW and, in effect,
stand in as an analytical data lake. In such cases, the data lake can host new
analytics applications. As a result, altered data sets or summarized results can be
sent to the established data warehouse for further analysis.
Hadoop data lake best practices
The contents of a Hadoop data lake need not be immediately incorporated into a
formal database schema or consistent data structure, which allows users to
store raw data as is; information can then either be analyzed in its raw form or
prepared for specific analytics uses as needed.

As a result, data lake systems tend to employ extract, load and transform (ELT)
methods for collecting and integrating data, instead of the extract, transform and
load (ETL) approaches typically used in data warehouses. Data can be extracted
and processed outside of HDFS using MapReduce, Spark and other data
processing frameworks.

Despite the common emphasis on retaining data in a raw state, data lake
architectures often strive to employ schema-on-the-fly techniques to begin to
refine and sort some data for enterprise uses. As a result, Hadoop data lakes have
come to hold both raw and curated data.

As big data applications become more prevalent in companies, the data lake
often is organized to support a variety of applications. While early Hadoop data
lakes were often the province of data scientists, increasingly, these lakes are
adding tools that allow analytics self-service for many types of users.

Hadoop data lake uses, challenges


Potential uses for Hadoop data lakes vary. For example, they can pool varied
legacy data sources, collect network data from multiple remote locations and
serve as a way station for data that is overloading another system.
Experimental analysis and archiving are among other Hadoop data lake uses.
They have also become an integral part of Amazon Web Services
(AWS) Lambda architectures that couple batch with real-time data processing.

The Hadoop data lake isn't without its critics or challenges for users. Spark, as
well as the Hadoop framework itself, can support file architectures other than
HDFS. Meanwhile, data warehouse advocates contend that similar architectures
-- for example, the data mart -- have a long lineage and that Hadoop and related
open source technologies still need to mature significantly in order to match the
functionality and reliability of data warehousing environments.

Experienced Hadoop data lake users say that a successful implementation


requires a strong architecture and disciplined data governance policies; without
those things, they warn, data lake systems can become out-of-control dumping
grounds. Effective metadata management typically helps to drive successful
enterprise data lake implementations.

Hadoop vs. Azure Data Lakes


There are other versions of data lakes, which offer similar functionality to the
Hadoop data lake and also tie into HDFS.

Microsoft launched its Azure Data Lake for big data analytical workloads in the
cloud in 2016. It is compatible with Azure HDInsight, Microsoft's data
processing service based on Hadoop, Spark, R and other open source
frameworks. The main components of Azure Data Lake are Azure Data Lake
Analytics, which is built on Apache YARN, Azure Data Lake Store and U-SQL.
It uses Azure Active Directory for authentication and access control lists and
includes enterprise-level features for manageability, scalability, reliability and
availability.

Around the same time that Microsoft launched its data lake, AWS launched Data
Lake Solutions an automated reference data lake implementation that guides
users through creation of a data lake architecture on the AWS cloud, using AWS
services, such as Amazon Simple Storage Service (S3) for storage and AWS
Glue, a managed data catalog and ETL service.

Building a strong data analytics platform architecture

Analytics platforms have made their way to the forefront of information-driven


enterprises. Winning organizations know a core competency in analytics requires
a modern data analytics platform architecture that delivers insights at critical
junctures in their data pipelines while minimizing cost, redundancy and
complexity.

What is a data analytics platform?


A data analytics platform can be defined as everything it takes to draw
meaningful and useful insights from data. For a general concept of an analytics
platform, think in terms of data, analytics and insights.

Delivering an analytics platform requires a robust architecture that serves as a


blueprint for delivering business unit and enterprise analytics, communicating
architectural decisions, reducing individual project risk and ensuring enterprise
consistency.
Modernization attributes
A traditional data analytics platform architecture is often not well positioned to
support today's data-driven organizations. New business demands, enabling
technologies and cost pressures are prompting organizations to modernize their
analytics platforms in order to realize the full potential of data as a corporate
asset.

Modernizing means rethinking a data analytics platform architecture, including


these attributes:

Agility at the speed of business


Cost optimization
Highly qualified personnel
Process automation
Best-in-class technology
Handling of data at any speed, size and variety
Seamless data integrations
Timely insights throughout data pipelines
Full spectrum of business intelligence capabilities
Robust security architecture
High-speed direct connect data fabric
Loosely coupled technology ecosystem
High-efficiency computing
Strong governance controls and stewardship
Rapid development and deployment
Well-documented architecture and metadata
Data lake vs. data reservoir
A strong data analytics platform architecture will account for data lakes and data
reservoirs. This coexistence is complementary as each repository addresses
different data and analytical uses at different points in the pipeline.

The main differences between the two involve data latency and refinement. Both
store structured and unstructured data, leveraging various data stores from
simple object files to SQL and NoSQL database engines to big data stores.
Data lakes are raw data repositories located at the beginning of data pipelines,
optimized for getting data into the analytics platform. Landing zones and
sandboxes of independent data designed for ingestion and discovery, these native
format data stores are open to private consumers for selective use. Analytics are
generally limited to time-sensitive insights and exploratory inquiry by
consumers who can tolerate the murky waters.

Data reservoirs are refined data repositories located at operational and back-end
points of data pipelines, optimized for getting data out of the analytics platform.
As sources of unified, harmonized and wrangled data designed for querying and
analysis, data reservoirs are purpose-built data stores that are open to the public
for general consumption. Analytics span a wide range of past, present and future
insights for use by casual and sophisticated consumers, serving both tactical and
strategic insights that run the business.

Determining at what point in the pipeline data becomes meaningful for a


particular use case is often tempered by time and quality.

On one hand, access to data early in the pipeline will favor time-sensitive
insights over the suitability of non-harmonized data, particularly for use cases
that require the most recent data. On the other hand, access to data later in the
pipeline will favor data accuracy over increased latency by virtue of curation,
particularly for use cases that require data that has been cleaned, conformed and
enriched, and that is of known quality.

Public cloud or on premises


Choosing where to run your analytics platform is not as easy decision.
Fortunately, public cloud and on-premises deployments aren't mutually
exclusive. Smaller organizations typically gravitate to an entirely public cloud
strategy, while midsize to large organizations often deploy a hybrid strategy or
assume complete control with an all on-premises strategy.
Any decision on where to host a data analytics platform should minimally
consider agility, scale, cost, security (particularly sensitive data protection),
network latency and analytic capabilities.

A big part of the hosting decision comes down to control. Organizations that are
comfortable sharing control are likely to lean more toward a cloud presence.
Organizations that feel comfortable owning the end-to-end platform will likely
lean more toward an on-premises option.
Regardless of where you run your analytics platform, modernization should not
simply be a lift-and-shift approach. You may not need a complete overhaul, but
take the opportunity to refresh select components and remove technical debt
across your platform.

Organizations that choose the public cloud for some or all their data analytics
platform architecture should take advantage of what the cloud does best. This
means moving from IaaS to SaaS and PaaS models. Look to maximize managed
services, migrate to native cloud services, automate elasticity, geo-disperse the
analytics platform and move to consumption-based pricing whenever possible by
using serverless technologies.

The importance of flexibility


Flexibility has become a necessary attribute of a modern data analytics platform
architecture. An expanding demand for analytics is forcing analytics platforms to
be more accessible, extensible and nimble while processing data at greater
velocity, volume and variety.

One thing is for sure: Your data analytics platform architecture will change. A
key measurement of a platform's flexibility is how well it adapts to business and
technology innovation. Expect the business to demand an accelerated analytics
lifecycle and greater autonomy via self-service capabilities. To keep pace with
the business, look for technology advancements in automation and artificial
intelligence as well as catalysts for augmented data management and analytics.

Data Warehouse
A data warehouse is a repository for data generated and collected by an
enterprise's various operational systems. Data warehousing is often part of a
broader data management strategy and emphasizes the capture of data from
different sources for access and analysis by business analysts, data scientists and
other end users.

Typically, a data warehouse is a relational database housed on a mainframe,


another type of enterprise server or, increasingly, in the cloud. Data from various
online transaction processing (OLTP) applications and other sources is
selectively extracted and consolidated for business intelligence (BI) activities
that include decision support, enterprise reporting and ad hoc querying by users.
Data warehouses also support online analytical processing (OLAP) technologies,
which organize information into data cubes that are categorized by different
dimensions to help accelerate the analysis process.

Basic components of a data warehouse


A data warehouse stores data that is extracted from internal data stores and, in
many cases, external data sources. The data records within the warehouse must
contain details to make it searchable and useful to business users. Taken
together, there are three main components of data warehousing:

1. A data integration layer that extracts data from operational systems, such
as Excel, ERP, CRM or financial applications.
2. A data staging area where data is cleansed and organized.
3. A presentation area where data is warehoused and made available for
use.

A data warehouse architecture can also be understood as a set of tiers, where the
bottom tier is the database server, the middle tier is the analytics engine and the
top tier is data warehouse software that presents information for reporting and
analysis.

Data analysis tools, such as BI software, enable users to access the data within
the warehouse. An enterprise data warehouse stores analytical data for all of an
organization's business operations; alternatively, individual business units may
have their own data warehouses, particularly in large companies. Data
warehouses can also feed data marts, which are smaller, decentralized systems in
which subsets of data from a warehouse are organized and made available to
specific groups of business users, such as sales or inventory management teams.

In addition, Hadoop has become an important extension of data warehouses for


many enterprises because the distributed data processing platform can improve
components of a data warehouse architecture -- from data ingestion to analytics
processing to data archiving. In some cases, Hadoop clusters serve as the staging
area for traditional data warehouses. In others, systems that incorporate Hadoop
and other big data technologies are deployed as full-fledged data warehouses
themselves.

Data warehouse benefits and options


Data warehouses can benefit organizations from both an IT and a business
perspective. For example:
Separating analytical processes from operational ones can enhance the
performance of operational systems and enable data analysts and
business users to access and query relevant data faster from multiple
sources.
Data warehouses can offer enhanced data quality and consistency for
analytics uses, thereby improving the accuracy of BI applications.
Businesses can choose on-premises systems, conventional cloud
deployments or data-warehouse-as-a-service (DWaaS) offerings.
On-premises data warehouses offer flexibility and security so IT teams
can maintain control over their data warehouse management and
configuration; they're available from IBM, Oracle and Teradata as an
example.
Cloud-based data warehouses such as Amazon Redshift, Google
BigQuery, Microsoft Azure SQL Data Warehouse and Snowflake enable
companies to quickly scale up their systems while eliminating the initial
infrastructure investments and ongoing system maintenance
requirements.
DWaaS, an offshoot of database as a service, provides a managed cloud
service that frees organizations from the need to deploy, configure and
administer their data warehouses. Such services are being offered by a
growing number of cloud vendors.
Types of data warehouses
There are three main approaches to implementing a data warehouse, which are
detailed below. Some organizations have also adopted federated data warehouses
that integrate separate analytical systems already put in place independently of
one another -- an approach proponents describe as a practical way to take
advantage of existing deployments.
Top-down approach: Created by data warehouse pioneer
William H. Inmon, this method calls for building the enterprise
data warehouse first. Data is extracted from operational systems
and possibly third-party external sources and may be validated in
a staging area before being integrated into a normalized data
model. Data marts are then created from the data stored in the data
warehouse.

Bottom-up method: Consultant Ralph Kimball developed an


alternative data warehousing architecture that calls
for dimensional data marts to be created first. Data is extracted
from operational systems, moved to a staging area and modeled
into a star schema design, with one or more fact tables connected
to one or more dimensional tables. The data is then processed and
loaded into data marts, each of which focuses on a specific
business process. The data marts are integrated using a data
warehouse bus architecture to form an enterprise data warehouse.
Hybrid method: Hybrid approaches to data warehouse design
include aspects from both the top-down and bottom-up methods.
Organizations often seek to combine the speed of the bottom-up
approach with the data integration capabilities achieved in a top-
down design.
Data warehouses vs. databases vs. data lakes
Databases and data lakes are often confused with data warehouses, but there are
important differences. While data warehouses typically store data from multiple
sources and utilize predefined schemas designed for data analytics, an
operational database is generally used to capture, process and store data from a
single source, such as a transactional system, and its schema is normalized. Such
databases typically aren't designed to run across very large data sets, as data
warehouses are.

By contrast, a data lake is a central repository for all types of raw data, whether
structured or unstructured, from multiple sources. Data lakes are most commonly
built on Hadoop or other big data platforms. A schema doesn't need to be defined
upfront in them, which allows for more types of analytics than data warehouses,
which have defined schemas. For example, data lakes can be used for text
searches, machine learning and real-time analytics.

Data warehouse innovations throughout history


The concept of data warehousing can be traced back to work conducted in the
mid-1980s by IBM researchers Barry Devlin and Paul Murphy. The duo coined
the term business data warehouse in their 1988 paper, "An architecture for a
business and information system," which stated:

"The [business information system] architecture is based on the assumption that


such a service runs against a repository of all required business information that
is known as the Business Data Warehouse (BDW). ... A necessary prerequisite
for the physical implementation of a business data warehouse service is
a business process and information architecture that defines (1) the reporting
flow between functions and (2) the data required."

Bill Inmon, as he is more familiarly known, furthered data warehouse


development with his 1992 book Building the Data Warehouse , as well as by
writing some of the first columns about the topic. Inmon's top-down design
method for building a data warehouse describes the technology as a subject-
oriented, integrated, time-variant and nonvolatile collection of data that supports
an organization's decision-making process.

The technology's growth continued with the founding of The Data Warehousing
Institute, now known as TDWI, in 1995, and with the 1996 publication of Ralph
Kimball's book The Data Warehouse Toolkit , which introduced his dimensional
modeling approach to data warehouse design.

In 2008, Inmon introduced the concept of data warehouse 2.0, which focuses on
the inclusion of unstructured data and corporate metadata.

Operational data store vs. data warehouse


You could be forgiven for thinking that operational data stores and data
warehouses are synonymous. After all, a data warehouse is a place where
operational data is stored for analysis and reporting. Case closed -- two sides of
the same coin, right?

Well, no, not so fast. There's more to the question of operational data store vs.
data warehouse than that. Both do store operational data, but in different forms
and for different purposes. And in many cases, organizations incorporate both
into their analytics architectures.

The operational data store (ODS) is a bit harder to pin down because there are
diverging views on exactly what it is and for what it's used. But, at heart, an
ODS pulls together data from multiple transaction processing systems on a
short-term basis, with frequent updates as new data is generated by the source
systems. Operational data stores often serve as interim staging areas for data
that's ultimately headed to a data warehouse or a big data platform for long-term
storage.
Uses and benefits of an ODS
An ODS generally holds detailed transaction data that has yet to be consolidated,
aggregated and transformed into consistent data sets for loading into a data
warehouse. From a data integration standpoint, then, an ODS might only involve
the first and third elements of the extract, transform and load (ETL) process
typically used to pull data from operational systems and to harmonize it for
analysis.

In that sense, an operational data store can be thought of as a funnel that takes in
raw data from various source systems and helps facilitate the process of feeding
business intelligence and analytics systems with more refined versions of that
data. The full ETL process is handled downstream, which streamlines data
transformation workloads and minimizes the processing pipelines needed
between the ODS and the source systems to which it's connected.

However, some people also view the operational data store as a BI and analytics
platform in its own right. Under that scenario, an ODS can be used to do near-
real-time data analysis aimed at uncovering tactical insights that organizations
can quickly apply to ongoing business operations -- for example, to increase
retail inventories of popular products based on fresh sales data. By comparison,
data warehouses typically support historical analysis of data accumulated over a
longer period of time.

Depending on the specific application, an ODS that's used for data analysis
might be updated multiple times daily, if not hourly or even more frequently.
Real-time data integration tools, such as change data capture software, can be
tapped to help enable such updates. In addition, some level of data cleansing and
consistency checks might be applied in the ODS to help ensure that the analytics
results are accurate.

ODS and data warehouse design


In weighing operational data store vs. data warehouse deployments, an ODS can
potentially be built on a lighter data platform, especially if it's primarily being
used as a temporary way station for data.

For example, an operational data store architecture might be based on the


MySQL open source database or the cloud-based Amazon Simple Storage
Service as an alternative to traditional data warehouse platforms such as Oracle,
Microsoft SQL Server, IBM DB2 and Teradata. In big data
environments, Hadoop clusters can provide an ODS staging area for feeding data
to either a data warehouse or another cluster built on top of the open source
distributed processing framework.

While data usually passes through an ODS relatively quickly to make room for
new data coming up behind it, things are different in a data warehouse. The
purpose there is to create an archive of data that can be analyzed to track
business performance and identify operational trends in order to guide strategic
decision-making by corporate and business executives.

A data warehouse might be updated frequently -- nightly, in some cases, weekly


or monthly in others. But it's a more static environment than an ODS: Data is
typically added, but not deleted, especially in the case of an enterprise data
warehouse (EDW), which is designed to provide a single source of consolidated
and cleansed data from all of a company's operations. EDWs tend to be large and
complex platforms as a result -- a combination that can make deploying them a
challenge.

ODS vs. dart mart


Another facet of the operational data store vs. data warehouse discussion is how
an ODS compares to a data mart. Data marts are purpose-built data warehouse
offshoots -- essentially, smaller warehouses that store data related to individual
business units or specific subject areas. A data mart and an ODS might be in the
same league on storage capacity, but otherwise, they differ in the same way that
EDWs and operational data stores do. Like their bigger brethren, data marts are a
repository for historical data that has been fully scrubbed and aggregated for
analysis.

Two other things to keep in mind about operational data stores: First, they aren't
the same thing as an operational database. The latter is the database built into a
transaction system -- it's the location from which the data flowing into an ODS
comes. Put another way, transaction data is initially processed in operational
databases and then moved to an ODS to begin its analytics journey.
Second, operational data stores are sometimes equated with master data
management (MDM) systems. MDM processes enable companies to create
common sets of master data on customers, products and suppliers. The master
data can then be fed back to transaction systems via an MDM hub, where the
data is managed and stored. Early on, some organizations built MDM
capabilities into ODS platforms, but that approach seems to have lessened in
recent years perhaps partly due to the MDM market not growing like proponents
hoped it would, itself a result of MDM's inherent complexities.
Advanced Analytics techniques fuel data-driven organization

The 2015 Pacific Northwest BI Summit is taking place in Grants Pass, Ore., this
weekend. The annual event brings together a small group of consultants and
vendors to discuss key trends and issues related to business intelligence,
analytics and data management. One of the participants is Claudia Imhoff,
president of consultancy Intelligent Solutions Inc. and founder of the Boulder BI
Brain Trust. At this year's conference, Imhoff will lead a discussion on
increasing the adoption of BI and analytics applications in companies. That topic
is similar to one she spoke about in a video interview with
SearchBusinessAnalytics at the 2014 summit: creating a more data-driven
organization through the use of higher-level predictive and prescriptive
analytics techniques.

In the interview, Imhoff said that basic descriptive analytics -- for example,
straightforward reporting on revenue, profits and other key performance
indicators -- is the most prevalent form of BI. But it's also "the least valuable of
all the analytics that companies can perform," she noted. The next step up is
diagnostic analytics, which addresses why something has happened but is "still
reactive," according to Imhoff.

On the other hand, she said, companies can use predictive analytics tools to look
toward the future -- by, say, identifying prospective customers who are likely to
be receptive to particular marketing campaigns. And prescriptive analytics
software can be applied to answer what-if questions in order to help optimize
business strategies and assess whether predicted business outcomes are worth
pursuing.

There are a number of issues that hold companies back from adopting more
advanced analytics techniques, Imhoff said. One is a lack of internal education
about the potential business benefits of effective analytics processes: "We need
to start building this culture in our organizations that understands the need for
analytics." Another issue she cited is a lack of analytical prowess resulting from
the ongoing shortage of data scientists and other skilled analytics professionals.
And an age-old but still common problem, she said, is "putting the cart before
the horse" on technology purchases and ending up with analytics systems and
tools that aren't a good fit for an organization's business needs.

Imhoff said BI, analytics and IT managers also need to understand that data
warehouses aren't the only valid repositories of analytics data anymore,
especially for storing the massive amounts of data being captured from sensors,
social networks and other new data sources. To support big data analytics
applications, she espoused an extended data warehouse architecture that
combines a traditional enterprise data warehouse with technologies such
as Hadoop clusters and NoSQL database systems. She sees well-designed data
visualizations as another must for fostering a data-driven organization, especially
in big data environments: "We're talking about massive numbers of data points,
and you can't just 'blat' that out on a screen."

Must-have features for Big Data Analytics Tools

Big data analytics involves a complex process that can span business
management, data scientists, developers and production teams. Crafting a new
data analytics model is just one part of this elaborate process.

The following are 10 must-have features in big data analytics tools that can help
reduce the effort required by data scientists to improve business results:

1. Embeddable results

Big data analytics gain value when the insights gleaned from data models can
help support decisions made while using other applications.

"It is of utmost importance to be able to incorporate these insights into a real-


time decision-making process," said Dheeraj Remella, chief technologist at
VoltDB, an in-memory database provider.

These features should include the ability to create insights in a format that is
easily embeddable into a decision-making platform, which should be able to
apply these insights in a real-time stream of event data to make in-the-moment
decisions.

2. Data wrangling

Data scientists tend to spend a good deal of time cleaning, labeling and
organizing data for data analytics. This involves seamless integration across
disparate data sources and types, applications and APIs, cleansing data, and
providing granular, role-based, secure access to the data.
Big data analytics tools must support the full spectrum of data types, protocols
and integration scenarios to speed up and simplify these data wrangling steps,
said Joe Lichtenberg, director of marketing for data platforms at InterSystems, a
database provider.

3. Data exploration

Data analytics frequently involves an ad hoc discovery and exploration phase of


the underlying data. This exploration helps organizations understand the
business context of a problem and formulate better analytic questions. Features
that help streamline this process can reduce the effort involved in testing new
hypotheses about the data to weed out bad ones faster and streamline the
discovery of useful connections buried in the data.

Strong visualization capabilities can also help this data exploration process.

4. Support for different analytics

There are a wide variety of approaches for putting data analytics results into
production, including business intelligence, predictive analytics, real-time
analytics and machine learning. Each approach provides a different kind of value
to the business. Good big data analytics tools should be functional and flexible
enough to support these different use cases with minimal effort or the retraining
that might be involved when adopting different tools

5. Scalability

Data scientists typically have the luxury of developing and testing different data
models on small data sets for long durations. But the resulting analytics models
need to run economically and often must deliver results quickly. This requires
that these models support high levels of scale for ingesting data and working
with large data sets in production without exorbitant hardware or cloud service
costs.

"A tool that scales an algorithm from small data sets to large with minimal effort
is also critical," said Eduardo Franco, data science lead at Descartes Labs, a
predictive analytics company. "So much time and effort is spent in making this
transition, so automating this is a huge help."

6. Version control
In a large data analytics project, several individuals may be involved in adjusting
the data analytics model parameters. Some of these changes may initially look
promising, but they can create unexpected problems when pushed into
production.

Version control built into big data analytics tools can improve the ability to track
these changes. If problems emerge later, it can also make it easier to roll back an
analytics model to a previous version that worked better.

"Without version control, one change made by a single developer can result in a
breakdown of all that was already created," said Charles Amick, vice president
of data science at Devo USA, a data operations platform provider.

7. Simple integration

The less time data scientists and developers spend customizing integrations to
process data sources and connect with applications, the more time they can
spend improving data analytic models and applications.

Simple integrations also make it easier to share results with other developers and
data scientists. Data analytics tools should support easy integration with existing
enterprise and cloud applications and data warehouses.

8. Data management

Big data analytics tools need a robust yet efficient data management platform to
ensure continuity and standardization across all deliverables, said Tim Lafferty,
director of analytics at Velocity Group Development, a data analytics
consultancy. As the magnitude of data increases, so does variability.

A robust data management platform can help an enterprise maintain a single


source for truth, which is critical for a successful data initiative.
9. Data governance

Data governance features are important for big data analytics tools to help
enterprises stay compliant and secure. This includes being able to track the
source and characteristics of the data sets used to build analytic models and to
help secure and manage data used by data scientists and engineers. Data sets
used to build models may introduce hidden biases that could create
discrimination problems.
Data governance is especially crucial for sensitive data, such as protected health
information and personally identifiable information that needs to comply with
privacy regulations. Some tools now include the ability to pseudonymize data,
allowing data scientists to build models based on personal information in
compliance with regulations like GDPR.

10. Data processing frameworks

Many big data analytics tools focus on either analytics or data processing. Some
frameworks, like Apache Spark, support both. These enable developers and data
scientists to use the same tools for real-time processing; complex extract,
transform and load tasks; machine learning; reporting; and SQL. This is
important because data science is a highly iterative process. A data scientist
might create 100 models before arriving at one that is put into production. This
iterative process often involves enriching the data to improve the results of the
models.

Unified analytics tools help enterprises build data pipelines across a multitude of
siloed data storage systems while training and modeling their solution in an
iterative fashion, said Ali Ghodsi, CEO and co-founder of Databricks, a data
analytics platform provider.
Data-driven storytelling opens analytics to all

Data storytelling, because it interprets and explains data, extends business


intelligence to business users and not just those trained in data analysis.

Data-driven storytelling has the potential to revolutionize analytics.

One of the great challenges of analytics has been making it accessible to more
than just the trained experts within an organization, the data analysts who
understand how to interpret data and use it to make informed decisions.

And just as visualizations helped make data more digestible a decade or so ago
and augmented intelligence is making analytics platforms easier for untrained
users to navigate, data-driven storytelling can put business intelligence in the
hands of a wider audience.

But unlike data visualizations and AI, technologies that only marginally extend
the reach of analytics, data-driven storytelling can have a wider impact in
enterprises.

Data storytelling, simply, is an automatically generated explanation of data. It's


the story of the data under analysis, that interpretation that if left to someone
without an expertise in data analysis can be dangerous. It's that story put in a
narrative form rather than just straight analysis of the data.

"Data storytelling is what you say when you're actually trying to understand
what's happening in the data and make a decision off of it," said Nate
Nichols, chief scientist at Narrative Science, a data storytelling software vendor.

For example, Nichols continued, if someone comes home and sees a spilled glass
of water on the kitchen counter and the wet footprint of a cat leading away from
the water, they have a data set.

"That's what you get from a spreadsheet or a dashboard," Nichols said. "But you
don't make a decision based off of that. You develop an interpretation of what
happened, you tell a story -- the cat came in, tried to drink, knocked over the
water and ran out. It's the story that helps you make the decision about how to
keep the cat away in the future."

In a business sense, data-driven storytelling, for example, can be the explanation


of sales figures in a report or dashboard.

Rather than just present the numbers and leave the interpretation up to the user,
data storytelling platforms break it down and put into a written narrative that
total sales in a given week were $15 million, which was up 10% over the week
before and up 20% over the weekly average. Meanwhile, the sales figures
include 100 deals with a certain employee leading the way with eight, and the
overall increase can be attributed to seasonal factors.

A Narrative Science data story about the sales figures highlights the most
relevant numbers in bold, creates a simple bar graph, and situates a block
graphic below a bold headline over the narrative. A traditional spreadsheet
would leave it to the user to interpret the same information presented in rows of
numbers.

NARRATIVE SCIENCE
A sample data story from Narrative Science describes an organization's sales
bookings.

And while data-driven storytelling has the potential to open up analytical


analysis to the masses, it isn't merely for the benefit of those untrained in the
language of data. Even those with backgrounds in data science can struggle to
find the meaning within data that can lead to action.

"As a trained analyst myself, data was always a means to an end," said Lyndsee
Manna, senior vice president of business development at Arria NLG, a natural
language generation vendor. "But I, as a human, had to wrestle to extract
something that was meaningful and could communicate to another human. The
shift to data storytelling is that I don't have to wrestle with the data anymore. The
data is going to tell me. It's knowledge automation."

The Psychology
Human beings understand stories.

From the earliest cave dwellers telling stories with pictures through the present
day, people have used stories to convey information and give it context.
Analytics, however, has largely lacked that storytelling aspect, missed out on the
power a story can have. Even data visualizations don't tell stories. They present
data in an easily understandable format charts, graphs and in artful ways, but
they usually don't give the data meaning in a richer context.

And that leaves countless business users out of the analytics process. Data
storytelling changes that. "It gives information context, it gives it purpose, and
makes it more memorable and understandable," said Donald Farmer, principal at
TreeHive Strategy. "For that reason it's very fundamental psychologically.
Storytelling is essential. In a sense, data storytelling is nothing new because
whenever we exchange data we do it with implicit stories. But data storytelling
as a practice is emerging."

Similarly, Sharon Daniels, CEO of Arria NLG, said data-driven storytelling


could revolutionize analytics because of the way humans react to narratives. "If
you follow how we evolved as human beings we started in caves with drawings
and communicating with visuals, and then language came about and opened up
our world, and the technology world is mirroring that," she said. "It's a very
interesting thing to see the storytelling parallels. The language component and
the storytelling is universal."

Meanwhile, because of the way people use storytelling to give meaning to


information and because of the technology now being developed by data
storytelling vendors that automatically generates a story to accompany data
anyone in an organization can use data to inform decision-making.

According to Dave Menninger, research director of data and analytics research at


Ventana Research, only between 20% and 40% of employees within most
organizations use analytics in their jobs.

"Data-driven storytelling can expand the reach of analytics. The promise of


storytelling is that we've been stuck at this level -- pick a number -- of
penetration of BI into an organization, and we have the opportunity to achieve
close to 100% with data storytelling," Menninger said.

In an informal sense, data-driven storytelling already permeates entire


organizations. When a CEO speaks about earnings, for example, they start with
hard numbers and then contextualize those numbers with a story. New
technology, however, can extend the reach of analytics in a more structured way.

"What's happening now is these specific technologies that are being developed to
support data storytelling are coming out, and they will [eventually] reach 100%
of the organization," Farmer said. "That's why it's exciting in a technology sense.
We've finally got a technology that actually can genuinely reach everyone in
some way."

The Technology
Analytics platforms largely focus on every aspect of the analytics process
leading up to interpretation. They're about preparing the data for analysis rather
than the analysis itself.

Vendors such as Alteryx and Teradata specialize in data management, loading


the data and structuring it. Others such as Tableau and Qlik are specialists in the
business intelligence layer, the presentation of the data for analysis. And still
others, including software giants IBM and Oracle, enable each aspect of the
analytics process.
Now, a crop of vendors has emerged that specialize in data-driven storytelling,
taking that data that's gone through the entire pipeline and giving it meaning.

Narrative Science, though founded only 10 years ago, is one of the veterans.
Arria NLG, which offers a suite of natural language generation tools in addition
to its data storytelling capabilities, is another that's been around for a while,
having been founded in 2009. And now startups like Paris-based Toucan
Toco are emerging as data storytelling gains momentum.

Meanwhile, longstanding BI vendors are also starting to offer data storytelling


tools. Tableau introduced Explain Data in 2019, and Yellowfin developed
Yellowfin Stories in 2018.

"Everyone wants everyone to be able to make data-driven decisions and not have
to have an analytical background or have their own analysts," Nichols said. "But
for people that aren't analysts and that are just trying to understand the story and
use that to guide their decision-making, that last mile is the hurdle."

According to Nichols, Narrative Science's data stories are generally short and to
the point, often only a paragraph or two, though they have the potential to be
longer. Arria NLG's stories can similarly be of varying length, depending on the
wants of the user.

"Whether you know about data and excel at BI or whether you don't, data can
feel very overwhelming," Manna said. "The biggest thing [data storytelling]
gives to humanity is to lift that feeling of being overwhelmed and give them
something -- in language -- they can comprehend quickly. The gift is
understanding something that either would have taken a very long time or never
to understand."

Data stories are generally the final phase of the analytics process rather
than embedded throughout. When BI vendors offer their own data-driven
storytelling tools, they generally provide the opportunity to embed stories at
points along the data pipeline, but that can be tricky, according to Farmer. If the
tools are introduced too early in the process, they can influence the outcome
rather than interpret the outcome.

"You have to be very careful with data storytelling," Farmer said. "For me, data
storytelling has to be focused on a single subject."

In addition, he said, it's important to understand that data storytelling doesn't


completely replace analysis. The stories produced by data storytelling platforms
are linear, and the real world is far more meandering.

The Future
Unlike most new technologies that start off in rudimentary forms and develop
over long periods of time, data storytelling platforms already deliver on the
promise of providing narratives that contextualize data and help the decision-
making process. However, they have the potential to do more.

Data-driven storytelling platforms don't yet know their users. They can analyze
data and craft a narrative based on it, but they don't yet have the machine
learning capabilities that will lead to personalized narratives.

"It really should be personalized explanation of the analysis and personalized


instructions on what to do based on the observations," Menninger said. "Many
vendors are at the point of explaining, and those explanations may be somewhat
personalized for the region or product you're responsible for, but few vendors
have gotten to the point where they're offering instructions."

He added that with machine learning, the tools eventually will recognize that a
person might look at a certain monthly report or dashboard and then follow up
by doing the same thing each time. But people with similar roles within the
organization might do something different after they look at that same report or
dashboard, so the software will recommend that perhaps the first person ought to
be doing something different after looking at the data.

Daniels, likewise, said personalization is an important part of the future of data-


driven storytelling. "I would say the ultimate data storytelling would be
hyperpersonalized, predictive analytics that is telling me not only what happened
and trends but is also telling me what to look out for and what could happen in
the future," she said. "It's bringing more predictive analytics into the data
storytelling, and we're not far off from that."

Beyond personalization, data storytelling platforms will likely evolve to be more


proactive, according to Nichols. Now, the platforms require users to open their
reports and dashboards and request the narrative. "Part of data storytelling is
understanding when a story needs to be told," Nichols said. "And when I think of
a perfect vision for data storytelling, it's the system being proactive. It's telling
you when there's a story you need to hear."

And the same is true conversely, he added, when there's nothing new of note and
there's no reason to generate a new story. No matter what the future holds,
however, data-driven storytelling tools will always be about extending the reach
of analytics to a broader audience, and for the first time, potentially everyone.

Use Cases of Big Data Analytics in Real World


The most valuable item for any company in modern times is data! Companies
can work much more efficiently by analyzing large amounts of data and making
business decisions on that basis. This means that Big Data Analytics is the
current path to profit! So is it any surprise that more and more companies are
gradually turning towards a data-based business model?

Big Data Analytics is much more objective than the older methods and
companies can make the correct business decisions using data insights. There
was a time when companies could only interact with their customers on one in
stores. And there was no way to know what individual customers wanted on a
large scale. But that has all changed with the coming of Big Data Analytics.
Now companies can directly engage with each customer online personally and
know what they want.
So let’s see the different ways companies can use Big Data Analytics in the real
world to improve their performance and become even more successful (and
rich!) with time.

1. Companies use Big Data Analytics to Increase Customer Retention

No company can exist without customers! And so attracting customers and even
more importantly, retaining those customers is necessary for a company. And
Big Data Analytics can certainly help with that! Big Data Analytics allows a
company to observe customer trends and then market their products specifically
keeping their customers in mind. And the more data that a company has about its
customer base, the more accurately they can observe customer trends and
patterns which will ensure that the company can deliver exactly what its
customers want. And this is the best way to increase customer retention. After
all, happy customers mean loyal customers!
An example of a company that uses Big Data Analytics to Increase Customer
Retention is Amazon . Amazon collects all the data about its customers such as
their names, addresses, search history, payments, etc. so that it can provide a
truly personalized experience. This means that Amazon knows who you are as
soon as you log in! It also provides you product recommendations based on your
history so you are more likely to buy things. And if you buy lots of things on
Amazon, you are less likely to leave Amazon!

2. Companies use Big Data Analytics to create Marketing Campaigns

How can a company reach new customers? Marketing campaigns! However, if a


great marketing campaign can get customers for a company, a poor marketing
campaign can make a company lose even its existing customers. And so Big
Data Analytics is necessary to analyze the customer base and understand what
people want so that the marketing campaign is successful in converting more
people. This can be done by monitoring the current online trends, understanding
customer behavior in the market and then cashing on that to create a successful
marketing campaign.
An example of a company that uses Big Data Analytics to create Marketing
Campaigns is Netflix . Have you noticed that as soon as you open Netflix, they
have movies and series marketed specifically for you? They do this by collecting
data on your watching habits and search history and then providing targeted
adverts. So if you have been watching mystery movies recently, that’s what you
will be recommended in the future as well!

3. Companies use Big Data Analytics for Risk Management

A company cannot sustain itself if they don’t have a successful risk management
plan. After all, how is a big company supposed to function if they cannot even
find risks ahead of time and then work to minimize them as much as possible?
And this is where Big Data Analytics comes in! It can be used to collect and
analyze the vast internal data available in the company archives that can help in
developing both short term and long term risk management models. Using these,
the company can identify future risks and make much more strategic business
decisions. That means much more money in the future!!!
An example of a company that uses Big Data Analytics for Risk Management
is Starbucks . Did you know that Starbucks can have multiple stores on a single
street and all of them are successful? This is because Starbucks does great risk
analysis as well as providing great coffee! They collect data like location data,
demographic data, customer preferences, traffic levels, etc. of any location they
plan to open a shop and only do it if the chances of success are high and the
associated risk is minimal. So they can even choose locations that are close
together as long as there is more profit and less risk.

4. Companies use Big Data Analytics for Supply Chain Handling

The supply chain begins with the creation of raw materials and ends at the
finished products in the hands of the customers. And for large companies, it is
very difficult to handle this supply chain. After all, it can contain thousands of
people and products that are moving from the point of manufacture to the point
of consumption! So companies can use Big Data Analytics to analyze their raw
materials, products in their warehouse inventories and their retailer details to
understand their production and shipment needs. This will make Supply Chain
Handling much easier which will lead to fewer errors and consequently fewer
losses for the company.
An example of a company that uses Big Data Analytics for Supply Chain
Handling is PepsiCo . While the most popular thing sold by PepsiCo is Pepsi of
course, did you know they sell many other things like Mountain Dew, Lays,
7Up, Doritos, etc. all over the world! And it is very difficult to manage the
Supply Chain Handling of so many things without using Big Data Analytics. So
PepsiCo uses data to calculate the amount and type of products that retailers
want without any wastage occurring.

5. Companies use Big Data Analytics for Product Creation

All companies are trying to create products that their customers want. Well, what
if companies were able to first understand what their customers want and then
create products? They would surely be successful! That’s what Big Data
Analytics aims to do for Product Creation. Companies can use data like previous
product response, customer feedback forms, competitor product successes, etc.
to understand what types of products customers want and then work on that. In
this way, companies can create new products as well as improve their previous
products according to market demand and become much more successful and
popular.
An example of a company that uses Big Data Analytics for Product Creation
is Burberry , a British luxury fashion house. They provide luxury with
technology! This is done by targeting customers on an individual level to find
out the products that they want and focusing on those. Burberry store employees
can also see your online purchase history and preferences and recommend
matching accessories with your clothes. And this makes a truly personalized
product experience which is only possible with Big Data Analytics.
Key Skills That Data Scientists Need
Data scientists have a deceptively straightforward job to do: make sense of the
torrent of data that enters an organization as unstructured hash. Somewhere in
that confusion (hopefully) lies vital insight.
But is skill with algorithms and datasets enough for data scientists to succeed?
What else do they need to know to advance their careers?
While many tech pros might think that pushing data from query to conclusion is
enough to get by, they also need to know how the overall business works, and
how their data work will ultimately impact strategies and revenue. The current
hunger for data analytics means that companies always want more from their
data scientists.
Hard and Soft Skills
“There is a shortage, a skills gap in data science. It is enormous and it is
growing,” said Crystal Valentine, vice president of technology strategy at MapR,
a Big Data firm.
As proof of this, Valentine cited a report from consulting firm McKinsey &
Co. that suggests a national shortage of as many as 190,000 people with “deep
analytical skills” by 2018. That’s in addition to a gap of roughly 1.5 million “Big
Data” analysts and managers during the same time period.
Modern data science evolved from three fields: applied mathematics, statistics,
and computer science. In recent years, however, the term “data scientist” has
broadened to include anyone with “a background in the quantitative field,”
Valentine added. Other fields—including physics and linguistics—are
developing more of a symbiotic relationship with data science, thanks in large
part to the evolution of artificial intelligence, machine learning, and natural
language processing.
In addition to aptitude with math and algorithms, successful data scientists have
also mastered soft skills. “They need to know more than what is happening in
the cubicle,” said Mansour Raad, senior software architect at ESRI, which
produces mapping software. “You have to be a people person.”
In order to effectively crunch numbers, in other words, data scientists need to
work with the people who know the larger business. They must interact with
managers who can frame the company’s larger strategy, as well as colleagues
who will turn data insights into real action. With more input from those other
stakeholders, data scientists can better formulate the right questions to drive their
analysis.
“Soft skills” also means a healthy curiosity, said Thomas Redman, a.k.a. the
“Data Doc,” who consults and speaks extensively about data science. Ideally, the
applicant “likes to understand data, to understand what is going on in the world.”
When applying for data-science jobs, he added, applicants are often judged on
their intellectual curiosity in addition to their other skills—employers fear “they
will stay in front of a computer screen,” Redman observed. That can create an
issue for some data scientists who are used to keeping their nose in the data, and
not interacting with other business units.
When Redman was a statistician at Bell Labs (long before the term “data
scientist” was even coined), managers made a point of telling those employees
who worked with data that the ultimate mission was to make the telephone
network run better. That meant more than understanding statistics; it meant
understanding the broader problems facing the company.
Faith vs. Skepticism
There’s an old saying in business: If you want to manage a problem, put a
number on it. Data does that, to a certain extent. While the data scientist will
wrangle the data, it’s up to the manager to make sense of it.
Data can be taken on faith or questioned. Doing the former risks “GIGO”—
Garbage In, Garbage Out. The latter requires “data skepticism”—a good skill for
anyone who works with data on a daily basis.
Sometimes Raad spends about 80 percent of his time just cleaning data: “The
data you get is just garbage.” In this respect, a data scientist is really a “data
janitor.”
In the real world, “data is messy,” MapR’s Valentine concurred. “You have to
have a real healthy skepticism when looking at data collected from a real-life
effort.” One can’t assume a uniform distribution: “Data is the side-effect of real-
world processes.”
A good data scientist keeps in mind that collected data is not unbiased. “You are
trying to leverage the data to answer a question. You are not trying to stretch it
too far,” Valentine added. “As a rule of thumb, gathering as much data as
possible is a good strategy.”
Even if you’re not a data scientist, taking the results of an analysis simply on
faith is rarely a good idea. “We’re uncomfortable when someone else knows
more than you do,” Redman said. Whenever you’re studying the results of an
analysis, have a list of questions handy—where did the data come from? What’s
the worst thing that can happen? What has to be true for the recommendation to
be correct?
“People who don’t question things are fair victims.” Redman said.
Bias vs. Objectivity
“Getting something right in the beginning is not a sign of victory.” Raad said. Be
skeptical—do you have all the data? Is the data too good to be true? “The trick is
to remove the human from the equation… Let the math speak for itself.” The
data skeptic can then take the next step, showing how much of a conclusion
is not random.
Don’t try to be perfect. The solution you craft must only be sufficient, getting the
user from Point A to B. “You build a good, working Volkswagen [rather] than a
Cadillac.” Raad said. “You have to be able to settle for the Volkswagen
sometimes.”
Teams’ preconceptions are often built into algorithms. For example, take a credit
algorithm that rates applicants for loans; while you might think the underlying
math is neutral, the programmer may have fed their biases into the code.
Bias is not a new problem, Valentine said. Engineers often have to make a
“subjective decision” when trying to meet goals crafting portions of solutions
that are sufficient to meet immediate needs. But it isn’t as if the underlying
algorithms are black boxes: data scientists will need to determine for themselves
if the software is producing a good outcome.
When it comes to data scientist, both hard and soft skills are necessary to do the
job along with a healthy skepticism. When it comes to advancing a data-science
career, not taking things on faith seems like a solid course of action.
Data analytics and career opportunities
From our childhood, we have heard that we are nothing without water and water
is our life. But in this modern technical era, the same thing can be said about
data. Data means information that is basically created from a source and flows to
a receiver. Every object, living or non-living is enclosed by various kinds of
data. We work only with the data which we can understand and the rest of the
data remains a mystery. It is impossible to work with lots of data simultaneously.
This is the part where data analytics plays an important role.
A data analyst is a person who is in charge of collecting and analyzing the data.
The development and testing of the analytical models based on the collected and
analyzed data are also done by the data analyst.
Now let’s talk about the basic requirements and the process of data analytics.
First of all, the raw or unstructured data from various sources is collected and
then combined into a common format. This data is then loaded into a data
analytics system such as a data warehouse, a Hadoop cluster, etc. Then data
cleansing and profiling are done to make sure that the data is error-free and
consistent overall. After that, the main operation in data analytics is performed
i.e. building an analytical model of the data using various programming
languages such as SQL, Python, Scala, etc. Finally, the analytical model results
are used with the help of data visualization, to make decisions and obtain the
desired results.
Career Opportunities in Data Analytics
In this digital age, data analytics is more important than ever. There are multiple
jobs opportunities in various industries with the demand for data analytics
professionals increasing day by day. Some of the career opportunities that
require data analytics professionals are given as follows:

1. Data Scientist
A data scientist collects and analyses that data so that the relevant
decisions can be made using data visualization. A holistic view of data,
good knowledge of data analytics and data visualization skills, as well
as knowledge of programming languages such as SQL, Python, Scala,
etc., is a basic requirement for a data scientist.

2. Data Engineer
A data engineer helps in the design, implementation, and optimization
of the data infrastructure that is around the various data analytics
processes. In general, a data engineer handles quite large data sets and
often helps in making this data readable for data scientists by data
cleansing and profiling.

3. Business Analyst
A business analyst helps in solving the business problems an
organization is facing by using data analytics to understand the
business models, company reports, technology integration documents,
etc. and provide various business strategies

4. Statistician
A statistician collects, analyses and interprets statistical data to obtain
coherent and useful information. Some of the common jobs of
statisticians are to provide statistical simulations, mathematical
modeling, analysis and interpretation of various survey results,
business forecasting on the basis of data analytics, etc.

5. Machine Learning Engineer


A machine learning engineer analyses and interprets algorithms and
statistical models for machine learning using data analytics. Mainly
knowledge of both programming and statistics is required for a
machine learning engineer.

6. Quantitative Analyst
A quantitative analyst helps in solving the various financial problems
by using data analytics to analyze large amounts of data to understand
financial risk management, investment patterns, exchange rate trends,
the stock market, etc.

These are just some of the career opportunities that require data analytics.
However, data analytics, in general, is a vast field and the opportunities it
provides are endless. There are many more career opportunities and chances in
the data analytics field with even more growth predicted in the future. So a
career in data analytics is a lucrative prospect with enormous scope and growth
in the future.

You might also like