Taming Big Data Analytics
Taming Big Data Analytics
Data analytics can also be separated into quantitative data analysis and
qualitative data analysis. The former involves the analysis of numerical data with
quantifiable variables. These variables can be compared or measured
statistically. The qualitative approach is more interpretive it focuses on
understanding the content of non-numerical data like text, images, audio and
video, common phrases, themes and points of view.
An advanced type of data analytics includes data mining, which involves sorting
through large data sets to identify trends, patterns and relationships. Another
type is called predictive analytics, which seeks to predict customer behavior,
equipment failures and other future events. Machine learning can also be used
for data analytics, using automated algorithms to churn through data sets more
quickly than data scientists can do via conventional analytical modeling. Big
data analytics applies data mining, predictive analytics and machine learning
tools. Text mining provides a means of analyzing documents, emails and other
text-based content.
Data analytics initiatives support a wide variety of business uses. For example,
banks and credit card companies analyze withdrawal and spending patterns to
prevent fraud and identity theft. E-commerce companies and marketing services
providers will use clickstream analysis to identify website visitors who are likely
to buy a particular product or service based on navigation and page-viewing
patterns. Healthcare organizations mine patient data to evaluate the effectiveness
of treatments for cancer and other diseases. Mobile network operators also
examine customer data to forecast churn. This allows mobile companies to take
steps to prevent defections to business rivals. To boost customer relationship
management efforts, other companies can also engage in CRM analytics to
segment customers for marketing campaigns and equip call center workers with
up-to-date information about callers.
The analytics process starts with data collection. Data scientists identify the
information they need for a particular analytics application, and then work on
their own or with data engineers and IT staff to assemble it for use. Data from
different source systems may need to be combined via data integration routines,
transformed into a common format and loaded into an analytics system, such as
a Hadoop cluster, NoSQL database or data warehouse.
In other cases, the collection process may consist of pulling a relevant subset out
of a stream of data that flows into, for example, Hadoop. This data is then moved
to a separate partition in the system so it can be analyzed without affecting the
overall data set.
Once the data that's needed is in place, the next step is to find and fix data
quality problems that could affect the accuracy of analytics applications. That
includes running data profiling and data cleansing tasks to ensure the
information in a data set is consistent and that errors and duplicate entries are
eliminated. Additional data preparation work is then done to manipulate and
organize the data for the planned analytics use. Data governance policies are
then applied to ensure that the data follows corporate standards and is being used
properly.
Speaking at our Information Builders‘ Summit, IDC Group vice president, Dan
Vesset estimated that knowledge workers spend less than 20% of their time on
data analysis. The rest of their time is taken up with finding, preparing and
managing data, “An organisation plagued by the lack of relevant data,
technology and processes, employing 1000 knowledge workers, wastes over
$5.7 million annually searching for, but not finding information,” warned Vesset.
Vesset’s comments underline the fact that data must be business-ready before it
can generate value through advanced analytics, predictive analytics, IoT, or
artificial intelligence (AI).
As we’ve seen from numerous enterprise case studies, co-ordination of data and
analytics strategies and resources is the key to generating return on analytics
investments.
Equally, organisations need a clear analytics strategy which clarifies the desired
business outcomes.
Analytics strategy often follows four clear stages: starting with descriptive
analytics; moving to diagnostic analytics; advancing to predictive analytics and
ultimately to prescriptive analytics.
These two strategies must be aligned because the type of analytics required by
the organisation will have a direct impact on data management aspects such as
storage and latency requirements. For example, operational analytics and
decision support will place a different load on the infrastructure to customer
portal analytics, which must be able to scale to meet sudden spikes in demand.
If operational analytics and IoT are central to your analytics strategy, then
integration of new data formats and real-time streaming and integration will
need to be covered in your data strategy.
When the analytics workload is considered, the impact on the data strategy
becomes clear. While a data lake project will serve your data scientists and back
office analysts, your customers and supply chain managers may be left in the
dark.
Over the past four decades, we have seen the majority of enterprise efforts
devoted to back-office analytics and data science in order to deliver data-based
insights to management teams.
However, the most effective analytics strategy is to deliver insights to the people
who can use them to generate the biggest business benefits.
We typically observe faster time to value where the analytics strategy focuses on
delivering insights directly to operational workers to support their decision-
making; or to add value to the services provided to partners and customers.
How to align data and analytics strategies One proven approach is to look
at business use cases for each stage in the analytics strategy. This might include
descriptive management scorecards and dashboards; diagnostic back-office
analytics and data science; operational analytics and decision support; M2M and
IoT; AI; or portal analytics created to enhance the customer experience.
Identify all the goals and policies that must be included in your strategies. Create
a framework to avoid gaps in data management so that the right data will be
captured, harmonised and stored to allow it to be used effectively within the
analytics strategy.
Look at how your organisation enables access to and integration of diverse data
sources. Consider how it uses software, batch or real-time processing and data
streams from all internal systems.
By looking at goals and policies, the organisation can accommodate any changes
to support a strong combined data and analytics strategy.
Once you have defined your data and analytics strategies, it’s critical to address
data quality. Mastering data ensures that your people can trust the analytic
insights derived from it. Taking this first step will greatly simplify your
organisation’s subsequent analytics initiatives.
As data is the fuel of the analytics engine, performance will depend on data
refinement.
The reality for many data professionals is that they struggle to gain organisation-
wide support for a data strategy. Business managers are more inclined to invest
in tangibles, such as dashboards Identifying the financial benefits of investing in
a data quality programme, or a master data management initiative is a challenge,
unless something has previously gone wrong which has convinced the
management team that valuable analytics outputs are directly tied to quality data
inputs.
To gain their support for a data strategy consider involving line of business
managers by asking them what the overall goals and outputs are for their
analytics initiatives. An understanding the desired outputs of data will then guide
the design of the data infrastructure.
Pulling together
The following organisations aligned their data and analytics strategies to deliver
clear business outcomes:
Food for the Poor used high quality data and analytics to reach its
fund raising target more quickly: reducing the time taken to raise
$10 million from six months to six days, so that it could more
quickly help people in dire need.
Lipari Foods integrated IoT, logistics and geo location data,
enabling it to analyse supply chain operations so that it uses
warehouse space more efficiently, allowing it to run an agile
operation with a small team of people.
St Luke’s University Health Network mastered its data as part of its
strategy to target specific households to make them aware of
specialised medications, reaching 98 per cent uptake in one of its
campaigns focused on thirty households. “Rather than getting
mired in lengthy data integration and master data management
(MDM) processes without any short-term benefits, stakeholders
decided to focus on time-to-value by letting business priorities
drive program deliverables,” explains Dan Foltz, program manager
for the EDW and analytics implementation at St. Luke’s. “We
simultaneously proceeded with data integration, data governance,
and BI development to achieve our business objectives as part of a
continuous flow. The business had new BI assets to meet their
needs in a timely fashion, while the MDM initiative improved those
assets and enabled progressively better analysis,” he adds. This
approach allowed the St. Luke’s team to deliver value throughout
the implementation.
These are just a few examples of organisations having a cohesive data strategy
and analytics strategy which has enabled them to generate better value from
diverse and complex data sets.
While analytics initiatives often begin with one or two clear business cases, it’s
important to ensure that the overall data analytics strategy is bigger than any
single initiative. Organisations that focus on individual projects may find that
they have overlooked key data infrastructure requirements once they try to scale.
As Grace Auh, Business Intelligence and Decision Support manager at Markham
Stouffville Hospital, observed during Information Builders’ Summit, “Are you
connecting the dots? Or are you just collecting them?”
Big data analytics applications allow data analysts, data scientists, predictive
modelers, statisticians and other analytics professionals to analyze growing
volumes of structured transaction data, plus other forms of data that are often left
untapped by conventional BI and analytics programs. This includes a mix
of semi-structured and unstructured data. For example, internet data, web
server logs, social media content, text from customer emails and survey
responses, mobile phone records, and machine data captured
by sensors connected to the internet of things (IoT).
Big data analytics is a form of advanced analytics, which has marked differences
compared to traditional BI.
How big data analytics works
In some cases, Hadoop clusters and NoSQL systems are used primarily as
landing pads and staging areas for data. This is before it gets loaded into a data
warehouse or analytical database for analysis usually in a summarized form that
is more conducive to relational structures.
More frequently, however, big data analytics users are adopting the concept of a
Hadoop data lake that serves as the primary repository for incoming streams
of raw data. In such architectures, data can be analyzed directly in a Hadoop
cluster or run through a processing engine like Spark. As in data warehousing,
sound data management is a crucial first step in the big data analytics process.
Data being stored in the HDFS must be organized, configured and partitioned
properly to get good performance out of both extract, transform and load (ETL)
integration jobs and analytical queries.
Once the data is ready, it can be analyzed with the software commonly used
for advanced analytics processes. That includes tools for:
data mining, which sift through data sets in search of patterns and
relationships;
predictive analytics, which build models to forecast customer behavior
and other future developments;
machine learning, which taps algorithms to analyze large data sets; and
deep learning, a more advanced offshoot of machine learning.
Text mining and statistical analysis software can also play a role in the big data
analytics process, as can mainstream business intelligence software and data
visualization tools. For both ETL and analytics applications, queries can be
written in MapReduce, with programming languages such as R, Python, Scala,
and SQL. These are the standard languages for relational databases that are
supported via SQL-on-Hadoop technologies.
Early big data systems were mostly deployed on premises, particularly in large
organizations that collected, organized and analyzed massive amounts of data.
But cloud platform vendors, such as Amazon Web Services (AWS)
and Microsoft, have made it easier to set up and manage Hadoop clusters in the
cloud. The same goes for Hadoop suppliers such as Cloudera-Hortonworks,
which supports the distribution of the big data framework on the AWS
and Microsoft Azure clouds. Users can now spin up clusters in the cloud, run
them for as long as they need and then take them offline with usage-based
pricing that doesn't require ongoing software licenses.
Big data has become increasingly beneficial in supply chain analytics. Big
supply chain analytics utilizes big data and quantitative methods to enhance
decision making processes across the supply chain. Specifically, big supply
chain analytics expands datasets for increased analysis that goes beyond the
traditional internal data found on enterprise resource planning (ERP) and supply
chain management (SCM) systems. Also, big supply chain analytics implements
highly effective statistical methods on new and existing data sources. The
insights gathered facilitate better informed and more effective decisions that
benefit and improve the supply chain.
Initially, as the Hadoop ecosystem took shape and started to mature, big data
applications were primarily the province of large internet and e-
commerce companies such as Yahoo, Google and Facebook, as well as analytics
and marketing services providers. In the ensuing years, though, big data
analytics has increasingly been embraced by retailers, financial services firms,
insurers, healthcare organizations, manufacturers, energy companies and other
enterprises.
Logical Architectures for Big Data Analytics
If you check the reference architectures for big data analytics proposed
by Forrester and Gartner, modern analytics need a plurality of systems: one or
several Hadoop clusters, in-memory processing systems, streaming tools,
NoSQL databases, analytical appliances and operational data stores, among
others.
This is not surprising, since different data processing tasks need different tools.
For instance: real-time queries have different requirements than batch jobs, and
the optimal way to execute queries for reporting is very different from the way to
execute a machine learning process. Therefore, all these on-going big data
analytics initiatives are actually building logical architectures, where data is
distributed across several systems.
While the traditional analytical tools that comprise basic business intelligence
(BI) examine historical data, tools for advanced analytics focus on forecasting
future events and behaviors, enabling businesses to conduct what-if analyses to
predict the effects of potential changes in business strategies.
Predictive analytics, data mining, big data analytics and machine learning are
just some of the analytical categories that fall under the heading of advanced
analytics. These technologies are widely used in industries including marketing,
healthcare, risk management and economics.
Open source tools have become a go-to option for many data scientists doing
machine learning and prescriptive analytics. They include programming
languages, as well as computing environments, including Hadoop and Spark.
Users typically say they like open source advanced analytics tools because they
are generally inexpensive to operate, offer strong functionality and are backed by
a user community that continually innovates the tools.
On the proprietary side, vendors including Microsoft, IBM and the SAS Institute
all offer advanced analytics tools. Most require a deep technical background and
understanding of mathematical techniques.
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics
Predictive analytics is a form of advanced analytics that uses both new and
historical data to forecast activity, behavior and trends. It involves
applying statistical analysis techniques, analytical queries and
automated machine learning algorithms to data sets to create predictive
models that place a numerical value or score on the likelihood of a particular
event happening.
Predictive analytics software applications use variables that can be measured and
analyzed to predict the likely behavior of individuals, machinery or other
entities. Predictive analytics can be used for a variety of use cases. For example,
an insurance company is likely to take into account potential driving safety
variables, such as age, gender, location, type of vehicle and driving record, when
pricing and issuing auto insurance policies.
Predictive analytics has grown alongside the emergence of big data systems. As
enterprises have amassed larger and broader pools of data in Hadoop clusters
and other big data platforms, they have created increased data mining
opportunities to gain predictive insights. Heightened development and
commercialization of machine learning tools by IT vendors have also helped
expand predictive analytics capabilities.
Once predictive modeling produces actionable results, the analytics team can
share them with business executives, usually with the aid of dashboards and
reports that present the information and highlight future business opportunities
based on the findings. Functional models can also be built into operational
applications and data products to provide real-time analytics capabilities, such as
a recommendation engine on an online retail website that points customers to
particular products based on their browsing activity and purchase choices.
Beyond data modeling, other techniques used by data scientists and experts
engaging in predictive analytics may include:
text analytics software to mine text-based content, such as Microsoft
Word documents, email and social media posts;
classification models that organize data into preset categories to make it
easier to find and retrieve; and
deep neural networking, which can emulate human learning and
automate predictive analytics.
IoT also enables similar predictive analytics uses for monitoring oil and gas
pipelines, drilling rigs, windmill farms and various other industrial
IoT installations. Localized weather forecasts for farmers based partly on data
collected from sensor-equipped weather data stations installed in farm fields is
another IoT-driven predictive modeling application.
Analytics tools
A wide range of tools is used in predictive modeling and analytics. IBM,
Microsoft, SAS Institute and many other software vendors offer predictive
analytics tools and related technologies supporting machine learning and deep
learning applications.
In addition, open source software plays a big role in the predictive analytics
market. The open source R analytics language is commonly used in predictive
analytics applications, as are the Python and Scala programming languages.
Several open source predictive analytics and machine learning platforms are also
available, including a library of algorithms built into the Spark processing
engine.
Analytics teams can use the base open source editions of R and other analytics
languages or pay for the commercial versions offered by vendors such as
Microsoft. The commercial tools can be expensive, but they come with technical
support from the vendor, while users of pure open source releases must
troubleshoot on their own or seek help through open source community support
sites.
Can you tell me something about the source of data you used in your
analysis?
Are you sure the sample data are representative of the population?
Are there any outliers in your data distribution? How did they affect
the results?
What assumptions are behind your analysis?
Are there any conditions that would make your assumptions invalid?
Even with those cautions, it’s still pretty amazing that we can use analytics to
predict the future. All we have to do is gather the right data, do the right type of
statistical model, and be careful of our assumptions. Analytical predictions may
be harder to generate than those by the late-night television soothsayer Carnac
the Magnificent, but they are usually considerably more accurate.
Big data analytics projects raise stakes for predictive models
One of the keys to success in big data analytics projects is building strong ties
between data analysts and business units. But there are also technical and skills
issues that can boost or waylay efforts to create effective analytical models for
running predictive analytics and data mining applications against sets of big
data.
But analytics teams need to weigh the benefits of using the full assortment of
data at their disposal. That might be necessary for some applications -- for
example, fraud detection, which depends on identifying outliers in a data set that
point toward fraudulent activity, or uplift modeling efforts that aim to segment
potential customers so marketing programs can be targeted at people who might
be positively influenced by them. In other cases, predictive modeling in big data
environments can be done effectively and more quickly with smaller data sets
through the use of data sampling techniques.
Data quality is another issue that needs to be taken into account in building
models for big data analytics applications, said Michael Berry, analytics director
at travel website operator TripAdvisor LLC's TripAdvisor for Business division
in Newton, Mass. "There's a hope that because data is big now, you don't have to
worry about it being accurate," Berry said during a session at the 2013 Predictive
Analytics World conference in Boston. "You just press the button, and you'll
learn something. But that may not stand up to reality."
Staffing also gets a spot on the list of predictive modeling and big data analytics
challenges. Skilled data scientists are in short supply, particularly ones with a
combination of big data and predictive analytics experience. That can make it
difficult to find qualified data analysts and modelers to lead big data analytics
projects.
Along those lines, a computer engineer on Pitts' staff had a master's degree in
business administration but didn't really know anything about statistical analysis.
Highmark paid for the engineer to go back to school to get a master's degree in
statistics as well. Pitts said he identified the worker for continuing education
support not only because the engineer had some of the necessary qualifications
but also because he had a personality trait that Pitts is particularly interested in:
curiosity.
At DataSong, Nesbitt typically looks for someone with a Ph.D. in statistics and
experience using the R programming language, which the company uses to build
its predictive models with R-based software from Revolution Analytics. "To
work on our team, where we're building models all the time and we're knee-deep
in data, you have to have technical skills," she said.
Ultimately, though, those skills must be put to use to pull business value out of
an organization's big data vaults. "The key to remain focused on is that this isn't
really a technical problem -- it's a business problem," said Tony Rathburn, a
senior consultant and training director at The Modeling Agency, an analytics
consultancy in Pittsburgh. "That's the real issue for the analyst: setting up the
problem in a way that actually provides value to a business unit. That point
hasn't changed, regardless of the amount of data."
Joe DeCosmo, Enova's chief analytics officer, said a member of his team
recently told him that when the analyst first started working at the company, he
had to get over his academic instincts to detail every theory-based aspect of
the predictive models he builds in order to focus more on the business impact the
models can have.
"They have to realize they don't have to build the perfect model," DeCosmo said.
"It's about building something that's better than what we're doing currently."
This issue is heating up as more businesses look for workers with data science
skills. Often, the people who have the skills organizations need, which include
statistical analysis, machine learning, and R and Python programming, come
from academic backgrounds. But businesses don't have the kind of time that
Ph.D. programs give students to build analytical models. In the real world,
models need to be built and deployed quickly to help drive timely business
strategies and decisions.
"At our scale, if we can get a model into production that's 10% better, that adds
material impact to our business," DeCosmo said.
There's always a tradeoff between time and predictive power when developing
analytical models. Spending more time on development to make a model better
could allow a data scientist to discover new correlations that boost the strength
of its predictions. But DeCosmo said he sees more business value in speedy
development.
"We're very focused on driving down the time [it takes to develop models]," he
said. "There's no such thing as a perfect model, so don't waste your time trying to
build one. We'd rather get that model out into production."
But since 2013, they've been using a "data blending" tool from Alteryx Inc. to
bring all the data into an analytics sandbox that business analysts can access with
Tableau's data discovery and visualization software. Sturgeon said that allows
the business analysts to skip the "middleman" reporting systems and build their
own reports, while his team does deeper analyses.
"We take the data and bring it together," he said. "Then we say, 'Here's the
sandbox, here are some tools, what questions do you want to ask?'"
Even when doing more data science work, though, the focus is on simplicity.
The analytics team is still working to develop its predictive capabilities, so for
now it's starting small. For example, it recently looked to see if there was a
correlation between macroeconomic data published by the Federal Reserve and
Schneider Electric's sales. The goal was to improve sales forecasting and set
more reasonable goals for the company's salespeople. The analysts could have
brought in additional economic data from outside sources to try to strengthen the
correlation, but they instead prioritized a basic approach.
"We aren't looking to build the best predictive model," Sturgeon said. "We're
starting simple and trying to gain traction."
Setting project requirements at the outset could slow down the analytics process
and limit the insights that get generated, Lampa cautioned, adding that data
scientists have to be able to go where the data takes them. "You can't create
effective models when you're always tied down to predetermined specifications,"
he said.
At the oil and gas drilling company Halliburton, traditional BI is still important,
but there is a growing emphasis on predictive analytics models. One company
official said this trend is going to be the key to differentiating the Houston-based
firm from its competitors and making it more successful.
"You can do as much business intelligence as you want but it's not going to help
you win against your competitors in the long run," said Satyam Priyadarshy,
chief data scientist at Halliburton, in a presentation at the Predictive Analytics
World conference in Boston. He added that predictive modeling is going to be a
"game changer."
But simply doing predictive analytics modeling isn't enough. For Priyadarshy
and other conference presenters, predictive initiatives are only successful when
they are business-oriented and narrowly tailored to address specific problems.
For Priyadarshy, this approach means breaking down some of the data silos that
inevitably spring up. During the process of exploring and drilling a new gas or
oil well, tremendous volumes of data are generated. But they come from several
different departments. For example, data from seismic surveys of sites have
traditionally not been shared with the drilling operations teams, Priyadarshy said.
But there's an obvious need for the crews manning the drills to know what kind
of material they're likely to hit at certain depths.
Priyadarshy said he and his team are working on a homegrown data platform
that would make this data more accessible. The platform is a combination
of Hadoop, SQL, and in-memory database tools. It also includes a data
virtualization tool that allows different teams to access data wherever it is stored.
Doing so allows drilling teams to build predictive analytics models based on data
coming off of drilling sensors and from seismic surveys. These models allow the
drilling teams to predict in real time how fast they should run the drill bit and
how much pressure to apply.
"You want to be clear about what types of problems you're trying to solve," said
Alfred Essa, vice president of analytics at McGraw-Hill Education in Columbus,
Ohio, during a presentation at Predictive Analytics World. "This helps you ask
deeper questions."
McGraw-Hill works with clients -- primarily local school districts and colleges -
- to look at their data to predict student performance. McGraw-Hill and the
schools have been able to reliably predict how students are likely to perform in
classes, including which students could fail or drop out, Essa said. But simply
giving this information to schools isn't necessarily helpful. He talks to clients to
make sure they have a plan for how they intend to use the information. Just
telling students they're likely to fail and they need to work harder might actually
backfire, causing them to give up. Schools need to develop curriculums to help
failing students before they do anything with the predictions, he said.
For Essa, the answer to this kind of question often comes during exploratory data
analysis. This early stage of modeling typically involves just looking at the data,
graphing various elements and trying to get a feel for what's in the data. This
stage can help modelers see variables that may point to trends, Essa said. In the
case of predicting student failure, they may be able to see factors that lead
students to fail, enabling schools to address these worries. This action goes
beyond just a predictive model.
"Before you start to do modeling, it's really helpful to pose questions and
interactively get answers back," Essa said.
"When you watch people try to interact with predictions, there are things you
don't even think about," he said. "As soon as you put the word 'confidence' in
there you've lost 90% of the audience."
Today, the Hopper app simply tells users to buy now because prices are as low as
they're likely to get or to wait because a better deal is likely to pop up. There are
some complicated predictive models running behind the scenes analyzing things
like historic price data, prices for given days of the week and month, destinations
and past sales. But Surry said customers don't need to know all these
calculations; they just need to know if they should buy an airline ticket or wait.
That's still relatively low -- which creates even bigger potential business benefits
for organizations that have invested in predictive analytics software. If a
company's competitors aren't doing predictive analytics, it has "a great
opportunity to get ahead," Gualtieri said.
Predictive analytics projects can also provide those benefits across various
industries, said Eric King, president and founder of The Modeling Agency LLC,
an analytics consulting and training services firm based in Pittsburgh. "Everyone
is overwhelmed with data and starving for information," King noted.
But that doesn't mean it's just a matter of rolling out the technology and
letting analytics teams play around with data. When predictive analytics is done
well, the business benefits can be substantial -- but there are "some mainly
strategic pitfalls" to watch out for, King said. "Many companies are doing
analytics to do analytics, and they aren't pursuing analytics that are measurable,
purposeful, accountable and understandable by leadership."
King agreed that organizations often give data scientists too much responsibility
and leeway in analytics applications.
"They're really not analytics leaders in a lot of cases," he said, adding that data
scientists often aren't very effective at interviewing people from the business side
about their needs or defining analytics project plans. Echoing Gualtieri, King
said a variety of other people, from the business and IT, should also play roles in
predictive analytics initiatives. "When you have the right balance with your
team, you'll end up with a purposeful and thriving analytics process that will
produce results."
In addition, companies need to understand the data they have at their disposal
and make it easily accessible for analysis, which is "no small task," according to
Gualtieri. Without an effective data management strategy, analytics efforts can
grind to a halt: "Data scientists consistently report that a large percentage of their
time is spent in the data preparation stage," he said. "If they can't effectively get
that data together or it takes too much time, opportunity is wasted."
Once those skills are in place and projects are under way, Rexer said a key to
getting good results from predictive analytics techniques is focusing on one
business initiative at a time -- for example, customer retention or getting online
shoppers to add more items to their carts. In some cases, companies think "they
can take all the data, throw it in [predictive models] and magically insights are
going to come out," he said. "Predictive analytics can be very helpful, but it's not
magic. You need to be tightly focused."
Descriptive Analytics
Descriptive modeling is a mathematical process that describes real- world events
and the relationships between factors responsible for them. The process is used
by consumer-driven organizations to help them target their marketing and
advertising efforts.
Big data and business analytics may not be considered ubiquitous quite yet, but
they are getting there.
Total big data revenues over the past several years have grown exponentially and
will approach $50 billion by 2017, according to business technology consultancy
Wikibon. Forbes cited a 2015 Capgemini global study predicting a 56%
increase in big data investments over three years. And Computer Science Corp.
estimates data production overall by 2020 will be 44 times what it was in 2009.
The analytics needed to work with all of that data is growing just as fast. But
analytics comes in many flavors, with the descriptive and predictive varieties
being the biggest and most useful. Yet, of the two, descriptive is embraced far
more by businesses than predictive.
Yet the gap between descriptive and predictive ostensibly isn't as great as it
seems. The resulting data from both methodologies is gathered for the sole
purpose of answering questions, albeit different questions. For descriptive data,
it's "What has happened?" and for predictive data, "What might happen next?"
This decision model leads to the next level prescriptive analytics a methodology
for choosing effective courses of action from available options, and a very
different beast from descriptive and predictive analytics. The point is that no
branch of analytics exists in isolation; each methodology feeds into the next and
adds a new layer of sophistication and functionality to the process.
The ultimate goal is to view analytics not simply as implementing a new process
or tool, but more as important steps in an enterprise's evolution on the way
to perpetual growth and change.
Prescriptive Analytics
Prescriptive analytics is the area of business analytics (BA) dedicated to finding
the best course of action for a given situation.
Prescriptive analytics can also suggest decision options for how to take
advantage of a future opportunity or mitigate a future risk, and illustrate the
implications of each decision option. In practice, prescriptive analytics can
continually and automatically process new data to improve the accuracy of
predictions and provide better decision options.
Next in line is prescriptive analytics : the science of outcomes. It's less intuitive
and much harder to embrace, yet it feeds the enterprise the kind of news we don't
necessarily want to hear. Descriptive and predictive results simply provide better
data for making decisions always a good thing and an important refinement of
what is already happening. But prescriptive results take it a step further: They tell
us what to do. That makes prescriptive at least as important as its siblings in
moving the enterprise forward.
Prescriptive models don't just inform those involved in the decision-making
process, they are the decision-making process. They articulate the best outcome,
which can create friction among those who aren't comfortable relinquishing their
decision-making responsibilities to a machine.
Business rules defining the enterprise's operations serve to gauge the impact of
prescriptive recommendations on operations, efficiency and the bottom
line. Projected outcomes are brought in line with institutional priorities, values
and goals. The rules are based on established policy, best practices, internal and
external constraints, local and global objectives, and so on. They determine to
what degree prescriptive recommendations and anticipated outcomes truly work.
The rules must be dynamic; organic; and, to some degree, fluid. The entire point
of an analytics-based institutional culture is acquiescence to the objective reality
of real-world data. A corporate self-image based on that data will necessarily
evolve. It follows that the business rules driving prescriptive analytics must also
evolve. Therefore, the prescriptive process and the successful outcomes it
delivers will feed back into the rules and steadily refine them.
An electronics manufacturer in southern Indiana put this idea to work in
selecting its optimum long-term customer contracts. Though its headquarters are
in the U.S., most of its actual manufacturing facilities are located on other
continents. Capacity to manufacture and deliver in those other countries is
governed by a number of risk factors involving fluctuating availability of raw
materials, economic conditions affecting logistics and employee turnover. So the
business rules applied to the company's contract evaluation process are critical to
the accuracy of analysis and must be adjusted frequently.
The healthcare industry has been a leader in modeling prescriptive solutions with
the environment. The service provider's need for efficiency is greater than ever
because of the massive changes in healthcare economics in recent years.
Capacity planning is a key factor in optimizing logistics and resources for
service delivery. The models incorporate vast amounts of environmental data,
including highly granular demographics, trends in health by region, and
economic conditions at both national and regional levels. By using these models,
many healthcare providers are adjusting near-term and long-term investment
plans for optimal service delivery.
Prescriptive analytics closes the big data loop. It's a natural endpoint for the
descriptive and predictive processes that precede it. Whatever the hype and
hoopla surrounding prescriptive models, its success depends on a combination of
mathematical innovation, mastery of data and old-fashioned hard work.
Diagnostic Analytics
Diagnostic Analytics: In this analysis, we generally use historical data over
other data to answer any question or for the solution of any problem. We try to
find any dependency and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into
a problem, and they also keep detailed information about there disposal
otherwise data collection may turn out individual for every problem and it will
be very time-consuming.
Common techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
Data Discovery
“We are drowning in information but starved for knowledge” according to
best selling author, John Naisbitt. Today’s businesses can collect piles of
information on everything from customer buying patterns and feedback to
supplier lead times and marketing efforts. Yet it is nearly impossible to
draw value and truth from the massive amount of data your business
collects without a data discovery system in place.
Data discovery is a term related to business intelligence technology. It is the
process of collecting data from your various databases and silos, and
consolidating it into a single source that can be easily and instantly evaluated.
Once your raw data is converted, you can follow your train of thought by drilling
down into the data with just few clicks. Once a trend is identified, the software
empowers you to unearth the contributing factors.
For instance, BI enables you to explore the data by region, different employees,
product type, and more. In a matter of seconds, you have access to actionable
insights to make rapid, fact-based decisions in response to your discoveries.
Without BI, discovering a trend is usually a case of coincidence.
With data discovery, the user searches for specific items or patterns in a data
set. Visual tools make the process fun, easy-to-use, swift, and intuitive.
Visualization of data now goes beyond traditional static reports. BI
visualizations have expanded to include geographical maps, pivot-tables, heat
maps, and more, giving you the ability to create high-fidelity presentations of
your discoveries.
Discover trends you did not know where there
With data discovery, executives are often shocked to discover trends they didn’t
know were there. Michael Smith of the Johnston Corporation had this to say
after implementing Phocas:
"Five minutes into the demo, I had found items that didn't have the margin I was
expecting, customers that didn't have the profitability I was expecting and
vendors that weren't performing the way I expected. I realised that we were onto
something that would be very impactful to our business."
These discoveries allow companies to discover unfavourable trends before they
become a problem and take action to avoid losses.
Data Mining
Data mining is the process of sorting through large data sets to identify patterns
and establish relationships to solve problems through data analysis. Data mining
tools allow enterprises to predict future trends.
In data mining, association rules are created by analyzing data for frequent
if/then patterns, then using the support and confidence criteria to locate the most
important relationships within the data. Support is how frequently the items
appear in the database, while confidence is the number of times if/then
statements are accurate.
Other data mining parameters include Sequence or Path
Analysis, Classification, Clustering and Forecasting. Sequence or Path Analysis
parameters look for patterns where one event leads to another later event. A
Sequence is an ordered list of sets of items, and it is a common type of data
structure found in many databases. A Classification parameter looks for new
patterns, and might result in a change in the way the data is organized.
Classification algorithms predict variables based on other factors within the
database.
Clustering parameters find and visually document groups of facts that were
previously unknown. Clustering groups a set of objects and aggregates them
based on how similar they are to each other.
There are different ways a user can implement the cluster, which differentiate
between each clustering model. Fostering parameters within data mining can
discover patterns in data that can lead to reasonable predictions about the future,
also known as predictive analysis.
Specific data mining benefits vary depending on the goal and the industry. Sales
and marketing departments can mine customer data to improve lead conversion
rates or to create one-to-one marketing campaigns. Data mining information on
historical sales patterns and customer behaviors can be used to build prediction
models for future sales, new products and services.
Companies in the financial industry use data mining tools to build risk models
and detect fraud. The manufacturing industry uses data mining tools to improve
product safety, identify quality issues, manage the supply chain and improve
operations.
Text Mining
Text mining (text analytics) is the process of exploring and analyzing large
amounts of unstructured text data aided by software that can identify concepts,
patterns, topics, keywords and other attributes in the data. It's also known as text
analytics, although some people draw a distinction between the two terms; in
that view, text analytics refers to the application that uses text mining techniques
to sort through data sets.
Text mining has become more practical for data scientists and other users due to
the development of big data platforms and deep learning algorithms that
can analyze massive sets of unstructured data.
Mining and analyzing text helps organizations find potentially valuable business
insights in corporate documents, customer emails, call center logs, verbatim
survey comments, social network posts, medical records and other sources of
text-based data. Increasingly, text mining capabilities are also being incorporated
into AI chatbots and virtual agents that companies deploy to provide automated
responses to customers as part of their marketing, sales and customer service
operations.
As a result, text mining tools are now better equipped to uncover underlying
similarities and associations in text data, even if data scientists don't have a good
understanding of what they're likely to find at the start of a project. For example,
an unsupervised model could organize data from text documents or emails into a
group of topics without any guidance from an analyst.
Applications of text mining
Sentiment analysis is a widely used text mining application that can track
customer sentiment about a company. Also known as opinion mining, sentiment
analysis mines text from online reviews, social networks, emails, call center
interactions and other data sources to identify common threads that point to
positive or negative feelings on the part of customers. Such information can be
used to fix product issues, improve customer service and plan new marketing
campaigns, among other things.
Other common text mining uses include screening job candidates based on the
wording in their resumes, blocking spam emails, classifying website content,
flagging insurance claims that may be fraudulent, analyzing descriptions of
medical symptoms to aid in diagnoses, and examining corporate documents as
part of electronic discovery processes. Text mining software also offers
information retrieval capabilities akin to what search engines and enterprise
search platforms provide, but that's usually just an element of higher level text
mining applications, and not a use in and of itself.
Chatbots answer questions about products and handle basic customer service
tasks; they do so by using natural language understanding (NLU) technology, a
subcategory of NLP that helps the bots understand human speech and written
text so they can respond appropriately.
Text mining can also help predict customer churn, enabling companies to take
action to head off potential defections to business rivals as part of their
marketing and customer relationship management programs. Fraud detection,
risk management, online advertising and web content management are other
functions that can benefit from the use of text mining tools.
In healthcare, the technology may be able to help diagnose illnesses and medical
conditions in patients based on the symptoms they report.
Text mining challenges and issues
Text mining can be challenging because the data is often vague, inconsistent and
contradictory. Efforts to analyze it are further complicated by ambiguities that
result from differences in syntax and semantics, as well as the use of slang,
sarcasm, regional dialects and technical language specific to individual vertical
industries. As a result, text mining algorithms must be trained to parse such
ambiguities and inconsistencies when they categorize, tag and summarize sets of
text data.
In addition, the deep learning models used in many text mining applications
require large amounts of training data and processing power, which can make
them expensive to run. Inherent bias in data sets is another issue that can lead
deep learning tools to produce flawed results if data scientists don't recognize the
biases during the model development process.
There's also a lot of text mining software to choose from. Dozens of commercial
and open source technologies are available, including tools from major software
vendors, including IBM, Oracle, SAS, SAP and Tibco.
Web Mining
In customer relationship management (CRM), Web mining is the integration of
information gathered by traditional data mining methodologies and techniques
with information gathered over the World Wide Web. (Mining means extracting
something useful or valuable from a baser substance, such as mining gold from
the earth.) Web mining is used to understand customer behavior, evaluate the
effectiveness of a particular Web site, and help quantify the success of a
marketing campaign.
Web mining allows you to look for patterns in data through content mining,
structure mining, and usage mining. Content mining is used to examine data
collected by search engines and Web spiders. Structure mining is used to
examine data related to the structure of a particular Web site and usage mining is
used to examine data related to a particular user's browser as well as data
gathered by forms the user may have submitted during Web transactions.
The information gathered through Web mining is evaluated (sometimes with the
aid of software graphing applications) by using traditional data
mining parameters such as clustering and classification, association, and
examination of sequential patterns.
Data is the lifeblood of modern businesses. Increasingly, getting the most out of
our organization's data with accurate insight and understanding makes a real
difference to business success. As a result, the data scientist has become a
critical hire for companies of all sizes, whether the job is a specialized position
in IT or embedded in a business unit.
Nevertheless, it isn't always clear what we mean by the term data scientist .
A highly qualified data analyst? Someone with a scientific background who
happens to work with data?
Certainly, data scientists typically are experienced in statistics and scripting, and
they often have a technical background, rather than a scientific, liberal arts or
business one. But the critical element of data science -- which does make it a
science rather than just a business practice -- is the importance of process and
experiments.
You likely remember learning about the scientific method in high school.
Scientists come up with theories and hypotheses. They design experiments to
test those hypotheses and then either confirm, reject or, more often, refine the
theory.
Basic business intelligence and reporting typically doesn't follow this process.
Instead, BI and business analysts sift, sort, tabulate and visualize data in order to
support a business case. For example, they may show graphically that sales of
our company's products in the western region are falling and also that this region
includes younger customers compared to other areas. From there, they could
make the case that we need to change our product, marketing or sales strategy
for that region. In BI, the most persuasive data visualization often carries the
argument.
The data scientist takes a different approach. Let's continue to use this sales
example to show how the data science process works, in the following six steps.
In addition, we could come up with a few related hypotheses, such as: It's not
simply that customers in the western region are younger, but also that younger
people typically earn less money and average income is lower there than it is in
other regions.
You can see already that the data scientist must be able to think through different
implications of related hypotheses in order to design the right data science
experiments. Just asking one direct question when analyzing data generally
proves less helpful than asking several. And to get the best results, data scientists
should work with business experts to tease out edge cases and counterexamples
that can help refine their hypotheses.
The BI team commonly works with data from a data warehouse that has been
cleaned up, transformed and modeled to reflect business rules and how analysts
have looked at the data in the past. The data scientist, on the other hand,
generally wants to look at data in its raw state, before any rules are applied to it.
Also, data science applications often require more data than what is stored in a
warehouse.
In our example, the company's data warehouse likely includes various details
about customers but perhaps not how they paid for products: by credit card,
cash, online payment, etc. Or we may find that, because data warehouse models
can be cumbersome to modify, the putative system of record is a little out of date
and doesn't yet include newer forms of payment -- exactly the kinds that are
attractive to younger people.
So, the data scientist needs to work with the IT team to get access to the most
detailed data sources that are available and pull together the required data. This
may be business data sourced from ERP, CRM or other operational systems, but
it increasingly also includes web logs, streaming data from IoT devices and
many other types of data. The raw data usually will be extracted and loaded -- or
ingested, as the jargon has it -- into a data lake. For simplicity and convenience,
though, the data scientist most often only works against a sample data set at this
early stage.
And this isn't to say that data scientists do no data preparation work at all. For
sure, they typically don't apply business models or predefined business rules to
the raw data in the manner of a data warehouse developer. But they do spend a
lot of time profiling and cleansing data -- for example, deciding how to handle
missing or outlier values -- and transforming it into structures that are
appropriate for specific machine learning algorithms and statistical models.
Today, there are numerous data science and machine learning tools that can try
different algorithms and approaches and select the best ones for analytics
applications, without much human intervention. You more or less point the tool
at the data, specify the variables you're interested in and leave it to run. Often
described as automated machine learning platforms, these systems are largely
marketed to business users who function as citizen data scientists, but they're just
as popular with skilled data scientists, who use them to investigate more models
than they could do manually.
Even the best model can be improved with some tuning and tweaking of
variables. Sometimes, the data scientist may even want to go back and shape the
data a little differently -- perhaps removing outliers that were left during the
initial data preparation stage. For example, I've seen many cases where the
original data was collected with default values that were convenient but wrong
and potentially misleading.
The results? Well, I can't tell you what they'll be. But, with an interesting
hypothesis, good data and a carefully built model, a data scientist should be able
to find something useful to the business. You may surprise yourself, even at this
stage, with an unexpected discovery. Most often, you'll either confirm or reject
your original hypothesis -- which, of course, is what you set out to do in the first
place.
Going back to our sales example, let's assume the model we decide to run proves
that, yes, younger people are less likely to buy our products -- but with some
important twists, which leads us to the next step.
What we have, however, is a mass of statistics from our model that the business
users may not understand. Perhaps in general, younger people are indeed less
likely to buy our product -- also, their average purchase is lower than those of
older customers. But some young people buy a lot, resulting in a high median
sale level.
This final step isn't always straightforward. First, updating the analytical model
with fresh data on an ongoing basis may require a different approach to data
loading. What we did manually as an experiment may not be efficient in
practice. Partly for this reason, another role has emerged in many businesses: the
data engineer, whose responsibilities include working closely with data scientists
to make models production-ready.
We should also recognize that, in our example, buying habits change over time,
perhaps with the economy or changes in taste. So, we have to keep the model up
to date and perhaps tune it again in the future. That may also be one of a data
engineer's tasks, although the data scientist must rework a model if it drifts too
much from its original accuracy.
Finally, the model that works best as an experiment may prove expensive to run
in practice. With data analysis increasingly done in the cloud, where we pay for
the use of computing and storage, we may find that some changes make the
model slightly less accurate but cheaper to run. A data engineer can also help
with that, but the trade-off between accuracy and cost can be a tricky choice.
At its best, data science involves widespread collaboration across business and
IT domains and adds new value to many different facets of an organization's
work.
Data Preparation
The specifics of the data preparation process vary by industry, organization and
need, but the framework remains largely the same.
• Gather data
The data preparation process begins with finding the right data. This can come
from an existing data catalog or can be added ad-hoc.
• Discover and assess data
After collecting the data, it is important to discover each dataset. This step is
about getting to know the data and understanding what has to be done before the
data becomes useful in a particular context.
Data Profiling
Data profiling is the process of examining, analyzing and reviewing data to
collect statistics surrounding the quality and hygiene of the dataset. Data quality
refers to the accuracy, consistency, validity and completeness of data. Data
profiling may also be known as data archeology, data assessment, data discovery
or data quality analysis.
The first step of data profiling is gathering one or multiple data sources and
its metadata for analysis. The data is then cleaned to unify structure, eliminate
duplications, identify interrelationships and find any anomalies. Once the data is
clean, different data profiling tools will return various statistics to describe the
dataset. This could include the mean, minimum/maximum value, frequency,
recurring patterns, dependencies or data quality risks.
Profiling tools evaluate the actual content, structure and quality of the data by
exploring relationships that exist between value collections both within and
across data sets. Vendors that offer software and tools that can automate the data
profiling process include Informatica, Oracle and SAS.
Data profiling returns a high-level overview of data that can result in the
following benefits:
Data profiling can be implemented in a variety of use cases where data quality is
important. For example, projects that involve data warehousing or business
intelligence may require gathering data from multiple disparate systems or
databases for one report or analysis. Applying the data profiling process to these
projects can help identify potential issues and corrections that need to be made
in ETL processing before moving forward.
Data Scrubbing
Data scrubbing, also called data cleansing, is the process of amending or
removing data in a database that is incorrect, incomplete, improperly formatted,
or duplicated. An organization in a data-intensive field like banking, insurance,
retailing, telecommunications, or transportation might use a data scrubbing tool
to systematically examine data for flaws by using rules, algorithms, and look-up
tables. Typically, a database scrubbing tool includes programs that are capable of
correcting a number of specific type of mistakes, such as adding missing zip
codes or finding duplicate records. Using a data scrubbing tool can save
a database administrator a significant amount of time and can be less costly than
fixing errors manually.
The first step is to Extract the data. Extracting data is the process of identifying
and reading data from one or more source systems, which may be databases,
files, archives, ERP, CRM or any other viable source of useful data.
The second step for ELT, is to Load the extract data. Loading is the process
of adding the extracted data to the target database.
The third step is to Transform the data. Data transformation is the process of
converting data from its source format to the format required for analysis.
Transformation is typically based on rules that define how the data should be
converted for usage and analysis in the target data store. Although transforming
data can take many different forms, it frequently involves converting coded data
into usable data using code and lookup tables.
Examples of transformations include:
With an ELT approach, a data extraction tool is used to obtain data from a source
or sources, and the extracted data is stored in a staging area or database. Any
required business rules and data integrity checks can be run on the data in the
staging area before it is loaded into the data warehouse. All data transformations
occur in the data warehouse after the data is loaded.
With ETL, the raw data is not available in the data warehouse because it is
transformed before it is loaded. With ELT, the raw data is loaded into the data
warehouse (or data lake) and transformations occur on the stored data.
Staging areas are used for both ELT and ETL, but with ETL the staging areas are
built into the ETL tool being used. With ELT, the staging area is in a database
used for the data warehouse.
ELT is most useful for processing the large data sets required for business
intelligence (BI) and big data analytics. Non-relational and unstructured data is
more conducive for an ELT approach because the data is copied "as is" from the
source. Applying analytics to unstructured data typically uses a "schema on
read" approach as opposed to the traditional "schema on write" used
by relational databases.
Loading data without first transforming it can be problematic if you are moving
data from a non-relational source to a relational target because the data will have
to match a relational schema. This means it will be necessary to identify and
massage data to support the data types available in the target database.
Data type conversion may need to be performed as part of the load process if the
source and target data stores do not support all the same data types. Such
problems can also occur when moving data from one relational database
management system (DBMS) to another, such as say Oracle to Db2, because the
data types supported differ from DBMS to DBMS.
ETL should be considered as a preferred approach over ELT when there is a need
for extensive data cleansing before loading the data to the target system, when
there are numerous complex computations required on numeric data and when
all the source data comes from relational systems.
ELT ETL
Order of Extract Extract
Processes Load Transform
Transform Load
Flexibility Because transformation is More upfront planning
not dependent on should be conducted to
extraction, ELT is more ensure that all relevant data is
flexible than ETL for being integrated.
adding more extracted data
in the future.
Administration More administration may Typically, a single tool is
be required as multiple used for all three stages
tools may need to be perhaps simplifying
adopted. administration effort.
Development With a more flexible ETL requires upfront design
Time approach, development planning, which can result in
time may expand less overhead and
depending upon development time because
requirements and approach. only relevant data is
processed.
End Users Data scientists and Users reading reports and
advanced analysts SQL coders
Complexity of Transformations are coded Transformations are coded in
Transformation in by programmers (e.g., the ETL tool by data
using Java) and must be integration professional
maintained like any other experienced with the tool.
program.
Hardware Typically, ELT tools do not It is common for ETL tools
Requirements require additional to require specific hardware
hardware, instead using with their own engines to
existing compute power for perform transformations.
transformations.
Skills ELT relies mostly on native ETL requires additional
DBMS functionality, so training and skills to learn the
existing skills can be used tool set that drives the
in most cases. extraction, transformation
and loading.
Maturity ELT is a relatively new ETL is a mature practice that
practice, and as such there has existed since the 1990s.
is less expertise and fewer There are many skilled
best practices available. technicians, best practices
exist, and there are many
useful ETL tools on the
market.
Data Stores Mostly Hadoop, perhaps Almost exclusively relational
NoSQL database. Rarely database.
relational database.
Use Cases Best for unstructured data Best for relational and
and nonrelational data. structured data. Better for
Ideal for data lakes. Can small to medium amounts of
work for homogeneous data.
relational data, too. Well-
suited for very large
amounts of data.
Benefits of ELT
One of the main attractions of ELT is the reduction in load times relative to the
ETL model. Taking advantage of the processing capability built into a data
warehousing infrastructure reduces the time that data spends in transit and is
usually more cost-effective. ELT can be more efficient by utilizing the computer
power of modern data storage systems.
When you use ELT, you move the entire data set as it exists in the source
systems to the target. This means that you have the raw data at your disposal in
the data warehouse, in contrast to the ETL approach where the raw data is
transformed before it is loaded to the data warehouse. This flexibility can
improve data analysis, enabling more analytics to be performed directly within
the data warehouse without having to reach out to the source systems for the
untransformed data.
Using the ELT can make sense when adopting a big data initiative for analytics.
Big data often relies on a large amount of data, as well as wide variety of data
that is more suitable for ELT.
Uses of ELT
ELT is often used in the following cases:
when the data is structured, but the source and target database are the
same type (i.e., Oracle source and target);
when the data is unstructured and massive, such as processing and
correlating data from log files and sensors'
when the data is relatively simple, but there are large amounts of it;
when there is a plan to use machine learning tools to process the data
instead of traditional SQL queries; and
schema on read.
Users can look for tools that can perform both ETL and ELT, as it's likely to have
the need for both data integration techniques.
A data store can be useful for managing a target data mart, data warehouse
and/or data lake. For an ELT approach, NoSQL database management
systems and Hadoop are viable candidates, as are purpose-built data warehouse
appliances. In some cases, a traditional relational DBMS may be appropriate.
Tools to Mine Big Data Analytics
Before it deployed a Hadoop cluster five years ago, retailer Macy's Inc. had big
problems analyzing all of the sales and marketing data its systems were
generating. And the problems were only getting bigger as Macy's pushed
aggressively to increase its online business, further ratcheting up the data
volumes it was looking to explore.
The company's traditional data warehouse architecture had severe processing
limitations and couldn't handle unstructured information, such as text. Historical
data was also largely inaccessible, typically having been archived on tapes that
were shipped to off-site storage facilities. Data scientists and other analysts
"could only run so many queries at particular times of the day," said Seetha
Chakrapany, director of marketing analytics and customer relationship
management (CRM) systems at Macy's. "They were pretty much shackled. They
couldn't do their jobs."
The Hadoop system has alleviated the situation, providing a big data analytics
architecture that also supports basic business intelligence (BI) and reporting
processes. Going forward, the cluster "could truly be an enterprise data analytics
platform" for Macy's, Chakrapany said. Already, along with the analytics teams
using it, thousands of business users in marketing, merchandising, product
management and other departments are accessing hundreds of BI
dashboards that are fed to them by the system.
But there's a lot more to the Macy's big data environment than the Hadoop
cluster alone. At the front end, for example, Macy's has deployed a variety of
analytics tools to meet different application needs. For statistical analysis, the
Cincinnati-based retailer uses SAS and Microsoft's R Server, which is based on
the R open source statistical programming language.
Several other tools provide predictive analytics, data mining and machine
learning capabilities. That includes H2O, Salford Predictive Modeler, the
Apache Mahout open source machine learning platform and KXEN -- the latter
an analytics technology that SAP bought three years ago and has since folded
into its SAP BusinessObjects Predictive Analytics software. Also in the picture
at Macy's are Tableau Software's data visualization tools and AtScale's BI on
Hadoop technology.
A better way to analyze big data
All the different tools are key elements in making effective use of the big data
analytics architecture, Chakrapany said in a presentation and follow-up interview
at Hadoop Summit 2016 in San Jose, Calif. Automating the advanced analytics
process through statistical routines and machine learning is a must, he noted.
Similar scenarios are increasingly playing out at other organizations, too. As big
data platforms such as Hadoop, NoSQL databases and the Spark processing
engine become more widely adopted, the number of companies deploying
advanced analytics tools that can help them take advantage of the data flowing
into those systems is also on the rise.
A TDWI survey conducted in the second half of 2015 also found increasing
plans to use predictive analytics software to bolster business operations. In that
case, 87% of 309 BI, analytics and data management professionals said their
organizations were already active users of the technology or expected to
implement it within three years. Other forms of advanced analytics, what-if
simulations and prescriptive analytics, for example are similarly in line for
increased usage, according to a report on the survey, which was published last
December (see "Predicting High Growth" chart).
Predictive analytics use is on the rise.
Progressive Casualty Insurance Co. is another company that's already there. The
Mayfield Village, Ohio-based insurer uses a Hadoop cluster partly to power its
Snapshot program, which awards policy discounts to safe drivers based on
operational data collected from their vehicles through a device that plugs into the
on- board diagnostics port.
The predictive analytics and machine learning capabilities are "huge," said
Pawan Divakarla, Progressive's data and analytics business leader. "You have so
much data, and you have fancier and fancier models for analyzing it. You need
something to assist you, to see what works."
"Even after 10 years, we're still uncovering benefits," said Andy Feng, vice
president in charge of Yahoo's big data and machine learning architecture. Feng
estimated that, over the past three years, he has spent about 95% of his time at
work focusing on machine learning tools and applications. In the past, the
automated algorithms that could be built and run with existing machine learning
technologies "weren't capable of leveraging huge data sets on Hadoop clusters,"
Feng said. "The accuracy wasn't that good."
"We always did machine learning, but we did it in a constrained fashion, so the
results were limited," added Sumeet Singh, senior director of product
development for cloud and big data platforms at Yahoo. However, he and Feng
said things have changed for the better in recent years, and in a big way. "We've
seen an amazing resurgence in artificial intelligence and machine learning, and
one of the reasons is all the data," Singh noted.
For example, Yahoo is now running a machine learning algorithm that uses
a semantic analysis process to better match paid ads on search results pages to
the search terms entered by web users; it has led to a 9% increase in revenue per
search, according to Feng. Another machine learning application lets users of
Yahoo's Flickr online photo and video service organize images based on their
visual content instead of the date on which they were taken. The algorithm can
also flag photos as not suitable for viewing at work to help users avoid
potentially embarrassing situations in the office, Feng said.
These new applications were made possible partly through the addition of
graphics processing units to Hadoop cluster nodes, Feng said the GPUs do image
processing that conventional CPUs can't handle. Yahoo also added Spark to the
big data analytics architecture to take over some of the processing work.
The appeal of the R language has gradually spread out of academia into business
settings, as many data analysts who trained on R in college prefer to continue
using it rather than pick up a new tool with which they are inexperienced.
Users can also write their own functions. The environment allows users to
combine individual operations, such as joining separate data files into a single
document, pulling out a single variable and running a regression on the resulting
data set, into a single function that can be used over and over.
Because it's been around for many years and has been popular throughout its
existence, the language is fairly mature. Users can download add-on packages
that enhance the basic functionality of the language. These packages enable
users to visualize data, connect to external databases, map data geographically
and perform advanced statistical functions. There is also a popular user
interface called RStudio, which simplifies coding in the R language.
The R language has been criticized for delivering slow analyses when applied to
large data sets. This is because the language utilizes single-threaded processing,
which means the basic open source version can only utilize one CPU at a time.
By comparison, modern big data analytics thrives on parallel data processing,
simultaneously leveraging dozens of CPUs across a cluster of servers to process
large data volumes quickly.
In addition to its single-threaded processing limitations, the R programming
environment is an in-memory application. All data objects are stored in a
machine's RAM during a given session. This can limit the amount of data R is
able to work on at one time.
Several software vendors have added support for the R programming language
to their offerings, allowing R to gain a stronger footing in the modern big data
realm. Vendors including IBM, Microsoft, Oracle, SAS Institute, TIBCO and
Tableau, among others, include some level of integration between their analytics
software and the R language. There are also R packages for popular open source
big data platforms, including Hadoop and Spark.
rnorm(100,65.342,2.1)
And from that, R will generate the data you're looking for.
Now, for many people, that might sound unbelievably boring. But the power of
R analytics lies in the application of the language's abilities: It's a perfect tool for
numerical simulations. For example, I recently wanted to perform a Monte Carlo
simulation of a scoring system called the Net Promoter Score (NPS). Monte
Carlo simulations are a vital part of analytics; they allow you to model the
behavior of complex systems in order to be able to understand them. Used by
analytics professionals for many years, they involve random sampling of sets of
numbers thousands or even millions of times.
R excels at creating and running Monte Carlo simulations, and the NPS
simulation described above took a mere nine lines of code. I would love to tell
you that I'm a hero because I managed to do it in nine lines, but that really isn't
the case. The R programming language is simply exceptionally good at
generating huge sets of numbers and then manipulating them. It's also good for
prototyping big data manipulations.
How does R manage to be so good at these kinds of tasks? The answer is that it
has a whole raft of functions that are designed specifically for this kind of work.
Where do they come from? R is free and open source. If people want a function
and can't find it, they can write one and add it to the function "bank" that is R.
They have been doing that for about 15 years, which means that most of the
functions you will ever need are already there.
Finally, R is a very easy language to learn -- you can just download the language
and a front-end environment (such as RStudio, which I used to create the image
embedded here) and start typing.
Visualising distributions
How you visualise the distribution of a variable will depend on whether the
variable is categorical or continuous. A variable is categorical if it can only take
one of a small set of values. In R, categorical variables are usually saved as
factors or character vectors. To examine the distribution of a categorical variable,
use a bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
The height of the bars displays how many observations occurred with each x
value. You can compute these values manually with dplyr::count():
diamonds %>%
count(cut)
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
A variable is continuous if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables. To examine
the distribution of a continuous variable, use a histogram:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
You can compute this by hand by
combining dplyr::count() and ggplot2::cut_width():
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # … with 5 more rows
A histogram divides the x-axis into equally spaced bins and then uses the height
of a bar to display the number of observations that fall in each bin. In the graph
above, the tallest bar shows that almost 30,000 observations have a carat value
between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the binwidth
argument, which is measured in the units of the x variable. You should always
explore a variety of binwidths when working with histograms, as different
binwidths can reveal different patterns. For example, here is how the graph
above looks when we zoom into just the diamonds with a size of less than three
carats and choose a smaller binwidth.
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
Typical values
In both bar charts and histograms, tall bars show the common values of a
variable, and shorter bars show less-common values. Places that do not have bars
reveal values that were not seen in your data. To turn this information into useful
questions, look for anything unexpected:
How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each
other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
The histogram below shows the length (in minutes) of 272 eruptions of the Old
Faithful Geyser in Yellowstone National Park. Eruption times appear to be
clustered into two groups: there are short eruptions (of around 2 minutes) and
long eruptions (4-5 minutes), but little in between.
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
Many of the questions above will prompt you to explore a relationship between
variables, for example, to see if the values of one variable can explain the
behavior of another variable. We’ll get to that shortly.
Unusual values
Outliers are observations that are unusual; data points that don’t seem to fit the
pattern. Sometimes outliers are data entry errors; other times outliers suggest
important new science. When you have a lot of data, outliers are sometimes
difficult to see in a histogram. For example, take the distribution of the y
variable from the diamonds dataset. The only evidence of outliers is the
unusually wide limits on the x-axis.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
There are so many observations in the common bins that the rare bins are so
short that you can’t see them (although maybe if you stare intently at 0 you’ll
spot something). To make it easy to see the unusual values, we need to zoom to
small values of the y-axis with coord_cartesian() :
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
(coord_cartesian() also has an xlim() argument for when you need to zoom into
the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly
differently: they throw away the data outside the limits.)
This allows us to see that there are three unusual values: 0, ~30, and ~60. We
pluck them out with dplyr:
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
#> # A tibble: 9 x 4
#> price x y z
#> <int> <dbl> <dbl> <dbl>
#> 1 5139 0 0 0
#> 2 6381 0 0 0
#> 3 12800 0 0 0
#> 4 15686 0 0 0
#> 5 18034 0 0 0
#> 6 2130 0 0 0
#> 7 2130 0 0 0
#> 8 2075 5.15 31.8 5.12
#> 9 12210 8.09 58.9 8.06
The y variable measures one of the three dimensions of these diamonds, in mm.
We know that diamonds can’t have a width of 0mm, so these values must be
incorrect. We might also suspect that measurements of 32mm and 59mm are
implausible: those diamonds are over an inch long, but don’t cost hundreds of
thousands of dollars!
It’s good practice to repeat your analysis with and without the outliers. If they
have minimal effect on the results, and you can’t figure out why they’re there,
it’s reasonable to replace them with missing values, and move on. However, if
they have a substantial effect on your results, you shouldn’t drop them without
justification. You’ll need to figure out what caused them (e.g. a data entry error)
and disclose that you removed them in your write-up.
Exercises
Exercises
It’s hard to see the difference in distribution because the overall counts differ so
much:
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
To make the comparison easier we need to swap what is displayed on the y-axis.
Instead of displaying count, we’ll display density , which is the count
standardised so that the area under each frequency polygon is one.
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
There’s something rather surprising about this plot - it appears that fair diamonds
(the lowest quality) have the highest average price! But maybe that’s because
frequency polygons are a little hard to interpret - there’s a lot going on in this
plot.
Another alternative to display the distribution of a continuous variable broken
down by a categorical variable is the boxplot. A boxplot is a type of visual
shorthand for a distribution of values that is popular among statisticians. Each
boxplot consists of:
A box that stretches from the 25th percentile of the distribution to the
75th percentile, a distance known as the interquartile range (IQR). In
the middle of the box is a line that displays the median, i.e. 50th
percentile, of the distribution. These three lines give you a sense of
the spread of the distribution and whether or not the distribution is
symmetric about the median or skewed to one side.
Visual points that display observations that fall more than 1.5 times
the IQR from either edge of the box. These outlying points are
unusual so are plotted individually.
A line (or whisker) that extends from each end of the box and goes to
the
farthest non-outlier point in the distribution.
Let’s take a look at the distribution of price by cut using geom_boxplot() :
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
We see much less information about the distribution, but the boxplots are much
more compact so we can more easily compare them (and fit more on one plot). It
supports the counterintuitive finding that better quality diamonds are cheaper on
average! In the exercises, you’ll be challenged to figure out why.
cut is an ordered factor: fair is worse than good, which is worse than very good
and so on. Many categorical variables don’t have such an intrinsic order, so you
might want to reorder them to make a more informative display. One way to do
that is with the reorder() function.
For example, take the class variable in the mpg dataset. You might be interested
to know how highway mileage varies across classes:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
To make the trend easier to see, we can reorder class based on the median value
of hwy :
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y =
hwy))
If you have long variable names, geom_boxplot() will work better if you flip it
90°. You can do that with coord_flip() .
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y =
hwy)) +
coord_flip()
Exercises
The size of each circle in the plot displays how many observations occurred at
each combination of values. Covariation will appear as a strong correlation
between specific x values and specific y values.
Another approach is to compute the count with dplyr:
diamonds %>%
count(color, cut)
#> # A tibble: 35 x 3
#> color cut n
#> <ord> <ord> <int>
#> 1 D Fair 163
#> 2 D Good 662
#> 3 D Very Good 1513
#> 4 D Premium 1603
#> 5 D Ideal 2834
#> 6 E Fair 224
#> # … with 29 more rows
Then visualise with geom_tile() and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
If the categorical variables are unordered, you might want to use the seriation
package to simultaneously reorder the rows and columns in order to more clearly
reveal interesting patterns. For larger plots, you might want to try the d3heatmap
or heatmaply packages, which create interactive plots.
Exercises
1. How could you rescale the count dataset above to more clearly show
the distribution of cut within colour, or colour within cut?
2. Use geom_tile() together with dplyr to explore how average flight
delays vary by destination and month of year. What makes the plot
difficult to read? How could you improve it?
3. Why is it slightly better to use aes(x = color, y = cut) rather
than aes(x = cut, y = color) in the example above?
Scatterplots become less useful as the size of your dataset grows, because points
begin to overplot, and pile up into areas of uniform black (as above). You’ve
already seen one way to fix the problem: using the alpha aesthetic to add
transparency.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
But using transparency can be challenging for very large datasets. Another
solution is to use bin. Previously you used geom_histogram()
and geom_freqpoly() to bin in one dimension. Now you’ll learn how to
use geom_bin2d() and geom_hex() to bin in two dimensions.
geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and
then use a fill color to display how many points fall into each bin. geom_bin2d()
creates rectangular bins. geom_hex() creates hexagonal bins. You will need to
install the hexbin package to use geom_hex() .
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
Another option is to bin one continuous variable so it acts like a categorical
variable. Then you can use one of the techniques for visualising the combination
of a categorical and a continuous variable that you learned about. For example,
you could bin carat and then for each group, display a boxplot:
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
cut_width(x, width) , as used above, divides x into bins of width width . By
default, boxplots look roughly the same (apart from number of outliers)
regardless of how many observations there are, so it’s difficult to tell that each
boxplot summarises a different number of points. One way to show that is to
make the width of the boxplot proportional to the number of points
with varwidth = TRUE .
Another approach is to display approximately the same number of points in each
bin. That’s the job of cut_number() :
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
Exercises
Once you’ve removed the strong relationship between carat and price, you can
see what you expect in the relationship between cut and price: relative to their
size, better quality diamonds are more expensive.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
You’ll learn how models, and the modelr package, work in the final part of the
book, model. We’re saving modelling for later because understanding what
models are and how they work is easiest once you have tools of data wrangling
and programming in hand.
ggplot2 calls
As we move on from these introductory chapters, we’ll transition to a more
concise expression of ggplot2 code. So far we’ve been very explicit, which is
helpful when you are learning:
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
Typically, the first one or two arguments to a function are so important that you
should know them by heart. The first two arguments to ggplot() are data
and mapping , and the first two arguments to aes() are x and y . In the
remainder of the book, we won’t supply those names. That saves typing, and, by
reducing the amount of boilerplate, makes it easier to see what’s different
between plots. That’s a really important programming concern that we’ll come
back in functions.
Rewriting the previous plot more concisely yields:
ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
Sometimes we’ll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from %>% to +. I wish this transition wasn’t necessary
but unfortunately ggplot2 was created before the pipe was discovered.
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
Example:
Let us take an example of time analysis series which is a method of predictive
analysis in R programming:
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# output to be created as png file
png(file ="predictiveAnalysis.png")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
# plotting the graph
plot(mts, xlab ="Weekly Data of sales",
ylab ="Total Revenue",
main ="Sales vs Revenue",
col.main ="darkgreen")
# saving the file
dev.off()
Output:
Forecasting Data:
Now, forecasting sales and revenue based on historical data.
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843,
471497, 936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# library required for forecasting
library(forecast)
# output to be created as png file
png(file ="forecastSalesRevenue.png")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
# forecasting model using arima model
fit <- auto.arima(mts)
# Next 5 forecasted values
forecast(fit, 5)
# plotting the graph with next
# 5 weekly forecasted values
plot(forecast(fit, 5), xlab ="Weekly Data of Sales",
ylab ="Total Revenue",
main ="Sales vs Revenue", col.main ="darkgreen")
# saving the file
dev.off()
Output:
Performing Hierarchical Cluster Analysis using R
Cluster analysis or clustering is a technique to find subgroups of data points
within a data set. The data points belonging to the same subgroup have similar
features or properties. Clustering is an unsupervised machine learning approach
and has a wide variety of applications such as market research, pattern
recognition, recommendation systems, and so on. The most common algorithms
used for clustering are K-means clustering and Hierarchical cluster analysis. In
this article, we will learn about hierarchical cluster analysis and its
implementation in R programming.
Hierarchical cluster analysis (also known as hierarchical clustering) is a
clustering technique where clusters have a hierarchy or a predetermined order.
Hierarchical clustering can be represented by a tree-like structure called
a Dendrogram . There are two types of hierarchical clustering:
hclust in the stats package and agnes in the cluster package for
agglomerative hierarchical clustering.
diana in the cluster package for divisive hierarchical clustering.
We will use the Iris flower data set from the datasets package in our
implementation. We will use sepal width, sepal length, petal width, and petal
length column as our data points. First, we load and normalize the data. Then the
dissimilarity values are computed with dist function and these values are fed to
clustering functions for performing hierarchical clustering.
Observe that in the above dendrogram, a leaf corresponds to one observation and
as we move up the tree, similar observations are fused at a higher height. The
height of the dendrogram determines the clusters. In order to identify the
clusters, we can cut the dendrogram with cutree . Then visualize the result in a
scatter plot using fviz_cluster function from the factoextra package.
# Cut tree into 3 groups
sub_grps <- cutree(hc1, k = 3)
# Visualize the result in a scatter plot
fviz_cluster(list(data = df, cluster = sub_grps))
Output:
We can also provide a border to the dendrogram around the 3 clusters as shown
below.
# Plot the obtained dendogram with
# rectangle borders for k clusters
plot(hc1, cex = 0.6, hang = -1)
rect.hclust(hc1, k = 3, border = 2:4)
Output:
Alternatively, we can use the agnes function to perform the hierarchical
clustering. Unlike hclust , the agnes function gives the agglomerative
coefficient, which measures the amount of clustering structure found (values
closer to 1 suggest strong clustering structure).
# agglomeration methods to assess
m <- c("average", "single", "complete")
names(m) <- c("average", "single", "complete")
# function to compute hierarchical
# clustering coefficient
ac <- function(x) {
agnes(df, method = x)$ac
}
map_dbl(m, ac)
Output:
average single complete
0.9035705 0.8023794 0.9438858
Complete linkage gives a stronger clustering structure. So, we use this
agglomeration method to perform hierarchical clustering with agnes function as
shown below.
# Hierarchical clustering
hc2 <- agnes(df, method = "complete")
# Plot the obtained dendogram
pltree(hc2, cex = 0.6, hang = -1,
main = "Dendrogram of agnes")
Output:
Python can be used as the script in Microsoft's Active Server Page (ASP)
technology. The scoreboard system for the Melbourne (Australia) Cricket
Ground is written in Python. Z Object Publishing Environment, a popular
Web application server, is also written in the Python language.
Python is everywhere!
With the widespread use of Python across major industry verticals, Python has
become a hot topic of discussion in the town. Python has been acknowledged as
the fastest-growing programming language , as per Stack Overflow Trends.
According to Stack Overflow Developers’ Survey 2019 , Python is the second
“most loved ” language with 73% of the developers choosing it above other
languages prevailing in the market.
The advent of the Anaconda platform has offered a great speed to the language.
This is why Python for big data has become one of the most popular options in
the industry. You can also hire Python Developer who can implement these
Python benefits in your business.
Open-Source
Developed with the help of a community-based model, Python is an open-source
programming language . Being an open-source language, Python supports
multiple platforms. Also, it can be run in various environments such as Windows
and Linux .
Numerical computing
Data analysis
Statistical analysis
Visualization
Machine learning
The Pydoop package( Python and Hadoop) provides you access to the HDFS
API for Hadoop which allows you to write Hadoop MapReduce programs and
applications.
How is the HDFS API beneficial for you? So, here you go. The HDFS API lets
you read and write information easily on files, directories, and global file system
properties without facing any hurdles.
Pydoop offers MapReduce API for solving complex problems with minimal
programming efforts. This API can be used to implement advanced data science
concepts like ‘Counters’ and ‘Record Readers’ which makes Python
programming the best choice for Big Data.
Also, Read — “Is Python for Financial App Development the Right fit?”
Speed
Python is considered to be one of the most popular languages for software
development because of its high speed and performance. As it accelerates the
code well, Python is an apt choice for big data .
Python programming supports prototyping ideas which help in making the code
run fast. Moreover, while doing so, Python also sustains the transparency
between the code and the process.
Scope
Python allows users in simplifying data operations . As Python is an object-
oriented language, it supports advanced data structures. Some of the data
structures that Python manages include lists, sets, tuples, dictionaries and many
more.
Final words
These were some of the benefits of using Python. So by now, you would have
got a clear idea of why Python for big data is considered the best fit. Python is a
simple and open-source language possessing high speed and robust Library
support.
“Big data is at the foundation of all the megatrends that are happening.” –
Chris Lynch
With the use of big data technology spreading across the globe, meeting the
requirements of this industry is surely a daunting task. But, with its incredible
benefits, Python has become a suitable choice for Big Data . You can also
leverage Python in your business for availing its advantages.
The dataset we’ll be using is chile voting dataset, which you can import in
python as:
import pandas as pd
Df = pd.read_csv("https://vincentarelbundock.github.io/
Rdatasets/csv/car/Chile.csv")
Descriptive Statistics
Descriptive statistics is a helpful way to understand characteristics of your data
and to get a quick summary of it. Pandas in python provide an interesting
method describe() . The describe function applies basic statistical computations
on the dataset like extreme values, count of data points standard deviation etc.
Any missing value or NaN value is automatically skipped. describe() function
gives a good picture of distribution of data.
DF.describe()
Here’s the output you’ll get on running above code:
Another useful method if value_counts() which can get count of each category in
a categorical attributed series of values. For an instance suppose you are dealing
with a dataset of customers who are divided as youth, medium and old categories
under column name age and your dataframe is “DF”. You can run this statement
to know how many people fall in respective categories. In our data set example
education column can be used
DF["education"].value_counts()
The output of the above code will be:
One more useful tool is boxplot which you can use through matplotlib module.
Boxplot is a pictorial representation of distribution of data which shows extreme
values, median and quartiles. We can easily figure out outliers by using boxplots.
Now consider the dataset we’ve been dealing with again and lets draw a boxplot
on attribute population
import pandas as pd
import matplotlib.pyplot as plt
DF = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/
data/master/airline-safety/airline-safety.csv")
y = list(DF.population)
plt.boxplot(y)
plt.show()
The output plot would look like this with spotting out outliers:
Grouping data
Group by is an interesting measure available in pandas which can help us figure
out effect of different categorical attributes on other data variables. Let’s see an
example on the same dataset where we want to figure out affect of people’s age
and education on the voting dataset.
DF.groupby(['education', 'vote']).mean()
The output would be somewhat like this:
If this group by output table is less understandable further analysts use pivot
tables and heat maps for visualization on them.
ANOVA
ANOVA stands for Analysis of Variance. It is performed to figure out the
relation between the different group of categorical data.
Under ANOVA we have two measures as result:
– F-testscore : which shows the variaton of groups mean over variation
– p-value: it shows the importance of the result
This can be performed using python module scipy method name f_oneway()
Syntax:
import scipy.stats as st
st.f_oneway(sample1, sample2, ..)
Loading Data:
data = pd.read_csv("state.csv")
# Check the type of data
print ("Type : ", type(data), "\n\n")
# Printing Top 10 Records
print ("Head -- \n", data.head(10))
# Printing last 10 Records
print ("\n\n Tail -- \n", data.tail(10))
Output :
Head --
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
5 Colorado 5029196 2.8 CO
6 Connecticut 3574097 2.4 CT
7 Delaware 897934 5.8 DE
8 Florida 18801310 5.8 FL
9 Georgia 9687653 5.7 GA
Tail --
State Population Murder.Rate Abbreviation
40 South Dakota 814180 2.3 SD
41 Tennessee 6346105 5.7 TN
42 Texas 25145561 4.4 TX
43 Utah 2763885 2.3 UT
44 Vermont 625741 1.6 VT
45 Virginia 8001024 4.1 VA
46 Washington 6724540 2.5 WA
47 West Virginia 1852994 4.0 WV
48 Wisconsin 5686986 2.9 WI
49 Wyoming 563626 2.7 WY
Code #1 : Adding Column to the dataframe
# Adding a new column with derived data
data['PopulationInMillions'] = data['Population']/1000000
# Changed data
print (data.head(5))
Output :
State Population Murder.Rate Abbreviation PopulationInMillions
0 Alabama 4779736 5.7 AL 4.779736
1 Alaska 710231 5.6 AK 0.710231
2 Arizona 6392017 4.7 AZ 6.392017
3 Arkansas 2915918 5.6 AR 2.915918
4 California 37253956 4.4 CA 37.253956
Code #2 : Data Description
data.describe()
Output :
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
State 50 non-null object
Population 50 non-null int64
Murder.Rate 50 non-null float64
Abbreviation 50 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 1.6+ KB
Code #4 : Renaming a column heading
# Rename column heading as it
# has '.' in it which will create
# problems when dealing functions
data.rename(columns ={'Murder.Rate': 'MurderRate'}, inplace = True)
# Lets check the column headings
list(data)
Output :
Head --
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
5 Colorado 5029196 2.8 CO
6 Connecticut 3574097 2.4 CT
7 Delaware 897934 5.8 DE
8 Florida 18801310 5.8 FL
9 Georgia 9687653 5.7 GA
Tail --
State Population Murder.Rate Abbreviation
40 South Dakota 814180 2.3 SD
41 Tennessee 6346105 5.7 TN
42 Texas 25145561 4.4 TX
43 Utah 2763885 2.3 UT
44 Vermont 625741 1.6 VT
45 Virginia 8001024 4.1 VA
46 Washington 6724540 2.5 WA
47 West Virginia 1852994 4.0 WV
48 Wisconsin 5686986 2.9 WI
49 Wyoming 563626 2.7 WY
ax2 = sns.barplot(
x ="State", y ="MurderRate",
data = data.sort_values('MurderRate', ascending = 1),
palette ="husl")
ax2.set(xlabel ='States', ylabel ='Murder Rate per 100000')
ax2.set_title('Murder Rate by State', size = 20)
plt.xticks(rotation =-90)
Output :
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
a list of 50 Text xticklabel objects)
Output:
29,10
Indexing DataFrames with Pandas
Indexing can be possible using the pandas.DataFrame.iloc method. The iloc
method allows to retrieve as many as rows and columns by position.
Examples:
# prints first 5 rows and every column which replicates df.head()
df.iloc[0:5,:]
# prints entire rows and columns
df.iloc[:,:]
# prints from 5th rows and first 5 columns
df.iloc[5:,:5]
Indexing Using Labels in Pandas
Indexing can be worked with labels using the pandas.DataFrame.loc method,
which allows to index using labels instead of positions.
Examples:
# prints first five rows including 5th index and every columns of df
df.loc[0:5,:]
# prints from 5th rows onwards and entire columns
df = df.loc[5:,:]
The above doesn’t actually look much different from df.iloc[0:5,:]. This is
because while row labels can take on any values, our row labels match the
positions exactly. But column labels can make things much easier when working
with data. Example:
# Prints the first 5 rows of Time period
# value
df.loc[:5,"Time period"]
Pandas Plotting
Plots in these examples are made using standard convention for referencing the
matplotlib API which provides the basics in pandas to easily create decent
looking plots.
Examples:
# import the required module
import matplotlib.pyplot as plt
# plot a histogram
df['Observation Value'].hist(bins=10)
# shows presence of a lot of outliers/extreme values
df.boxplot(column='Observation Value', by = 'Time period')
# plotting points as a scatter plot
x = df["Observation Value"]
y = df["Time period"]
plt.scatter(x, y, label= "stars", color= "m",
marker= "*", s=30)
# x-axis label
plt.xlabel('Observation Value')
# frequency label
plt.ylabel('Time period')
# function to show the plot
plt.show()
Storing DataFrame in CSV Format :
Pandas provide to.csv('filename', index = "False|True") function to write
DataFrame into a CSV file. Here filename is the name of the CSV file that you
want to create and index tells that index (if Default) of DataFrame should be
overwritten or not. If we set index = False then the index is not overwritten. By
Default value of index is TRUE then index is overwritten.
Example :
import pandas as pd
# assigning three series to s1, s2, s3
s1 = pd.Series([0, 4, 8])
s2 = pd.Series([1, 5, 9])
s3 = pd.Series([2, 6, 10])
# taking index and column values
dframe = pd.DataFrame([s1, s2, s3])
# assign column name
dframe.columns =['Geeks', 'For', 'Geeks']
# write data to csv file
dframe.to_csv('geeksforgeeks.csv', index = False)
dframe.to_csv('geeksforgeeks1.csv', index = True)
Output :
geeksforgeeks1.csv
geeksforgeeks2.csv
Handling Missing Data
The Data Analysis Phase also comprises of the ability to handle the missing data
from our dataset, and not so surprisingly Pandas live up to that expectation as
well. This is where dropna and/or fillna methods comes into the play. While
dealing with the missing data, you as a Data Analyst are either supposed to drop
the column containing the NaN values (dropna method) or fill in the missing
data with mean or mode of the whole column entry (fillna method), this decision
is of great significance and depends upon the data and the affect would create in
our results.
import pandas as pd
# Create a DataFrame
dframe = pd.DataFrame({'Geeks': [23, 24, 22],
'For': [10, 12, np.nan],
'geeks': [0, np.nan, np.nan]},
columns =['Geeks', 'For', 'geeks'])
# This will remove all the
# rows with NAN values
# If axis is not defined then
# it is along rows i.e. axis = 0
dframe.dropna(inplace = True)
print(dframe)
# if axis is equal to 1
dframe.dropna(axis = 1, inplace = True)
print(dframe)
Output :
axis=0
axis=1
Output :
Groupby Method (Aggregation) :
The groupby method allows us to group together the data based off any row or
column, thus we can further apply the aggregate functions to analyze our data.
Group series using mapper (dict or key function, apply given function to group,
return result as series) or by a series of columns.
Consider this is the DataFrame generated by below code :
import pandas as pd
import numpy as np
# create DataFrame
dframe = pd.DataFrame({'Geeks': [23, 24, 22, 22, 23, 24],
'For': [10, 12, 13, 14, 15, 16],
'geeks': [122, 142, 112, 122, 114, 112]},
columns = ['Geeks', 'For', 'geeks'])
# Apply groupby and aggregate function
# max to find max value of column
# "For" and column "geeks" for every
# different value of column "Geeks".
print(dframe.groupby(['Geeks']).max())
Output :
Here ‘Z’ is an array of size 100, and values ranging from 0 to 255. Now,
reshaped ‘z’ to a column vector. It will be more useful when more than one
features are present. Then change the data to np.float32 type.
Output:
Now, apply the k-Means clustering algorithm to the same example as in the
above test data and see its behavior.
Steps Involved:
1) First we need to set a test data.
2) Define criteria and apply kmeans().
3) Now separate the data.
4) Finally Plot the data.
import numpy as np
import cv2
from matplotlib import pyplot as plt
X = np.random.randint(10,45,(25,2))
Y = np.random.randint(55,70,(25,2))
Z = np.vstack((X,Y))
# convert to np.float32
Z = np.float32(Z)
# define criteria and apply kmeans()
criteria = (cv2.TERM_CRITERIA_EPS +
cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
ret,label,center =
cv2.kmeans(Z,2,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
# Now separate the data
A = Z[label.ravel()==0]
B = Z[label.ravel()==1]
# Plot the data
plt.scatter(A[:,0],A[:,1])
plt.scatter(B[:,0],B[:,1],c = 'r')
plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's')
plt.xlabel('Test Data'),plt.ylabel('Z samples')
plt.show()
Output:
Scala
Scala (Scalable Language) is a software programming language that mixes
object-oriented methods with functional programming capabilities that support a
more concise style of programming than other general-purpose languages like
Java, reducing the amount of code developers have to write. Another benefit of
the combined object-functional approach is that features that work well in small
programs tend to scale up efficiently when run in larger environments.
Scala also includes its own interpreter, which can be used to execute instructions
directly, without previous compiling. Another key feature in Scala is a "parallel
collections" library designed to help developers address parallel programming
problems. Pattern matching is among the application areas in which such parallel
capabilities have proved to be especially useful.
Apache Spark, an open source data processing engine for batch processing,
machine learning, data streaming and other types of analytics applications, is
very significant example of Scala usage. Spark is written in Scala, and the
language is central to its support for distributed data sets that are handled as
collective software objects to help boost resiliency. However, Spark applications
can be programmed in Java and the Python language in addition to Scala.
This article examines Scala's Java versatility and interoperability, the Scala
tooling and runtime features that help ensure reliable performance, and some of
the challenges developers should watch out for when they use this language.
Scala attracted wide attention from developers in 2015 due to its effectiveness
with general-purpose cluster computing. Today, it's found in many Java virtual
machine (JVM) systems, where developers use Scala to eliminate the need for
redundant type information. Because programmers don't have to specify a type,
they also don't have to repeat it.
Scala shares a common runtime platform with Java, so it can execute Java code.
Using the JVM and JavaScript runtimes, developers can build high-performance
systems with easy access to the rest of the Java library ecosystem. Because the
JVM is deeply embedded in enterprise code, Scala offers a concise shortcut that
guarantees diverse functionality and granular control.
Developers can also rely on Scala to more effectively express general
programming patterns. By reducing the number of lines, programmers can write
type-safe code in an immutable manner, making it easy to apply concurrency and
to synchronize processing.
Scala treats functions like first-class objects. Programmers can compose with
relatively guaranteed type safety. Scala's lightweight syntax is perfect for
defining anonymous functions and nesting. Scala's pattern-matching ability also
makes it possible to incorporate functions within class definitions.
Java developers can quickly become productive in Scala if they have an existing
knowledge of OOP, and they can achieve greater flexibility because they can
define data types that have either functional or OOP-based attributes.
Managing dependency versions can also be a challenge in Scala. It's not unusual
for a language to cause headaches for developers when it comes to dependency
management, but that challenge is particularly prevalent in Scala due to the sheer
number of Scala versions and upgrades. New Scala releases often mark a
significant shift that requires massive developer retraining and codebase
migrations.
Developers new to Scala should seek out the support of experienced contributors
to help minimize the learning curve. While Scala still exists in a relatively
fragmented, tribal ecosystem, it's hard to say where Scala is heading in terms of
adoption. However, with the right support, Scala functional programming can be
a major asset.
Python vs Scala
Python is a high level, interpreted and general purpose dynamic programming
language that focuses on code readability. Python requires less typing, provides
new libraries, fast prototyping, and several other new features.
Scala is a high level language.it is a purely object-oriented programming
language. The source code of the Scala is designed in such a way that its
compiler can interpret the Java classes.
Below are some major differences between Python and Scala:
PYTHON SCALA
An extra work is created for the No extra work is created in Scala and
interpreter at the runtime. thus it is 10 times faster than Python.
Python can be used for small-scale Scala can be used for large-scale
projects. projects.
We will be using Scala IDE only for demonstration purposes. A dedicated spark
compiler is required to run the below code. Follow the link to run the below
code.
Let’s create our first data frame in spark.
Scala
// Importing SparkSession
import org.apache.spark.sql.SparkSession
// Creating SparkSession object
val sparkSession = SparkSession.builder()
.appName("My First Spark Application")
.master("local").getOrCreate()
// Loading sparkContext
val sparkContext = sparkSession.sparkContext
// Creating an RDD
val intArray = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
// parallelize method creates partitions, which additionally
// takes integer argument to specifies the number of partitions.
// Here we are using 3 partitions.
val intRDD = sparkContext.parallelize(intArray, 3)
// Printing number of partitions
println(s"Number of partitons in intRDD : ${intRDD.partitions.size}")
// Printing first element of RDD
println(s"First element in intRDD : ${intRDD.first}")
// Creating string from RDD
// take(n) function is used to fetch n elements from
// RDD and returns an Array.
// Then we will convert the Array to string using
// mkString function in scala.
val strFromRDD = intRDD.take(intRDD.count.toInt).mkString(", ")
println(s"String from intRDD : ${strFromRDD}")
// Printing contents of RDD
// collect function is used to retrieve all the data in an RDD.
println("Printing intRDD: ")
intRDD.collect().foreach(println)
Output:
So, we can say that Spark is a powerful open-source engine for data processing.
Components of Apache Spark
Spark is a cluster computing system. It is faster as compared to other cluster
computing systems (such as Hadoop). It provides high-level APIs in Python,
Scala, and Java. Parallel jobs are easy to write in Spark. In this article, we will
discuss the different components of Apache Spark.
Spark processes a huge amount of datasets and it is the foremost active Apache
project of the current time. Spark is written in Scala and
provides API in Python, Scala, Java, and R. The most vital feature of Apache
Spark is its in-memory cluster computing that extends the speed of the data
process. Spark is an additional general and quicker processing platform. It helps
us to run programs relatively quicker than Hadoop (i.e.) a hundred times quicker
in memory and ten times quicker even on the disk. The main features of spark
are:
Components of Spark:
The above figure illustrates all the spark components. Let’s understand each of
the components in detail:
Uses of Apache Spark: The main applications of the spark framework are:
1. Map – The mapper processes each line of the input data (it is in the
form of a file), and produces key – value pairs.
2. Reduce – The reducer processes the list of key – value pairs (after the
Mapper’s function). It outputs a new set of key – value pairs.
One major advantage of using Spark is that it does not load the dataset into
memory, lines is a pointer to the ‘file_name.txt’ file.
A simple PySpark app to count the degree of each vertex for a given graph
–
1. Our text file is in the following format – (each line represents an edge
of a directed graph)
1 2
1 3
2 3
3 4
. .
. .
. .PySpark
2. Large Datasets may contain millions of nodes, and edges.
3. First few lines set up the SparkContext. We create an RDD lines from
it.
4. Then, we transform the lines RDD to edges RDD.The function conv
a?cts on each line and key value pairs of the form (1, 2), (1, 3), (2, 3),
(3, 4), … are stored in the edges RDD.
5. After this the reduceByKey aggregates all the key – pairs
corresponding to a particular key and numNeighbours function is
used for generating each vertex’s degree in a separate RDD Adj_list ,
which has the form (1, 2), (2, 1), (3, 1), …
$ cd /home/arik/Downloads/spark-1.6.0/
$ ./bin/spark-submit degree.py
2. You can use your Spark installation path in the first line.
Output :
Row(Ship_name='Journey', Cruise_line='Azamara', Age=6,
Tonnage=30.276999999999997, passengers=6.94, length=5.94,
cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0)
Output :
test_data.describe().show()
Output :
#import LinearRegression library
from pyspark.ml.regression import LinearRegression
#creating an object of class LinearRegression
#object takes features and label as input arguments
ship_lr=LinearRegression(featuresCol='features',labelCol='crew')
#pass train_data to train model
trained_ship_model=ship_lr.fit(train_data)
#evaluating model trained for Rsquared error
ship_results=trained_ship_model.evaluate(train_data)
print('Rsquared Error :',ship_results.r2)
#R2 value shows accuracy of model is 92%
#model accuracy is very good and can be use for predictive analysis
Output :
Output :
Code :
# identifying the columns having less meaningful data on the basis of datatypes
l_int =[]
for item in df_train.dtypes:
if item[1]=='int':
l_int.append(item[0])
print(l_int)
l_str =[]
for item in df_train.dtypes:
if item[1]=='string':
l_str.append(item[0])
print(l_str)
Output
Integer Datatypes:
['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea',
'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
'MiscVal', 'MoSold', 'YrSold', 'SalePrice']
String Datatypes:
['MSZoning', 'LotFrontage', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2',
'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
'Exterior2nd',
'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
'HeatingQC',
'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
'PoolQC', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition']
Code :
# identifying integer column records having less meaningful data
# identifying integer column records having less meaningful data
for i in df_train.columns:
if i in l_int:
a ='df_train'+'.'+i
ct_total = df_train.select(i).count( )
ct_zeros = df_train.filter((col(i)== 0)).count()
per_zeros =(ct_zeros / ct_total)*100
print('total count / zeros count '
+i+' '+str(ct_total)+' / '+str(ct_zeros)+' / '+str(per_zeros))
Code :
# converting string to numeric feature
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
feat_list =['MSZoning', 'LotFrontage', 'Street', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle',
'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2',
'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
'PavedDrive', 'SaleType', 'SaleCondition']
print('indexed list created')
# there are multiple features to work
# using pipeline we can convert multiple features to indexers
indexers = [StringIndexer(inputCol = column, outputCol =
column+"_index").fit(df_new) for column in feat_list]
type(indexers)
# Combines a given list of columns into a single vector column.
# input_cols: Columns to be assembled.
# returns Dataframe with assembled column.
pipeline = Pipeline(stages = indexers)
df_feat = pipeline.fit(df_new).transform(df_new)
df_feat.columns
# using above code we have converted list of features into indexes
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
# we will convert below columns into features to work with
assembler = VectorAssembler(inputCols =['MSSubClass', 'LotArea',
'OverallQual',
'OverallCond', 'YearBuilt', 'YearRemodAdd',
'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'GrLivArea',
'BsmtFullBath', 'FullBath', 'HalfBath',
'GarageArea', 'MoSold', 'YrSold',
'MSZoning_index', 'LotFrontage_index',
'Street_index', 'LotShape_index',
'LandContour_index', 'Utilities_index',
'LotConfig_index', 'LandSlope_index',
'Neighborhood_index', 'Condition1_index',
'Condition2_index', 'BldgType_index',
'HouseStyle_index', 'RoofStyle_index',
'RoofMatl_index', 'Exterior1st_index',
'Exterior2nd_index', 'MasVnrType_index',
'MasVnrArea_index', 'ExterQual_index',
'ExterCond_index', 'Foundation_index',
'BsmtQual_index', 'BsmtCond_index',
'BsmtExposure_index', 'BsmtFinType1_index',
'BsmtFinType2_index', 'Heating_index',
'HeatingQC_index', 'CentralAir_index',
'Electrical_index', 'KitchenQual_index',
'Functional_index', 'FireplaceQu_index',
'GarageType_index', 'GarageYrBlt_index',
'GarageFinish_index', 'GarageQual_index',
'GarageCond_index', 'PavedDrive_index',
'SaleType_index', 'SaleCondition_index'],
outputCol ='features')
output = assembler.transform(df_feat)
final_data = output.select('features', 'SalePrice')
# splitting data for test and validation
train_data, test_data = final_data.randomSplit([0.7, 0.3])
Code :
train_data.describe().show()
test_data.describe().show()
Code :
from pyspark.ml.regression import LinearRegression
house_lr = LinearRegression(featuresCol ='features', labelCol ='SalePrice')
trained_house_model = house_lr.fit(train_data)
house_results = trained_house_model.evaluate(train_data)
print('Rsquared Error :', house_results.r2)
# Rsquared Error : 0.8279155904297449
# model accuracy is 82 % with train data
# evaluate model on test_data
test_results = trained_house_model.evaluate(test_data)
print('Rsquared error :', test_results.r2)
# Rsquared error : 0.8431420382408793
# result is quiet better with 84 % accuracy
# create unlabelled data from test_data
# test_data.show()
unlabeled_data = test_data.select('features')
unlabeled_data.show()
Code :
predictions = trained_house_model.transform(unlabeled_data)
predictions.show()
SQL
SQL (Structured Query Language) is a standardized programming language
that's used to manage relational databases and perform various operations on the
data in them. Initially created in the 1970s, SQL is regularly used not only by
database administrators, but also by developers writing data integration scripts
and data analysts looking to set up and run analytical queries.
The uses of SQL include modifying database table and index structures; adding,
updating and deleting rows of data; and retrieving subsets of information from
within a database for transaction processing and analytics applications. Queries
and other SQL operations take the form of commands written as statements --
commonly used SQL statements include select, add, insert, update, delete, create,
alter and truncate.
Both proprietary and open source relational database management systems built
around SQL are available for use by organizations. They include Microsoft SQL
Server, Oracle Database, IBM DB2, SAP HANA, SAP Adaptive
Server, MySQL (now owned by Oracle) and PostgreSQL. However, many of
these database products support SQL with proprietary extensions to the standard
language for procedural programming and other functions. For example,
Microsoft offers a set of extensions called Transact-SQL (T-SQL), while Oracle's
extended version of the standard is PL/SQL. As a result, the different variants of
SQL offered by vendors aren't fully compatible with one another.
SQL syntax is the coding format used in writing statements. Figure 1 shows an
example of a DDL statement written in Microsoft's T-SQL to modify a database
table in SQL Server 2016:
An example of T-SQL code in SQL Server 2016. This is the code for the ALTER
TABLE WITH (ONLINE = ON | OFF) option.
SQL-on-Hadoop tools
SQL-on-Hadoop query engines are a newer offshoot of SQL that enable
organizations with big data architectures built around Hadoop systems to take
advantage of it instead of having to use more complex and less familiar
languages -- in particular, the MapReduce programming environment for
developing batch processing applications.
SQL-on-Hadoop
SQL-on-Hadoop is a class of analytical application tools that combine
established SQL-style querying with newer Hadoop data framework elements.
The different means for executing SQL in Hadoop environments can be divided
into (1) connectors that translate SQL into a MapReduce format; (2) "push
down" systems that forgo batch-oriented MapReduce and execute SQL within
Hadoop clusters; and (3) systems that apportion SQL work between MapReduce-
HDFS clusters or raw HDFS clusters, depending on the workload.
One of the earliest efforts to combine SQL and Hadoop resulted in the Hive data
warehouse, which featured HiveQL software for translating SQL-like queries
into MapReduce jobs. Other tools that help support SQL-on-Hadoop include
BigSQL, Drill, Hadapt, Hawq, H-SQL, Impala, JethroData, Polybase, Presto,
Shark (Hive on Spark), Spark, Splice Machine, Stinger, and Tez (Hive on Tez).
The first SQL-on-Hadoop engine was Apache Hive, but during the last 12
months, many new ones have been released. These include CitusDB, Cloudera
Impala, Concurrent Lingual, Hadapt, InfiniDB, JethroData, MammothDB,
Apache Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba
and Splice Machine.
And, of course, there are a few SQL database management systems that support
polyglot persistence. This means that they can store data in their own native SQL
database or in Hadoop; by doing so, they also offer SQL access to Hadoop data.
Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft
PolyBase, Actian ParAccel and Teradata Aster Database (via SQL-H).
The answer is that it does matter, because not all of these technologies are
created equal. On the outside, they all look the same, but internally they are very
different. For example, CitusDB knows where all the data is stored and uses that
knowledge to access the data as efficiently as possible. JethroData stores indexes
to get direct access to data, and Splice Machine offers a transactional SQL
interface.
SQL dialect. The richer the SQL dialect supported, the wider the range of
applications that can benefit from it. In addition, the richer the dialect, the more
query processing can be pushed to Hadoop and the less the applications and
reporting tools have to do.
Joins. Executing joins on big tables fast and efficiently is not always easy,
especially if the SQL-on-Hadoop engine has no idea where the data is stored. An
inefficient style of join processing can lead to massive amounts of I/O and can
cause colossal data transport between nodes. Both can result in really poor
performance.
Storage format. Hadoop supports some "standard" storage formats of the data,
such as Parquet, Avro and ORCFile. The more SQL-on-Hadoop technologies use
such formats, the more tools and other SQL-on-Hadoop engines can read that
same data. This drastically minimizes the need to replicate data. Thus, it's
important to verify whether a proprietary storage format is used.
Data federation. Not all data is stored in Hadoop. Most enterprise data is still
stored in other data sources, such as SQL databases. A SQL-on-Hadoop engine
must support distributed joins on data stored in all kinds of data sources. In other
words, it must support data federation.
Apache Hive
Apache Hive is an open source data warehouse system for querying and
analyzing large data sets that are principally stored in Hadoop files. It is
commonly a part of compatible tools deployed as part of the software ecosystem
based on the Hadoop framework for handling large data sets in a distributed
computing environment.
Like Hadoop, Hive has roots in batch processing techniques. It was originated in
2007 by developers at Facebook who sought to provide SQL access to Hadoop
data for analytics users. Like Hadoop, Hive was developed to address the need to
handle petabytes of data accumulating via web activity. Release 1.0 became
available in February 2015.
When SQL queries are submitted via Hive, they are initially received by a driver
component that creates session handles, forwards requests to a compiler via Java
Database Connectivity/Open Database Connectivity interfaces, which
subsequently forwards jobs for execution. Hive enables
data serialization/deserialization and increases flexibility in schema design by
including a system catalog called Hive-Metastore.
Early Hive file support comprised text files (also called flat files), SequenceFiles
(flat files consisting of binary key/value pairs) and Record Columnar Files
(RCFiles), which store columns of a table in a columnar database way). Hive
columnar storage support has come to include Optimized Row Columnar (ORC)
files and Parquet files.
Hive execution and interactivity were a topic of attention nearly from its
inception. That is because query performance lagged that of more familiar SQL
engines. In 2013, to boost performance, Apache Hive committers began work on
the Stinger project, which brought Apache Tez and directed acyclic graph
processing to the warehouse system.
Additions accompanying releases 2.3 in 2017 and release 3.0 in 2018 furthered
Apache Hive's development. Among highlights were support for Live Long and
Process (LLAP) functionality that allows prefetching and caching of columnar
data and support for atomicity, consistency, isolation and durability (ACID)
operations including INSERT, UPDATE and DELETE. Work also began on
materialized views and automatic query rewriting capabilities familiar to
traditional data warehouse users.
Apache Hive was among the very first attempts to bring SQL querying
capabilities to the Hadoop ecosystem. Among a host of other SQL-on-
Hadoop alternatives that have arisen are BigSQL, Drill, Hadapt, Impala and
Presto. Also, Apache Pig has emerged as an alternative language to HiveQL for
Hadoop-oriented data warehousing.
Execution Engine
Execution of the execution plan made by the compiler is performed in
the execution engine. The plan is a DAG of stages. The dependencies
within the various stages of the plan is managed by execution engine as
well as it executes these stages on the suitable system components.
Diagram – Architecture of Hive that is built on the top of Hadoop
In the above diagram along with architecture, job execution flow in Hive with
Hadoop is demonstrated step by step .
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like
interface between the user and the Hadoop distributed file system (HDFS) which
integrates Hadoop. It is built on top of Hadoop. It is a software project that
provides data query and analysis. It facilitates reading, writing and handling
wide datasets that stored in distributed storage and queried by Structure Query
Language (SQL) syntax. It is not built for Online Transactional Processing
(OLTP) workloads. It is frequently used for data warehousing tasks like data
encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to
enhance scalability, extensibility, performance, fault-tolerance and loose-
coupling with its input formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers
standard SQL functionality for analytics. Traditional SQL queries are written in
the MapReduce Java API to execute SQL Application and SQL queries over
distributed data. Hive provides portability as most data warehousing applications
functions with SQL-based query languages like NoSQL.
Components of Hive:
1. HCatalog
It is a Hive component and is a table as well as a store management
layer for Hadoop. It enables user along with various data processing
tools like Pig and MapReduce which enables to read and write on the
grid easily.
2. WebHCat
It provides a service which can be utilized by the user to run Hadoop
MapReduce (or YARN), Pig, Hive tasks or function Hive metadata
operations with an HTTP interface.
Modes of Hive:
Hive is functioned in two major modes which are described below. These modes
are depended on the size of data nodes in Hadoop.
1. Local Mode
It is used, when the Hadoop is built under pseudo mode which have
only one data node, when the data size is smaller in term of restricted
to single local machine, and when processing will be faster on smaller
datasets existing in the local machine.
2. Map Reduce Mode
It is used, when Hadoop is built with multiple data nodes and data is
divided across various nodes, it will function on huge datasets and
query is executed parallelly, and to achieve enhanced performance in
processing large datasets.
Characteristics of Hive:
Features of Hive:
The art of modeling involves selecting the right data sets, algorithms and
variables and the right techniques to format data for a particular business
problem. But there's more to it than model-building mechanics. No model will
do any good if the business doesn't understand its results. Communicating the
results to executives so they understand what the model discovered and how
it can benefit the business is critical but challenging, it's the "last mile" in the
whole analytical modeling process and often the most treacherous. Without that
understanding, though, business managers might be loath to use the analytical
findings to make critical business decisions.
They should also be realistic about the likely fruits of their scientific and artistic
labors. Though some models make fresh observations about business data, most
don't; they extract relationships or patterns that people already know about but
might overlook or ignore otherwise.
For example, a crime model predicts that the number of criminal incidents will
increase in a particular neighborhood on a particular summer night. A grizzled
police sergeant might cavalierly dismiss the model's output, saying he was aware
that would happen because an auto race takes place that day at the local
speedway, which always spawns spillover crime in the adjacent neighborhood.
"Tell me something I don't already know," he grumbles. But that doesn't mean
the modeling work was for naught: In this case, the model reinforces the
policeman’s implicit knowledge, bringing it to the forefront of his consciousness,
so he can act on it.
Many people think that to excel at analytics their companies need only hire a
bunch of statisticians who understand the nuances of sophisticated algorithms
and give them high-powered tools to crunch data. But that only gets you so far.
The art of analytical modeling is a skill that requires intimate knowledge of an
organization's processes and data as well as the ability to communicate with
business executives in business terms. Like fine furniture makers, analytics
professionals who master these skills can build high-quality models with lasting
value and reap huge rewards for their organizations in the process.
Algorithm
An algorithm (pronounced AL-go-rith-um) is a procedure or formula for solving
a problem, based on conducting a sequence of specified actions. A
computer program can be viewed as an elaborate algorithm. In mathematics and
computer science, an algorithm usually means a small procedure that solves a
recurrent problem.
The word algorithm derives from the name of the mathematician, Mohammed
ibn-Musa al-Khwarizmi, who was part of the royal court in Baghdad and who
lived from about 780 to 850. Al-Khwarizmi's work is the likely source for the
word algebra as well.
Business Analytics
Business analytics (BA) is the iterative, methodical exploration of an
organization's data, with an emphasis on statistical analysis. Business analytics is
used by companies that are committed to making data-driven decisions. Data-
driven companies treat their data as a corporate asset and actively look for ways
to turn it into a competitive advantage. Successful business analytics depends
on data quality, skilled analysts who understand the technologies and the
business, and an organizational commitment to using data to gain insights that
inform business decisions.
The more advanced areas of business analytics can start to resemble data
science, but there is also a distinction between these two terms. Even when
advanced statistical algorithms are applied to data sets, it doesn't necessarily
mean data science is involved. That's because true data science involves more
custom coding and exploring answers to open-ended questions.
Data scientists generally don't set out to solve a specific question, as most
business analysts do. Rather, they will explore data using advanced statistical
methods and allow the features in the data to guide their analysis. There are a
host of business analytics tools that can perform these kinds of functions
automatically, requiring few of the special skills involved in data science.
Self-service has become a major trend among business analytics tools. Users
now demand software that is easy to use and doesn't require specialized training.
This has led to the rise of simple-to-use tools from companies such
as Tableau and Qlik, among others. These tools can be installed on a single
computer for small applications or in server environments for enterprise-wide
deployments. Once they are up and running, business analysts and others with
less specialized training can use them to generate reports, charts and web portals
that track specific metrics in data sets.
Edge Analytics
Edge analytics is an approach to data collection and analysis in which an
automated analytical computation is performed on data at a sensor, network
switch or other device instead of waiting for the data to be sent back to a
centralized data store.
Edge analytics has gained attention as the internet of things (IoT) model of
connected devices has become more prevalent. In many organizations, streaming
data from manufacturing machines, industrial equipment, pipelines and other
remote devices connected to the IoT creates a massive glut of operational data,
which can be difficult -- and expensive -- to manage. By running the data
through an analytics algorithm as it's created, at the edge of a corporate network,
companies can set parameters on what information is worth sending to a cloud or
on-premises data store for later use -- and what isn't.
Analyzing data as it's generated can also decrease latency in the decision-making
process on connected devices. For example, if sensor data from a manufacturing
system points to the likely failure of a specific part, business rules built into the
analytics algorithm interpreting the data at the network edge can automatically
shut down the machine and send an alert to plant managers so the part can be
replaced. That can save time compared to transmitting the data to a central
location for processing and analysis, potentially enabling organizations to reduce
or avoid unplanned equipment downtime.
Not all hardware supports it . Simply put, not every IoT device has the
memory, CPU and storage hardware required to perform deep analytics
onboard the device.
You might have to develop your own edge analytics platform. Edge
analytics is still a relatively new technology. Although off-the-shelf
analytical platforms do exist, it's entirely possible that an organization
might have to develop its own edge analytics platform based on the
devices that it wants to analyze.
Inductive Reasoning
Inductive reasoning is a logical process in which multiple premises, all believed
true or found true most of the time, are combined to obtain a specific conclusion.
A meteorologist will tell you that in the United States (which lies in the northern
hemisphere), most tornadoes rotate counterclockwise, but not all of them do.
Therefore, the conclusion is probably true, but not necessarily true. Inductive
reasoning is, unlike deductive reasoning, not logically rigorous. Imperfection can
exist and inaccurate conclusions can occur, however rare; in deductive reasoning
the conclusions are mathematically certain.
The discipline of supply chain analytics has existed for over 100 years, but the
mathematical models, data infrastructure, and applications underpinning these
analytics have evolved significantly. Mathematical models have improved with
better statistical techniques, predictive modeling and machine learning. Data
infrastructure has changed with cloud infrastructure, complex event processing
(CEP) and the internet of things. Applications have grown to provide insight
across traditional application silos such as ERP, warehouse management,
logistics and enterprise asset management.
Some ERP and SCM vendors have begun applying CEP to their platforms for
real-time supply chain analytics. Most ERP and SCM vendors have one-to-one
integrations, but there is no standard. However, the Supply Chain Operations
Reference (SCOR) model provides standard metrics for comparing supply chain
performance to industry benchmarks.
Ideally, supply chain analytics software would be applied to the entire chain, but
in practice it is often focused on key operational subcomponents, such as
demand planning, manufacturing production, inventory management or
transportation management. For example, supply chain finance analytics can
help identify increased capital costs or opportunities to boost working
capital; procure-to-pay analytics can help identify the best suppliers and provide
early warning of budget overruns in certain expense categories; and
transportation analytics software can predict the impact of weather on shipments.
The process of creating supply chain analytics typically starts with data scientists
who understand a particular aspect of the business, such as the factors that relate
to cash flow, inventory, waste and service levels. These experts look for potential
correlations between different data elements to build a predictive model that
optimizes the output of the supply chain. They test out variations until they have
a robust model.
Supply chain analytics models that reach a certain threshold of success are
deployed into production by data engineers with an eye toward scalability and
performance. Data scientists, data engineers and business users work together to
refine the way these data analytics are presented and operationalized in practice.
Supply chain models are improved over time by correlating the performance of
data analysis models in production with the business value they deliver.
Data visualization. The ability to slice and dice data from different
angles to improve insight and understanding.
Stream processing. Deriving insight from multiple data streams
generated by, for example, the IoT, applications, weather reports and
third-party data.
Social media integration. Using sentiment data from social feeds to
improve demand planning.
Natural language processing. Extracting and organizing unstructured
data buried in documents, news sources and data feeds.
Location intelligence. Deriving insight from location data to
understand and optimize distribution.
Digital twin of the supply chain. Organizing data into a
comprehensive model of the supply chain that is shared across different
types of users to improve predictive and prescriptive analytics.
Graph databases. Organizing information into linked elements that
make it easier to find connections, identify patterns and improve
traceability of products, suppliers and facilities.
Types of supply chain analytics
A common lens used to delineate the main types of supply chain analytics is
based on Gartner's model of the four capabilities of analytics: descriptive,
diagnostic, predictive and prescriptive.
Another way of breaking down types of supply chain analytics is by their form
and function. Advisory firm Supply Chain Insights, for example, breaks down
the types of supply chain analytics into the following functions:
Workflow
Decision support
Collaboration
Unstructured text mining
Structured data management
In this model, the different types of analytics feed into each other as part of an
end-to-end ongoing process for improving supply chain management.
For example, a company could use unstructured text mining to turn raw data
from contracts, social media feeds and news reports into structured data that is
relevant to the supply chain. This improved, more structured data could then
help automate and improve workflows, such as procure-to-pay processes. The
data in digitized workflows is much easier to capture than data from manual
workflows, thus increasing the data available for decision support systems.
Better decision support could in turn enhance collaboration across different
departments like procurement and warehouse management or between supply
chain partners.
The advent of mainframe computers gave rise to the data processing work done
by IBM researcher Hans Peter Luhn, who some credit for coining the
term business intelligence in his 1958 paper, "A Business Intelligence System."
His work helped build the foundation for the different types of data analytics
used in supply chain analytics.
As the internet became a force in the 1990s, people looked at how it could be
applied in supply chain management. A pioneer in this area was the British
technologist Kevin Ashton. As a young product manager tasked with solving the
problem of keeping a popular lipstick on store shelves, Ashton hit upon radio
frequency identification sensors as a way to automatically capture data about the
movement of products across the supply chain. Ashton, who would go on to co-
found the Massachusetts Institute of Technology's Auto-ID Center that perfected
RFID technology and sensors, coined the term internet of things to explain this
revolutionary new feature of supply chain management.
The 1990s also saw the development of CEP by researchers such as the team
headed by Stanford University's David Luckham and others. CEP's ability to
capture incoming data from real-time events helped supply chain managers
correlate low-level data related to factory operations, the physical movements of
products, and weather into events that could then be analyzed by supply chain
analytics tools. For example, data about production processes could be
abstracted to factory performance, which in turn could be abstracted into
business events related to things like inventory levels.
Another turning point in the field of supply chain analytics was the advent of
cloud computing, a new vehicle for delivering IT infrastructure, software and
platforms as service. By providing a foundation for orchestrating data across
multiple sources, the cloud has driven improvements in many types of analytics,
including supply chain analytics. The emergence of data lakes
like Hadoop allowed enterprises to capture data from different sources on a
common platform, further refining supply chain analytics by enabling companies
to correlate more types of data. Data lakes also made it easier to implement
advanced analytics that operated on a variety of structured and unstructured data
from different applications, event streams and the IoT.
Other technologies expected to play a big role in supply chain analytics and
management include the following:
Statistical Analysis
Statistical analysis is the collection and interpretation of data in order to uncover
patterns and trends. It is a component of data analytics. Statistical analysis can
be used in situations like gathering research interpretations, statistical modeling
or designing surveys and studies. It can also be useful for business intelligence
organizations that have to work with large data volumes.
Analytic Database
An analytic database is a read-only system that stores historical data on business
metrics such as sales performance and inventory levels. Business analysts,
corporate executives and other workers can run queries and reports against an
analytic database. The information is updated on a regular basis to incorporate
recent transaction data from an organization’s operational systems.
Data warehouse appliances , which combine the database with hardware and
BI tools in an integrated platform that’s tuned for analytical workloads and
designed to be easy to install and operate.
In-memory databases , which load the source data into system memory in a
compressed, non-relational format in an attempt to streamline the work involved
in processing queries.
Massively parallel processing (MPP) databases , which spread data across a
cluster of servers, enabling the systems to share the query processing workload.
Real-Time Analytics
Real-time analytics is the use of data and related resources for analysis as soon
as it enters the system. The adjective real-time refers to a level of computer
responsiveness that a user senses as immediate or nearly immediate. The term is
often associated with streaming data architectures and real-time operational
decisions that can be made automatically through robotic process
automation and policy enforcement.
Whereas historical data analysis uses a set of historical data for batch analysis,
real-time analytics instead visualizes and analyzes the data as it appears in the
computer system. These enables data scientists to use real-time analytics for
purposes such as:
an aggregator that gathers data event streams (and perhaps batch files)
from a variety of data sources;
a broker that makes data available for consumption; and
an analytics engine that analyzes the data, correlates values and blends
streams together.
The system that receives and sends data streams and executes the application and
real-time analytics logic is called the stream processor.
In order for the real-time data to be useful, the real-time analytics applications
being used should have high availability and low response times. These
applications should also feasibly manage large amounts of data, up to terabytes.
This should all be done while returning answers to queries within seconds.
The term real-time also includes managing changing data sources -- something
that may arise as market and business factors change within a company. As a
result, the real-time analytics applications should be able to handle big data. The
adoption of real-time big data analytics can maximize business returns, reduce
operational costs and introduce an era where machines can interact over
the internet of things using real-time information to make decisions on their
own.
Different technologies exist that have been designed to meet these demands,
including the growing quantities and diversity of data. Some of these new
technologies are based on specialized appliances -- such as hardware and
software systems. Other technologies utilize a special processor and memory
chip combination, or a database with analytics capabilities embedded in its
design.
Businesses that utilize real-time analytics greatly reduce risk throughout their
company since the system uses data to predict outcomes and suggest alternatives
rather than relying on the collection of speculations based on past events or
recent scans -- as is the case with historical data analytics. Real-time analytics
provides insights into what is going on in the moment.
Once the company has unanimously decided on what real time means, it faces
the challenge of creating an architecture with the ability to process data at high
speeds. Unfortunately, data sources and applications can cause processing-speed
requirements to vary from milliseconds to minutes, making creation of a capable
architecture difficult. Furthermore, the architecture must also be capable of
handling quick changes in data volume and should be able to scale up as the data
grows.
Finally, companies may find that their employees are resistant to the change
when implementing real-time analytics. Therefore, businesses should focus on
preparing their staff by providing appropriate training and fully communicating
the reasons for the change to real-time analytics.
Here are some examples of how enterprises are tapping into real-time analytics:
We will explore in detail on each one of the following terms one by one:
Artificial Intelligence
Artificial intelligence, or AI for short, has been around since the mid 1950s. It’s
not necessarily new. But it became super popular recently because of the
advancements in processing capabilities. Back in the 1900s, there just wasn’t the
necessary computing power to realize AI. Today, we have some of the fastest
computers the world has ever seen. And the algorithm implementations have
improved so much that we can run them on commodity hardware, even your
laptop or smartphone that you’re using to read this right now. And given the
seemingly endless possibilities of AI, everybody wants a piece of it.
But what exactly is artificial intelligence? Artificial intelligence is the ability that
can be imparted to computers which enables these machines to understand data,
learn from the data, and make decisions based on patterns hidden in the data, or
inferences that could otherwise be very difficult (to almost impossible) for
humans to make manually. AI also enables machines to adjust their “knowledge”
based on new inputs that were not part of the data used for training these
machines.
But there’s one thing you need to make sure, that you have enough data for AI to
learn from. If you have a very small data lake that you’re using to train your AI
model, the accuracy of the prediction or decision could be low. So more the data,
better is the training of the AI model, and more accurate will be the outcome.
Depending on the size of your training data, you can choose various algorithms
for your model. This is where machine learning and deep learning start to show
up.
In the early days of AI, neural networks were all the rage. There were multiple
groups of people across the globe working on bettering their neural networks.
But as I mentioned earlier in the post, the limitations of the computing hardware
kind of hindered the advancement of AI. But from the late 1980s all the way up
to the 2010s, machine learning it was. Every major tech company was investing
heavily in machine learning. Companies such as Google, Amazon, IBM,
Facebook, etc. were virtually dragging AI and ML PhD. people straight from
universities. But these days, even machine learning has taken a back seat. It’s all
about deep learning now. There’s definitely been an evolution of AI in the last
few decades, and it’s getting better with every passing year. You can visualize
this evolution from the image below.
Each processing node has its own small sphere of knowledge, including what it
has seen and any rules it was originally programmed with or developed for itself.
The tiers are highly interconnected, which means each node in tier n will be
connected to many nodes in tier n-1 its inputs and in tier n+1, which provides
input data for those nodes. There may be one or multiple nodes in the output
layer, from which the answer it produces can be read.
Artificial neural networks are notable for being adaptive, which means they
modify themselves as they learn from initial training and subsequent runs
provide more information about the world. The most basic learning model is
centered on weighting the input streams, which is how each node weights the
importance of input data from each of its predecessors. Inputs that contribute to
getting right answers are weighted higher.
For example, if nodes David, Dianne and Dakota tell node Ernie the current
input image is a picture of Brad Pitt, but node Durango says it is Betty White,
and the training program confirms it is Pitt, Ernie will decrease the weight it
assigns to Durango's input and increase the weight it gives to that of David,
Dianne and Dakota.
In defining the rules and making determinations that is, the decision of each
node on what to send to the next tier based on inputs from the previous tier
neural networks use several principles. These include gradient-based
training, fuzzy logic, genetic algorithms and Bayesian methods. They may be
given some basic rules about object relationships in the space being modeled.
Further, the assumptions people make when training algorithms causes neural
networks to amplify cultural biases. Biased data sets are an ongoing challenge in
training systems that find answers on their own by recognizing patterns in data.
If the data feeding the algorithm isn't neutral and almost no data is, the machine
propagates bias.
Recurrent neural networks (RNN) are more complex. They save the output of
processing nodes and feed the result back into the model. This is how the model
is said to learn to predict the outcome of a layer. Each node in the RNN model
acts as a memory cell, continuing the computation and implementation of
operations. This neural network starts with the same front propagation as a feed-
forward network, but then goes on to remember all processed information in
order to reuse it in the future. If the network's prediction is incorrect, then the
system self-learns and continues working towards the correct prediction during
backpropagation. This type of ANN is frequently used in text-to-speech
conversions.
Convolutional neural networks (CNN) are one of the most popular models
used today. This neural network computational model uses a variation of
multilayer perceptrons and contains one or more convolutional layers that can be
either entirely connected or pooled. These convolutional layers create feature
maps that record a region of image which is ultimately broken into rectangles
and sent out for nonlinear processing. The CNN model is particularly popular in
the realm of image recognition; it has been used in many of the most advanced
applications of AI, including facial recognition, text digitization and natural
language processing. Other uses include paraphrase detection, signal processing
and image classification.
Deconvolutional neural networks utilize a reversed CNN model process. They
aim to find lost features or signals that may have originally been considered
unimportant to the CNN system's task. This network model can be used in image
synthesis and analysis.
Parallel processing abilities mean the network can perform more than
one job at a time.
Information is stored on an entire network, not just a database.
The ability to learn and model nonlinear, complex relationships helps
model the real life relationships between input and output.
Fault tolerance means the corruption of one or more cells of the ANN
will not stop the generation of output.
Gradual corruption means the network will slowly degrade over time,
instead of a problem destroying the network instantly.
The ability to produce output with incomplete knowledge with the loss
of performance being based on how important the missing information
is.
No restrictions are placed on the input variables, such as how they
should be distributed.
Machine learning means the ANN can learn from events and make
decisions based on the observations.
The ability to learn hidden relationships in the data without commanding
any fixed relationship means an ANN can better model highly Volatile
data and non-constant variance.
The ability to generalize and infer unseen relationships on unseen data
means ANNs can predict the output of unseen data.
The lack of rules for determining the proper network structure means the
appropriate artificial neural network architecture can only be found
through trial and error and experience.
The requirement of processors with parallel processing abilities makes
neural networks hardware dependent.
The network works with numerical information, therefor all problems
must be translated into numerical values before they can be presented to
the ANN.
The lack of explanation behind probing solutions is one of the biggest
disadvantages in ANNs. The inability to explain the why or how behind
the solution generates a lack of trust in the network.
Chatbots
Natural language processing, translation and language generation
Stock market prediction
Delivery driver route planning and optimization
Drug discovery and development
These are just a few specific areas to which neural networks are being applied
today. Prime uses involve any process that operates according to strict rules or
patterns and has large amounts of data. If the data involved is too large for a
human to make sense of in a reasonable amount of time, the process is likely a
prime candidate for automation through artificial neural networks.
It wasn't until around 2010 that research picked up again. The big data trend,
where companies amass vast troves of data, and parallel computing gave data
scientists the training data and computing resources needed to run complex
artificial neural networks. In 2012, a neural network was able to beat human
performance at an image recognition task as part of the ImageNet competition.
Since then, interest in artificial neural networks as has soared and the technology
continues to improve.
Machine Learning
Machine learning (ML) is a type of artificial intelligence (AI) that allows
software applications to become more accurate at predicting outcomes without
being explicitly programmed to do so. Machine learning algorithms use
historical data as input to predict new output values.
Recommendation engines are a common use case for machine learning. Other
popular uses include fraud detection, spam filtering, malware threat
detection, business process automation (BPA) and predictive maintenance.
Robotics . Robots can learn to perform tasks in the physical world using
this technique.
Video gameplay . Reinforcement learning has been used to teach bots to
play a number of video games.
Resource management . Given finite resources and a defined goal,
reinforcement learning can help enterprises plan how to allocate
resources.
Uses of machine learning
Today, machine learning is used in a wide range of applications. Perhaps one of
the most well-known examples of machine learning in action is
the recommendation engine that powers Facebook's News Feed.
Behind the scenes, the engine is attempting to reinforce known patterns in the
member's online behavior. Should the member change patterns and fail to read
posts from that group in the coming weeks, the News Feed will adjust
accordingly.
Human resource information systems -- HRIS systems can use machine learning
models to filter through applications and identify the best candidates for an open
position.
Self-driving cars -- Machine learning algorithms can even make it possible for a
semi-autonomous car to recognize a partially visible object and alert the driver.
But machine learning comes with disadvantages. First and foremost, it can be
expensive. Machine learning projects are typically driven by data scientists, who
command high salaries. These projects also require software infrastructure that
can be high-cost.
There is also the problem of machine learning bias. Algorithms that trained on
data sets that exclude certain populations or contain errors can lead to inaccurate
models of the world that, at best, fail and, at worst, are discriminatory. When an
enterprise bases core business processes on biased models, it can run into
regulatory and reputational harm.
Step 1: Align the problem with potential data inputs that should be considered
for the solution. This step requires help from data scientists and experts who
have a deep understanding of the problem.
Step 2: Collect data, format it and label the data if necessary. This step is
typically led by data scientists, with help from data wranglers.
Step 3: Chose which algorithm(s) to use and test to see how well they perform.
This step is usually carried out by data scientists.
1834 - Charles Babbage conceives the idea for a general all-purpose device that
could be programmed with punched cards.
1842 - Ada Lovelace describes a sequence of operations for solving
mathematical problems using Charles Babbage's theoretical punch- card
machine and becomes the first programmer.
1847 - George Boole creates Boolean logic, a form of algebra in which all values
can be reduced to the binary values of true or false.
1959 - MADALINE becomes the first artificial neural network applied to a real-
world problem: removing echoes from phone lines.
1985 - Terry Sejnowski and Charles Rosenberg's artificial neural network taught
itself how to correctly pronounce 20,000 words in one week.
2006 - Computer scientist Geoffrey Hinton invents the term deep learning to
describe neural net research.
2014 - A chatbot passes the Turing Test by convincing 33% of human judges
that it was a Ukrainian teen named Eugene Goostman.
2014 - Google's AlphaGo defeats the human champion in Go, the most difficult
board game in the world.
2016 - LipNet, DeepMind's artificial-intelligence system, identifies lip-read
words in video with an accuracy of 93.4%.
2019 - Amazon controls 70% of the market share for virtual assistants in the
U.S.
Types of Machine Learning Algorithms
The nine machine learning algorithms that follow are among the most popular
and commonly used to train enterprise models. The models each support
different goals, range in user friendliness and use one or more of the following
machine learning approaches: supervised learning, unsupervised learning, semi-
supervised learning or reinforcement learning.
Linear regression
Linear regression is best for when "you are looking at predicting your value or
predicting a class," said Shekhar Vemuri, CTO of technology service company
Clairvoyant, based in Chandler, Ariz.
This algorithm works best for training data that can clearly be separated by a
line, also referred to as a hyperplane. Nonlinear data can be programmed into a
facet of SVM called nonlinear SVMs. But, with training data that's hyper-
complex -- faces, personality traits, genomes and genetic material -- the class
systems become smaller and harder to identify and require a bit more human
assistance.
SVMs are used heavily in the financial sector, as they offer high accuracy on
both current and future data sets. The algorithms can be used to compare relative
financial performance, value and investment gains virtually.
Companies with nonlinear data and different kinds of data sets often use SVM,
Vemuri said.
Decision tree
A decision tree algorithm takes data and graphs it out in branches to show the
possible outcomes of a variety of decisions. Decision trees classify response
variables and predict response variables based on past decisions.
Decision trees are a visual method of mapping out decisions. Their results are
easy to explain and can be accessible to citizen data scientists. A decision tree
algorithm maps out various decisions and their likely impact on an end result
and can even be used with incomplete data sets.
Decision trees, due to their long-tail visuals, work best for small data sets, low-
stakes decisions and concrete variables. Because of this, common decision tree
use cases involve augmenting option pricing -- from mortgage lenders
classifying borrowers to product management teams quantifying the shift in
market that would occur if they changed a major ingredient.
Decision trees remain popular because they can outline multiple outcomes and
tests without requiring data scientists to deploy multiple algorithms, said Jeff
Fried, director of product management for InterSystems, a software company
based in Cambridge, Mass.
Apriori
The Apriori algorithm, based on the Apriori principle, is most commonly used in
market basket analysis to mine item sets and generate association rules. The
algorithms check for a correlation between two items in a data set to determine if
there's a positive or negative correlation between them.
The Apriori algorithm is primed for sales teams that seek to notice which
products customers are more likely to buy in combination with other products. If
a high percentage of customers who purchase bread also purchase butter, the
algorithm can conclude that purchase of A (bread) will often lead to purchase of
B (butter). This can be cross-referenced in data sets, data points and purchase
ratios.
Apriori algorithms can also determine that purchase of A (bread) is only 10%
likely to lead to the purchase of C (corn). Marketing teams can use this
information to inform things like product placement strategies. Besides sales
functions, Apriori algorithms are favored by e-commerce giants, like Amazon
and Alibaba, but are also used to understand searcher intent by sites like Bing
and Google to predict searches by correlating associated words.
K-means clustering
The K-means algorithm is an iterative method of sorting data points into groups
based on similar characteristics. For example, a K-means cluster algorithm
would sort web results for the word civic into groups relating to Honda Civic
and civic as in municipal or civil.
GANs are deep generative models that have gained popularity. GANs have the
ability to imitate data in order to model and predict. They work by essentially
pitting two models against each other in a competition to develop the best
solution to a problem. One neural network, a generator, creates new data while
another, the discriminator, works to improve on the generator's data. After many
iterations of this, data sets become more and more lifelike and realistic. Popular
media uses GANs to do things like face creation and audio manipulation. GANs
are also impactful for creating large data sets using limited training points,
optimizing models and improving manufacturing processes.
Reinforcement learning
Reinforcement learning algorithms are based on a system of rewards and
punishments learned through trial and error. The model is given a goal and seeks
maximum reward for getting closer to that goal based on limited information and
learns from its previous actions. Reinforcement learning algorithms can be
model-free -- creating interpretations of data through constant trial and error -- or
model-based -- adhering more closely to a set of predefined steps with minimal
trial and error.
Q-learning
Q-learning algorithms are model-free, which means they seek to find the best
method of achieving a defined goal by seeking the maximum reward by trying
the maximum amount of actions. Q-learning is often paired with deep learning
models in research projects, including Google's DeepMind. Q-learning further
breaks down into various algorithms, including deep deterministic policy
gradient (DDPG) or hindsight experience replay (HER).
Automated machine learning is one of the trendiest and most popular areas of
enterprise AI software right now. With vendors offering everything from
individual automated machine learning tools to cloud-based, full-service
programs, autoML is quickly helping enterprises streamline business process and
dive into AI.
In light of the rise of autoML, analysts and experts are encouraging enterprises
to evaluate their specific needs alongside the intended purpose of the tools -- to
augment data scientists' work -- instead of trying to use autoML without a larger
AI framework.
Whether your enterprise has a flourishing data science team, citizen data science
team or relies heavily on outsourcing data science work, autoML can provide
value if you choose tools and use cases wisely.
"Make sure that you're using [autoML] for the right intended purpose, which is
automate the grunt work that a data scientist typically has to do," said Shekhar
Vemuri, CTO of technology service company Clairvoyant, based in Chandler,
Ariz.
AutoML tools are being used to augment and speed up the modeling process,
because data scientists spend most of their time on data engineering and data
washing, said Evan Schnidman, CEO of natural language processing company
Prattle, based in St. Louis.
"The first ranges of tools are all about how [to] streamline the data ingestion,
data washing process. The next ranges of the tools are how [to] then streamline
model development and model deployment. And then the third ranges are how
[to] streamline model testing and validation," he said.
Still, experts warned autoML users not to expect automated machine learning
tools to replace data scientists.
AutoML and augmented analytics do not fully replace expert data scientists, said
Carlie Idoine, senior director and analyst of data science and business analytics
at Gartner.
"A key realization should be that we're using autoML to essentially gain scale
and try out more things than we could do manually or hand code," Vemuri said.
Schnidman echoed the sentiment, calling autoML a support tool for data
scientists. Businesses that have a mature data science team are poised to get the
most net value, because the automated tools are an extension of data scientists'
capabilities.
"AutoML works for those who say, We've done this manually and taken it as far
as we can go. So, we want to use these augmented tools to do feature
engineering, maybe take out some bias we have and see what it finds that we
didn't consider,'" Idoine said.
If enterprises intend for autoML to replace their data science team, or be their
only point of AI development, the tools will give limited advantages. AutoML is
only one step of many in an overall AI strategy -- especially in enterprises that
are heavily regulated and those affected by recent data protection laws.
"Regulated industries and verticals have all these other legal concerns that they
need to keep in mind and stay on top of. Make sure that you're able to ensure that
your tool of choice is able to integrate into your overall AI workflow," Vemuri
said.
Limitations of tools
The biggest limitation of automated machine learning tools today is they work
best on known types of problems using algorithms like regression and
classification. Because autoML has to be programmed to follow steps, some
algorithms and forms of AI are not compatible with automation.
"Some of the newer types of algorithms like deep neural nets aren't really well
suited for autoML; that type of analysis is much more sophisticated, and it's not
something that can be easily explained," Idoine said.
AutoML is also wrapped up in the problem of black box algorithms and testing.
If a process can't be easily outlined -- even if the automated machine learning
tool can complete it -- the process will be hard to explain. Black box
functionality comes with a whole host of its own issues, including bias and
struggles with incomplete data sets.
"We don't want to encourage black boxes for people that aren't experts in this
type of work," Idoine said.
But some enterprises are finding ways to work around this problem. At the Flink
Forward conference in San Francisco, engineers at Comcast and Capital One
described how they are using Apache Flink to help bridge this gap to speed the
deployment of new AI algorithms.
Version everything
The tools used by data scientists and engineers can differ in subtle ways. That
leads to problems replicating good AI models in production.
"This ensures that what we put into production and feature engineering are
consistent," said Dave Torok, senior enterprise architect at Comcast.
At the moment, this process is not fully automated. However, the goal is to move
toward full lifecycle automation for Comcast's machine learning development
pipeline.
Capital One ran into similar problems trying to connect Python for its data
scientists and Java for its production environment to create better fraud detection
algorithms. They did some work to build up a Jython library that acts as an
adaptor.
Andrew Gao, software engineer at Capital One, said the bank's previous
algorithms did not have access to all of a customer's activities. On the production
side, these models needed to be able to return an answer in a reasonable amount
of time.
"We want to catch fraud, but not create a poor customer experience," Gao said.
The initial project started off as one monolithic Flink application. However,
Capital One ran into problems merging data from historical data sources and
current streaming data, so they broke this up into several smaller microservices
that helped address the problem.
This points to one of the current limitations of using stream processing for
building AI apps. Stephan Ewen, chief technology officer at Data Artisans and
lead developer of Flink, said that the development of Flink tooling has
traditionally focused on AI and machine learning in production.
"Engineers can do model training logic using Flink, but we have not pushed for
that. This is coming up more and more," he said.
Deep Learning
Deep learning is a type of machine learning (ML) and artificial intelligence (AI)
that imitates the way humans gain certain types of knowledge. Deep learning is
an important element of data science, which includes statistics and predictive
modeling. It is extremely beneficial to data scientists who are tasked with
collecting, analyzing and interpreting large amounts of data; deep learning
makes this process faster and easier.
Initially, the computer program might be provided with training data -- a set of
images for which a human has labeled each image "dog" or "not dog" with meta
tags. The program uses the information it receives from the training data to
create a feature set for "dog" and build a predictive model. In this case, the
model the computer first creates might predict that anything in an image that has
four legs and a tail should be labeled "dog." Of course, the program is not aware
of the labels "four legs" or "tail." It will simply look for patterns of pixels in the
digital data. With each iteration, the predictive model becomes more complex
and more accurate.
Unlike the toddler, who will take weeks or even months to understand the
concept of "dog," a computer program that uses deep learning algorithms can be
shown a training set and sort through millions of images, accurately identifying
which images have dogs in them within a few minutes.
Customer experience. Deep learning models are already being used for
chatbots. And, as it continues to mature, deep learning is expected to be
implemented in various businesses to improve the customer experiences
and increase customer satisfaction.
Text generation. Machines are being taught the grammar and style of a
piece of text and are then using this model to automatically create a
completely new text matching the proper spelling, grammar and style of
the original text.
Aerospace and military. Deep learning is being used to detect objects
from satellites that identify areas of interest, as well as safe or unsafe
zones for troops.
Industrial automation. Deep learning is improving worker safety in
environments like factories and warehouses by providing services that
automatically detect when a worker or object is getting too close to a
machine.
Adding color. Color can be added to black and white photos and videos
using deep learning models. In the past, this was an extremely time-
consuming, manual process.
Medical research. Cancer researchers have started implementing deep
learning into their practice as a way to automatically detect cancer cells.
Computer vision. Deep learning has greatly enhanced computer vision,
providing computers with extreme accuracy for object detection and
image classification, restoration and segmentation.
Limitations and challenges
The biggest limitation of deep learning models is they learn through
observations. This means they only know what was in the data on which they
trained. If a user has a small amount of data or it comes from one specific source
that is not necessarily representative of the broader functional area, the models
will not learn in a way that is generalizable.
The issue of biases is also a major problem for deep learning models. If a model
trains on data that contains biases, the model will reproduce those biases in its
predictions. This has been a vexing problem for deep learning programmers,
because models learn to differentiate based on subtle variations in data elements.
Often, the factors it determines are important are not made explicitly clear to the
programmer. This means, for example, a facial recognition model might make
determinations about people's characteristics based on things like race or gender
without the programmer being aware.
The learning rate can also become a major challenge to deep learning models. If
the rate is too high, then the model will converge too quickly, producing a less-
than-optimal solution. If the rate is too low, then the process may get stuck, and
it will be even harder to reach a solution.
The hardware requirements for deep learning models can also create limitations.
Multicore high-performing graphics processing units (GPUs) and other similar
processing units are required to ensure improved efficiency and decreased time
consumption. However, these units are expensive and use large amounts of
energy. Other hardware requirements include random access memory (RAM)
and a hard drive or RAM-based solid-state drive (SSD).
Furthermore, machine learning does not require the same costly, high-end
machines and high-performing GPUs that deep learning does.
In the end, many data scientists choose traditional machine learning over deep
learning due to its superior interpretability, or the ability to make sense of the
solutions. Machine learning algorithms are also preferred when the data is small.
Instances where deep learning becomes preferable include situations where there
is a large amount of data, a lack of domain understanding for feature
introspection or complex problems, such as speech recognition and natural
language processing.
History
Deep learning can trace its roots back to 1943 when Warren McCulloch and
Walter Pitts created a computational model for neural networks using
mathematics and algorithms. However, it was not until the mid-2000s that the
term deep learning started to appear. It gained popularity following the
publication of a paper by Geoffrey Hinton and Ruslan Salakhutdinov that
showed how a neural network with many layers could be trained one layer at a
time.
In 2012, Google made a huge impression on deep learning when its algorithm
revealed the ability to recognize cats. Two years later, in 2014, Google bought
DeepMind, an artificial intelligence startup from the U.K. Two years after that,
in 2016, Google DeepMind's algorithm, AlphaGo, mastered the complicated
board game Go, beating professional player Lee Sedol at a tournament in Seoul.
Recently, deep learning models have generated the majority of advances in the
field of artificial intelligence. Deep reinforcement learning has emerged as a way
to integrate AI with complex applications, such as robotics, video games and
self-driving cars. The primary difference between deep learning and
reinforcement learning is, while deep learning learns from a training set and then
applies what is learned to a new data set, deep reinforcement learning learns
dynamically by adjusting actions using continuous feedback in order to optimize
the reward.
A reinforcement learning agent has the ability to provide fast and strong control
of generative adversarial networks (GANs). The Adversarial Threshold Neural
Computer (ATNC) combines deep reinforcement learning with GANs in order to
design small organic molecules with a specific, desired set of pharmacological
properties.
GANs are also being used to generate artificial training data for machine
learning tasks, which can be used in situations with imbalanced data sets or
when data contains sensitive information.
Here is a very simple illustration of how a deep learning program works. This
video by the LuLu Art Group shows the output of a deep learning program after
its initial training with raw motion capture data. This is what the program
predicts the abstract concept of "dance" looks like.
With each iteration, the program's predictive model became more complex and
more accurate.
Di Santo was investigating the motions involved when fish such as skates swim.
She filmed individual fish in a tank and manually annotated their body parts
frame by frame, an effort that required about a month of full-time work for 72
seconds of footage. Using an open-source application called DLTdv, developed
in the computer language MATLAB, she then extracted the coordinates of body
parts — the key information needed for her research. That analysis showed,
among other things, that when little skates (Leucoraja erinacea ) need to swim
faster, they create an arch on their fin margin to stiffen its edge1.
NatureTech
Researchers have long been interested in tracking animal motion, Mathis says,
because motion is “a very good read-out of intention within the brain”. But
conventionally, that has involved spending hours recording behaviours by hand.
The previous generation of animal-tracking tools mainly determined centre of
mass and sometimes orientation, and the few tools that captured finer details
were highly specialized for specific animals or subject to other constraints, says
Talmo Pereira, a neuroscientist at Princeton University in New Jersey.
Each tool has limitations; some require specific experimental set-ups, or don’t
work well when animals always crowd together. But methods will improve
alongside advances in image capture and machine learning, says Sandeep Robert
Datta, a neuroscientist at Harvard Medical School in Boston, Massachusetts.
“What you’re looking at now is just the very beginning of what is certain to be a
long-term transformation in the way neuroscientists study behaviour,” he says.
Strike a pose
DeepLabCut is based on software used to analyse human poses. Mathis’ team
adapted its underlying neural network to work for other animals with relatively
few training data. Between 50 and 200 manually annotated frames are generally
sufficient for standard lab studies, although the amount needed depends on
factors such as data quality and the consistency of the people doing the labelling,
Mathis says. In addition to annotating body parts with a GUI, users can issue
commands through a Jupyter Notebook, a computational document popular with
data scientists. Scientists have used DeepLabCut to study both lab and wild
animals, including mice, spiders, octopuses and cheetahs. Neuroscientist Wujie
Zhang at the University of California, Berkeley, and his colleague used it to
estimate the behavioural activity of Egyptian fruit bats (Rousettus aegyptiacus )
in the lab2.
DeepPoseKit offers “very good innovations”, Pereira says. Mathis disputes the
validity of the performance comparisons, but Graving says that “our results offer
the most objective and fair comparison we could provide”. Mathis’ team
reported an accelerated version of DeepLabCut that can run on a mobile phone
in an article posted in September on the arXiv preprint repository4.
Biologists who want to test multiple software solutions can try Animal Part
Tracker, developed by Kristin Branson, a computer scientist at the Howard
Hughes Medical Institute’s Janelia Research Campus in Ashburn, Virginia, and
her colleagues. Users can select any of several posture-tracking algorithms,
including modified versions of those used in DeepLabCut and LEAP, as well as
another algorithm from Branson’s lab. DeepPoseKit also offers the option to use
alternative algorithms, as will SLEAP.
Other tools are designed for more specialized experimental set-ups. DeepFly3D,
for instance, tracks 3D postures of single tethered lab animals, such as mice with
implanted electrodes or fruit flies walking on a tiny ball that acts as a treadmill.
Pavan Ramdya, a neuroengineer at the Swiss Federal Institute of Technology in
Lausanne (EPFL), and his colleagues, who developed the software, are using
DeepFly3D to help identify which neurons in fruit flies are active when they
perform specific actions.
Other software packages can help biologists to make sense of animals’ motions.
For instance, researchers might want to translate posture coordinates into
behaviours such as grooming, Mathis says. If scientists know which behaviour
they’re interested in, they can use the Janelia Automatic Animal Behavior
Annotator (JAABA), a supervised machine-learning tool developed by
Branson’s team, to annotate examples and automatically identify more instances
in videos.
By mixing and matching these tools, researchers can extract new meaning from
animal imagery. “It gives you the full kit of being able to do whatever you
want,” Pereira says.
Data Lakes vs. Data Warehouses
Understand the differences between the two most popular options for storing big
data.
When it comes to storing big data, the two most popular options are data lakes
and data warehouses. Data warehouses are used for analyzing archived
structured data, while data lakes are used to store big data of all structures.
Data Lake
A data lake is a storage repository that holds a vast amount of raw data in its
native format until it is needed. While a hierarchical data warehouse stores data
in files or folders, a data lake uses a flat architecture to store data. Each data
element in a lake is assigned a unique identifier and tagged with a set of
extended metadata tags. When a business question arises, the data lake can be
queried for relevant data, and that smaller set of data can then be analyzed to
help answer the question.
The term data lake is often associated with Hadoop-oriented object storage. In
such a scenario, an organization's data is first loaded into the Hadoop platform,
and then business analytics and data mining tools are applied to the data where it
resides on Hadoop's cluster nodes of commodity computers.
Like big data, the term data lake is sometimes disparaged as being simply a
marketing label for a product that supports Hadoop. Increasingly, however, the
term is being used to describe any large data pool in which the schema and data
requirements are not defined until the data is queried.
The term describes a data storage strategy, not a specific technology, although it
is frequently used in conjunction with a specific technology (Hadoop). The same
can be said of the term data warehouse , which despite often referring to a
specific technology (relational database), actually describes a broad data
management strategy.
Data lake vs. data warehouse
Data lakes and data warehouses are two different strategies for storing big data.
The most important distinction between them is that in a data warehouse, the
schema for the data is preset; that is, there is a plan for the data upon its entry
into the database. In a data lake, this is not necessarily the case. A data lake can
house both structured and unstructured data and does not have a predetermined
schema. A data warehouse handles primarily structured data and has a
predetermined schema for the data it houses.
To put it more simply, think of the concept of a warehouse versus the concept of
a lake. A lake is liquid, shifting, amorphous, largely unstructured and is fed from
rivers, streams, and other unfiltered sources of water. A warehouse, on the other
hand, is a man-made structure, with shelves and aisles and designated places for
the things inside of it. Warehouses store curated goods from specific sources.
Warehouses are prestructured, lakes are not.
Users -- Data warehouses are useful when there is a massive amount of data
from operational systems that need to be readily available for analysis. Data
lakes are more useful when an organization needs a large repository of data, but
does not have a purpose for all of it and can afford to apply a schema to it upon
access.
Because the data in a lake is often uncurated and can originate from sources
outside of the company's operational systems, lakes are not a good fit for the
average business analytics user. Instead, data lakes are better suited for use by
data scientists, because it takes a level of skill to be able to sort through the large
body of uncurated data and readily extract meaning from it.
Data quality -- In a data warehouse, the highly curated data is generally trusted
as the central version of true because it contains already processed data. The data
in a data lake is less reliable because it could be arriving from any source in any
state. It may be curated, and it may not be, depending on the source.
Processing -- The schema for data warehouses is on-write, meaning it is pre-set
for when the data is entered into the warehouse. The schema for a data lake is
on-read, meaning it doesn't exist until the data has been accessed and someone
chooses to use it for something.
Performance/cost -- Data warehouses are usually more expensive for large data
volumes, but the trade-off is faster query results, reliability and higher
performance. Data lakes are designed with low cost in mind, but query results
are improving as the concept and surrounding technologies mature.
Agility -- Data lakes are highly agile; they can be configured and reconfigured
as needed. Data warehouses are less so.
Security -- Data warehouses are generally more secure than data lakes because
warehouses as a concept have existed for longer and therefore, security
methods have had the opportunity to mature.
Because of their differences, and the fact that data lakes are a newer and still-
evolving concept, organizations might choose to use both a data warehouse and a
data lake in a hybrid deployment. This may be to accommodate the addition of
new data sources, or to create an archive repository to deal with data roll-off
from the main data warehouse. Frequently data lakes are an addition to, or
evolution of, an organization's current data management structure instead of a
replacement.
However, there are three main principles that distinguish a data lake from other
big data storage methods and make up the basic architecture of a data lake. They
are:
Although data is largely unstructured and not geared toward answering any
specific question, it should still be organized in some manner so that doing this
in the future is possible. Whatever technology ends up being used to deploy an
organization's data lake, a few features should be included to ensure that the data
lake is functional and healthy and that the large repository of unstructured data
doesn't go to waste. These include:
Criticism
Despite the benefits of having a cheap, unstructured repository of data at an
organization's disposal, several legitimate criticisms have been levied against the
strategy.
One of the biggest potential follies of the data lake is that it might turn into a
data swamp, or data graveyard. If an organization practices poor data governance
and management, it may lose track of the data that exists in the lake, even as
more pours in. The result is a wasted body of potentially valuable data rotting
away unseen at the "bottom" of the data lake, so to speak, rendering it
deteriorated, unmanaged and inaccessible.
Another problem with the term data lake itself is that it is used in many contexts
in public discourse. Although it makes most sense to use it to describe a strategy
of data management, it has also commonly been used to describe specific
technologies and as a result, has a level of arbitrariness to it. This challenge may
cease to be once the term matures and finds a more concrete meaning in the
public discourse.
Vendors
Although a data lake isn't a specific technology, there are several technologies
that enable them. Some vendors that offer those technologies are:
Such systems can also hold transactional data pulled from relational databases,
but they're designed to support analytics applications, not to handle transaction
processing. As public cloud platforms have become common sites for data
storage, many people build Hadoop data lakes in the cloud.
With the use of commodity hardware and Hadoop's standing as an open source
technology, proponents claim that Hadoop data lakes provide a less expensive
repository for analytics data than traditional data warehouses. In addition, their
ability to hold a diverse mix of structured, unstructured and semistructured data
can make them a more suitable platform for big data management and analytics
applications than data warehouses based on relational software.
As a result, data lake systems tend to employ extract, load and transform (ELT)
methods for collecting and integrating data, instead of the extract, transform and
load (ETL) approaches typically used in data warehouses. Data can be extracted
and processed outside of HDFS using MapReduce, Spark and other data
processing frameworks.
Despite the common emphasis on retaining data in a raw state, data lake
architectures often strive to employ schema-on-the-fly techniques to begin to
refine and sort some data for enterprise uses. As a result, Hadoop data lakes have
come to hold both raw and curated data.
As big data applications become more prevalent in companies, the data lake
often is organized to support a variety of applications. While early Hadoop data
lakes were often the province of data scientists, increasingly, these lakes are
adding tools that allow analytics self-service for many types of users.
The Hadoop data lake isn't without its critics or challenges for users. Spark, as
well as the Hadoop framework itself, can support file architectures other than
HDFS. Meanwhile, data warehouse advocates contend that similar architectures
-- for example, the data mart -- have a long lineage and that Hadoop and related
open source technologies still need to mature significantly in order to match the
functionality and reliability of data warehousing environments.
Microsoft launched its Azure Data Lake for big data analytical workloads in the
cloud in 2016. It is compatible with Azure HDInsight, Microsoft's data
processing service based on Hadoop, Spark, R and other open source
frameworks. The main components of Azure Data Lake are Azure Data Lake
Analytics, which is built on Apache YARN, Azure Data Lake Store and U-SQL.
It uses Azure Active Directory for authentication and access control lists and
includes enterprise-level features for manageability, scalability, reliability and
availability.
Around the same time that Microsoft launched its data lake, AWS launched Data
Lake Solutions an automated reference data lake implementation that guides
users through creation of a data lake architecture on the AWS cloud, using AWS
services, such as Amazon Simple Storage Service (S3) for storage and AWS
Glue, a managed data catalog and ETL service.
The main differences between the two involve data latency and refinement. Both
store structured and unstructured data, leveraging various data stores from
simple object files to SQL and NoSQL database engines to big data stores.
Data lakes are raw data repositories located at the beginning of data pipelines,
optimized for getting data into the analytics platform. Landing zones and
sandboxes of independent data designed for ingestion and discovery, these native
format data stores are open to private consumers for selective use. Analytics are
generally limited to time-sensitive insights and exploratory inquiry by
consumers who can tolerate the murky waters.
Data reservoirs are refined data repositories located at operational and back-end
points of data pipelines, optimized for getting data out of the analytics platform.
As sources of unified, harmonized and wrangled data designed for querying and
analysis, data reservoirs are purpose-built data stores that are open to the public
for general consumption. Analytics span a wide range of past, present and future
insights for use by casual and sophisticated consumers, serving both tactical and
strategic insights that run the business.
On one hand, access to data early in the pipeline will favor time-sensitive
insights over the suitability of non-harmonized data, particularly for use cases
that require the most recent data. On the other hand, access to data later in the
pipeline will favor data accuracy over increased latency by virtue of curation,
particularly for use cases that require data that has been cleaned, conformed and
enriched, and that is of known quality.
A big part of the hosting decision comes down to control. Organizations that are
comfortable sharing control are likely to lean more toward a cloud presence.
Organizations that feel comfortable owning the end-to-end platform will likely
lean more toward an on-premises option.
Regardless of where you run your analytics platform, modernization should not
simply be a lift-and-shift approach. You may not need a complete overhaul, but
take the opportunity to refresh select components and remove technical debt
across your platform.
Organizations that choose the public cloud for some or all their data analytics
platform architecture should take advantage of what the cloud does best. This
means moving from IaaS to SaaS and PaaS models. Look to maximize managed
services, migrate to native cloud services, automate elasticity, geo-disperse the
analytics platform and move to consumption-based pricing whenever possible by
using serverless technologies.
One thing is for sure: Your data analytics platform architecture will change. A
key measurement of a platform's flexibility is how well it adapts to business and
technology innovation. Expect the business to demand an accelerated analytics
lifecycle and greater autonomy via self-service capabilities. To keep pace with
the business, look for technology advancements in automation and artificial
intelligence as well as catalysts for augmented data management and analytics.
Data Warehouse
A data warehouse is a repository for data generated and collected by an
enterprise's various operational systems. Data warehousing is often part of a
broader data management strategy and emphasizes the capture of data from
different sources for access and analysis by business analysts, data scientists and
other end users.
1. A data integration layer that extracts data from operational systems, such
as Excel, ERP, CRM or financial applications.
2. A data staging area where data is cleansed and organized.
3. A presentation area where data is warehoused and made available for
use.
A data warehouse architecture can also be understood as a set of tiers, where the
bottom tier is the database server, the middle tier is the analytics engine and the
top tier is data warehouse software that presents information for reporting and
analysis.
Data analysis tools, such as BI software, enable users to access the data within
the warehouse. An enterprise data warehouse stores analytical data for all of an
organization's business operations; alternatively, individual business units may
have their own data warehouses, particularly in large companies. Data
warehouses can also feed data marts, which are smaller, decentralized systems in
which subsets of data from a warehouse are organized and made available to
specific groups of business users, such as sales or inventory management teams.
By contrast, a data lake is a central repository for all types of raw data, whether
structured or unstructured, from multiple sources. Data lakes are most commonly
built on Hadoop or other big data platforms. A schema doesn't need to be defined
upfront in them, which allows for more types of analytics than data warehouses,
which have defined schemas. For example, data lakes can be used for text
searches, machine learning and real-time analytics.
The technology's growth continued with the founding of The Data Warehousing
Institute, now known as TDWI, in 1995, and with the 1996 publication of Ralph
Kimball's book The Data Warehouse Toolkit , which introduced his dimensional
modeling approach to data warehouse design.
In 2008, Inmon introduced the concept of data warehouse 2.0, which focuses on
the inclusion of unstructured data and corporate metadata.
Well, no, not so fast. There's more to the question of operational data store vs.
data warehouse than that. Both do store operational data, but in different forms
and for different purposes. And in many cases, organizations incorporate both
into their analytics architectures.
The operational data store (ODS) is a bit harder to pin down because there are
diverging views on exactly what it is and for what it's used. But, at heart, an
ODS pulls together data from multiple transaction processing systems on a
short-term basis, with frequent updates as new data is generated by the source
systems. Operational data stores often serve as interim staging areas for data
that's ultimately headed to a data warehouse or a big data platform for long-term
storage.
Uses and benefits of an ODS
An ODS generally holds detailed transaction data that has yet to be consolidated,
aggregated and transformed into consistent data sets for loading into a data
warehouse. From a data integration standpoint, then, an ODS might only involve
the first and third elements of the extract, transform and load (ETL) process
typically used to pull data from operational systems and to harmonize it for
analysis.
In that sense, an operational data store can be thought of as a funnel that takes in
raw data from various source systems and helps facilitate the process of feeding
business intelligence and analytics systems with more refined versions of that
data. The full ETL process is handled downstream, which streamlines data
transformation workloads and minimizes the processing pipelines needed
between the ODS and the source systems to which it's connected.
However, some people also view the operational data store as a BI and analytics
platform in its own right. Under that scenario, an ODS can be used to do near-
real-time data analysis aimed at uncovering tactical insights that organizations
can quickly apply to ongoing business operations -- for example, to increase
retail inventories of popular products based on fresh sales data. By comparison,
data warehouses typically support historical analysis of data accumulated over a
longer period of time.
Depending on the specific application, an ODS that's used for data analysis
might be updated multiple times daily, if not hourly or even more frequently.
Real-time data integration tools, such as change data capture software, can be
tapped to help enable such updates. In addition, some level of data cleansing and
consistency checks might be applied in the ODS to help ensure that the analytics
results are accurate.
While data usually passes through an ODS relatively quickly to make room for
new data coming up behind it, things are different in a data warehouse. The
purpose there is to create an archive of data that can be analyzed to track
business performance and identify operational trends in order to guide strategic
decision-making by corporate and business executives.
Two other things to keep in mind about operational data stores: First, they aren't
the same thing as an operational database. The latter is the database built into a
transaction system -- it's the location from which the data flowing into an ODS
comes. Put another way, transaction data is initially processed in operational
databases and then moved to an ODS to begin its analytics journey.
Second, operational data stores are sometimes equated with master data
management (MDM) systems. MDM processes enable companies to create
common sets of master data on customers, products and suppliers. The master
data can then be fed back to transaction systems via an MDM hub, where the
data is managed and stored. Early on, some organizations built MDM
capabilities into ODS platforms, but that approach seems to have lessened in
recent years perhaps partly due to the MDM market not growing like proponents
hoped it would, itself a result of MDM's inherent complexities.
Advanced Analytics techniques fuel data-driven organization
The 2015 Pacific Northwest BI Summit is taking place in Grants Pass, Ore., this
weekend. The annual event brings together a small group of consultants and
vendors to discuss key trends and issues related to business intelligence,
analytics and data management. One of the participants is Claudia Imhoff,
president of consultancy Intelligent Solutions Inc. and founder of the Boulder BI
Brain Trust. At this year's conference, Imhoff will lead a discussion on
increasing the adoption of BI and analytics applications in companies. That topic
is similar to one she spoke about in a video interview with
SearchBusinessAnalytics at the 2014 summit: creating a more data-driven
organization through the use of higher-level predictive and prescriptive
analytics techniques.
In the interview, Imhoff said that basic descriptive analytics -- for example,
straightforward reporting on revenue, profits and other key performance
indicators -- is the most prevalent form of BI. But it's also "the least valuable of
all the analytics that companies can perform," she noted. The next step up is
diagnostic analytics, which addresses why something has happened but is "still
reactive," according to Imhoff.
On the other hand, she said, companies can use predictive analytics tools to look
toward the future -- by, say, identifying prospective customers who are likely to
be receptive to particular marketing campaigns. And prescriptive analytics
software can be applied to answer what-if questions in order to help optimize
business strategies and assess whether predicted business outcomes are worth
pursuing.
There are a number of issues that hold companies back from adopting more
advanced analytics techniques, Imhoff said. One is a lack of internal education
about the potential business benefits of effective analytics processes: "We need
to start building this culture in our organizations that understands the need for
analytics." Another issue she cited is a lack of analytical prowess resulting from
the ongoing shortage of data scientists and other skilled analytics professionals.
And an age-old but still common problem, she said, is "putting the cart before
the horse" on technology purchases and ending up with analytics systems and
tools that aren't a good fit for an organization's business needs.
Imhoff said BI, analytics and IT managers also need to understand that data
warehouses aren't the only valid repositories of analytics data anymore,
especially for storing the massive amounts of data being captured from sensors,
social networks and other new data sources. To support big data analytics
applications, she espoused an extended data warehouse architecture that
combines a traditional enterprise data warehouse with technologies such
as Hadoop clusters and NoSQL database systems. She sees well-designed data
visualizations as another must for fostering a data-driven organization, especially
in big data environments: "We're talking about massive numbers of data points,
and you can't just 'blat' that out on a screen."
Big data analytics involves a complex process that can span business
management, data scientists, developers and production teams. Crafting a new
data analytics model is just one part of this elaborate process.
The following are 10 must-have features in big data analytics tools that can help
reduce the effort required by data scientists to improve business results:
1. Embeddable results
Big data analytics gain value when the insights gleaned from data models can
help support decisions made while using other applications.
These features should include the ability to create insights in a format that is
easily embeddable into a decision-making platform, which should be able to
apply these insights in a real-time stream of event data to make in-the-moment
decisions.
2. Data wrangling
Data scientists tend to spend a good deal of time cleaning, labeling and
organizing data for data analytics. This involves seamless integration across
disparate data sources and types, applications and APIs, cleansing data, and
providing granular, role-based, secure access to the data.
Big data analytics tools must support the full spectrum of data types, protocols
and integration scenarios to speed up and simplify these data wrangling steps,
said Joe Lichtenberg, director of marketing for data platforms at InterSystems, a
database provider.
3. Data exploration
Strong visualization capabilities can also help this data exploration process.
There are a wide variety of approaches for putting data analytics results into
production, including business intelligence, predictive analytics, real-time
analytics and machine learning. Each approach provides a different kind of value
to the business. Good big data analytics tools should be functional and flexible
enough to support these different use cases with minimal effort or the retraining
that might be involved when adopting different tools
5. Scalability
Data scientists typically have the luxury of developing and testing different data
models on small data sets for long durations. But the resulting analytics models
need to run economically and often must deliver results quickly. This requires
that these models support high levels of scale for ingesting data and working
with large data sets in production without exorbitant hardware or cloud service
costs.
"A tool that scales an algorithm from small data sets to large with minimal effort
is also critical," said Eduardo Franco, data science lead at Descartes Labs, a
predictive analytics company. "So much time and effort is spent in making this
transition, so automating this is a huge help."
6. Version control
In a large data analytics project, several individuals may be involved in adjusting
the data analytics model parameters. Some of these changes may initially look
promising, but they can create unexpected problems when pushed into
production.
Version control built into big data analytics tools can improve the ability to track
these changes. If problems emerge later, it can also make it easier to roll back an
analytics model to a previous version that worked better.
"Without version control, one change made by a single developer can result in a
breakdown of all that was already created," said Charles Amick, vice president
of data science at Devo USA, a data operations platform provider.
7. Simple integration
The less time data scientists and developers spend customizing integrations to
process data sources and connect with applications, the more time they can
spend improving data analytic models and applications.
Simple integrations also make it easier to share results with other developers and
data scientists. Data analytics tools should support easy integration with existing
enterprise and cloud applications and data warehouses.
8. Data management
Big data analytics tools need a robust yet efficient data management platform to
ensure continuity and standardization across all deliverables, said Tim Lafferty,
director of analytics at Velocity Group Development, a data analytics
consultancy. As the magnitude of data increases, so does variability.
Data governance features are important for big data analytics tools to help
enterprises stay compliant and secure. This includes being able to track the
source and characteristics of the data sets used to build analytic models and to
help secure and manage data used by data scientists and engineers. Data sets
used to build models may introduce hidden biases that could create
discrimination problems.
Data governance is especially crucial for sensitive data, such as protected health
information and personally identifiable information that needs to comply with
privacy regulations. Some tools now include the ability to pseudonymize data,
allowing data scientists to build models based on personal information in
compliance with regulations like GDPR.
Many big data analytics tools focus on either analytics or data processing. Some
frameworks, like Apache Spark, support both. These enable developers and data
scientists to use the same tools for real-time processing; complex extract,
transform and load tasks; machine learning; reporting; and SQL. This is
important because data science is a highly iterative process. A data scientist
might create 100 models before arriving at one that is put into production. This
iterative process often involves enriching the data to improve the results of the
models.
Unified analytics tools help enterprises build data pipelines across a multitude of
siloed data storage systems while training and modeling their solution in an
iterative fashion, said Ali Ghodsi, CEO and co-founder of Databricks, a data
analytics platform provider.
Data-driven storytelling opens analytics to all
One of the great challenges of analytics has been making it accessible to more
than just the trained experts within an organization, the data analysts who
understand how to interpret data and use it to make informed decisions.
And just as visualizations helped make data more digestible a decade or so ago
and augmented intelligence is making analytics platforms easier for untrained
users to navigate, data-driven storytelling can put business intelligence in the
hands of a wider audience.
But unlike data visualizations and AI, technologies that only marginally extend
the reach of analytics, data-driven storytelling can have a wider impact in
enterprises.
"Data storytelling is what you say when you're actually trying to understand
what's happening in the data and make a decision off of it," said Nate
Nichols, chief scientist at Narrative Science, a data storytelling software vendor.
For example, Nichols continued, if someone comes home and sees a spilled glass
of water on the kitchen counter and the wet footprint of a cat leading away from
the water, they have a data set.
"That's what you get from a spreadsheet or a dashboard," Nichols said. "But you
don't make a decision based off of that. You develop an interpretation of what
happened, you tell a story -- the cat came in, tried to drink, knocked over the
water and ran out. It's the story that helps you make the decision about how to
keep the cat away in the future."
Rather than just present the numbers and leave the interpretation up to the user,
data storytelling platforms break it down and put into a written narrative that
total sales in a given week were $15 million, which was up 10% over the week
before and up 20% over the weekly average. Meanwhile, the sales figures
include 100 deals with a certain employee leading the way with eight, and the
overall increase can be attributed to seasonal factors.
A Narrative Science data story about the sales figures highlights the most
relevant numbers in bold, creates a simple bar graph, and situates a block
graphic below a bold headline over the narrative. A traditional spreadsheet
would leave it to the user to interpret the same information presented in rows of
numbers.
NARRATIVE SCIENCE
A sample data story from Narrative Science describes an organization's sales
bookings.
"As a trained analyst myself, data was always a means to an end," said Lyndsee
Manna, senior vice president of business development at Arria NLG, a natural
language generation vendor. "But I, as a human, had to wrestle to extract
something that was meaningful and could communicate to another human. The
shift to data storytelling is that I don't have to wrestle with the data anymore. The
data is going to tell me. It's knowledge automation."
The Psychology
Human beings understand stories.
From the earliest cave dwellers telling stories with pictures through the present
day, people have used stories to convey information and give it context.
Analytics, however, has largely lacked that storytelling aspect, missed out on the
power a story can have. Even data visualizations don't tell stories. They present
data in an easily understandable format charts, graphs and in artful ways, but
they usually don't give the data meaning in a richer context.
And that leaves countless business users out of the analytics process. Data
storytelling changes that. "It gives information context, it gives it purpose, and
makes it more memorable and understandable," said Donald Farmer, principal at
TreeHive Strategy. "For that reason it's very fundamental psychologically.
Storytelling is essential. In a sense, data storytelling is nothing new because
whenever we exchange data we do it with implicit stories. But data storytelling
as a practice is emerging."
"What's happening now is these specific technologies that are being developed to
support data storytelling are coming out, and they will [eventually] reach 100%
of the organization," Farmer said. "That's why it's exciting in a technology sense.
We've finally got a technology that actually can genuinely reach everyone in
some way."
The Technology
Analytics platforms largely focus on every aspect of the analytics process
leading up to interpretation. They're about preparing the data for analysis rather
than the analysis itself.
Narrative Science, though founded only 10 years ago, is one of the veterans.
Arria NLG, which offers a suite of natural language generation tools in addition
to its data storytelling capabilities, is another that's been around for a while,
having been founded in 2009. And now startups like Paris-based Toucan
Toco are emerging as data storytelling gains momentum.
"Everyone wants everyone to be able to make data-driven decisions and not have
to have an analytical background or have their own analysts," Nichols said. "But
for people that aren't analysts and that are just trying to understand the story and
use that to guide their decision-making, that last mile is the hurdle."
According to Nichols, Narrative Science's data stories are generally short and to
the point, often only a paragraph or two, though they have the potential to be
longer. Arria NLG's stories can similarly be of varying length, depending on the
wants of the user.
"Whether you know about data and excel at BI or whether you don't, data can
feel very overwhelming," Manna said. "The biggest thing [data storytelling]
gives to humanity is to lift that feeling of being overwhelmed and give them
something -- in language -- they can comprehend quickly. The gift is
understanding something that either would have taken a very long time or never
to understand."
Data stories are generally the final phase of the analytics process rather
than embedded throughout. When BI vendors offer their own data-driven
storytelling tools, they generally provide the opportunity to embed stories at
points along the data pipeline, but that can be tricky, according to Farmer. If the
tools are introduced too early in the process, they can influence the outcome
rather than interpret the outcome.
"You have to be very careful with data storytelling," Farmer said. "For me, data
storytelling has to be focused on a single subject."
The Future
Unlike most new technologies that start off in rudimentary forms and develop
over long periods of time, data storytelling platforms already deliver on the
promise of providing narratives that contextualize data and help the decision-
making process. However, they have the potential to do more.
Data-driven storytelling platforms don't yet know their users. They can analyze
data and craft a narrative based on it, but they don't yet have the machine
learning capabilities that will lead to personalized narratives.
He added that with machine learning, the tools eventually will recognize that a
person might look at a certain monthly report or dashboard and then follow up
by doing the same thing each time. But people with similar roles within the
organization might do something different after they look at that same report or
dashboard, so the software will recommend that perhaps the first person ought to
be doing something different after looking at the data.
And the same is true conversely, he added, when there's nothing new of note and
there's no reason to generate a new story. No matter what the future holds,
however, data-driven storytelling tools will always be about extending the reach
of analytics to a broader audience, and for the first time, potentially everyone.
Big Data Analytics is much more objective than the older methods and
companies can make the correct business decisions using data insights. There
was a time when companies could only interact with their customers on one in
stores. And there was no way to know what individual customers wanted on a
large scale. But that has all changed with the coming of Big Data Analytics.
Now companies can directly engage with each customer online personally and
know what they want.
So let’s see the different ways companies can use Big Data Analytics in the real
world to improve their performance and become even more successful (and
rich!) with time.
No company can exist without customers! And so attracting customers and even
more importantly, retaining those customers is necessary for a company. And
Big Data Analytics can certainly help with that! Big Data Analytics allows a
company to observe customer trends and then market their products specifically
keeping their customers in mind. And the more data that a company has about its
customer base, the more accurately they can observe customer trends and
patterns which will ensure that the company can deliver exactly what its
customers want. And this is the best way to increase customer retention. After
all, happy customers mean loyal customers!
An example of a company that uses Big Data Analytics to Increase Customer
Retention is Amazon . Amazon collects all the data about its customers such as
their names, addresses, search history, payments, etc. so that it can provide a
truly personalized experience. This means that Amazon knows who you are as
soon as you log in! It also provides you product recommendations based on your
history so you are more likely to buy things. And if you buy lots of things on
Amazon, you are less likely to leave Amazon!
A company cannot sustain itself if they don’t have a successful risk management
plan. After all, how is a big company supposed to function if they cannot even
find risks ahead of time and then work to minimize them as much as possible?
And this is where Big Data Analytics comes in! It can be used to collect and
analyze the vast internal data available in the company archives that can help in
developing both short term and long term risk management models. Using these,
the company can identify future risks and make much more strategic business
decisions. That means much more money in the future!!!
An example of a company that uses Big Data Analytics for Risk Management
is Starbucks . Did you know that Starbucks can have multiple stores on a single
street and all of them are successful? This is because Starbucks does great risk
analysis as well as providing great coffee! They collect data like location data,
demographic data, customer preferences, traffic levels, etc. of any location they
plan to open a shop and only do it if the chances of success are high and the
associated risk is minimal. So they can even choose locations that are close
together as long as there is more profit and less risk.
The supply chain begins with the creation of raw materials and ends at the
finished products in the hands of the customers. And for large companies, it is
very difficult to handle this supply chain. After all, it can contain thousands of
people and products that are moving from the point of manufacture to the point
of consumption! So companies can use Big Data Analytics to analyze their raw
materials, products in their warehouse inventories and their retailer details to
understand their production and shipment needs. This will make Supply Chain
Handling much easier which will lead to fewer errors and consequently fewer
losses for the company.
An example of a company that uses Big Data Analytics for Supply Chain
Handling is PepsiCo . While the most popular thing sold by PepsiCo is Pepsi of
course, did you know they sell many other things like Mountain Dew, Lays,
7Up, Doritos, etc. all over the world! And it is very difficult to manage the
Supply Chain Handling of so many things without using Big Data Analytics. So
PepsiCo uses data to calculate the amount and type of products that retailers
want without any wastage occurring.
All companies are trying to create products that their customers want. Well, what
if companies were able to first understand what their customers want and then
create products? They would surely be successful! That’s what Big Data
Analytics aims to do for Product Creation. Companies can use data like previous
product response, customer feedback forms, competitor product successes, etc.
to understand what types of products customers want and then work on that. In
this way, companies can create new products as well as improve their previous
products according to market demand and become much more successful and
popular.
An example of a company that uses Big Data Analytics for Product Creation
is Burberry , a British luxury fashion house. They provide luxury with
technology! This is done by targeting customers on an individual level to find
out the products that they want and focusing on those. Burberry store employees
can also see your online purchase history and preferences and recommend
matching accessories with your clothes. And this makes a truly personalized
product experience which is only possible with Big Data Analytics.
Key Skills That Data Scientists Need
Data scientists have a deceptively straightforward job to do: make sense of the
torrent of data that enters an organization as unstructured hash. Somewhere in
that confusion (hopefully) lies vital insight.
But is skill with algorithms and datasets enough for data scientists to succeed?
What else do they need to know to advance their careers?
While many tech pros might think that pushing data from query to conclusion is
enough to get by, they also need to know how the overall business works, and
how their data work will ultimately impact strategies and revenue. The current
hunger for data analytics means that companies always want more from their
data scientists.
Hard and Soft Skills
“There is a shortage, a skills gap in data science. It is enormous and it is
growing,” said Crystal Valentine, vice president of technology strategy at MapR,
a Big Data firm.
As proof of this, Valentine cited a report from consulting firm McKinsey &
Co. that suggests a national shortage of as many as 190,000 people with “deep
analytical skills” by 2018. That’s in addition to a gap of roughly 1.5 million “Big
Data” analysts and managers during the same time period.
Modern data science evolved from three fields: applied mathematics, statistics,
and computer science. In recent years, however, the term “data scientist” has
broadened to include anyone with “a background in the quantitative field,”
Valentine added. Other fields—including physics and linguistics—are
developing more of a symbiotic relationship with data science, thanks in large
part to the evolution of artificial intelligence, machine learning, and natural
language processing.
In addition to aptitude with math and algorithms, successful data scientists have
also mastered soft skills. “They need to know more than what is happening in
the cubicle,” said Mansour Raad, senior software architect at ESRI, which
produces mapping software. “You have to be a people person.”
In order to effectively crunch numbers, in other words, data scientists need to
work with the people who know the larger business. They must interact with
managers who can frame the company’s larger strategy, as well as colleagues
who will turn data insights into real action. With more input from those other
stakeholders, data scientists can better formulate the right questions to drive their
analysis.
“Soft skills” also means a healthy curiosity, said Thomas Redman, a.k.a. the
“Data Doc,” who consults and speaks extensively about data science. Ideally, the
applicant “likes to understand data, to understand what is going on in the world.”
When applying for data-science jobs, he added, applicants are often judged on
their intellectual curiosity in addition to their other skills—employers fear “they
will stay in front of a computer screen,” Redman observed. That can create an
issue for some data scientists who are used to keeping their nose in the data, and
not interacting with other business units.
When Redman was a statistician at Bell Labs (long before the term “data
scientist” was even coined), managers made a point of telling those employees
who worked with data that the ultimate mission was to make the telephone
network run better. That meant more than understanding statistics; it meant
understanding the broader problems facing the company.
Faith vs. Skepticism
There’s an old saying in business: If you want to manage a problem, put a
number on it. Data does that, to a certain extent. While the data scientist will
wrangle the data, it’s up to the manager to make sense of it.
Data can be taken on faith or questioned. Doing the former risks “GIGO”—
Garbage In, Garbage Out. The latter requires “data skepticism”—a good skill for
anyone who works with data on a daily basis.
Sometimes Raad spends about 80 percent of his time just cleaning data: “The
data you get is just garbage.” In this respect, a data scientist is really a “data
janitor.”
In the real world, “data is messy,” MapR’s Valentine concurred. “You have to
have a real healthy skepticism when looking at data collected from a real-life
effort.” One can’t assume a uniform distribution: “Data is the side-effect of real-
world processes.”
A good data scientist keeps in mind that collected data is not unbiased. “You are
trying to leverage the data to answer a question. You are not trying to stretch it
too far,” Valentine added. “As a rule of thumb, gathering as much data as
possible is a good strategy.”
Even if you’re not a data scientist, taking the results of an analysis simply on
faith is rarely a good idea. “We’re uncomfortable when someone else knows
more than you do,” Redman said. Whenever you’re studying the results of an
analysis, have a list of questions handy—where did the data come from? What’s
the worst thing that can happen? What has to be true for the recommendation to
be correct?
“People who don’t question things are fair victims.” Redman said.
Bias vs. Objectivity
“Getting something right in the beginning is not a sign of victory.” Raad said. Be
skeptical—do you have all the data? Is the data too good to be true? “The trick is
to remove the human from the equation… Let the math speak for itself.” The
data skeptic can then take the next step, showing how much of a conclusion
is not random.
Don’t try to be perfect. The solution you craft must only be sufficient, getting the
user from Point A to B. “You build a good, working Volkswagen [rather] than a
Cadillac.” Raad said. “You have to be able to settle for the Volkswagen
sometimes.”
Teams’ preconceptions are often built into algorithms. For example, take a credit
algorithm that rates applicants for loans; while you might think the underlying
math is neutral, the programmer may have fed their biases into the code.
Bias is not a new problem, Valentine said. Engineers often have to make a
“subjective decision” when trying to meet goals crafting portions of solutions
that are sufficient to meet immediate needs. But it isn’t as if the underlying
algorithms are black boxes: data scientists will need to determine for themselves
if the software is producing a good outcome.
When it comes to data scientist, both hard and soft skills are necessary to do the
job along with a healthy skepticism. When it comes to advancing a data-science
career, not taking things on faith seems like a solid course of action.
Data analytics and career opportunities
From our childhood, we have heard that we are nothing without water and water
is our life. But in this modern technical era, the same thing can be said about
data. Data means information that is basically created from a source and flows to
a receiver. Every object, living or non-living is enclosed by various kinds of
data. We work only with the data which we can understand and the rest of the
data remains a mystery. It is impossible to work with lots of data simultaneously.
This is the part where data analytics plays an important role.
A data analyst is a person who is in charge of collecting and analyzing the data.
The development and testing of the analytical models based on the collected and
analyzed data are also done by the data analyst.
Now let’s talk about the basic requirements and the process of data analytics.
First of all, the raw or unstructured data from various sources is collected and
then combined into a common format. This data is then loaded into a data
analytics system such as a data warehouse, a Hadoop cluster, etc. Then data
cleansing and profiling are done to make sure that the data is error-free and
consistent overall. After that, the main operation in data analytics is performed
i.e. building an analytical model of the data using various programming
languages such as SQL, Python, Scala, etc. Finally, the analytical model results
are used with the help of data visualization, to make decisions and obtain the
desired results.
Career Opportunities in Data Analytics
In this digital age, data analytics is more important than ever. There are multiple
jobs opportunities in various industries with the demand for data analytics
professionals increasing day by day. Some of the career opportunities that
require data analytics professionals are given as follows:
1. Data Scientist
A data scientist collects and analyses that data so that the relevant
decisions can be made using data visualization. A holistic view of data,
good knowledge of data analytics and data visualization skills, as well
as knowledge of programming languages such as SQL, Python, Scala,
etc., is a basic requirement for a data scientist.
2. Data Engineer
A data engineer helps in the design, implementation, and optimization
of the data infrastructure that is around the various data analytics
processes. In general, a data engineer handles quite large data sets and
often helps in making this data readable for data scientists by data
cleansing and profiling.
3. Business Analyst
A business analyst helps in solving the business problems an
organization is facing by using data analytics to understand the
business models, company reports, technology integration documents,
etc. and provide various business strategies
4. Statistician
A statistician collects, analyses and interprets statistical data to obtain
coherent and useful information. Some of the common jobs of
statisticians are to provide statistical simulations, mathematical
modeling, analysis and interpretation of various survey results,
business forecasting on the basis of data analytics, etc.
6. Quantitative Analyst
A quantitative analyst helps in solving the various financial problems
by using data analytics to analyze large amounts of data to understand
financial risk management, investment patterns, exchange rate trends,
the stock market, etc.
These are just some of the career opportunities that require data analytics.
However, data analytics, in general, is a vast field and the opportunities it
provides are endless. There are many more career opportunities and chances in
the data analytics field with even more growth predicted in the future. So a
career in data analytics is a lucrative prospect with enormous scope and growth
in the future.