0% found this document useful (0 votes)

212 views130 pages

Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES

The document provides an overview of big data, defining it as large and complex datasets that traditional systems struggle to manage. It discusses the characteristics of big data, known as the '3 Vs' (volume, velocity, variety), and introduces additional Vs like veracity, variability, and value. Furthermore, it outlines the types of big data analytics, their benefits, and the challenges associated with managing unstructured data.

Uploaded by

jaya lakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

212 views130 pages

Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES

Uploaded by

jaya lakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 130

Department of Computer Science and Business Systems

(An Autonomous Institutions)

CCS334 BIG DATA ANALYTICS

UNIT I UNDERSTANDING BIG DATA

Introduction to big data – convergence of key trends – unstructured data – industry

examples of big data – web analytics – big data applications– big data technologies –
introduction to Hadoop – open source technologies – cloud and big data – mobile
business intelligence – Crowd sourcing analytics – inter and trans firewall analytics.

INTRODUCTION TO BIG DATA

What is Big Data

Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These
datasets are so huge and complex in volume, velocity, and variety, that traditional data
management systems cannot store, process, and analyze them.

The amount and availability of data is growing rapidly, spurred on by digital

technology advancements, such as connectivity, mobility, the Internet of Things (IoT),
and artificial intelligence (AI). As data continues to expand and proliferate, new big
data tools are emerging to help companies collect, process, and analyze data at the
speed needed to gain the most value from it.

Big data describes large and diverse datasets that are huge in volume and also rapidly
grow in size over time. Big data is used in machine learning, predictive modeling, and
other advanced analytics to solve business problems and make informed decisions

The Vs of big data

Big data definitions may vary slightly, but it will always be described in terms of volume,
velocity, and variety. These big data characteristics are often referred to as the “3 Vs of

 Volume
As its name suggests, the most common characteristic associated with big data
is its high volume. This describes the enormous amount of data that is available
for collection and produced from a variety of sources and devices on a
continuous basis.
 Velocity
Big data velocity refers to the speed at which data is generated. Today, data is
often produced in real time or near real time, and therefore, it must also be

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

processed, accessed, and analyzed at the same rate to have any meaningful
impact.
 Variety
Data is heterogeneous, meaning it can come from many different sources and
can be structured, unstructured, or semi-structured. More traditional structured
data (such as data in spreadsheets or relational databases) is now
supplemented by unstructured text, images, audio, video files, or semi-
structured formats like sensor data that can’t be organized in a fixed data
schema. big data” and were first defined by Gartner in 2001.

In addition to these three original Vs, three others that are often mentioned in relation
to harnessing the power of big data: veracity, variability, and value.

 Veracity:

Big data can be messy, noisy, and error-prone, which makes it difficult to control
the quality and accuracy of the data. Large datasets can be unwieldy and
confusing, while smaller datasets could present an incomplete picture. The
higher the veracity of the data, the more trustworthy it is.

 Variability:

The meaning of collected data is constantly changing, which can lead to

inconsistency over time. These shifts include not only changes in context and
interpretation but also data collection methods based on the information that
companies want to capture and analyze.

 Value:

It’s essential to determine the business value of the data you collect. Big data
must contain the right data and then be effectively analyzed in order to yield
insights that can help drive decision-making.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generate
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge number
of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

o Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its million
users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.

How does big data work?

The central concept of big data is that the more visibility you have into anything,
the more effectively you can gain insights to make better decisions, uncover growth
opportunities, and improve your business model.

Making big data work requires three main actions:

1. Integration:

Big data collects terabytes, and sometimes even petabytes, of raw data from many
sources that must be received, processed, and transformed into the format that
business users and analysts need to start analyzing it.

2. Management:

Big data needs big storage, whether in the cloud, on-premises, or both. Data must
also be stored in whatever form required. It also needs to be processed and made
available in real time. Increasingly, companies are turning to cloud solutions to take
advantage of the unlimited compute and scalability.

3. Analysis:

The final step is analyzing and acting on big data—otherwise, the investment won’t
be worth it. Beyond exploring the data itself, it’s also critical to communicate and
share insights across the business in a way that everyone can understand. This
includes using tools to create data visualizations like charts, graphs, and
dashboards.
What is big data analytics?

Big data analytics is the process of collecting, examining, and analysing large amounts
of data to discover market trends, insights, and patterns that can help companies make
better business decisions. This information is available quickly and efficiently so that
companies can be agile in crafting plans to maintain their competitive advantage.

Big data analytics is important because it helps companies leverage their data to
identify opportunities for improvement and optimisation. Across different business
segments, increasing efficiency leads to overall more intelligent operations, higher
profits, and satisfied customers. Big data analytics helps companies reduce costs and
develop better, customer-centric products and services.Technologies such as business
intelligence (BI) tools and systems help organisations take unstructured and 3

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

structured data from multiple sources. Users (typically employees) input queries into
these tools to understand business operations and performance. Big data analytics
uses the four data analysis methods to uncover meaningful insights and derive
solutions.
Types of big data analytics

Four main types of big data analytics support and inform different business decisions.

1. Descriptive analytics

Descriptive analytics refers to data that can be easily read and interpreted. This data
helps create reports and visualise information that can detail company profits and
sales.

Example: During the pandemic, a leading pharmaceutical company conducted data

analysis on its offices and research labs. Descriptive analytics helped them identify
consolidated unutilised spaces and departments, saving the company millions of
pounds.

2. Diagnostics analytics

Diagnostics analytics helps companies understand why a problem occurred. Big data
technologies and tools allow users to mine and recover data that helps dissect an issue
and prevent it from happening in the future.

Example: An online retailer’s sales have decreased even though customers continue
to add items to their shopping carts. Diagnostics analytics helped to understand that
the payment page was not working correctly for a few weeks.

3. Predictive analytics

Predictive analytics looks at past and present data to make predictions. With artificial
intelligence (AI), machine learning, and data mining, users can analyse the data to
predict market trends.

Example: In the manufacturing sector, companies can use algorithms based on

historical data to predict if or when a piece of equipment will malfunction or break
down.

4. Prescriptive analytics

Prescriptive analytics solves a problem, relying on AI and machine learning to gather

and use data for risk management.

Example: Within the energy sector, utility companies, gas producers, and pipeline
owners identify factors that affect the price of oil and gas to hedge risks.

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Benefits of big data analytics

Incorporating big data analytics into a business or organisation has several

advantages. These include:

Cost reduction: Big data can reduce costs in storing all business data in one
place. Tracking analytics also helps companies find ways to work more efficiently
to cut costs wherever possible.

Product development: Developing and marketing new products, services, or

brands is much easier when based on data collected from customers’ needs and
wants. Big data analytics also helps businesses understand product viability and
to keep up with trends.

Strategic business decisions: The ability to constantly analyse data helps

businesses make better and faster decisions, such as cost and supply chain
optimisation.

Customer experience: Data-driven algorithms help marketing efforts (targeted

ads, for example) and increase customer satisfaction by delivering an enhanced
customer experience.

Risk management: Businesses can identify risks by analysing data patterns and
developing solutions for managing those risks.

UNSTRUCTURED DATA

Types of Big Data

All data cannot be stored in the same way. The methods for data storage can be
accurately evaluated after the type of data has been identified

1. Structured data

Structured data is data whose elements are addressable for effective analysis. It
has been organized into a formatted repository that is typically a database. It concerns
5

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

all data which can be stored in database in a table with rows and columns. They have
relational keys and can easily be mapped into pre-designed fields. Today, those data
are most processed in the development and simplest way to manage information.
Example: Relational data.

2. Semi-Structured data

Semi-structured data is information that does not reside in a relational database

but that has some organizational properties that make it easier to analyze. With some
processes, you can store them in the relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured exist to ease space. Example: XML
data.

3. Unstructured data

Unstructured data is a data which is not organized in a predefined manner or 6

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

does not have a predefined data model, thus it is not a good fit for a mainstream
relational database. So for Unstructured data, there are alternative platforms for
storing and managing, it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics applications. Example:
Word, PDF, Text, Media logs.

Unstructured data is the data which does not conforms to a data model and
has no easily identifiable structure such that it can not be used by a computer program
easily. Unstructured data is not organised in a pre-defined manner or does not have a
pre-defined data model, thus it is not a good fit for a mainstream relational database.

From 80% to 90% of data generated and collected by organizations is

unstructured, and its volumes are growing rapidly — many times faster than the rate
of growth for structured databases.

Unstructured data stores contain a wealth of information that can be used to

guide business decisions. However, unstructured data has historically been very
difficult to analyze. With the help of AI and machine learning, new software tools are
emerging that can search through vast quantities of it to uncover beneficial and
actionable business intelligence.

Unstructured data vs. structured data

Let’s take structured data first: it’s usually stored in a relational database or
RDBMS, and is sometimes referred to as relational data. It can be easily mapped into
designated fields — for example, fields for zip codes, phone numbers, and credit cards.
Data that conforms to RDBMS structure is easy to search, both with human-defined 7

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

queries and with software.

Unstructured data, in contrast, doesn’t fit into these sorts of pre-defined data
models. It can’t be stored in an RDBMS. And because it comes in so many formats, it’s
a real challenge for conventional software to ingest, process, and analyze. Simple
content searches can be undertaken across textual unstructured data with the right
tools.

Beyond that, the lack of consistent internal structure doesn’t conform to what
typical data mining systems can work with. As a result, companies have largely been
unable to tap into value-laden data like customer interactions, rich media, and social
network conversations. Robust tools for doing so are only now being developed and
commercialized.

What are some examples of unstructured data?

Unstructured data can be created by people or generated by machines.

Here are some examples of the human-generated variety:

 Email: Email message fields are unstructured and cannot be parsed by

traditional analytics tools. That said, email metadata affords it some structure,
and explains why email is sometimes considered semi-structured data.

 Text files: This category includes word processing documents, spreadsheets,

presentations, email, and log files.

 Social media and websites: data from social networks like Twitter, LinkedIn, and
Facebook, and websites such as Instagram, photo-sharing sites, and YouTube.

 Mobile and communications data: For this category, look no further than text
messages, phone recordings, collaboration software, chat, and instant
messaging.

 Media: This data includes digital photos, audio, and video files.

Here are some examples of unstructured data generated by machines:

 Scientific data: This includes oil and gas surveys, space exploration, seismic
imagery, and atmospheric data.

 Digital surveillance: This category features data like reconnaissance photos and
videos.

 Satellite imagery: This data includes weather data, land forms, and military
movements.

le business intelligence.

Characteristics of Unstructured Data:

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

 Data neither conforms to a data model nor has any structure.

 Data cannot be stored in the form of rows and columns as in Databases
 Data does not follow any semantic or rules

 Data lacks any particular format or sequence

 Data has no easily identifiable structure

 Due to lack of identifiable structure, it cannot used by computer programs

easily

Sources of Unstructured Data:

 Web pages

 Images (JPEG, GIF, PNG, etc.)

 Videos

 Memos

 Reports

 Word documents and PowerPoint presentations

 Surveys

Advantages of Unstructured Data:

 Its supports the data which lacks a proper format or sequence

 The data is not constrained by a fixed schema

 Very Flexible due to absence of schema.

 Data is portable

 It is very scalable

 It can deal easily with the heterogeneity of sources.

 These types of data have a variety of business intelligence and analytics

applications.

Disadvantages of Unstructured data:

 It is difficult to store and manage unstructured data due to lack of schema

and structure

 Indexing the data is difficult and error prone due to unclear structure and not
having pre-defined attributes. Due to which search results are not very
accurate.

 Ensuring security to data is difficult task. 9

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Problems faced in storing unstructured data:

 It requires a lot of storage space to store unstructured data.

 It is difficult to store videos, images, audios, etc.

 Due to unclear structure, operations like update, delete and search is very
difficult.

 Storage cost is high as compared to structured data

 Indexing the unstructured data is difficult

Possible solution for storing Unstructured data:

 Unstructured data can be converted to easily manageable formats

 using Content addressable storage system (CAS) to store unstructured data.

It stores data based on their metadata and a unique name is assigned to every
object stored in it. The object is retrieved based on content not its location.

 Unstructured data can be stored in XML format.

 Unstructured data can be stored in RDBMS which supports BLOBs

Extracting information from unstructured Data:

unstructured data do not have any structure. So it cannot easily interpreted by

conventional algorithms. It is also difficult to tag and index unstructured data. So
extracting information from them is tough job. Here are possible solutions:

 Taxonomies or classification of data helps in organising data in hierarchical

structure. Which will make search process easy.

 Data can be stored in virtual repository and be automatically tagged. For

example Documentum.

 Use of application platforms like XOLAP.

XOLAP helps in extracting information from e-mails and XML based
documents

 Use of various data mining tools

BIG DATA INDUSTRY APPLICATIONS

Here are some of the sectors where Big Data is actively used:

Ecommerce - Predicting customer trends and optimizing prices are a few of the
ways e-commerce uses Big Data analytics

Marketing - Big Data analytics helps to drive high ROI marketing operations,
10
which result in improved sales
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

Education - Used to develop new and improve existing courses based on market
requirements
Healthcare - With the help of a patient’s medical history, Big Data analytics is
used to predict how likely they are to have health issues

Media and entertainment - Used to understand the demand of shows, movies,

songs, and more to deliver a personalized recommendation list to its users

Banking - Customer income and spending patterns help to predict the likelihood
of choosing various banking offers, like loans and credit cards

Telecommunications - Used to forecast network capacity and improve customer

experience

Government - Big Data analytics helps governments in law enforcement, among

other things

APPLICATIONS OF BIG DATA

In today’s world, there are a lot of data. Big companies utilize those data for their
business growth. By analyzing this data, the useful decision can be made in various
cases as discussed below:

1. Tracking Customer Spending Habit, Shopping Behavior:

In big retails store (like Amazon, Walmart, Big Bazar etc.) management team
has to keep data of customer’s spending habit (in which product customer spent, in
which brand they wish to spent, how frequently they spent), shopping behavior,
customer’s most liked product (so that they can keep those products in the store).
Which product is being searched/sold most, based on that data, production/collection
rate of that product get fixed.

Banking sector uses their customer’s spending behavior-related data so that

they can provide the offer to a particular customer to buy his particular liked product
by using bank’s credit or debit card with discount or cashback. By this way, they can
send the right offer to the right person at the right time.

2. Recommendation:

By tracking customer spending habit, shopping behavior, Big retails store

provide a recommendation to the customer. E-commerce site like Amazon, Walmart,
Flipkart does product recommendation. They track what product a customer is
searching, based on that data they recommend that type of product to that customer.

As an example, suppose any customer searched bed cover on Amazon. So, 11

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Amazon got data that customer may be interested to buy bed cover. Next time when
that customer will go to any google page, advertisement of various bed covers will be
seen. Thus, advertisement of the right product to the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked, watched video
type. Based on the content of a video, the user is watching, relevant advertisement is
shown during video running. As an example suppose someone watching a tutorial
video of Big data, then advertisement of some other big data course will be shown
during that video.

3. Smart Traffic System:

Data about the condition of the traffic of different road, collected through
camera kept beside the road, at entry and exit point of the city, GPS device placed in
the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-free or less jam
way, less time taking ways are recommended. Such a way smart traffic system can be
built in the city by Big data analysis. One more profit is fuel consumption can be
reduced.

4. Secure Air Traffic System:

At various places of flight (like propeller etc) sensors present. These sensors
capture data like the speed of flight, moisture, temperature, other environmental
condition. Based on such data analysis, an environmental parameter within flight are
set up and varied.

By analyzing flight’s machine-generated data, it can be estimated how long the

machine can operate flawlessly when it to be replaced/repaired.

5. Auto Driving Car:

Big data analysis helps drive a car without human interpretation. In the various
spot of car camera, a sensor placed, that gather data like the size of the surrounding
car, obstacle, distance from those, etc. These data are being analyzed, then various
calculation like how many angles to rotate, what should be speed, when to stop, etc
carried out. These calculations help to take action automatically.

6. Virtual Personal Assistant Tool:

Big data analysis helps virtual personal assistant tool (like Siri in Apple Device,
Cortana in Windows, Google Assistant in Android) to provide the answer of the various
question asked by users. This tool tracks the location of the user, their local time,
season, other data related to question asked, etc. Analyzing all such data, it provides
an answer.

As an example, suppose one user asks “Do I need to take Umbrella?”, the tool
collects data like location of the user, season and weather condition at that location,
then analyze these data to conclude if there is a chance of raining, then provide the
12

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

answer.
7. IoT:

Manufacturing company install IOT sensor into machines to collect operational

data. Analyzing such data, it can be predicted how long machine will work without any
problem when it requires repairing so that company can take action before the
situation when machine facing a lot of issues or gets totally down. Thus, the cost to
replace the whole machine can be saved.

In the Healthcare field, Big data is providing a significant contribution. Using

big data tool, data regarding patient experience is collected and is used by doctors to
give better treatment. IoT device can sense a symptom of probable coming disease in
the human body and prevent it from giving advance treatment. IoT Sensor placed
near-patient, new-born baby constantly keeps track of various health condition like
heart bit rate, blood presser, etc. Whenever any parameter crosses the safe limit, an
alarm sent to a doctor, so that they can take step remotely very soon.

8. Education Sector:

Online educational course conducting organization utilize big data to search

candidate, interested in that course. If someone searches for YouTube tutorial video
on a subject, then online or offline course provider organization on that subject send
ad online to that person about their course.

9. Energy Sector:

Smart electric meter read consumed power every 15 minutes and sends this
read data to the server, where data analyzed and it can be estimated what is the time
in a day when the power load is less throughout the city. By this system manufacturing
unit or housekeeper are suggested the time when they should drive their heavy
machine in the night time when power load less to enjoy less electricity bill.

10. Media and Entertainment Sector:

Media and entertainment service providing company like Netflix, Amazon

Prime, Spotify do analysis on data collected from their users. Data like what type of
video, music users are watching, listening most, how long users are spending on site,
etc are collected and analyzed to set the next business strategy.

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

BIG DATA TECHNOLOGIES

Big data technologies can be categorized into four main types: data storage, data
mining, data analytics, and data visualization [2]. Each of these is associated with
certain tools, and you’ll want to choose the right tool for your business needs
depending on the type of big data technology required.

1. Data storage

Big data technology that deals with data storage has the capability to fetch, store, and
manage big data. It is made up of infrastructure that allows users to store the data so
that it is convenient to access. Most data storage platforms are compatible with other
programs. Two commonly used tools are Apache Hadoop and MongoDB.

 Apache Hadoop: Apache is the most widely used big data tool. It is an open-
source software platform that stores and processes big data in a distributed
computing environment across hardware clusters. This distribution allows for
faster data processing. The framework is designed to reduce bugs or faults, be
scalable, and process all data formats.

 MongoDB: MongoDB is a NoSQL database that can be used to store large

volumes of data. Using key-value pairs (a basic unit of data), MongoDB
categorizes documents into collections. It is written in C, C++, and JavaScript,
and is one of the most popular big data databases because it can manage and
store unstructured data with ease.

2. Data mining

Data mining extracts the useful patterns and trends from the raw data. Big data
technologies such as Rapidminer and Presto can turn unstructured and structured data
into usable information.

 Rapidminer: Rapidminer is a data mining tool that can be used to build

predictive models. It draws on these two roles as strengths, of processing and
preparing data, and building machine and deep learning models. The end-to-
end model allows for both functions to drive impact across the organization [3].

 Presto: Presto is an open-source query engine that was originally developed by

Facebook to run analytic queries against their large datasets. Now, it is available
widely. One query on Presto can combine data from multiple sources within an
organization and perform analytics on them in a matter of minutes.

3. Data analytics

In big data analytics, technologies are used to clean and transform data into
information that can be used to drive business decisions. This next step (after data 14

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

mining) is where users perform algorithms, models, and predictive analytics using tools
such as Apache Spark and Splunk.

 Apache Spark: Spark is a popular big data tool for data analysis because it is
fast and efficient at running applications. It is faster than Hadoop because it
uses random access memory (RAM) instead of being stored and processed in
batches via MapReduce . Spark supports a wide variety of data analytics tasks
and queries.

 Splunk: Splunk is another popular big data analytics tool for deriving insights
from large datasets. It has the ability to generate graphs, charts, reports, and
dashboards. Splunk also enables users to incorporate artificial intelligence (AI)
into data outcomes.

4. Data visualization

Finally, big data technologies can be used to create stunning visualizations from the
data. In data-oriented roles, data visualization is a skill that is beneficial for presenting
recommendations to stakeholders for business profitability and operations—to tell an
impactful story with a simple graph.

 Tableau: Tableau is a very popular tool in data visualization because its drag-
and-drop interface makes it easy to create pie charts, bar charts, box
plots, Gantt charts, and more. It is a secure platform that allows users to share
visualizations and dashboards in real time.

 Looker: Looker is a business intelligence (BI) tool used to make sense of big
data analytics and then share those insights with other teams. Charts, graphs,
and dashboards can be configured with a query, such as monitoring weekly
brand engagement through social media analytics.

OPEN SOURCE TECHNOLOGIES / BIG DATA ANALYTICS TOOLS

There are hundreds of data analytics tools out there in the market today but the
selection of the right tool will depend upon your business NEED, GOALS, and VARIETY
to get business in the right direction. Now, let’s check out the top 10 analytics tools in
big data.

1. APACHE Hadoop

It’s a Java-based open-source platform that is being used to store and process
big data. It is built on a cluster system that allows the system to process data efficiently
and let the data run parallel. It can process both structured and unstructured data from
one server to multiple computers. Hadoop also offers cross-platform support for its
users. Today, it is the best big data analytic tool and is popularly used by many tech
giants such as Amazon, Microsoft, IBM, etc. 15

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Features of Apache Hadoop:

 Free to use and offers an efficient storage solution for businesses.

 Offers quick access via HDFS (Hadoop Distributed File System).
 Highly flexible and can be easily implemented with MySQL, and JSON.
 Highly scalable as it can distribute a large amount of data in small segments.
 It works on small commodity hardware like JBOD or a bunch of disks.

2. Cassandra

APACHE Cassandra is an open-source NoSQL distributed database that is used

to fetch large amounts of data. It’s one of the most popular tools for data analytics
and has been praised by many tech companies due to its high scalability and
availability without compromising speed and performance. It is capable of delivering
thousands of operations every second and can handle petabytes of resources with
almost zero downtime. It was created by Facebook back in 2008 and was published
publicly.

Features of APACHE Cassandra:

 Data Storage Flexibility: It supports all forms of data i.e. structured,

unstructured, semi-structured, and allows users to change as per their needs.
 Data Distribution System: Easy to distribute data with the help of replicating
data on multiple data centers.
 Fast Processing: Cassandra has been designed to run on efficient commodity
hardware and also offers fast storage and data processing.
 Fault-tolerance: The moment, if any node fails, it will be replaced without any
delay.

3. Qubole

It’s an open-source big data tool that helps in fetching data in a value of chain
using ad-hoc analysis in machine learning. Qubole is a data lake platform that offers
end-to-end service with reduced time and effort which are required in moving data
pipelines. It is capable of configuring multi-cloud services such as AWS, Azure, and
Google Cloud. Besides, it also helps in lowering the cost of cloud computing by 50%.

Features of Qubole:

 Supports ETL process: It allows companies to migrate data from multiple

sources in one place.
 Real-time Insight: It monitors user’s systems and allows them to view real-time
insights
 Predictive Analysis: Qubole offers predictive analysis so that companies can
take actions accordingly for targeting more acquisitions.
4. Advanced Security System: To protect users’ data in the cloud, Qubole uses an
16

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

advanced security system and also ensures to protect any future breaches. Besides,
it also allows encrypting cloud data from any potential threat. Xplenty

It is a data analytic tool for building a data pipeline by using minimal codes in
it. It offers a wide range of solutions for sales, marketing, and support. With the help
of its interactive graphical interface, it provides solutions for ETL, ELT, etc. The best part
of using Xplenty is its low investment in hardware & software and its offers support
via email, chat, telephonic and virtual meetings. Xplenty is a platform to process data
for analytics over the cloud and segregates all the data together.

Features of Xplenty:

 Rest API: A user can possibly do anything by implementing Rest API

 Flexibility: Data can be sent, and pulled to databases, warehouses, and
salesforce.
 Data Security: It offers SSL/TSL encryption and the platform is capable of
verifying algorithms and certificates regularly.
 Deployment: It offers integration apps for both cloud & in-house and supports
deployment to integrate apps over the cloud.

5. Spark

APACHE Spark is another framework that is used to process data and perform
numerous tasks on a large scale. It is also used to process data via multiple computers
with the help of distributing tools. It is widely used among data analysts as it offers
easy-to-use APIs that provide easy data pulling methods and it is capable of handling
multi-petabytes of data as well. Recently, Spark made a record of processing 100
terabytes of data in just 23 minutes which broke the previous world record of Hadoop
(71 minutes). This is the reason why big tech giants are moving towards spark now and
is highly suitable for ML and AI today.

Features of APACHE Spark:

 Ease of use: It allows users to run in their preferred language. (JAVA, Python,
etc.)
 Real-time Processing: Spark can handle real-time streaming via Spark
Streaming
 Flexible: It can run on, Mesos, Kubernetes, or the cloud.

6. Mongo DB

Came in limelight in 2010, is a free, open-source platform and a document-

oriented (NoSQL) database that is used to store a high volume of data. It uses
collections and documents for storage and its document consists of key-value pairs
which are considered a basic unit of Mongo DB. It is so popular among developers due
to its availability for multi-programming languages such as Python, Jscript, and Ruby.
17

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Features of Mongo DB:

 Written in C++: It’s a schema-less DB and can hold varieties of documents

inside.
 Simplifies Stack: With the help of mongo, a user can easily store files without
any disturbance in the stack.
 Master-Slave Replication: It can write/read data from the master and can be
called back for backup.

7. Apache Storm

A storm is a robust, user-friendly tool used for data analytics, especially in small
companies. The best part about the storm is that it has no language barrier
(programming) in it and can support any of them. It was designed to handle a pool of
large data in fault-tolerance and horizontally scalable methods. When we talk about
real-time data processing, Storm leads the chart because of its distributed real-time
big data processing system, due to which today many tech giants are using APACHE
Storm in their system. Some of the most notable names are Twitter, Zendesk, NaviSite,
etc.

Features of Storm:

 Data Processing: Storm process the data even if the node gets disconnected
 Highly Scalable: It keeps the momentum of performance even if the load
increases
 Fast: The speed of APACHE Storm is impeccable and can process up to 1 million
messages of 100 bytes on a single node.

8. SAS

Today it is one of the best tools for creating statistical modeling used by data
analysts. By using SAS, a data scientist can mine, manage, extract or update data in
different variants from different sources. Statistical Analytical System or SAS allows a
user to access the data in any format (SAS tables or Excel worksheets). Besides that it
also offers a cloud platform for business analytics called SAS Viya and also to get a
strong grip on AI & ML, they have introduced new tools and products.

Features of SAS:

 Flexible Programming Language: It offers easy-to-learn syntax and has also

vast libraries which make it suitable for non-programmers
 Vast Data Format: It provides support for many programming languages which
also include SQL and carries the ability to read data from any format.
 Encryption: It provides end-to-end security with a feature called SAS/SECURE.

9. Data Pine

Datapine is an analytical used for BI and was founded back in 2012 (Berlin, 18

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Germany). In a short period of time, it has gained much popularity in a number of

countries and it’s mainly used for data extraction (for small-medium companies
fetching data for close monitoring). With the help of its enhanced UI design, anyone
can visit and check the data as per their requirement and offer in 4 different price
brackets, starting from $249 per month. They do offer dashboards by functions,
industry, and platform.

Features of Datapine:

 Automation: To cut down the manual chase, datapine offers a wide array of AI
assistant and BI tools.
 Predictive Tool: datapine provides forecasting/predictive analytics by using
historical and current data, it derives the future outcome.
 Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc
reporting, etc.

10. Rapid Miner

It’s a fully automated visual workflow design tool used for data analytics. It’s a
no-code platform and users aren’t required to code for segregating data. Today, it is
being heavily used in many industries such as ed-tech, training, research, etc. Though
it’s an open-source platform but has a limitation of adding 10000 data rows and a
single logical processor. With the help of Rapid Miner, one can easily deploy their ML
models to the web or mobile (only when the user interface is ready to collect real-time
figures).

Features of Rapid Miner:

 Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via
URL
 Storage: Users can access cloud storage facilities such as AWS and dropbox
 Data validation: Rapid miner enables the visual display of multiple results in
history for better evaluation.
CLOUD AND BIG DATA

1. Big Data:
Big data refers to the data which is huge in size and also increasing rapidly with
respect to time. Big data includes structured data, unstructured data as well as
semi-structured data. Big data cannot be stored and processed in traditional
data management tools it needs specialized big data management tools. It
refers to complex and large data sets having 5 V’s volume, velocity, Veracity,
Value and variety information assets. It includes data storage, data analysis, data
mining and data visualization.

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Examples of the sources where big data is generated includes social media data, e-
commerce data, weather station data, IoT Sensor data etc.

Characteristics of Big Data :

 Variety of Big data – Structured, unstructured, and semi structured data

 Velocity of Big data – Speed of data generation

 Volume of Big data – Huge volumes of data that is being generated

 Value of Big data – Extracting useful information and making it valuable

 Variability of Big data – Inconsistency which can be shown by the data at times.

Advantages of Big Data :

 Cost Savings

 Better decision-making

 Better Sales insights

 Increased Productivity

 Improved customer service.

Disadvantages of Big Data :

 Incompatible tools

 Security and Privacy Concerns

 Need for cultural change

 Rapid change in technology

 Specific hardware needs.

2. Cloud Computing :

Cloud computing refers to the on demand availability of computing resources over

internet. These resources includes servers, storage, databases, software, analytics,
networking and intelligence over the Internet and all these resources can be used as
per requirement of the customer. In cloud computing customers have to pay as per
use. It is very flexible and can be resources can be scaled easily depending upon the
requirement. Instead of buying any IT resources physically, all resources can be availed
depending on the requirement from the cloud vendors. Cloud computing has three
service models i.e Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and
Software as a Service (SaaS).

Examples of cloud computing vendors who provides cloud computing services are
Amazon Web Service (AWS), Microsoft Azure, Google Cloud Platform, IBM Cloud 20

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Services etc.
Characteristics of Cloud Computing :

 On-Demand availability

 Accessible through a network

 Elastic Scalability

 Pay as you go model

 Multi-tenancy and resource pooling.

Advantages of Cloud Computing :

 Back-up and restore data

 Improved collaboration

 Excellent accessibility

 Low maintenance cost

 On-Demand Self-service.

Disadvantages of Cloud Computing:

 Vendor lock-in

 Limited Control

 Security Concern

 Downtime due to various reason

 Requires good Internet connectivity.

Difference between Big Data and Cloud Computing:

S.No. BIG DATA CLOUD COMPUTING

Big data refers to the data which is Cloud computing refers to the on
01. huge in size and also increasing demand availability of computing
rapidly with respect to time. resources over internet.

Cloud Computing Services includes

Big data includes structured data,
Infrastructure as a Service (IaaS),
02. unstructured data as well as semi-
Platform as a Service (PaaS) and
structured data.
Software as a Service (SaaS).

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Volume of data, Velocity of data, On-Demand availability of IT resources,

03.
Variety of data, Veracity of data, and broad network access, resource pooling,

Value of data are considered as the 5 elasticity and measured service are
most important characteristics of Big considered as the main characteristics
data. of cloud computing.

The purpose of big data is to

The purpose of cloud computing is to
organizing the large volume of data
store and process data in cloud or
04. and extracting the useful information
availing remote IT services without
from it and using that information for
physically installing any IT resources.
the improvement of business.

Distributed computing is used for

Internet is used to get the cloud based
05. analyzing the data and extracting the
services from different cloud vendors.
useful information.

Big data management allows

centralized platform, provision for Cloud computing services are cost
06.
backup and recovery and low effective, scalable and robust.
maintenance cost.

Some of the challenges of big data are Some of the challenges of cloud
variety of data, data storage and computing are availability,
07.
integration, data processing and transformation, security concern,
resource management. charging model.

Big data refers to huge volume of Cloud computing refers to remote IT

08. data, its management, and useful resources and different internet service
information extraction. models.

Cloud computing is used to store data

Big data is used to describe huge and information on remote servers and
09.
volume of data and information. also processing the data using remote
infrastructure.

Some of the cloud computing vendors

Some of the sources where big data is
who provides cloud computing services
generated includes social media data,
10. are Amazon Web Service (AWS),
e-commerce data, weather station
Microsoft Azure, Google Cloud Platform,
data, IoT Sensor data etc.
IBM Cloud Services etc.
22

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

WEB ANALYTICS

Web Analytics or Online Analytics refers to the analysis of quantifiable and measurable
data of your website with the aim of understanding and optimizing the web usage.

Web Analytics is the methodological study of online/offline patterns and trends. It is

a technique that you can employ to collect, measure, report, and analyze your website
data. It is normally carried out to analyze the performance of a website and optimize
its web usage.

web analytics used to track key metrics and analyze visitors’ activity and traffic flow. It
is a tactical approach to collect data and generate reports. It is an ongoing process
that helps in attracting more traffic to a site and thereby, increasing the Return on
Investment.

Web analytics focuses on various issues. For example,

 Detailed comparison of visitor data, and Affiliate or referral data.

 Website navigation patterns.

 The amount of traffic your website received over a specified period of time.

 Search engine data.

Web analytics improves online experience for your customers and elevates your
business prospects. There are various Web Analytics tools available in the market. For
example, Google Analytics, Kissmetrics, Optimizely, etc.

Importance of Web Analytics

Web Analytics needed to assess the success rate of a website and its associated
business. Using Web Analytics, we can − 23

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

 Assess web content problems so that they can be rectified

 Have a clear perspective of website trends

 Monitor web traffic and user flow

 Demonstrate goals acquisition

 Figure out potential keywords

 Identify segments for improvement

 Find out referring sources

Web Analytics Process

The primary objective of carrying out Web Analytics is to optimize the website in order
to provide better user experience. It provides a data-driven report to measure visitors’
flow throughout the website.

Take a look at the following illustration. It depicts the process of web analytics.

 Set the business goals.

 To track the goal achievement, set the Key Performance Indicators (KPI).

 Collect correct and suitable data.

 To extract insights, Analyze data.

 Based on assumptions learned from the data analysis, Test alternatives.

 Based on either data analysis or website testing, Implement insights.

Types of Web Analytics

There are two types of web analytics −

 On-site − It measures the users’ behaviour once it is on the website. For

example, measurement of your website performance.

 Off-site − It is the measurement and analysis irrespective of whether you own

or maintain a website. For example, measurement of visibility, comments,
potential audience, etc.

Metrics of Web Analytics

There are three basic metrics of web analytics −

Count

It is most basic metric of measurement. It is represented as a whole number or a

fraction. For example,

 Number of visitors = 12999, Number of likes = 3060, etc.

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

 Total sales of merchandise = $54,396.18.

Ratio

It is typically a count divided by some other count. For example, Page views per
visit.

Key Performance Indicator (KPI)

It depends upon the business type and strategy. KPI varies from one business to
another.

Micro and macro Level Data Insights

Google Analytics gives you more insight data accurately. You can understand the
data at two levels micro level and macro level.

Micro Level Analysis

It pertains to an individual or a small group of individuals. For example, number

of times job application submitted, number of times print this page was clicked,
etc.

Macro Level Analysis

It is concerned with the primary business objectives with huge groups of people
such as communities, nation, etc. For example, number of conversions in a
particular demographic.

Web Analysis - What to Measure?

These are the few measurements conducted in web analytics −

 Engagement Rate
It shows how long a person stays on your web page. What all pages he surf.
To make your web pages more engaging, include informative content, visuals,
fonts and bullets.
 Bounce Rate
If a person leaves your website within a span of 30 sec, it is considered as a
bounce. The rate at which users spin back is called the bounce rate. To
minimize bounce rate include related posts, clear call-to-action and backlinks
in your webpages.
 Dashboards
Dashboard is single page view of information important to user. You can
create your own dashboards keeping in mind your requirements. You may
keep only frequently viewed data on dashboard.
 Event Tracking
Event tracking allows you to track other activities on your website. For
example, you can track downloads and sign-ups through event tracking. 25

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

 Traffic Source
You can overview traffic sources. You can even filter it further. Figuring out the
key areas can help you learn about the area of improvement.
 Annotations
It allows you to view a traffic report for past time. You can click on graph and
type in to save it for future study.
 Visitor Flow
It gives you a clear picture of pages visited and the sequence of the same.
Understanding users’ path may help you in re-navigation in order to give
customer a hassle-free navigation.
 Content
It gives you insight about website’s content section. You can see how each
page is doing, website loading speed, etc.
 Conversions
Analytics lets you track goals and path used to achieve these goals. You can
get details regarding, product performances, purchase amount, and mode of
billing. Web Analytics offer you more than this. All you need is to analyze
things minutely and keep patience.
 Page Load Time
More is the load time, the more is bounce rate. Tracking page load time is
equally important.
 Behavior
Behavior lets you know page views and time spent on website. You can find
out how customer behaves once he is on your website.

MOBILE BUSINESS INTELLIGENCE

Business Intelligence

“Business Intelligence is not just about turning data into information, rather
organizations need that data to impact how their business operates and responds to
the changing marketplace.”

So, it is not all about transforming data into information, though Business Intelligence
significantly involves this process. Business Intelligence is transforming data into
meaningful, actionable insights that enable organizations to make informed business
strategies and tactical decisions.

Mobile Business Intelligence

Business Intelligence delivers relevant and trustworthy information to the right person
at the right time. Mobile business intelligence is the transfer of business
intelligence from the desktop to mobile devices such as the BlackBerry, iPad, and 26

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

iPhone.
The ability to access analytics and data on mobile devices or tablets rather than
desktop computers is referred to as mobile business intelligence. The business metric
dashboard and key performance indicators (KPIs) are more clearly displayed.

With the rising use of mobile devices, so have the technology that we all utilise in our
daily lives to make our lives easier, including business. Many businesses have benefited
from mobile business intelligence. Essentially, this post is a guide for business owners
and others to educate them on the benefits and pitfalls of Mobile BI.

Need for mobile BI?

Mobile phones' data storage capacity has grown in tandem with their use. You are
expected to make decisions and act quickly in this fast-paced environment. The
number of businesses receiving assistance in such a situation is growing by the day.

To expand your business or boost your business productivity, mobile BI can help, and
it works with both small and large businesses. Mobile BI can help you whether you are
a salesperson or a CEO. There is a high demand for mobile BI in order to reduce
information time and use that time for quick decision making.

As a result, timely decision-making can boost customer satisfaction and improve an

enterprise's reputation among its customers. It also aids in making quick decisions in
the face of emerging risks.

Data analytics and visualisation techniques are essential skills for any team that wants
to organise work, develop new project proposals, or wow clients with impressive
presentations.

Advantages of mobile BI

1. Simple access

Mobile BI is not restricted to a single mobile device or a certain place. You can view
your data at any time and from any location. Having real-time visibility into a firm
improves production and the daily efficiency of the business. Obtaining a company's
perspective with a single click simplifies the process.

2. Competitive advantage

Many firms are seeking better and more responsive methods to do business in order
to stay ahead of the competition. Easy access to real-time data improves company
opportunities and raises sales and capital. This also aids in making the necessary
decisions as market conditions change.

3. Simple decision-making

As previously stated, mobile BI provides access to real-time data at any time and
from any location. During its demand, Mobile BI offers the information. This assists 27

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

consumers in obtaining what they require at the time. As a result, decisions are
made quickly.
4. Increase Productivity

By extending BI to mobile, the organization's teams can access critical company

data when they need it. Obtaining all of the corporate data with a single click frees
up a significant amount of time to focus on the smooth and efficient operation of
the firm. Increased productivity results in a smooth and quick-running firm.

Disadvantages of mobile

1. Stack of data

The primary function of a mobile BI is to store data in a systematic manner and then
present it to the user as required. As a result, Mobile BI stores all of the information
and does end up with heaps of earlier data. The corporation only needs a small
portion of the previous data, but they need to store the entire information, which
ends up in the stack

2. Expensive

Mobile BI can be quite costly at times. Large corporations can continue to pay for
their expensive services, but small businesses cannot. As the cost of mobile BI is not
sufficient, we must additionally consider the rates of IT workers for the smooth
operation of BI, as well as the hardware costs involved. However, larger
corporations do not settle for just one Mobile BI provider for their organisations;
they require multiple. Even when doing basic commercial transactions, mobile BI is
costly.

3 Time consuming

Businesses prefer Mobile BI since it is a quick procedure. Companies are not patient
enough to wait for data before implementing it. In today's fast-paced environment,
anything that can produce results quickly is valuable. The data from the warehouse
is used to create the system, hence the implementation of BI in an enterprise takes
more than 18 months.

4 Data breach

The biggest issue of the user when providing data to Mobile BI is data leakage. If
you handle sensitive data through Mobile BI, a single error can destroy your data as
well as make it public, which can be detrimental to your business.

Many Mobile BI providers are working to make it 100 percent secure to protect their
potential users' data. It is not only something that mobile BI carriers must consider,
but it is also something that we, as users, must consider when granting data access
authorization.
28

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

5 Poor quality data

Because we work online in every aspect, we have a lot of data stored in Mobile BI,
which might be a significant problem. This means that a large portion of the data
analysed by Mobile BI is irrelevant or completely useless. This can speed down the
entire procedure. This requires you to select the data that is important and may be
required in the future.

Best Mobile BI tools

1. Si Sense

Sisense is a flexible business intelligence (BI) solution that includes powerful

analytics, visualisations, and reporting capabilities for managing and supporting
corporate data. Businesses can use the solution to evaluate large, diverse databases
and generate relevant business insights. You may easily view enormous volumes of
complex data with Si Sense's code-first, low-code, and even no-code technologies.
Si Sense was established in 2004 with its headquarters in New York.

Since then, the team has only taken precautionary steps in their investigation. Once
the company had received $ 4 million in funding from investors, they began to pace
its research.

2 SAP Roambi analytics

Roambi analytics is a BI tool that offers a solution that allows you to fundamentally
rethink your data analysis, making it easier and faster while also increasing your data
interaction.

You can consolidate all of your company's data in a single tool using SAP Roambi
Analytics, which integrates all ongoing systems and data. Use of SAP Roambi
analysis is a simple three-step technique. Upload your html or spreadsheet files first.
The information is subsequently transformed into informative data or graphs, as
well as data that may be visualised.

After the data is collected, you may easily share it with your preferred device.
Roambi Analytics was founded in 2008 by a team based in California.

3 Microsoft Power BI pro

Microsoft's strength BI is an easy-to-use tool for all non-technical business owners.

who are unfamiliar with BI tools but wish to aggregate, analyse, visualise, and share
data you only need a basic understanding of Excel and other Microsoft tools, and
if you are familiar with these, the Microsoft BI tool can be used as a self-service
tool. Microsoft Power BI has a unique feature that allows users to create subsets
of data and then automatically apply analytics to that information.

4 IBM Cognos Analytics 29

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Cognos Analytics is an IBM-registered web-based business intelligence tool.

Cognos Analytics is now merging with Watsons, and the benefits for users are
extremely exciting. Watson cognos analytics will assist in connecting and cleaning
the users' data, resulting in proper visualised data.
That way, the business owner will know where they stand in comparison to their
competitors and where they can grow in the future. It combines reporting,
modelling, analysis, dashboards to help you understand your organization's data
and make sound business decisions.

5 Amazon quick sights

Amazon Quick View assists in the creation and distribution of interactive BI

dashboards to their users, as well as the retrieval of answers in natural language
queries in seconds. Quick sight can be accessed through any device embedded in
any website, portal, or app.

Amazon Quick Sight allows you to quickly and easily create interactive
dashboards and reports for your users. Anyone in your organisation can
securely access those dashboards via browsers or mobile devices.

Quick sight's eye-catching feature is its pay-per-session model, which allows users
to use the creative dashboard created by another without paying much. The user
pays according to the length of the session, with prices ranging from $0.30 for a
30-minute session to $5 for unlimited use per month per user.

CROWD SOURCING ANALYTICS

Crowdsourcing is a sourcing model in which an individual or an organization
gets support from a large, open-minded, and rapidly evolving group of people in the
form of ideas, micro-tasks, finances, etc. Crowdsourcing typically involves the use of
the internet to attract a large group of people to divide tasks or to achieve a target.
The term was coined in 2005 by Jeff Howe and Mark Robinson. Crowdsourcing can
help different types of organizations get new ideas and solutions, deeper consumer
engagement, optimization of tasks, and several other things.

Let us understand this term deeply with the help of an example. Like
GeeksforGeeks is giving young minds an opportunity to share their knowledge with
the world by contributing articles, videos of their respective domain. Here
GeeksforGeeks is using the crowd as a source not only to expand their community but
also to include ideas of several young minds improving the quality of the content.

Where Can We Use Crowdsourcing?

Crowdsourcing is touching almost all sectors from education to health. It is not

only accelerating innovation but democratizing problem-solving methods. Some 30

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

fields where crowdsourcing can be used.

1. Enterprise

2. IT
3. Marketing

4. Education

5. Finance

6. Science and Health

How to Crowdsource?

1. For scientific problem solving, a broadcast search is used where an organization

mobilizes a crowd to come up with a solution to a problem.

2. For information management problems, knowledge discovery and

management is used to find and assemble information.

3. For processing large datasets, distributed human intelligence is used. The

organization mobilizes a crowd to process and analyze the information.

Examples of Crowdsourcing

1. Doritos: It is one of the companies which is taking advantage of crowdsourcing

for a long time for an advertising initiative. They use consumer-created ads for
one of their 30-Second Super Bowl Spots(Championship Game of Football).

2. Starbucks: Another big venture which used crowdsourcing as a medium for

idea generation. Their white cup contest is a famous contest in which customers
need to decorate their Starbucks cup with an original design and then take a
photo and submit it on social media.

3. Lays:” Do us a flavor” contest of Lays used crowdsourcing as an idea-generating

medium. They asked the customers to submit their opinion about the next chip
flavor they want.

4. Airbnb: A very famous travel website that offers people to rent their houses or
apartments by listing them on the website. All the listings are crowdsourced by
people.

Crowdsourced Marketing

As discussed already crowdsourcing helps grow businesses grow a lot. May it be a

business idea or just a logo design, crowdsourcing engages people directly and in
turn, saves money and energy. In the upcoming years, crowdsourced marketing
will surely get a boost as the world is accepting technology faster.

Main Types of Crowdsourcing 31

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Crowdsourcing involves obtaining information or resources from a wide swath of

people. In general, we can break this up into four main categories:
 Wisdom - Wisdom of crowds is the idea that large groups of people are
collectively smarter than individual experts when it comes to problem-solving
or identifying values (like the weight of a cow or number of jelly beans in a jar).

 Creation - Crowd creation is a collaborative effort to design or build something.

Wikipedia and other wikis are examples of this. Open-source software is
another good example.

 Voting - Crowd voting uses the democratic principle to choose a particular

policy or course of action by "polling the audience."

 Funding - Crowdfunding involved raising money for various purposes by

soliciting relatively small amounts from a large number of funders.

Crowdsourcing Sites

Here is the list of some famous crowdsourcing and crowdfunding sites.

1. Kickstarter

2. GoFundMe

3. Patreon

4. RocketHub

Advantages of Crowdsourcing

1. Evolving Innovation: Innovation is required everywhere and in this advancing

world innovation has a big role to play. Crowdsourcing helps in getting
innovative ideas from people belonging to different fields and thus helping
businesses grow in every field.

2. Save costs: There is the elimination of wastage of time of meeting people and
convincing them. Only the business idea is to be proposed on the internet and
you will be flooded with suggestions from the crowd.

3. Increased Efficiency: Crowdsourcing has increased the efficiency of business

models as several expertise ideas are also funded.

Disadvantages of Crowdsourcing

1. Lack of confidentiality: Asking for suggestions from a large group of people can
bring the threat of idea stealing by other organizations.

2. Repeated ideas: Often contestants in crowdsourcing competitions submit

repeated, plagiarized ideas which leads to time wastage as reviewing the same
ideas is not worthy. 32

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

INTER AND TRANS FIREWALL ANALYTICS

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Inter-firewall analytics

 Focus: Analyzes traffic flows between different firewalls within a network.

 Methodology: Utilizes data collected from multiple firewalls to identify

anomalies and potential breaches.

 Benefits: Provides a comprehensive view of network traffic flow and helps

identify lateral movement across different security zones.

 Limitations: Requires deployment of multiple firewalls within the network and

efficient data exchange mechanisms between them.

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Trans-firewall analytics

 Focus: Analyzes encrypted traffic that traverses firewalls, which traditional

security solutions may not be able to decrypt and inspect.

 Methodology: Uses deep packet inspection (DPI) and other advanced

techniques to analyze the content of encrypted traffic without compromising
its security.

 Benefits: Provides insight into previously hidden threats within encrypted traffic
and helps detect sophisticated attacks.

 Limitations: Requires specialized hardware and software solutions for DPI, and
raises concerns regarding potential data privacy violations.

Choosing the right approach

The choice between inter-firewall and trans-firewall analytics depends on several

factors, including:

 Network size and complexity: Larger and more complex networks benefit
more from inter-firewall analytics for comprehensive
monitoring.

Security needs and threats: Trans-firewall

analytics is crucial for networks handling
sensitive data and facing advanced threats.

Budget and resources:

trans-firewall analytics
requires additional investment in
specialized hardware and software.

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 39

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 40

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334
SubCode:CCS334Subject Name:BigData
Subject Name:Big Data Analytics
Analytics 41
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 42

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject
SubCode:CCS334 Name:Big
Subject Data Analytics
Name:Big Data Analytics 43
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 44

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 45

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 46

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334
SubCode:CCS334Subject Name:BigData
Subject Name:Big Data Analytics
Analytics 47
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 48

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 49

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 50

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 51

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 52

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 53

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 54

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 55

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334Subject
SubCode:CCS334 Subject Name:Big
Name:Big Data Analytics
Data Analytics 56
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 57
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334
SubCode:CCS334 SubjectName:Big
Subject Name:Big Data
DataAnalytics
Analytics 58
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 59

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 60

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject
SubCode:CCS334 Name:Big
Subject Name:BigData Analytics
Data Analytics
61
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 62

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 63

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 64

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334
SubCode:CCS334 SubjectName:Big
Subject Name:Big Data
DataAnalytics
Analytics 65
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 66

SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334
SubCode:CCS334Subject
Subject Name:Big Data
Name:Big Data Analytics
Analytics 67
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics 68

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334
SubCode:CCS334Subject Name:BigData
Subject Name:Big Data Analytics
Analytics 69
Department of Computer Science and Business Systems

(An Autonomous Institutions)

UNIT - III
MapReduce

• Job – A Job in the context of HadoopMapReduce is the unit of work to be

performed as requested by the client / user. The information associated with the Job
includes the data to be processed (input data), MapReduce logic / program /
algorithm, and any other relevant configuration information necessary to execute the
Job.

• Task – HadoopMapReduce divides a Job into multiple sub-jobs known as Tasks.

These tasks can be run independent of each other on various nodes across the cluster.
There are primarily two types of Tasks – Map Tasks and Reduce Tasks.

• Job Tracker– Just like the storage (HDFS), the computation (MapReduce) also
works in a master-slave / master-worker fashion. A Job Tracker node acts as the
Master and is responsible for scheduling / executing Tasks on appropriate nodes,
coordinating the execution of tasks, sending the information for the execution of
tasks, getting the results back after the execution of each task, re-executing the failed
Tasks, and monitors / maintains the overall progress of the Job. Since a Job consists
of multiple Tasks, a Job’s progress depends on the status / progress of Tasks
associated with it. There is only one Job Tracker node per Hadoop Cluster.

• TaskTracker – A TaskTracker node acts as the Slave and is responsible for

executing a Task assigned to it by the JobTracker. There is no restriction on the
number of TaskTracker nodes that can exist in a Hadoop Cluster. TaskTracker
receives the information necessary for execution of a Task from JobTracker,
Executes the Task, and Sends the Results back to Job Tracker.

• Map() – Map Task in MapReduce is performed using the Map() function. This part
of the MapReduce is responsible for processing one or more chunks of data and
producing the output results.
• Reduce() – The next part / component / stage of the MapReduce programming
model is the Reduce() function. This part of the MapReduce is responsible for
consolidating the results produced by each of the Map() functions/tasks.

SubCode:CCS334 Subject Name:Big Data Analytics 70

Department of Computer Science and Business Systems

(An Autonomous Institutions)

• Data Locality – MapReduce tries to place the data and the compute as close as
possible. First, it tries to put the compute on the same node where data resides, if that
cannot be done (due to reasons like compute on that node is down, compute on that
node is performing some other computation, etc.), then it tries to put the compute on the node nearest to
the respective data node(s) which contains the data to be processed. This feature of MapReduce is
“Data Locality”.
The following diagram shows the logical flow of a MapReduce programming model.

MapReduce Work Flow

The stages depicted above are

• Input: This is the input data / file to be processed.

• Split: Hadoop splits the incoming data into smaller pieces called “splits”.
• Map: In this step, MapReduce processes each split according to the logic defined in
map() function. Each mapper works on each split at a time. Each mapper is treated
as a task and multiple tasks are executed across different TaskTrackers and
coordinated by the JobTracker.
• Combine: This is an optional step and is used to improve the performance by
reducing the amount of data transferred across the network. Combiner is the same as
the reduce step and is used for aggregating the output of the map() function before it
is passed to the subsequent steps.
• Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put
them in order, and grouped before sending them to the next step.
• Reduce: This step is used to aggregate the outputs of mappers using the reduce()
function. Output of reducer is sent to the next and final step. Each reducer is treated
as a task and multiple tasks are executed across different TaskTrackers and
coordinated by the JobTracker.
• Output: Finally the output of reduce step is written to a file in HDFS.

SubCode:CCS334 Subject Name:Big Data Analytics 71

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Game Example

Say you are processing a large amount of data and trying to find out what percentage of your user
base where talking about games. First, we will identify the keywords which we are going to map
from the data to conclude that it’s something related to games. Next, we will write a mapping function
to identify such patterns in our data. For example, the keywords can be Gold medals, Bronze medals,
Silver medals, Olympic football, basketball, cricket, etc.

Let us take the following chunks in a big data set and see how to process it.

“Hi, how are you” “We love

football”

“He is an awesome football player” “Merry

Christmas”

“Olympics will be held in China” “Records broken

today in Olympics” “Yes, we won 2 Gold medals”

“He qualified for Olympics”

Mapping Phase – So our map phase of our algorithm will be as

1. Declare a function “Map”

2. Loop: For each words equal to “football”
3. Increment counter
4. Return key value “football”=>counter

In the same way, we can define n number of mapping functions for mapping various words:
“Olympics”, “Gold Medals”, “cricket”, etc.

Reducing Phase – The reducing function will accept the input from all these mappers in form of key
value pair and then processing it. So, input to the reduce function will look like the following:

SubCode:CCS334 Subject Name:Big Data Analytics 72

Department of Computer Science and Business Systems

(An Autonomous Institutions)

reduce (“football”=>2)
reduce (“Olympics”=>3)

Our algorithm will continue with the following steps

5. Declare a function reduce to accept the values from map function.

6. Where for each key-value pair, add value to counter.
7. Return “games”=> counter.

At the end, we will get the output like “games”=>5.

Now, getting into a big picture we can write n number of mapper functions here. Let us say that you
want to know who all where wishing each other. In this case you will write a mapping function to map
the words like “Wishing”, “Wish”, “Happy”, “Merry” and then will write a corresponding reducer
function.

Here you will need one function for shuffling which will distinguish between the “games” and
“wishing” keys returned by mappers and will send it to the respective reducer function. Similarly you
may need a function for splitting initially to give inputs to the mapper functions in form of chunks. The
following diagram summarizes the flow of Map reduce algorithm:

SubCode:CCS334 Subject Name:Big Data Analytics 73

Department of Computer Science and Business Systems

(An Autonomous Institutions)

In the above map reduce flow

• The input data can be divided into n number of chunks depending upon the amount
of data and processing capacity of individual unit.
• Next, it is passed to the mapper functions. Please note that all the chunks are
processed simultaneously at the same time, which embraces the parallel processing
of data.
• After that, shuffling happens which leads to aggregation of similar patterns.
• Finally, reducers combine them all to get a consolidated output as per the logic.
• This algorithm embraces scalability as depending on the size of the input data, we
can keep increasing the number of the parallel processing units.

Unit Tests with MR Unit

HadoopMapReduce jobs have a unique code architecture that follows a specific template with
specific constructs.
This architecture raises interesting issues when doing test-driven development (TDD) and writing unit
tests.
With MRUnit, you can craft test input, push it through your mapper and/or reducer, and verify its
output all in a JUnit test.
As do other JUnit tests, this allows you to debug your code using the JUnit test as a driver. A
map/reduce pair can be tested using MRUnit’sMapReduceDriver. , a combiner can be tested using
MapReduceDriver as well.
A PipelineMapReduceDriver allows you to test a workflow of map/reduce jobs. Currently, partitioner’s
do not have a test driver under MRUnit.

MRUnit allows you to do TDD(Test Driven Development) and write lightweight unit tests which
accommodate Hadoop’s specific architecture and constructs.

Example: We’re processing road surface data used to create maps. The input contains both linear
surfaces and intersections. The mapper takes a collection of these mixed surfaces as input, discards
anything that isn’t a linear road surface, i.e., intersections, and then processes each road surface and
SubCode:CCS334 Subject Name:Big Data Analytics 74
Department of Computer Science and Business Systems

(An Autonomous Institutions)

writes it out to HDFS. We can keep count and eventually print out how many non-road surfaces are
inputs. For debugging purposes, we can additionally print out how many road surfaces were processed.

Anatomy of a MapReduce Job Run

You can run a MapReduce job with a single method call: submit () on a Job object (you can also call
waitForCompletion(), which submits the job if it hasn’t been submitted already, then waits for it to
finish). This method call conceals a great deal of processing behind the scenes. This section uncovers
the steps Hadoop takes to run a job.
The whole process is illustrated in Figure 7-1. At the highest level, there are five independent entities:
• The client, which submits the MapReduce job.
• The YARN resource manager, which coordinates the allocation of compute resources
On the cluster.
• The YARN node managers, which launch and monitor the compute containers on
Machines in the cluster.

• The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in containers
That are scheduled by the resource manager and managed by the node managers.
• The distributed filesystem, which is used for sharing job files between the other entities.
• He distributed filesystem ,which is used for sharing job files between the other entities.

SubCode:CCS334 Subject Name:Big Data Analytics 75

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Figure 7-1. How Hadoop runs a MapReduce job

Classic MapReduce

A job run in classic MapReduce is illustrated in Figure 6-1. At the highest level, there are four
independent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker is a Java application whose

SubCode:CCS334 Subject Name:Big Data Analytics 76

Department of Computer Science and Business Systems

(An Autonomous Institutions)

main class is JobTracker.

• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java
applications whose main class is TaskTracker.
• The distributed filesystem, which is used for sharing job files between the other entities.

Job Initialization:
When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from
where the job scheduler will pick it up and initialize it. Initialization involves creating an object to
represent the job being run.
To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client
from the shared filesystem. It then creates one map task for each split.

Task Assignment:
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
Heartbeats tell the jobtracker that a tasktracker is alive As a part of the heartbeat, a tasktracker will
indicate whether it is ready to run a new task, and if it is, the jobtracker will llocate it a task, which
it communicates to the tasktracker using the heartbeat return value.

Task Execution:
Now that the tasktracker has been assigned a task, the next step is for it to run the task. First, it localizes
the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem. It also copies any
files needed from the distributed cache by the application to the local disk. TaskRunner launches a new
Java Virtual Machine to run each task in.

Progress and Status Updates:

MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run. Because
this is a significant length of time, it’s important for the user to get feedback on how the job is
progressing. A job and each of its tasks have a status.When a task is running, it keeps track of its
progress, that is, the proportion of the task completed.

Job Completion:

SubCode:CCS334 Subject Name:Big Data Analytics 77

Department of Computer Science and Business Systems

(An Autonomous Institutions)

When the jobtracker receives a notification that the last task for a job is complete (this will be the
special job cleanup task), it changes the status for the job to “successful.”

YARN

Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications can co-
exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time bringing great
benefits for manageability and cluster utilization.

Components Of YARN
o Client: For submitting MapReduce jobs.

o Resource Manager: To manage the use of resources across the cluster

o Node Manager:For launching and monitoring the computer containers on machines
in the cluster.

o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by
the resource manager, and managed by the node managers.

Jobtracker&Tasktrackerwere were used in previous version of Hadoop, which were responsible for
handling resources and checking progress management. However, Hadoop
2.0 has Resource manager and NodeManager to overcome the shortfall of
Jobtracker&Tasktracker.

Benefits of YARN

o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000
task, but Yarn is designed for 10,000 nodes and 1 lakh tasks.
o Utiliazation: Node Manager manages a pool of resources, rather than a fixed
number of the designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the
process of upgrading MapReduce more manageable.
SubCode:CCS334 Subject Name:Big Data Analytics 78
Department of Computer Science and Business Systems

(An Autonomous Institutions)

Sort and Shuffle

The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task is
complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to
disk. Using the input from each Mapper <k2,v2>, we collect all the values for each unique key k2. This
output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.

MapReduce Types
Mapping is the core technique of processing a list of data elements that come in pairs of keys and
values. The map function applies to individual elements defined as key-value pairs of a list and
produces a new list. The general idea of map and reduce function of Hadoop can be illustrated as
follows:
map: (K1, V1) -> list (K2, V2)
reduce: (K2, list(V2)) -> list (K3, V3)
The input parameters of the key and value pair, represented by K1 and V1 respectively, are different
from the output pair type: K2 and V2. The reduce function accepts the same format output by the map,
but the type of output again of the reduce operation is different: K3 and V3. The Java API for this is as
follows:

public interface Mapper<K1, V1, K2, V2> extends JobConfigurable,Closeable

void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)

throws IOException;

}
public interface Reducer<K2, V2, K3, V3> extends JobConfigurable,Closeable
{
void reduce(K2 key, Iterator<V2> values,

OutputCollector<K3, V3> output, Reporter

reporter)throws IOException;
}
SubCode:CCS334 Subject Name:Big Data Analytics 79
Department of Computer Science and Business Systems

(An Autonomous Institutions)

The OutputCollector is the generalized interface of the Map-Reduce framework to facilitate collection
of data output either by the Mapper or the Reducer. These outputs are nothing but

intermediate output of the job. Therefore, they must be parameterized with their types. The Reporter
facilitates the Map-Reduce application to report progress andupdatecountrsand status information. If,
however, the combine function is used, it has
the same form as the reduce function and the output is fed to the reduce function. This may be
illustrated as follows

map: (K1, V1) -> list (K2, V2)

combine: (K2, list(V2)) -> list (K2, V2)
reduce: (K2, list(V2)) -> list (K3, V3)
Note that the combine and reduce functions use the same type, except in the variable names where K3
is K2 and V3 is V2.
The partition function operates on the intermediate key-value types. It controls the partitioning of the
keys of the intermediate map outputs. The key derives the partition using a typical hash function. The
total number of partitions is the same as the number of reduce tasks for the job. The partition is
determined only by the key ignoring the value.

public interface Partitioner<K2, V2> extends

JobConfigurable

{ intgetPartition(K2 key, V2 value, intnumberOfPartition);

SubCode:CCS334 Subject Name:Big Data Analytics 80

Department of Computer Science and Business Systems

(An Autonomous Institutions)

UNIT IV

BASICS OF HADOOP 6

Data format – analyzing data with Hadoop – scaling out – Hadoop streaming – Hadoop
pipes – design of Hadoop distributed file system (HDFS) – HDFS concepts – Java
interface – data flow – Hadoop I/O – data integrity – compression – serialization – Avro
– file-based data structures - Cassandra – Hadoop integration.

4.1 BASICS OF HADOOP - DATA FORMAT

Hadoop is an open-source framework for processing, storing, and analyzing large

volumes of data in a distributed computing environment. It provides a reliable, scalable,
and distributed computing system for big data.

Key Components:

• Hadoop Distributed File System (HDFS): HDFS is the storage system of

Hadoop, designed to store very large files across multiple machines.
• MapReduce: MapReduce is a programming model for processing and
generating large datasets that can be parallelized across a distributed cluster of
computers.
• YARN (Yet Another Resource Negotiator): YARN is the resource management
layer of Hadoop, responsible for managing and monitoring resources in a cluster.

Advantages of Hadoop:

• Scalability: Hadoop can handle and process vast amounts of data by distributing
it across a cluster of machines.
• Fault Tolerance: Hadoop is fault-tolerant, meaning it can recover from failures,
ensuring that data processing is not disrupted.
• Cost-Effective: It allows businesses to store and process large datasets cost-
effectively, as it can run on commodity hardware.

Here are some basics:

1. Data Storage in Hadoop:

o Hadoop uses the Hadoop Distributed File System (HDFS) to store data
SubCode:CCS334 Subject Name:Big Data Analytics 81
Department of Computer Science and Business Systems

(An Autonomous Institutions)

across multiple machines in a distributed fashion. Data is divided into

blocks (typically 128 MB or 256 MB in size), and each block is replicated
across several nodes in the cluster for fault tolerance.
2. Data Formats:
o Hadoop can work with various data formats, but some common ones include:
▪ Text: Data is stored in plain text files, such as CSV or TSV.
▪ SequenceFile: A binary file format optimized for Hadoop, suitable
for storing key-value pairs.
▪ Avro: A data serialization system that supports schema evolution.
It's often used for complex data structures.
▪ Parquet: A columnar storage format that is highly optimized for
analytics workloads. It's efficient for both reading and writing.
3. Data Ingestion:
o Before analyzing data, you need to ingest it into Hadoop. You can use
tools like Apache Flume, Apache Sqoop, or simply copy data into HDFS
using Hadoop commands.
4. Data Processing:
o Hadoop primarily processes data using a batch processing model. It uses a
programming model called MapReduce to distribute the processing tasks
across the cluster. You write MapReduce jobs to specify how data should
be processed.
o In addition to MapReduce, Hadoop ecosystem also includes higher-level
processing frameworks like Apache Spark, Apache Hive, and Apache Pig,
which provide more user-friendly abstractions for data analysis.
5. Data Analysis:
o Once data is processed, you can analyze it to gain insights. This may
involve running SQL-like queries (with Hive), machine learning algorithms
(with Mahout or Spark MLlib), or custom data processing logic.
6. Data Output:
o After analysis, you can store the results back into Hadoop, or you can
export them to other systems for reporting or further analysis.
7. Data Compression:
o Hadoop allows data compression to reduce storage requirements and
improve processing speed. Common compression formats include Gzip,
Snappy, and LZO.
8. Data Schema:
o When working with structured data, it's important to define a schema.
Some formats like Avro and Parquet have built-in schema support. In
other cases, you may need to maintain the schema separately.
9. Data Partitioning and Shuffling:
o During data processing, Hadoop can partition data into smaller chunks and
SubCode:CCS334 Subject Name:Big Data Analytics 82
Department of Computer Science and Business Systems

(An Autonomous Institutions)

shuffle it across nodes to optimize the processing pipeline.

10. Data Security and Access Control:
o Hadoop provides security mechanisms to control access to data and
cluster resources. This includes authentication, authorization, and
encryption.
4.1.1. Step-by-step installation of Hadoop on a single Ubuntu machine.

Installing Hadoop on a single-node cluster is a common way to set up Hadoop for

learning and development purposes. In this guide, I'll walk you through the step-by-step
installation of Hadoop on a single Ubuntu machine.

Prerequisites:

• A clean installation of Ubuntu.

• Java installed on your system.
Let's proceed with the installation:
Step 1: Download Hadoop

1. Visit the Apache Hadoop website (https://hadoop.apache.org) and choose the

Hadoop version you want to install. Replace X.Y.Z with the version number you
choose.
2. Download the Hadoop distribution using wget or your web browser. For example:

bash
wget https://archive.apache.org/dist/hadoop/common/hadoop-X.Y.Z/hadoop-
X.Y.Z.tar.gz

Step 2: Extract Hadoop 3. Extract the downloaded Hadoop tarball to your desired
directory (e.g., /usr/local/):

bash
sudo tar -xzvf hadoop-X.Y.Z.tar.gz -C /usr/local/

Step 3: Configure Environment Variables 4. Edit your ~/.bashrc file to set up

environment variables. Replace X.Y.Z with your Hadoop version:

bash
export HADOOP_HOME=/usr/local/hadoop-X.Y.Z
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Apply these changes to your current shell:

SubCode:CCS334 Subject Name:Big Data Analytics 83
Department of Computer Science and Business Systems

(An Autonomous Institutions)

bash
source ~/.bashrc

Step 4: Edit Hadoop Configuration Files 5. Navigate to the Hadoop configuration directory:

bash
cd $HADOOP_HOME/etc/hadoop

6. Edit the hadoop-env.sh file to specify the Java home directory. Add the following
line to the file, pointing to your Java installation:

bash
export JAVA_HOME=/usr/lib/jvm/default-java

7. Configure Hadoop's core-site.xml by editing it and adding the following XML

snippet. This sets the Hadoop Distributed File System (HDFS) data directory:

xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
8. Configure Hadoop's hdfs-site.xml by editing it and adding the following XML
snippet. This sets the HDFS data and metadata directories:

xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop-X.Y.Z/data/namenode</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop-X.Y.Z/data/datanode</value>
</property>

Step 5: Format the HDFS Filesystem 9. Before starting Hadoop services, you need to
format the HDFS filesystem. Run the following command:

bash
hdfs namenode -format

SubCode:CCS334 Subject Name:Big Data Analytics 84

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Step 6: Start Hadoop Services 10. Start the Hadoop services using the following command:

bash
start-all.sh

Step 7: Verify Hadoop Installation 11. Check the running Hadoop processes using the jps
command:

bash
jps

You should see a list of Java processes running, including NameNode, DataNode,
ResourceManager, and NodeManager.

Step 8: Access Hadoop Web UI 12. Open a web browser and access the Hadoop Web UI at
http://localhost:50070/ (for HDFS) and http://localhost:8088/ (for YARN ResourceManager).

You have successfully installed Hadoop on a single-node cluster. You can now use it for
learning and experimenting with Hadoop and MapReduce.

4.2 DATA FORMAT - ANALYZING DATA WITH HADOOP

Data Formats in Hadoop:

• Text Files: Simple plain text files, where each line represents a record.
• Sequence Files: Binary files containing serialized key/value pairs.
• Avro: A data serialization system that provides rich data structures in a compact
binary format.
• Parquet: A columnar storage file format optimized for use with Hadoop.

Analyzing Data with Hadoop:

• MapReduce Programming Model: Data analysis tasks in Hadoop are

accomplished using the MapReduce programming model, where data is
processed in two stages: the Map stage processes and sorts the data, and
the Reduce stage performs summary
operations.
• Hive: Hive is a data warehousing and SQL-like query language for Hadoop. It
allows users to query and manage large datasets stored in Hadoop HDFS.
• Pig: Pig is a high-level platform and scripting language built on top of Hadoop,
used for creating MapReduce programs for data analysis.

SubCode:CCS334 Subject Name:Big Data Analytics 85

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Analyzing data with Hadoop involves understanding the data format and structure, as
well as using appropriate tools and techniques for processing and deriving insights from
the data. Here are some key considerations when it comes to data format and analysis
with Hadoop:

1. Data Format:

• Structured Data: If your data is structured, meaning it follows a fixed schema,

you can use formats like Avro, Parquet, or ORC. These columnar storage formats
are efficient for large-scale data analysis and support schema evolution.
• Semi-Structured Data: Data in JSON or XML format falls into this category.
Hadoop can handle semi-structured data, and tools like Hive and Pig can help
you query and process it effectively.
• Unstructured Data: Text data, log files, and other unstructured data can be
processed using Hadoop as well. However, processing unstructured data often
requires more complex parsing and natural language processing (NLP)
techniques.

2. Data Ingestion:

• Before you can analyze data with Hadoop, you need to ingest it into the Hadoop
Distributed File System (HDFS) or another storage system compatible with
Hadoop. Tools like Apache Flume or Apache Sqoop can help with data ingestion.

3. Data Processing:

• Hadoop primarily uses the MapReduce framework for batch data processing. You
write MapReduce jobs to specify how data should be processed. However, there
are also high- level processing frameworks like Apache Spark and Apache Flink
that provide more user-
friendly abstractions and real-time processing capabilities.

4. Data Analysis:
• For SQL-like querying of structured data, you can use Apache Hive, which
provides a SQL interface to Hadoop. Hive queries get translated into MapReduce
or Tez jobs.
• Apache Pig is a scripting language specifically designed for data processing in
Hadoop. It's useful for ETL (Extract, Transform, Load) tasks.
• For advanced analytics and machine learning, you can use Apache Spark, which
provides MLlib for machine learning tasks, and GraphX for graph processing.

5. Data Storage and Compression:

SubCode:CCS334 Subject Name:Big Data Analytics 86

Department of Computer Science and Business Systems

(An Autonomous Institutions)

• Hadoop provides various storage formats optimized for analytics (e.g., Parquet,
ORC) and supports data compression to reduce storage requirements and
improve processing speed.

6. Data Partitioning and Shuffling:

• Hadoop can automatically partition data into smaller chunks and shuffle it across
nodes to optimize the processing pipeline.

7. Data Security and Access Control:

• Hadoop offers mechanisms for securing data and controlling access through
authentication, authorization, and encryption.

8. Data Visualization:

• To make sense of the analyzed data, you can use data visualization tools like
Apache Zeppelin or integrate Hadoop with business intelligence tools like
Tableau or Power BI.

9. Performance Tuning:

• Hadoop cluster performance can be optimized through configuration settings

and resource allocation. Understanding how to fine-tune these parameters is
essential for efficient data analysis.

10. Monitoring and Maintenance:

• Regularly monitor the health and performance of your Hadoop cluster using
tools like Ambari or Cloudera Manager. Perform routine maintenance tasks to
ensure smooth operation.

Analyzing data with Hadoop involves a combination of selecting the right data format,
processing tools, and techniques to derive meaningful insights from your data.
Depending on your specific use case, you may need to choose different formats and
tools to suit your needs.

4.3 SCALING OUT

SubCode:CCS334 Subject Name:Big Data Analytics 87

Department of Computer Science and Business Systems

(An Autonomous Institutions)

Scaling out is a fundamental concept in distributed

computing and is one of the key benefits of using
Hadoop for big data analysis. Here are some important
points related to scaling out in Hadoop:

• Horizontal Scalability: Hadoop is designed for

horizontal scalability, which means that you can
expand the cluster by adding more commodity
hardware machines to it. This allows you to
accommodate larger datasets and perform more
extensive data processing.

• Data Distribution: Hadoop's HDFS distributes data

across multiple nodes in the cluster. When you
scale out by adding more nodes, data is
automatically distributed across these new
machines. This distributed data storage ensures
fault tolerance and high availability.

• Processing Power: Scaling out also means

increasing the processing power of the cluster. You
can run more MapReduce tasks and analyze data in
parallel across multiple nodes, which can
significantly speed up data processing.

• Elasticity: Hadoop clusters can be designed to be

elastic, meaning you can dynamically add or
remove nodes based on workload requirements.
This is particularly useful in cloud-based Hadoop
deployments where you pay for resources based
on actual usage.

SubCode:CCS334 Subject Name:Big Data Analytics

88
Department of Computer Science and Business Systems

(An Autonomous Institutions)

• Balancing Resources: When scaling out, it's

important to consider resource management and
cluster balancing. Tools like Hadoop YARN (Yet
Another Resource Negotiator) help allocate and
manage cluster resources efficiently.

Scaling Hadoop:

• Horizontal Scaling: Hadoop clusters can scale

horizontally by adding more machines to the
existing cluster. This approach improves processing
power and storage capacity.
• Vertical Scaling: Vertical scaling involves adding
more resources (CPU, RAM) to existing nodes in
the cluster. However, there are limits to vertical
scaling, and horizontal scaling is preferred for
handling larger workloads.
• Cluster Management Tools: Tools like Apache
Ambari and Cloudera Manager help in managing
and scaling Hadoop clusters efficiently.
• Data Partitioning: Proper data partitioning
strategies ensure that data is distributed evenly
across the cluster, enabling efficient processing.

SubCode:CCS334 Subject Name:Big Data Analytics

89
Department of Computer Science and Business Systems

(An Autonomous Institutions)

4.4. HADOOP STREAMING

What is Hadoop Streaming? Hadoop Streaming is a

utility that comes with Hadoop distribution. It allows you
to create and run MapReduce jobs with any executable or
script as the mapper and/or the reducer. This means you
can use any programming language that can read from
standard input and write to standard output for your
MapReduce tasks.

How Hadoop Streaming Works:

SubCode:CCS334 Subject Name:Big Data Analytics

90
Department of Computer Science and Business Systems

(An Autonomous Institutions)

1. Input: Hadoop Streaming reads input from HDFS

or any other file system and provides it to the
mapper as lines of text.

2. Mapper: You can use any script or executable as a

mapper. Hadoop Streaming feeds the input lines to
the mapper's standard input.

3. Shuffling and Sorting: The output from the

mapper is sorted and partitioned by the Hadoop
framework.
4. Reducer: Similarly, you can use any script or
executable as a reducer. The reducer reads sorted
input lines from its standard input and produces
output, which is written to HDFS or any other file
system.

5. Output: The final output is stored in HDFS or the specified

output directory.

Advantages of Hadoop Streaming:

• Language Flexibility: It allows developers to use

languages like Python, Perl, Ruby, etc., for writing
MapReduce programs, extending Hadoop's
usability beyond Java developers.

• Rapid Prototyping: Developers can quickly

prototype and test algorithms without the need to
compile and package Java code.

4.5HADOOP PIPES

What is Hadoop Pipes? Hadoop Pipes is a C++ API to

implement Hadoop MapReduce applications. It enables
SubCode:CCS334 Subject Name:Big Data Analytics
91
Department of Computer Science and Business Systems

(An Autonomous Institutions)

the use of C++ to write MapReduce programs, allowing

developers proficient in C++ to leverage Hadoop's
capabilities.

How Hadoop Pipes Works:

1. Mapper and Reducer: Developers write the mapper and

reducer functions in C++.

2. Input: Hadoop Pipes reads input from HDFS or

other file systems and provides it to the mapper as
key-value pairs.

3. Map and Reduce Operations: The developer

specifies the map and reduce functions, defining
the logic for processing the input key-value pairs.

4. Output: The final output is written to HDFS or another file

system.

Advantages of Hadoop Pipes:

• Performance: Programs written in C++ can

sometimes be more performant due to the lower-
level memory management and execution speed of
C++.

• C++ Libraries: Developers can leverage existing

C++ libraries and codebases, making it easier to
integrate with other systems and tools.

Both Hadoop Streaming and Hadoop Pipes provide

flexibility in terms of programming languages, enabling a
broader range of developers to work with Hadoop and
leverage its powerful data processing capabilities.
SubCode:CCS334 Subject Name:Big Data Analytics
92
Department of Computer Science and Business Systems

(An Autonomous Institutions)

4.6 DESIGN OF HADOOP DISTRIBUTED FILE SYSTEM (HDFS):

Architecture: Hadoop Distributed File System (HDFS) is
designed to store very large files across multiple machines
in a reliable and fault-tolerant manner. Its architecture
consists of the following components:

1. NameNode: The NameNode is the master server

that manages the namespace and regulates access
to files by clients. It stores metadata about the files
and directories, such as the file structure tree and
the mapping of file blocks to DataNodes.

2. DataNode: DataNodes are responsible for storing

the actual data. They store data in the form of
blocks and send periodic heartbeats and block
reports to the NameNode to confirm that they are
functioning correctly.

3. Block: HDFS divides large files into fixed-size

blocks (typically 128 MB or 256 MB). These blocks
are stored across the DataNodes in the cluster.

4. Secondary NameNode (Deprecated Term): The

Secondary NameNode is not a backup or failover
NameNode. Its primary function is to periodically
merge the namespace image and edit log files,
reducing the load on the primary NameNode.

Replication: HDFS replicates each block multiple times

(usually three) and places these replicas on different
DataNodes across the cluster. Replication ensures fault
tolerance. If a DataNode or block becomes unavailable,
the system can continue to function using the remaining
replicas.
SubCode:CCS334 Subject Name:Big Data Analytics
93
Department of Computer Science and Business Systems

(An Autonomous Institutions)

4.7 HDFS CONCEPTS:

1. Block: Blocks are the fundamental units of data storage

in HDFS. Each file is broken down into blocks, and these
blocks are distributed across the cluster's DataNodes.

2. Namespace: HDFS organizes files and directories in a

hierarchical namespace. The NameNode manages this
namespace and regulates access to files by clients.

3. Replication: As mentioned earlier, HDFS replicates

blocks for fault tolerance. The default replication factor is
3, but it can be configured based on the cluster's
requirements.

4. Fault Tolerance: HDFS achieves fault tolerance by

replicating data blocks across multiple nodes. If a
DataNode or block becomes unavailable due to hardware
failure or other issues, the system can continue to operate
using the replicated blocks.

5. High Write Throughput: HDFS is optimized for high

throughput of data, making it suitable for applications
with large datasets. It achieves this through the
parallelism of writing and reading data across multiple
nodes.

6. Scalability: HDFS is designed to scale horizontally by

adding more nodes to the cluster. This scalability allows
Hadoop clusters to handle large and growing amounts of
data.
7. Data Integrity: HDFS ensures data integrity by storing
checksums of data with each block. This checksum is
verified by clients and DataNodes to ensure that data is
SubCode:CCS334 Subject Name:Big Data Analytics
94
Department of Computer Science and Business Systems

(An Autonomous Institutions)

not corrupted during storage or transmission.

4.8 JAVA INTERFACE IN HADOOP:

Hadoop provides Java APIs that developers can use to

interact with the Hadoop ecosystem. The Java interface in
Hadoop includes various classes and interfaces that allow
developers to create MapReduce jobs, configure Hadoop
clusters, and manipulate data stored in HDFS. Here's a
brief overview of key components in the Java interface:

1. org.apache.hadoop.mapreduce Package:
o Mapper: Interface for the mapper task in a MapReduce
job.
o Reducer: Interface for the reducer task in a MapReduce
job.
o Job: Represents a MapReduce job configuration.
o InputFormat: Specifies the input format of the job.
o OutputFormat: Specifies the output format of the job.
o Configuration: Represents Hadoop configuration
properties.
2. org.apache.hadoop.fs Package:
o FileSystem: Interface representing a file
system in Hadoop (HDFS, local file system,
etc.).
o Path: Represents a file or directory path in Hadoop.
3. org.apache.hadoop.io Package:
o Writable: Interface for custom Hadoop data types.
o WritableComparable: Interface for custom
data types that are comparable and
writable.

Developers use these interfaces and classes to create

custom MapReduce jobs, configure input and output
formats, and interact with HDFS. They can implement the
SubCode:CCS334 Subject Name:Big Data Analytics
95
Department of Computer Science and Business Systems

(An Autonomous Institutions)

Mapper and Reducer interfaces to define their own map

and reduce logic for processing data.

4.9 DATA FLOW IN HADOOP:

Data flow in Hadoop refers to the movement of data

between different stages of a MapReduce job or between
Hadoop components. Here's how data flows in a typical
Hadoop MapReduce job:

1. Input Phase:
o Input data is read from one or more
sources, such as HDFS files, HBase tables, or
other data storage systems.
o Input data is divided into input splits, which
are processed by individual mapper tasks.
2. Map Phase:
o Mapper tasks process the input splits and produce
intermediate key-value pairs.
o The intermediate data is partitioned, sorted,
and grouped by key before being sent to
the reducers.
3. Shuffle and Sort Phase:
o Intermediate data from all mappers is shuffled and
sorted based on keys.
o Data with the same key is grouped
together, and each group of data is sent to
a specific reducer.
4. Reduce Phase:
o Reducer tasks receive sorted and grouped
intermediate data.
o Reducers process the data and produce the
final output key-value pairs, which are
typically written to HDFS or another storage
system.
5. Output Phase:
SubCode:CCS334 Subject Name:Big Data Analytics
96
Department of Computer Science and Business Systems

(An Autonomous Institutions)

o The final output is stored in HDFS or

another output location specified by the
user.
o Users can access the output data for further analysis or
processing.

4.10 HADOOP I/O - DATA INTEGRITY:

Ensuring data integrity is crucial in any distributed storage

and processing system like Hadoop. Hadoop provides
several mechanisms to maintain data integrity:

1. Replication:
o HDFS stores multiple replicas of each block
across different nodes. If a replica is
corrupted, Hadoop can use one of the other
replicas to recover the lost data.
2. Checksums:
o HDFS uses checksums to validate the
integrity of data blocks. Each block is
associated with a checksum, which is verified
by both the client reading the data and the
DataNode storing the data. If a block's
checksum doesn't match the expected value,
Hadoop knows the data is corrupted and
can request it from another node.
3. Write Pipelining:
o HDFS pipelines the data through several
nodes during the writing process. Each node
in the pipeline verifies the checksums before
passing the data to the next node. If a node
detects corruption, it can request the block
from another replica.
4. Error Detection and Self-healing:
o Hadoop can detect corrupted blocks and
automatically replace them with healthy
SubCode:CCS334 Subject Name:Big Data Analytics
97
Department of Computer Science and Business Systems

(An Autonomous Institutions)

replicas from other nodes, ensuring the

integrity of the stored data.

4.11 COMPRESSION AND SERIALIZATION IN HADOOP:

1. Compression:
o Hadoop supports various compression
algorithms like Gzip, Snappy, and LZO.
Compressing data before storing it in HDFS
can significantly reduce storage
requirements and improve the efficiency of
data processing. You can specify the
compression codec when writing data to
HDFS or when configuring MapReduce jobs.

Example of specifying a compression codec in a MapReduce

job:
java

conf.set("mapreduce.map
.output.compress", "true");
conf.set("mapreduce.map.out
put.compress.codec",
"org.apache.hadoop.io.compr
ess.SnappyCodec");

 Serialization:

• Hadoop uses its own serialization framework called

Writable to serialize data efficiently. Writable data
types are Java objects optimized for Hadoop's data
transfer. You can also use Avro or Protocol
Buffers for serialization. These serialization
formats are more
efficient than Java's default serialization
mechanism, especially in the context of large- scale
data processing.

Example of using Avro for serialization:

SubCode:CCS334 Subject Name:Big Data Analytics
98
Department of Computer Science and Business Systems

(An Autonomous Institutions)

java
// Writing Avro data to
HDFS
DatumWriter<YourAvroReco
rd> datumWriter = new
SpecificDatumWriter<>(Yo
urAvroRecord.class);
DataFileWriter<YourAvroR
ecord> dataFileWriter =
new
DataFileWriter<>(datumWr
iter);
dataFileWriter.create(yourAvroRecord.getSchema(), new
File("output.avro"));
dataFileWriter.append(yourAvroRecord);
dataFileWriter.close();

By utilizing these mechanisms, Hadoop ensures that data

integrity is maintained during storage and processing.
Additionally, compression and efficient serialization
techniques optimize storage and data transfer,
contributing to the overall performance of Hadoop
applications.

4.12 AVRO - FILE-BASED DATA STRUCTURES:

Apache Avro is a data serialization framework that

provides efficient data interchange in Hadoop. It enables
the serialization of data structures in a language-
independent way, making it ideal for data stored in files.
Avro uses JSON for defining data types and protocols,
allowing data to be self-describing and allowing complex
data structures.

Key Concepts:

1. Schema Definition: Avro uses JSON to define

schemas. Schemas define the data structure,
including types and their relationships. For
SubCode:CCS334 Subject Name:Big Data Analytics
99
Department of Computer Science and Business Systems

(An Autonomous Institutions)

example, you can define records, enums, arrays,

and more in Avro schemas.

json
"type": "record",
"
n
a
m
e
"
:

"
U
s
e
r
"
,

"
f
i
e
l
d
s
"
:

[
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "address", "type": "string"}
]
}
{

1.
2. Serialization: Avro encodes data using the defined
schema, producing compact binary files. Avro data
is self-describing, meaning that the schema is
embedded in the data itself.
3. Deserialization: Avro can deserialize the data

SubCode:CCS334 Subject Name:Big Data Analytics

100
Department of Computer Science and Business Systems

(An Autonomous Institutions)

back into its original format using the schema

information contained within the data.
4. Code Generation: Avro can generate code in
various programming languages from a schema.
This generated code helps in working with Avro
data in a type-safe manner.

Avro is widely used in the Hadoop ecosystem due to its

efficiency, schema evolution capabilities, and language
independence, making it a popular choice for serializing
data in Hadoop applications.

4.13 CASSANDRA - HADOOP INTEGRATION:

Apache Cassandra is a highly scalable, distributed NoSQL

database that can handle large amounts of data across
many commodity servers. Integrating Cassandra with
Hadoop provides the ability to combine the advantages
of a powerful database system with the extensive data
processing capabilities of the Hadoop ecosystem.

Integration Strategies:

1. Cassandra Hadoop Connector: Cassandra

provides a Hadoop integration tool called the
Cassandra Hadoop Connector. It allows
MapReduce jobs to read and write data to and
from Cassandra.
2. Cassandra as a Source or Sink: Cassandra can act
as a data source or sink for Apache Hadoop and
Apache Spark jobs. You can configure Hadoop or
Spark to read data from Cassandra tables or write
results back to Cassandra.
3. Cassandra Input/Output Formats: Cassandra
supports Hadoop Input/Output formats, allowing
SubCode:CCS334 Subject Name:Big Data Analytics
101
Department of Computer Science and Business Systems

(An Autonomous Institutions)

MapReduce jobs to directly read from and write to

Cassandra tables.
Benefits of Integration:

• Data Processing: You can perform complex data

processing tasks on data stored in Cassandra using
Hadoop's distributed processing capabilities.
• Data Aggregation: Aggregate data from multiple
Cassandra nodes using Hadoop's parallel
processing, enabling large-scale data analysis.
• Data Export and Import: Use Hadoop to export
data from Cassandra for backup or analytical
purposes. Similarly, you can import data into
Cassandra after processing it using Hadoop.
Integrating Cassandra and Hadoop allows businesses to
leverage the best of both worlds: Cassandra's real-time,
high-performance database capabilities and Hadoop's
extensive data processing and analytics features. This
integration enables robust, large-scale data applications
for a variety of use cases.

SubCode:CCS334 Subject Name:Big Data Analytics

102
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics
103
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

104
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

105
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

106
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 107
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

108
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

109
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

110
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

111
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

112
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

113
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

114
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 115
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

116
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 117
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

118
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 119
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

120
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 121
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 122
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 123
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 124
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 125
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

126

SubCode:CCS334 Subject Name:Big Data Analytics

Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

SubCode:CCS334 Subject Name:Big Data Analytics 127
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

128
SubCode:CCS334 Subject Name:Big Data Analytics
Department of Computer Science and Business Systems

(An Autonomous Institutions)

SubCode:CCS334 Subject Name:Big Data Analytics

129
Department of Computer Science and Business Systems

(An Autonomous Institutions)

130
SubCode:CCS334 Subject Name:Big Data Analytics

Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Age Detection Using Machine
No ratings yet
Age Detection Using Machine
11 pages
Agriculture Management System-3
No ratings yet
Agriculture Management System-3
22 pages
NoSQL Data Management Overview
No ratings yet
NoSQL Data Management Overview
36 pages
Final Year Project Report-1
No ratings yet
Final Year Project Report-1
42 pages
Internship Report DiabetesPrediction
No ratings yet
Internship Report DiabetesPrediction
15 pages
L-2.9 Hmac Cmac
No ratings yet
L-2.9 Hmac Cmac
14 pages
Data Communication & OSI Model Basics
No ratings yet
Data Communication & OSI Model Basics
13 pages
Ai3021 - It in Agricultural System Syllabus
No ratings yet
Ai3021 - It in Agricultural System Syllabus
1 page
CCS341 Data Warehousing
No ratings yet
CCS341 Data Warehousing
7 pages
Mini Project Report
No ratings yet
Mini Project Report
19 pages
Updated 5th and 6th Sem 2021 Scheme and Syllabus
No ratings yet
Updated 5th and 6th Sem 2021 Scheme and Syllabus
71 pages
Distributed Software Engineering Case Study Questions
No ratings yet
Distributed Software Engineering Case Study Questions
5 pages
Big Data Processing with Hive & Pig
No ratings yet
Big Data Processing with Hive & Pig
18 pages
CSD4001 - SVT Lab Record Note - Sample
No ratings yet
CSD4001 - SVT Lab Record Note - Sample
78 pages
Unit 1 (DMW)
No ratings yet
Unit 1 (DMW)
53 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
Dbms Final Report Nithin and Ramesh
No ratings yet
Dbms Final Report Nithin and Ramesh
40 pages
cw3551 Dis Unit 2 Notes
No ratings yet
cw3551 Dis Unit 2 Notes
18 pages
Jawaharlal Nehru Engineering College: Digital Image Processing
50% (2)
Jawaharlal Nehru Engineering College: Digital Image Processing
26 pages
MSC IT Syllabus
93% (15)
MSC IT Syllabus
69 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
BI UNIT-I Chp01 (Business Intelligence)
No ratings yet
BI UNIT-I Chp01 (Business Intelligence)
14 pages
Advance Software Engineering Notes
100% (1)
Advance Software Engineering Notes
188 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
Medical Insurance Cost Prediction Report Full
100% (1)
Medical Insurance Cost Prediction Report Full
50 pages
Unit Iv - Itas
No ratings yet
Unit Iv - Itas
19 pages
APP Question Bank Unit3
100% (1)
APP Question Bank Unit3
5 pages
Distributed Systems Overview
No ratings yet
Distributed Systems Overview
11 pages
Cs3353 Foundations of Data Science Unit V
No ratings yet
Cs3353 Foundations of Data Science Unit V
13 pages
Experiment No 5: AIM: Study The Use of Network Reconnaissance Tools Like WHOIS, Dig
No ratings yet
Experiment No 5: AIM: Study The Use of Network Reconnaissance Tools Like WHOIS, Dig
6 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
38 pages
M.E. Cse (Ai&ml)
No ratings yet
M.E. Cse (Ai&ml)
63 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
Unit IV - AI3021
No ratings yet
Unit IV - AI3021
4 pages
Software-Defined Networks Overview
100% (1)
Software-Defined Networks Overview
24 pages
Ai3021 It in Agricultural Systems Question Bank
No ratings yet
Ai3021 It in Agricultural Systems Question Bank
6 pages
April 2020,2021,2022,2023 Anna University MCA Question Paper
No ratings yet
April 2020,2021,2022,2023 Anna University MCA Question Paper
16 pages
3rd & 4th Sem MCA Syllabus Updated
No ratings yet
3rd & 4th Sem MCA Syllabus Updated
98 pages
Univeresity Information Hub (243459-Ayu25)
No ratings yet
Univeresity Information Hub (243459-Ayu25)
112 pages
FYBBA (CA) C Programming Sem - I (2021-22) Question Paper
No ratings yet
FYBBA (CA) C Programming Sem - I (2021-22) Question Paper
3 pages
Experiment 3 Module 1
No ratings yet
Experiment 3 Module 1
6 pages
Collaborating Using Cloud Services
0% (1)
Collaborating Using Cloud Services
3 pages
PES1PG21CA154
No ratings yet
PES1PG21CA154
48 pages
Java Course File
No ratings yet
Java Course File
306 pages
Introduction To Client Server
No ratings yet
Introduction To Client Server
19 pages
HBase for Data Engineers
No ratings yet
HBase for Data Engineers
13 pages
ML Lab (R22) Manual
No ratings yet
ML Lab (R22) Manual
25 pages
Itas Unit3
No ratings yet
Itas Unit3
14 pages
GRT Institute of Engineering and Technology,: Tiruttani
No ratings yet
GRT Institute of Engineering and Technology,: Tiruttani
11 pages
App Java Report-Eb Ocr
No ratings yet
App Java Report-Eb Ocr
42 pages
Java Text Editor Project Report
100% (1)
Java Text Editor Project Report
10 pages
Introduction To Parallel Databases
No ratings yet
Introduction To Parallel Databases
24 pages
17CS81 IOT Notes Module4
No ratings yet
17CS81 IOT Notes Module4
17 pages
Iat Ii - 2 Marks Question Bank With Answer-Ai3021 It in Agricultural System
No ratings yet
Iat Ii - 2 Marks Question Bank With Answer-Ai3021 It in Agricultural System
12 pages
Unit 1 - Understanding Big Data
No ratings yet
Unit 1 - Understanding Big Data
39 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
38 pages
Unit 1 Bda
No ratings yet
Unit 1 Bda
32 pages
Unit 1 - Big Data Analytics - CCS334
No ratings yet
Unit 1 - Big Data Analytics - CCS334
35 pages
Remove Static Noise From IR Sensor Data
No ratings yet
Remove Static Noise From IR Sensor Data
1 page
Embedded Programming Lab II
No ratings yet
Embedded Programming Lab II
41 pages
Adobe Scan 20 Jun 2025
No ratings yet
Adobe Scan 20 Jun 2025
1 page
AICTE Vaninew
No ratings yet
AICTE Vaninew
4 pages
EE3303
No ratings yet
EE3303
2 pages
Mango DB
No ratings yet
Mango DB
2 pages
MongoDB TCO Comparison MongoDB Oracle
No ratings yet
MongoDB TCO Comparison MongoDB Oracle
12 pages
Citrix Workspace Guide 2023
No ratings yet
Citrix Workspace Guide 2023
198 pages
Deploy Qlik Sense Enterprise On Windows
No ratings yet
Deploy Qlik Sense Enterprise On Windows
362 pages
M Ahsan
No ratings yet
M Ahsan
2 pages
Amazon EC2 Session2
No ratings yet
Amazon EC2 Session2
6 pages
INCEPTION REPORT - Papda Security Audit Service
No ratings yet
INCEPTION REPORT - Papda Security Audit Service
39 pages
Spring Cloud
No ratings yet
Spring Cloud
18 pages
Android App Development Tools
No ratings yet
Android App Development Tools
2 pages
Fundamentals of Accounting With Analytics - Workbook v2024
No ratings yet
Fundamentals of Accounting With Analytics - Workbook v2024
46 pages
Ss CP4N
No ratings yet
Ss CP4N
6 pages
AZ-900T00 Microsoft Azure Fundamentals-00
100% (1)
AZ-900T00 Microsoft Azure Fundamentals-00
9 pages
Vivekanand Reddy Malipatel Resume
No ratings yet
Vivekanand Reddy Malipatel Resume
1 page
Profile Summary Core Competencies: Ranjeet Singh Gahlot
No ratings yet
Profile Summary Core Competencies: Ranjeet Singh Gahlot
3 pages
FMC Conf1
No ratings yet
FMC Conf1
13 pages
Protection of Information Assets
No ratings yet
Protection of Information Assets
104 pages
© 2023, Amazon Web Services, Inc. or Its Affiliates. Panasonic Automotive Systems Co. LTD, or Its Affiliates. All Rights Reserved
No ratings yet
© 2023, Amazon Web Services, Inc. or Its Affiliates. Panasonic Automotive Systems Co. LTD, or Its Affiliates. All Rights Reserved
20 pages
EPM Strategy Roadmap 20191211 InterRel Webcast
No ratings yet
EPM Strategy Roadmap 20191211 InterRel Webcast
59 pages
Eucalyptus: AWS-Compatible Cloud Software
No ratings yet
Eucalyptus: AWS-Compatible Cloud Software
3 pages
Generative AI for Business Leaders
No ratings yet
Generative AI for Business Leaders
3 pages
Exploring Data Transformation With Google Cloud
No ratings yet
Exploring Data Transformation With Google Cloud
3 pages
Tom Hombergs, Björn Wilmsmann and Philip Riecks - Stratospheric From Zero To Production With Spring Boot and AWS-leanpub - Com (2021)
100% (2)
Tom Hombergs, Björn Wilmsmann and Philip Riecks - Stratospheric From Zero To Production With Spring Boot and AWS-leanpub - Com (2021)
452 pages
SAP764360 SAP Integration Suite MAY 2024 R1
No ratings yet
SAP764360 SAP Integration Suite MAY 2024 R1
18 pages
InformationServer 85 To 115
No ratings yet
InformationServer 85 To 115
72 pages
Web Designer Power Up Wordpress
90% (10)
Web Designer Power Up Wordpress
116 pages
Gsma Kenya Ar2014 060214 Web Single Pgs
No ratings yet
Gsma Kenya Ar2014 060214 Web Single Pgs
86 pages
Filipino Thesis
No ratings yet
Filipino Thesis
7 pages
P.prabu (31x61c) CCS334 BDA - Unit 1
No ratings yet
P.prabu (31x61c) CCS334 BDA - Unit 1
31 pages
SNP Whitebook-Ed-2023 Sap-S4hana en
No ratings yet
SNP Whitebook-Ed-2023 Sap-S4hana en
48 pages
Faculty Positions in Geoinformatics
No ratings yet
Faculty Positions in Geoinformatics
1 page
SAKTHIVEL's Resume
No ratings yet
SAKTHIVEL's Resume
1 page