Digital Notes IDBA Final Original
Digital Notes IDBA Final Original
ON
INTRODUCTION TO BIG DATA ANALYTICS
Prepared by
DEPARTMENT OF CSE-AI&ML
MALLAREDDY ENGINEERING COLLEGE FOR WOMEN
(Autonomous Institution-UGC, Govt. of India)
Accredited by NBA & NAAC with ‘A’ Grade, UGC, Govt. of IndiaNIRF
Indian Ranking–2018, Accepted by MHRD, Govt. of India
Permanently Affiliated to JNTUH, Approved by AICTE, ISO 9001:2015 Certified Institution
AAAA+ Rated by Digital Learning Magazine, AAA+ Rated by Careers 360 Magazine, 6thRank CSR
Platinum Rated by AICTE-CII Survey, Top 100 Rank band by ARIIA, MHRD, Govt. of India
National Ranking-Top 100 Rank band by Outlook, National Ranking-Top 100 Rank band by Times News
Magazine
Maisammaguda, Dhulapally, Secunderabad, Kompally-500100
2023 – 2024
Course Name: Introduction to Big Data Analytics
Course Code: 2012PE05
Course Objectives:
● To understand about big data
● To learn the analytics of Big Data
● To Understand the Map, Reduce fundamentals
Course Outcomes:
● Preparing for data summarization, query, and analysis.
● Applying data modeling techniques to large data sets
● Creating applications for Big Data analytics
● Building a complete business data analytic solution
UNIT-I
INTRODUCTION TO BIG DATA
Introduction to Big data: Overview, Characteristics of Data, Evolution of Big Data, Definition of
Big Data, Challenges with Big Data.
Big data analytics: Classification of Analytics, Importance and challenges of big data,
Terminologies, Data storage and analysis.
UNIT-II
HADOOP TECHNOLOGY
Introduction to Hadoop: A brief history of Hadoop, Convolution approach versus Hadoop,
Introduction to Hadoop Ecosystem, Processing data with Hadoop, Hadoop distributors, Use case,
Challenge in Hadoop.
UNIT-III
HADOOP FILE SYSTEM
Introduction to Hadoop distributed file system (HDFS): Overview, Design of HDFS, Concepts,
Basic File systems vs. Hadoop File systems, Local File System, File-Based Data Structures,
Sequential File, Map File.
The Java Interface: Library Classes, inserting data from a Hadoop URL,inserting data using the
file system API, Storing Data.
UNIT-IV
FUNDAMENTALS OF MAP REDUCE
Introduction to Map reduce: Its framework, Features of Map reduce, Its working, Analyze Map
reduce functions, Map reduce techniques to optimize job, Uses, Controlling input formats in map
reduce execution, Different phases in map reduce, Applications.
UNIT-V
BIG DATA PLATFORMS
Sqoop, Cassandra, Mongo DB, Hive, PIG, Storm, Flink, Apache.
TEXT BOOK:
1. Seema Acharya, SubhashiniChellappan, “Big Data and Analytics”, WileyPublications, First
Edition,2015
REFERENCE BOOKS:
1. Judith Huruwitz, Alan Nugent, Fern Halper, Marcia Kaufman, “Big datafor
dummies”, John Wiley & Sons, Inc.(2013)
2. Tom White, “Hadoop The Definitive Guide”, O’Reilly Publications, FourthEdition,
2015
3. Dirk Deroos, Paul C.Zikopoulos, Roman B.Melnky, Bruce Brown, RafaelCoss,
“Hadoop For Dummies”, Wiley Publications,2014
4. Robert D.Schneider, “Hadoop For Dummies”, John Wiley & Sons,Inc.(2012)
5. Paul Zikopoulos, “Understanding Big Data: Analytics for Enterprise ClassHadoop and
Streaming Data, McGraw Hill, 2012 Chuck Lam, “Hadoop In Action”,Dreamtech
Publications,201
INDEX
Data Information
Meaning: Data is raw, unorganized facts that When data is processed, organized,
need to be processed. Data can be structured or presented in a given context
something simple and seemingly so as to make it useful, it is called
random and useless until it is Information.
organized.
Example: Each student's test score is one piece The class' average score or the school's
of data average score is the information that can
be concluded from the given data.
Definition: Latin 'datum' meaning "that which is Information is interpreted data.
given". Data was the plural form of
datum singular.
Characteristics of Data:
Let us start with the characteristics of data, As depicted in below figure, data has three key
characteristics:
Composition: The composition of data deals with the structure of data, that is, the sources or at
different granularity, the types, and the nature of data as to whether it is static or real-time streaming
Condition: The condition of data deals with the state of data, that is, "Can one use this data as is
for analysis? or "Does it require cleansing for further enhancement and enrichment:Context: The
context of data deals with Where has this data been generated? Why was this data generated?"How
sensitive is this data?" What are the events associated with this data? and so on.Small data (data as
it existed prior to the big data revolution) is about certainty. It is about fairly known data sources;
it is about no major changes to the composition or context of data.
Most often we have answers to queries like why this data was generated, where and when it was
generated exactly how we would like to use it, what questions will this data be able to answer, and
so on. Big data is about complexity in terms of multiple and unknown datasets, in terms of
exploding volume, in terms of the speed at which the data is being generated and the speed at which
the data is being generated and the speed at which it needs to be processed, and in terms of the
variety of data(internal or external, behavioral or social) that is being generated.
Evolution of Big Data:
1970c and before was the era of mainframes. The data was essentially primitive and structured.
Relational databases evolved in 1980s and 1990s. The era was of data intensive applications. The
World Wide Web WWW) and the Internet of Things (loT) have led to an onslaught of structured,
unstructured, and multimedia data.
Data Generation and Data Utilization Data Driven
Storage
Complex and Structured data,
Unstructured unstructured data,
multimedia data
Complex and Relational databases:
Relational Data-intensive
applications
Primitive and Mainframes: Basic
Structured data storage
Relational
1970s and before (1980s and 1990s) 2000s and beyond
Figure: The evolution
of
Figure: Evolution of Big Data
Definition of Big Data:
If we were to ask you the simple question: "Define Big Data", what would your answer be? Well,
we will give you a few responses that we have heard over time:
1. Anything beyond the human and technical infrastructure needed to support storage,
processıng, and analysis.
2. Today's BIG may be tomorrow's NORMAL.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 2
Introduction to Big Data Analytics Dept.of IT
High-volume
High-velocity
High-variety
Cost-effective,
innovative forms
of information
processing
Enhanced insight
&decision making
Capture
Storage
Curation
Analysis
Transfer
Visualizatio
Privacy
Bits 0 or 1
Bytes 8bits
Kilobytes 1024 bytes
Megabytes 10242 bytes
Gigabytes 10243 bytes
Terabytes 10244 bytes
Petabytes 10245 bytes
Exabytes 10246 bytes
Zettabytes 10247 bytes
Yottabytes 10248 bytes
Data storage
Archives
Media
Sensor data
Sources of Big Data
Business apps
Public web
Social media
Figure: Sources of Big Data
Velocity:
We have moved from the days of batch processing to real time processing.
Batch periodic Near real time Real time processing
Variety:
Variety deals with a Wide range of data types and sources of data. We will study this under three
categories:
Structured data, semi-structured data and unstructured data.
1. Structured data: From traditional transaction processing systems and RDBMS, etc.
2. Semi-structured data: For example Hyper Text Markup Language (HTML), eXensible Markup
Language (XML).
3. Unstructured data: For example unstructured text documents, audios, videos, emails, photos,
PDFs, social media, etc.
Evolution of Big Data, Definition of Big Data, Challenges with Big Data, Traditional Business
Intelligence (BI) versus Big Data.
Big data analytics: Classification of Analytics, Importance and challenges facing big data,
Terminologies Used in Big Data Environments, The Big Data Technology Landscape.
Big Data is becoming one of the most talked about technology trends nowadays. The real challenge
with the big organization is to get maximum out of the data already available and predict what kind
of data to collect in the future. How to take the existing data and make it meaningful that it provides
us accurate insight in the past data is one of the key discussion points in many of the executive
meetings in organizations.
With the explosion of the data the challenge has gone to the next level and now a Big Data is
becoming the reality in many organizations. The goal of every organization and expert is same to
get maximum out of the data, the route and the starting point are different for each organization
and expert. As organizations are evaluating and architecting big data solution they are also learning
the ways and opportunities which are related to Big Data.
There is not a single solution to big data as well there is not a single vendor which can claim to
know all about Big Data. Big Data is too big a concept and there are many players – different
architectures, different vendors and different technology.
Classifications Of Analytics
What is Data Analytics?
In this new digital world, data is being generated in an enormous amount which opens
new paradigms. As we have high computing power as well as a large amount of data we
can make use of this data to help us make data-driven decision making. The main
benefits of data-driven decisions are that they are made up by observing past trends
which have resulted in beneficial results.
In short, we can say that data analytics is the process of manipulating data to extract useful
trends and hidden patterns which can help us derive valuable insights to make business
predictions.
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from modeling,
machine learning, data mining, and game theory that analyze current and historical facts to
make predictions about a future event. Techniques that are used for predictive analytics are:
• Linear Regression
• Time Series Analysis and Forecasting
• Data Mining
Basic Corner Stones of Predictive Analytics
• Predictive modeling
• Decision Analysis and optimization
• Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide
historic reviews like:
• Data Queries
• Reports
• Descriptive Statistics
• Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to
take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors
such as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any
question or for the solution of any problem. We try to find any dependency and
pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise data
collection may turn out individual for every problem and it will be very time- consuming.
Common techniques used for Diagnostic Analytics are:
• Data discovery
• Data mining
• Correlations
Future Scope of Data Analytics
1. Retail: To study sales patterns, consumer behavior, and inventory
management, data analytics can be applied in the retail sector. Data analytics
can be used by retailers to make data-driven decisions regarding what products
to stock, how to price them, and how to best organize their stores.
2. Healthcare: Data analytics can be used to evaluate patient data, spot trends
in patient health, and create individualized treatment regimens. Data
analytics can be used by healthcare companies to enhance patient outcomes
and lower healthcare expenditures.
3. Finance: In the field of finance, data analytics can be used to evaluate
investment data, spot trends in the financial markets, and make wise investment
decisions. Data analytics can be used by financial institutions to lower risk and
boost the performance of investment portfolios.
VOLUME
The exponential growth in the data storage as the data is now more than text data. The data can be
found in the format of videos, music’s and large images on our social media channels. It is very
common to have Terabytes and Petabytes of the storage system for enterprises. As the database
grows the applications and architecture built to support the data needs to be re-evaluatedquite often.
Sometimes the same data is re-evaluated with multiple angles and even though the original data
is the same the new intelligence creates explosion of the data. The big volume indeed represents
Big Data.
VELOCITY
The data growth and social media explosion have changed how we look at the data. There was a
time when we used to believe that data of yesterday is recent. The matter of the fact newspapers
is still following that logic. However, news channels and radios have changed how fast we receive
the news.
Today, people reply on social media to update them with the latest happening. On social media
sometimes a few seconds old messages (a tweet, status updates etc.) is not something interests
users.
They often discard old messages and pay attention to recent updates. The data movement is now
almost real time and the update window has reduced to fractions of the seconds. This high velocity
data represent Big Data.
VARIETY
Data can be stored in multiple format. For example database, excel, csv, access or for the matter of
the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional
format as we assume, it may be in the form of video, SMS, pdf or something we might have not
thought about it. It is the need of the organization to arrange it and make it meaningful.
It will be easy to do so if we have data in the same format, however it is not the case most of the
time. The real world have data in many different formats and that is the challenge we need to
overcome with the Big Data. This variety of the data represent Big Data.
In December 2012 apache releases Hadoop 1.0.0, more information and installation guide can be
found at Apache Hadoop Documentation. Hadoop is not a single project but includes a numberof
other technologies in it.
2. MAPREDUCE
MapReduce was introduced by google to create large amount of web search indexes.It is basically
a framework to write applications that processes a large amount of structured orunstructured data
over the web. MapReduce takes the query and breaks it into parts to run it on multiple nodes. By
distributed query processing it makes it easy to maintain large amount of databy dividing the data
into several different machines.Hadoop MapReduce is a software frameworkfor easily writing
applications to manage large amount of data sets with a highly fault tolerant manner. More tutorials
and getting started guide can be found at Apache Documentation.
3. HDFS (Hadoop distributed file system)
HDFS is a java based file system that is used to store structured or unstructured data over large
clusters of distributed servers. The data stored in HDFS has no restriction or rule to be applied, the
data can be either fully unstructured of purely structured.In HDFS the work to make data senseful
is done by developer's code only. Hadoop distributed file system provides a highly fault tolerant
atmosphere with a deployment on low cost hardware machines. HDFS is now a part of Apache
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 13
Introduction to Big Data Analytics Dept.of IT
Hadoop project, more information and installation guide can be found at Apache HDFS
documentation.
4. HIVE
Hive was originally developed by Facebook, now it is made open source for some time. Hive works
something like a bridge in between sql and Hadoop, it is basically used to make Sql queries on
Hadoop clusters. Apache Hive is basically a data warehouse that provides ad-hoc queries, data
summarization and analysis of huge data sets stored in Hadoop compatible file systems.
Hive provides a SQL like called HiveQL query-based implementation of huge amount of data
stored in Hadoop clusters. In January 2013 apache releases Hive 0.10.0, more information and
installation guide can be found at Apache Hive Documentation.
5. PIG
Pig was introduced by yahoo and later on it was made fully open source. It also provides a bridge
to query data over Hadoop clusters but unlike hive, it implements a script implementation to make
Hadoop data access able by developers and businesspersons. Apache pig provides a high- level
programming platform for developers to process and analyses Big Data using user defined
functions and programming efforts. In January 2013 Apache released Pig 0.10.1 which is defined
for use with Hadoop 0.10.1 or later releases. More information and installation guide can be found
at Apache Pig Getting Started Documentation.
platforms.
• With the help of big data technologies IT companies are able to process third-
party data fast, which is often hard to understand at once by having inherently high
horsepower and parallelized working of platforms.
Big data challenges include the storing, analyzing the extremely large and fast- growing
data.
Some of the Big Data challenges are:
1. Sharing and Accessing Data:
• Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources.
• Sharing data can cause substantial challenges.
• It includes the need for inter and intra- institutional
legal documents.
•
Accessing data from public repositories leads to multiple
difficulties.
• It is necessary for the data to be available in an accurate, complete and
timely manner because if data in the companies information system is
to be used to make accurate decisions in time then it becomes
necessary for data to be available in this manner.
2. Privacy and Security:
• It is another most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.
• Most of the organizations are unable to maintain regular checks due
to large amounts of data generation. However, it should be necessary
to perform security checks and observation in real time because it is
most beneficial.
• There is some information of a person which when combined with
external large data may lead to some facts of a person which may be
secretive and he might not want the owner to know this information
about that person.
• Some of the organization collects information of the people in
order to add value to their business. This is done by making
insights into their lives that they’re unaware of.
3. Analytical Challenges:
• There are some huge analytical challenges in big data which arise
some main challenges questions like how to deal with a problem if
data volume gets too large?
• Or how to find out the important data points?
• Or how to use data to the best advantage?
• These large amount of data on which these type of analysis is to be
done can be structured (organized data), semi-structured (Semi- organized
data) or unstructured (unorganized data). There are two techniques
through which decision making can be done:
• Either incorporate massive data volumes in the analysis.
• Or determine upfront which Big data is relevant.
4. Technical challenges:
• Quality of data:
• When there is a collection of a large amount of data and
storage of this data, it comes at a cost. Big companies,
business leaders and IT leaders always want large data
storage.
• For better results and conclusions, Big data rather than
having irrelevant data, focuses on quality data storage.
• This further arise a question that how it can be ensured that
data is relevant, how much data would be enough for
decision making and whether the stored data is accurate or
not.
• Fault tolerance:
• Fault tolerance is another technical challenge and fault
tolerance computing is extremely hard, involving intricate
algorithms.
• Nowadays some of the new technologies like cloud computing and big
data always intended that whenever the failure occurs the damage
done should be within the acceptable threshold that is the whole task
should not begin from the scratch.
• Scalability:
• Big data projects can grow and evolve rapidly. The
scalability issue of Big Data has lead towards cloud
computing.
• It leads to various challenges like how to run and execute various
jobs so that goal of each workload can be achieved cost-
effectively.
• It also requires dealing with the system failures in an efficient
manner. This leads to a big question again that what kinds of
storage devices are to be used.
TRADITIONAL VS BIG DATA BUSINESS APPROACH
1. Schema less and Column oriented Databases (NoSql)
We are using table and row based relational databases over the years, these databases are just fine with online
transactions and quick updates. When unstructured and large amount of data comes into the picture we needs
some databases without having a hard code schema attachment. There are a number of databases to fit into
this category, these databases can store unstructured, semi structured or even fully structured data.
Apart from other benefits the finest thing with schema less databases is that it makes data migration very
easy. MongoDB is a very popular and widely used NoSQL database these days.NoSQL and schema less
databases are used when the primary concern is to store a huge amount of data and notto maintain
relationship between elements. "NoSQL (not only Sql) is a type of databases that does not primarily rely
upon schema based structure and does not use Sql for data processing."
Figur1e.:Big Data
The traditional approach work on the structured data that has a basic layout and the structure provided.
Topics:
✓ Introducing Hadoop
✓ History of Hadoop
✓ Convolution approach versus Hadoop
✓ Introduction to Hadoop Eco System
✓ Processing Data with Hadoop
✓ Hadoop Distributors
✓ Use Cases
✓ Challenges in Hadoop.
3. Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured and
unstructured data.
4. Highly Available and Fault Tolerant – When a node fails, the Hadoop framework automatically
fails over to another node.
CONVOLUTION APPROACH VERSUS HADOOP
There are two major components of the Hadoop framework and both of them does two of the important
keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the
person analyzing the data.
Relational data is often normalized to retain its integrity and remove redundancy. Normalization
poses problems for MapReduce, since it makes reading a record anon- local operation, and one of the
central assumptions that MapReduce makes is that its possible to perform (high-speed) streaming reads
and writes.
A web server log is a good example of a set of records that is not normalized (for ex- ample, the client
host names are specified in full each time, even though the same client may appear many times), and
this is one reason that log files of all kinds are particularly well-suited to analysis with MapReduce.
MapReduce is a linearly scalable programming model. The programmer writes two functions—a map
function and a reduce function—each of which defines a mapping from one set of key-value pairs to
another. These functions are oblivious to the sizeof the data or the cluster that they are operating on,
so they can be used unchanged for a small data set and for a massive one. More important, if you
double the size of the input data, a job will run twice as slow. But if you also double the size of the
cluster, a job will run as fast as the original one. This is not generally true of SQL queries.
Overtime,however,thedifferencesbetweenrelationaldatabasesandMapReduce systems are likely to
blur—both as relational databases start incorporating some of the ideas from MapReduce (such as
Aster Data’s and Greenplum’s databases) and, from the other direction, as higher-level query
languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable
to traditional database programmers.
A BRIEF HISTORY OF HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text
search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a
part of the Lucene project.
Building a web search engine from scratch was an ambitious goal, for not only is the software
required to crawl and index websites complex to write, but it is also a challenge to run without a
dedicated operations team, since there are so many moving parts. It’s expensive, too: Mike Cafarella
and Doug Cutting estimated a system supporting a 1-billion-page index would cost around half a
million dollars in hardware, with a monthly running cost of $30,000. Nevertheless,
they believed it was a worthy goal, as it would open up and ultimately democratize search engine
algorithms.
Nutch was started in2002, and a working crawler and search system quickly emerged. However,
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 23
Introduction to Big Data Analytics Dept.of IT
Google reported that its MapReduce implementation sorted one terabyte in 68 seconds. As the first
edition of this book was going to press (May2009), it was announced that a team at Yahoo! Used
Hadoop to sort one tera byte in 62 seconds.
provides its own set of basic types that are optimized for network serialization. These are found
in the org.apache.hadoop.iopackage.
Here we use LongWritable, which corresponds to a JavaLong, Text (like JavaString), and
IntWritable (like JavaInteger).
The map()method is passed a key and a value. We convert the Text value containing the line of
input into a Java String, then use its substring() method to extract the columns we are interested
in.
The map() method also provides an instance of Context to write the output to. In this case, we write
the year as a Text object (since we are just using it as a key), and the temperature is wrapped in an
IntWritable. We write an output record only if the temperature is present and the quality code
indicates the temperature reading is OK.
Again, four formal type parameters are used to specify the input and out put types, this time for
the reduce function. The input types of the reduce function must match the output types of the
map function: Text and IntWritable. And in this case, the output types of the reduce function are
Text and IntWritable, for a year and its maximum temperature, which we find by iterating
through the temperatures and comparing each with a record of the highest found so far.
Having constructed a Job object, we specify the input and output paths. An input path is specified by
calling the static addInputPath() method on FileInputFormat, and it can be a single file,a
directory(in which case,the input forms all the files in that directory), or a file pattern. As the name
suggests, addInputPath() can be called more than once to use input from multiple paths.
The output path (of which there is only one) is specified by the static setOutputPath()methodon
FileOutputFormat. It specifies a directory where the output files from the reducer functions are
written. The directory shouldn’t exist before running the job, as Hadoop will complain and
not run the job. This precaution is to prevent data loss (it can be very annoying to accidentally
overwrite the output of a long job with another).
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.The setOutputKeyClass() and setOutputValueClass() methods
control the output types for the map and the reduce functions,which are of ten the same,as they
are in our case. If they are different, then the map output types can be set using the methods
setMapOutputKeyClass() andsetMapOutputValueClass().The input types are controlled via the
inputformat, which we have not explicitly sets using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the job. The
wait ForCompletion() method on Job submits the job and waits for it to finish. The method’s boolean
argument is a verbose flag, so in this case the job writes information about its progress to the console.
The return value of the waitForCompletion() method is a boolean indicating success (true) or failure
(false), which we translate into the program’s exit code of 0 or 1.
A TEST RUN
After writing a MapReduce job, it’s normal to try it out on a small dataset to flushout any immediate
problems with the code. First install Hadoop in standalone mode— there are instructions for how to
do this in Appendix A. This is the mode in which
Hadooprunsusingthelocalfilesystemwithalocaljobrunner.Then install and compile the examples
using the instructions on the book’swebsite.
When the hadoop command is invoked with a classname as the first argument, it launches a JVM to
run the class. It is more convenient to use hadoop than straight java since the former adds the Hadoop
libraries (and their dependencies) to the class- path and picks up the Hadoop configuration,too. To
add the application classes to the classpath, we’ve defined an environment variable called
HADOOP_CLASSPATH, which the hadoop script picks up.
The last section of the output, titled “Counters,” shows the statistics that Hadoop generates for
each job it runs. These are very useful for checking whether the amount of data processed is
what you expected. For example, we can follow the number of records that went through the
system: five map inputs produced five map outputs, then five reduce inputs in two groups
produced two reduce outputs.
The output was written to the output directory, which contains one output file per reducer. The job
had a single reducer, so we find a single file, named part-r-00000:
% cat output/part-r-00000
1949 111
1950 22
This result is the same as when we went through it by hand earlier. We interpret this as saying that
the maximum temperature recorded in 1949 was 11.1°C, and in 1950 it was 2.2°C.
THE OLD AND THE NEW JAVA MAPREDUCE APIS
The Java Map Reduce API used in the previous section was first released in Hadoop
0.20.0. This newAPI, sometimes referred to as “ContextObjects,”was designed to make the API
easier to evolve in the future. It is type-incompatible with the old, how- ever, so applications need
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 28
Introduction to Big Data Analytics Dept.of IT
the MapReduce system. The new Context, for example, essentially unifies the role of the JobConf,
the Output Collector, and the Reporter from the old API.
In both APIs, key-value record pairs are pushed to the mapper and reducer, butin addition, the new
API allows both mappers and reducers to control the execution flow by overriding the run() method.
For example, records can be processed in batches, or the execution can be terminated before all the
records have been processed. In the old API this is possible form appears by writing aMapRunnable,
but no equivalent exists for reducers.
Configuration has been unified. The old API has a special JobConf object for job configuration,
which is an extension of Hadoop’s vanilla Configuration object (used for configuring daemons. In
the new API, this distinction is dropped, so job configuration is done through a Configuration.
Job control is performed through the Job class in the new API, rather than the old JobClient, which no
longer exists in the new API.
Output files are named slightly differently: in the old API both map and reduce outputs are named
part-nnnnn, while in the new API map outputs are named part- m-nnnnn, and reduce outputs are
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 29
Introduction to Big Data Analytics Dept.of IT
named part-r-nnnnn (where nnnnn is an integer designating the part number, starting from zero).
User-overridable methods in the new API are declared to throw java.lang.InterruptedException.
What this means is that you can write your code to be reponsive to interupts so that the framework
can gracefully cancel long-running operations if it needsto.
In the new APIthereduce()method passes values as a java.lang.Iterable, rather than a
java.lang.Iterator (as the oldAPIdoes). This change make site as over the values using
Java’s for-each loop construct: for (VALUEIN value : values) { ...}
HADOOP ECOSYSTEM
Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS, renamed from
NDFS), the term is also used for a family of related projects that fall under the umbrella of
infrastructure for distributed computing and large-scale data processing.
All of the core projects covered in this book are hosted by the Apache Software Foundation, which
provides support for a community of open source software projects, including the original HTTP
Server from which it gets its name. As the Hadoop eco- system grows, more projects are appearing,
not necessarily hosted at Apache, which provide complementary services to Hadoop, or build on the
core to add higher-level abstractions.
The Hadoop projects that are covered in this book are described briefly here:
Common
A set of components and interfaces for distributed filesystems and general I/O (serialization, Java
RPC, persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and persistent data storage.
MapReduce
A distributed data processing model and execution environment that runs on large clusters of
commodity machines.
HDFS
A distributed filesystem that runs on large clusters of commodity machines.
Pig
Adataflowlanguageandexecutionenvironmentforexploringverylargedatasets. Pig runs on HDFS and
MapReduceclusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a query language
based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the
data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports
both batch-style computations using MapReduce and point queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives such as
distributed locks that can be used for building distributed applications.
Sqoop
A tool for efficiently moving data between relational databases and HDFS.
PHYSICALARCHITECTURE
ii. Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is
shared between nodes is the network that connects them.
80ports
➢ There are global 8 core switches
➢ The rack switch has uplinks connected to core switches and hence connecting all other racks with
uniform bandwidth, forming the Cluster In the cluster, you have few machines to act as Namenode
and as JobTracker. They are referred as Masters. These masters have different configuration favoring
more DRAM and CPU and less localstorage.
Hadoop cluster has 3 components:
1. Client
2. Master
3. Slave
The role of each components are shown in the below image.
• Name node keeps track of all the file system related information such as to
• Which section of file is saved in which part of the cluster
• Last access time for the files
• User permissions like which user have access to the file
JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.
To know more about JobTracker, please read the article All You Want to Know about MapReduce (The
Heart of Hadoop)
Secondary NameNode
Slaves:
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
• Store the data
• Process the computation
• Each slave runs both a DataNode and Task Tracker daemon which communicates to their masters.
• The Task Tracker daemon is a slave to the Job Tracker and the DataNode daemon a slave to the
NameNode
• Client machine does this step and loads the Sample.txt into cluster.
• It breaks the sample.txt into smaller chunks which are known as "Blocks" in Hadoop context.
• Client put these blocks on different machines(datanodes)throughout the cluster.
2. Next, how does the Client knows that to which data nodes load theblocks?
• Now NameNode comes into picture.
• The NameNode used its Rack Awareness intelligence to decide on which DataNode to provide.
• For each of the data block (in this case Block-A, Block-B and Block-C), Client contacts NameNode
and in response NameNode sends an ordered list of 3 DataNodes.
3. How does the Client knows that to which data nodes load the blocks?
For example in response to Block-A request, Node Name may send DataNode-2, DataNode-3
andDataNode-4.
Block-B DataNodes list DataNode-1, DataNode-3, DataNode-4 and for Block C data node list
DataNode-1, DataNode-2, DataNode-3.Hence
• Block A gets stored in DataNode-2, DataNode-3, DataNode-4
• Block B gets stored in DataNode-1, DataNode-3, DataNode-4
• Block C gets stored in DataNode-1, DataNode-2, DataNode-3
• Every block is replicated to more than 1 data nodes to ensure the data recovery on the time of
machine failures. That's why NameNode send 3 DataNodes list for each individual block
4. Who does the block replication?
• Client write the data block directly to one DataNode.
• DataNodes then replicate the block to other Datanodes.
• When one block gets written in all 3 DataNode then only cycle repeats for next block.
5. Who does the block replication?
• In Hadoop Gen1 there is only one Name Node where in Gen2 there is active passive model in Name
Node where one more node "Passive Node" comes in picture.
• The default setting for Hadoop is to have 3 copies of each block in the cluster. This setting can be
configured with "dfs.replication" parameter of hdfs-site.xml file.
• Keep note that Client directly writes the block to the DataNode without any intervention of
NameNode in this process.
Challenges in Hadoop
• Network File system is the oldest and the most commonly used distributed file system and was
designed for the general class of applications, Hadoop only specific kind of applications can make
use of it.
• It is known that Hadoop has been created to address the limitations of the distributed file system,
where it can store the large amount of data, offers failure protection and provides fast access, but it
should be known that the benefits that come with Hadoop come at some cost.
• Hadoop is designed for applications that require random reads; So if a file has four parts the file
would like to read all the parts one-by-one going from 1 to 4 till the end. Random seek is where you
want to go to a specific location in the file; this is something that isn’t possible with Hadoop. Hence,
Hadoop is designed for non- real-time batch processing of data.
• Hadoop is designed for streaming reads caching of data isn’t provided. Caching of data is provided
which means that when you want to read data another time, it can be read very fast from the cache.
This caching isn’t possible because you get faster access to the data directly by doing the sequential
read; hence caching is n’t available through Hadoop.
• It will write the data and then it will read the data several times. It will not be updating the data that
it has written; hence updating data written to closed files is not available. However,
youhavetoknowthatinupdate0.19appendingwillbe supported for those files that aren’t closed. But for
those files that have been closed, updating is not possible.
• In case of Hadoop we aren’t talking about one computer; in this scenario we usually have a large
number of computers and hardware failures are unavoidable; sometime one computer will fail and
sometimes the entire rack can fail too. Hadoop gives excellent protection against hardware failure;
however the performance will go down proportionate to the number of computers that are down. In
the big picture, it doesn’t really matter and it is not generally noticeable since if you have 100
computers and in them if 3 fail then 97 are still working. So the proportionate loss of performance isn’t
that noticeable. However, the way Hadoop works there is the loss in performance. Now this loss of
performance through hardware failures is something that is managed through replication strategy.
UNIT-III
HADOOP FILE SYSTEM
When a data set out grows the storage capacity of a single physical machine, it becomes necessary to
partition it a cross a number of separate machines. File systems that manage the storage across a
network of machines are called distributed filesystems. Since they are network-based, all the
complications of network programming kick in, thus making distributed filesystems more complex
than regular disk filesystems. For example, one of the biggest challenges is making the filesystem
tolerate node failure with out suffering data loss.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem. (You may some times see references to “DFS”—informally or in older documentation
or configurations—which is the same thing.) HDFS is Hadoop’s flag ship file system and is the focus
of this chapter, but Hadoop actually has a general- purpose file system abstraction, so we’ll see along
the way how Hadoop integrates with other storage systems(such as the local file system and
AmazonS3).
THE DESIGN OF HDFS
HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware. Let’s examine this statement in more detail:
VERY LARGEFILES
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in
size. There are Hadoop clusters running today that store petabytes of data.
STREAMING DATA ACCESS
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-
many-times pattern. A dataset is typically generated or copied from source, then various analyses are
performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the
dataset, so the time to read the whole dataset is more important than the latency in reading the first
record.
COMMODITY HARDWARE
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters
of commodity hardware (commonly available hardware available from multiple vendors) for which
the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to
carry on working without a noticeable interruption to the user in the face of such failure.
It is also worth examining the applications for which using HDFS does not work so well. While this
may change in the future, these areas where HDFS is not a good fit today:
LOW-LATENCY DATA ACCESS
Applications that require low-latency access to data, in the tens of milliseconds range, will notwork
well with HDFS. Remember, HDFS is optimized for delivering a high throughput of data, and this
may be at the expense of latency. HBase is currently a better choice for low-latency access.
Since the namenode holds filesystem meta data in memory, the limit to the number of files in a
filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file,
directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking
one block, you would need at least 300MB of memory. While storing millions of files is feasible,
billions is beyond the capability of current hardware.
MULTIPLE WRITERS, ARBITRARY FILE MODIFICATIONS
Files in HDFS may be written to by a single writer.Writes are always made at the end of the file.
There is no support for multiple writers, or for modifications at arbitrary offsets in the file. (These
might be supported in the future, but they are likely to be relatively inefficient.)
HDFS CONCEPTS
BLOCKS
A disk has a blocksize, which is the minimum amount of data that it can read or write. File systems
for a single disk build on this by dealing with data in blocks, which are an integral multiple of the
disk blocksize. Filesystem blocks are typically a few kilobytes in size, while disk blocks are
normally 512 bytes. This is generally transparent to the filesystem user who is simply reading or
writing a file—of whatever length. However, there are tools to perform filesystem maintenance, such
as df and fsck, that operate on the filesystem block level.
HDFS, too, has the concept of a block, but it is a much larger unit—64MB by default. Like in a file
system for a single disk, files in HDFS are broken into block-sized chunks, which are stored as
independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single
block does not occupy a full block’s worth of underlying storage. When unqualified, the term “block”
in this book refers to a block in HDFS.
Having a block abstraction for a distributed filesystem brings several benefits. The first benefit is the most
obvious: a file can be larger than any single disk in the network.
There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take
advantage of any of the disks in the cluster. In fact, it would be possible, if unusual, to store a single file
on an HDFS cluster whose blocks filled all the disks in the cluster.
Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem.
Simplicity is something to strive for all in all systems, but is especially important for a distributed
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 42
Introduction to Big Data Analytics Dept.of IT
system in which the failure modes are so varied. The storage subsystem deals with blocks,
simplifying storage management(since blocks are a fixed size, it is easy to calculate how many can
be stored on a given disk) and eliminating metadata concerns (blocks are just a chunk of data to be
stored—file metadata such as permissions in formation does not need to be stored with the blocks,
so another system can handle meta data separately).
Furthermore, blocks fit well with replication for providing fault tolerance and availability. To insure
against corrupted blocks and disk and machine failure, each block is replicated to a small number of
physically separate machines(typically three). If a block becomes unavailable, a copy can be read
from another location in a way that is trans- parent to the client. A block that is no longer available
due to corruption or machine failure can be replicated from its alternative locations to other live
machines to bring the replication factor back to the normal level. Similarly, some applications may
choose to set a high replication factor for the blocks in a popular file to spread the read load on
the cluster.
Like its disk filesystem cousin, HDFS’s fsck command understands blocks.
For example, running:
% Hadoop fsck / -files -blocks will list the blocks that make up each file in the filesystem.
how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make
the namenode resilient to failure, and Hadoop provides two mechanisms for this.The first way is to
back up the files that make up the persistent state of the filesystem metadata. Hadoop can be
configured so that the namenode writes its persistent state to multiple filesystems. These writes are
synchronous and atomic. The usual configuration choice is to write to local disk aswell as a remote
NFSmount.
It is also possible to run a secondary namenode, which despiteits name does not act as a namenode.
Its main role is to periodically merge the namespace image with the edit log to prevent the edit log
from becoming too large. The secondary namenode usually runs on a separate physical machine,
since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps
a copy of the merged name- space image, which can be used in the event of the namenode failing.
However, the state of the secondary namenode lags that of the primary, so in the event of total failure
of the primary, data loss is almost certain. The usual course of action in this case is to copy the
namenode’s metadata files that are on NFS to the secondary and run it as the new primary.
HDFS FEDERATION
The namenode keeps a reference to every file and block in the filesystem in memory, which means
that on very large clusters with many files, memory becomes the limiting factor for scaling. HDFS
Federation, introduced in the 0.23 release series, allows a cluster to scale by adding namenodes, each
of which manages a portion of the filesystem namespace. For example, one namenode might manage
all the filesrooted under/user, say, and a second namenode might handle files under/share.
Under federation, each namenode manages a namespace volume, which is made up of the metadata
for the name space, and a block pool containing all the blocks for the files in the namespace.
Namespace volumes are independent of each other, which means namenodes do not communicate
with one another, and furthermore the failure of one namenode does not affect the availability of the
namespace managed by other namenodes. Block pool storage is not partitioned, however, so
datanodes register with each namenode in the cluster and store blocks from multiple blockpools.
To access a federated HDFS cluster, clients use client-side mount tables to map file paths to
namenodes. This is managed in configuration using the ViewFileSystem, and viewfs:// URIs.
HDFS HIGH-AVAILABILITY
Thecombinationofreplicatingnamenodemetadataonmultiplefilesystems,and using the secondary
namenode to create checkpoints protects against data loss, but does not provide high- availability of
the filesystem.The namenode is still a single point of failure (SPOF), since if it did fail, all clients—
including MapReduce jobs—would be unable to read, write, or list files, because the namenode is
the sole repository of the metadata and the file-to-block mapping. In such an event the whole
Hadoop system would effectively be out of service until a new namenode could be brought on line.
To recover from a failed namenode in this situation, an administrator starts a new primary namenode
with one of the filesystem metadata replicas, and configures datanodes and clients to use this new
namenode. The new namenode is not able to serve requests until it has i) loaded its namespace image
into memory, ii) replayed its edit log, and iii) received enough block reports from the datanodes to
leave safe mode. On large clusters with many files and blocks, the time it takes for a namenode to
start from cold can be 30 minutes or more.
The long recovery time is a problem for routine maintenance too. In fact, since unexpected failure
of the namenode is so rare, the case for planned downtime is actually more important in practice.
The 0.23 release series of Hadoop remedies this situation by adding support for HDFS high-
availability(HA). In this implementation there is a pair of namenodes in an active- standby
configuration. In the event of the failure of the active namenode, the standby takes over its duties to
continue servicing client requests without a significant interruption. A few architectural changes are
needed to allow this to happen:
The namenodes must use highly-available shared storage to share the edit log. (In the initial
implementation of HA this will require an NFSfiler, but in future releases more options will be
provided, such as a BookKeeper-based system built on Zoo- Keeper.) When a standby namenode
comes up it reads up to the end of the shared edit log to synchronize its state with the active namenode,
and then continues to read new entries as they are written by the active namenode.
Datanodes must send block reports to both namenodes since the block mappings are stored in a
namenode’s memory, and not ondisk. Clients must be configured to handle namenode fail over, which
uses a mechanism that is transparent to users. If the active namenode fails, then the standby can take
over very quickly (in a few tens of seconds) since it has the latest state available in memory: both the
latest edit log entries, and an up-to-date block mapping. The actual observed fail over time will be
longer in practice (around a minute or so),since the system needs to be conservative in deciding that
the active namenode has failed.
In the unlikely even to best and by being down when the active fails, the administrator can still start
the standby from cold. This is no worse than the non-HA case, and from an operational point of
view it’s an improvement, since the process is a standard operational procedure built into Hadoop.
Scalability: These file systems are not inherently designed for massive scalability across multiple
machines. If you need more storage or processing power, you typically upgrade the hardware of the
single machine.
Data Processing: Basic file systems are not optimized for distributed data processing. They rely on
the processing power of a single machine to manage and manipulate data.
Redundancy: Basic file systems might offer features like RAID (Redundant Array of Independent
Disks) for data redundancy, but the primary copy of data remains on a single machine.
Hadoop File System (HDFS):
Definition: HDFS (Hadoop Distributed File System) is the primary storage system of the Hadoop
ecosystem. It's designed to store vast amounts of data across multiple machines in a distributed
manner.
Distributed Storage: HDFS divides large files into blocks (typically 128 MB or 256 MB) and
distributes these blocks across multiple machines in a cluster. This distribution ensures that data is
replicated across different nodes for fault tolerance.
Scalability: HDFS is highly scalable. As your data grows, you can add more machines to the
Hadoop cluster, and HDFS will distribute the data across these new nodes automatically.
Fault Tolerance: HDFS provides fault tolerance by replicating data blocks across multiple nodes. If
a node fails, the data blocks stored on that node are still available on other nodes, ensuring data
availability.
We could also have used a relative path and copied the file to our home directory in HDFS, which in
this case is /user/tom:
% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt
Let’s copy the file back to the local filesystem and check whether it’s the same:
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9 MD5 (quangle.copy.txt) =
a16f231da6b05e2ba7a339320e7dacd9
The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.
Finally, let’s look at an HDFS file listing. We create a directory first just to see how it is displayed
in the listing:
% hadoopfs-mkdir books
% hadoopfs -ls .
Found 2 items
drwxr-xr-x-tomsupergroup 0 2009-04-02 22:41/user/tom/books-rw-r--r--
1tomsupergroup2009-04-02 22:29/user/tom/quangle.txt
The information returned is very similar to the Unix command ls -l, with a few min or differences.
The first column shows the filemode.
The second column is the replication factor of the file (something a traditional Unix filesystem does
not have).
The entry in this column is empty for directories in the concept of replication does not apply to
them—directories are treated as metadata and stored by the namenode, not the datanodes. The third
and fourth columns show the file owner and group. The fifth column is the size of the file in bytes,
or zero for directories.
The sixth and seventh columns are the last modified date and time. Finally, the eighth column is the
absolute name of the file ordirectory
HADOOP FILESYSTEMS
Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The Java
abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there are
several concrete implementations
Hadoop provides many interfaces to its filesystems, and it generally uses the URI scheme to pick the
correct filesystem instance to communicate with. For example, the filesystem shell that we met in the
previous section operates with all Hadoop filesys- tems. To list the files in the root directory of the
local filesystem, type:
% hadoopfs -ls file:///
Although it is possible (and sometimes very convenient) to run MapReduce programs that access any
of these filesystems, when you are processing large volumes of data, you should choose a distributed
filesystem that has the data locality optimization, notably HDFS.
1. Sequencefile
2. MapFile
Sequencefile
1. sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary forms
of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be
efficiently stored and processed small files .
3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer**
provides append functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom writable type.
Sequencefile Compression
1. The internal format of the sequencefile depends on whether compression is enabled, or, if it is,
either a record compression or a block compression.
A. No compression type : If compression is not enabled (the default setting), then each record
consists of its record length (number of bytes), the length of the key, the key and the value. The
Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the
uncompressed format, and the difference is that the value byte is compressed with the encoder
defined in the header. Note that the key is not compressed.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 51
Introduction to Big Data Analytics Dept.of IT
C. Block compression type : Block compression compresses multiple records at once , so it is more
compact than record compression and generally preferred . When the number of bytes recorded
reaches the minimum size, it is added to the block. The default value is 1000000 bytes. The format
is record count, key length, key, value length, value.
A sequence file is a flat file format that is used to store key-value pairs. It is a binary file format that
is optimized for storing large amounts of data in a serialized format. Here are some key points
about sequence files in Hadoop:
Purpose: Sequence files are primarily used to store intermediate data during MapReduce jobs in
Hadoop. They are designed to be efficient for both reading and writing large datasets.
Structure: A sequence file consists of a header followed by a series of key-value pairs. Each key-
value pair is stored as a record within the file. Both keys and values can be of any Hadoop-
supported data type, including custom writable types.
Compression: Sequence files support various compression codecs such as Gzip, Snappy, and LZO.
This allows for efficient storage and processing of data by reducing the file size and improving
performance.
Input/Output Formats: Hadoop provides input and output formats for sequence files, which allow
them to be easily integrated into MapReduce jobs. These formats handle the serialization and
deserialization of key-value pairs as needed.
Usage: Sequence files are commonly used for storing intermediate data between Map and Reduce
phases in a MapReduce job. They can also be used for storing data in HDFS (Hadoop Distributed
File System) when a structured binary format is preferred.
Tooling: Hadoop provides various tools and utilities for working with sequence files, such as the
SequenceFile.Reader and SequenceFile.Writer classes in the Hadoop API. These classes allow
developers to read from and write to sequence files programmatically.
C. Simple to modify: The main responsibility is to modify the corresponding business logic,
regardless of the specific storage format.
The downside is the need for a merge file, and the merged file will be inconvenient to view.
because it is a binary file.
read/write Sequencefile
Write Process:
1) Create a configuration
2) Get filesystem
Read process:
1) Create a configuration
2) Get filesystem
MapFile in Hadoop
Definition:
A MapFile in Hadoop is a storage format optimized for key-value pairs. It provides a mechanism to
store sorted key-value pairs in a sequence file, which allows for efficient lookups and data retrieval.
Components:
Index: The mapfile maintains an index that enables faster lookup operations. This index helps in
locating the position of a key within the file, facilitating efficient retrieval.
Usage:
MapFile is particularly useful in scenarios where you need fast key-based lookups, like in certain
Hadoop applications, especially when dealing with large-scale datasets.
API:
Hadoop provides Java APIs to create, read, and write to MapFile formats. This means you can
integrate MapFile operations within your MapReduce jobs or other Hadoop applications using the
Hadoop API.
Benefits:
Efficient Retrieval: Due to its sorted nature and index, MapFile allows for efficient retrieval of
values based on keys.
Scalability: It can handle large-scale datasets, making it suitable for big data applications in the
Hadoop ecosystem.
How to Use:
If you're developing Hadoop applications or MapReduce jobs and need to perform frequent key-
value lookups, you might consider using MapFile. You can create, populate, and query MapFiles
using Hadoop's APIs.
read/write Mapfile
Write Process:
1) Create a configuration
2) Get filesystem
Read process:
1) Create a configuration
2) Get filesystem
Figure 3-1. Accessing HDFS over HTTP directly, and via a bank of HDFS proxies
The second way of accessing HDFS over HTTP relies on one or more standalone proxy servers.(The
proxies are state less so they can run behind a standard load balancer.)All traffic to the cluster passes
through the proxy. This allows for stricter firewall and bandwidth limiting policies tobe put in place.
It’s common to use a proxy for transfers between Hadoop clusters located in different datacenters.
The original HDFS proxy (in src/contrib/hdfsproxy) was read-only, and could be accessed by clients
using the HSFTP FileSystem implementation (hsftp URIs). From re- lease0.23, there is a new proxy
called HttpFS that has read and write capabilities, and which exposes the same HTTP interfaceas
WebHDFS, so clients can access either rusing web hdfs URIs.
The HTTP REST API that Web HDFS exposes is formally defined in a specification, so it is likely
that over time clients in languages other than Java will be written that use it directly.
FUSE
Filesystem in Userspace (FUSE) allows files systems that are implemented in userspace to be
integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem
(but typically HDFS) to be mounted as a standard filesystem. You can then use Unix utilities (such
as ls and cat) to interact with the filesystem, as well as POSIX libraries to access the filesystem
from any programming language.
Fuse-DFSisimplementedinCusinglibhdfsastheinterfacetoHDFS.Documentation for compiling and
running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.
THE JAVAINTERFACE
In this section, we dig into the Hadoop’s FileSystem class: the API for interacting with one of
Hadoop’s filesystems.5 While we focus mainly on the HDFS implementation,
DistributedFileSystem, in general you should strive to write your code against the FileSystem abstract
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 56
Introduction to Big Data Analytics Dept.of IT
class, to retain portability across filesystems. This is very useful when testing your program,for
example, since you can rapidly run tests using data stored on the local filesystem.
READING DATA FROM A HADOOP URL
One of the simplest ways to read a file from a Hadoop filesystem is by using a
java.net.URL object to open a stream to read the data from. The general idiom is:
InputStream in = null; try {
in = new URL("hdfs://host/path").openStream();
// process in} finally { IOUtils.closeStream(in);
}
There’s a little bit more work required to make Java recognize Hadoop’s hdfsURLscheme.
This is achieved by calling the setURLStreamHandlerFactory method on URL
1. Fromrelease0.21.0, there is a new file system interface called File Context with better handling
of multiple filesystems (so a single File Context can resolve multiple file system schemes, for
example) and a cleaner, more consistent interface.
2. With an instance of FsUrlStreamHandlerFactory.This method can only be called once
per JVM, so it is typically executed in a static block. This limitation means that if some other
part of your program – perhaps a third – party component outside your control – sets a URL
Stream Handler Factory, you won’t be able to use this approach for reading data from Hadoop.
Program for displaying files from Hadoop file systems on standard output, like the Unix cat
command.
Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler
public class URLCat{
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception { InputStream in = null;
try {
in = new URL(args[0]).openStream(); IOUtils.copyBytes(in, System.out, 4096, false);
} finally { IOUtils.closeStream(in);
}
}
Example 3-2. Displaying files from a Hadoop filesystem on standard output by using the FileSystem
directly
public class FileSystemCat{
public static void main(String[] args) throws Exception { String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf); InputStream in = null;
try {
in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096,false);
} finally { IOUtils.closeStream(in);
}
}
}
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 58
Introduction to Big Data Analytics Dept.of IT
FSDataInputStream
The open() method on FileSystem actually returns a FSDataInputStream rather than a standard
java.io class. This class is a specialization of java.io.DataInputStream with support for random
access, so you can read from any part of the stream:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream implements Seekable,
PositionedReadable{
// implementation elided
}
The Seekable interface permits seeking to a position in the file and a query method for the current
offset from the start of the file (getPos()):
public interface Seekable {
void seek(long pos) throws IOException; long getPos() throws IOException;
}
Calling seek() with a position that is greater than the length of the file will result in an
IOException. Unlike the skip() method of java.io.InputStream that positions the stream at a point
later than the current position, seek() can move to an arbitrary, ab- solute position in the file.
Example 3-3 is a simple extension of Example 3-2 that writes a file to standard out twice: after writing
it once, it seeks to the start of the file and streams through it once again. Example 3-3. Displaying
files from a Hadoop filesystem on standard output twice, by using seek public class
FileSystemDoubleCat{
public static void main(String[] args) throws Exception { String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf); FSDataInputStream in = null;
try {
in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file IOUtils.copyBytes(in, System.out, 4096,
false);
} finally { IOUtils.closeStream(in);
}
}
}
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 59
Introduction to Big Data Analytics Dept.of IT
}
As an alternative to creating a new file, you can append to an existing file using the
append() method (there are also some other overloaded versions):
public FSDataOutputStreamappend(Path f) throws IOException
Theappendoperationallowsasinglewritertomodifyanalreadywrittenfilebyopening it and writing data
from the final offset in the file. With this API, applications that produce unbounded files, such as
logfiles, can write to an existing file after a restart, for example. The append operation is optional
and not implemented by all Hadoop filesystems. For example, HDFS supports append, but S3
filesystemsdon’t.
To copy a localfile to a Hadoop filesystem.We illustrate progress by printing a period every time
the progress() method is called by Hadoop, which is after each 64 K packet of data is written to the
datanode pipeline. (Note that this particular behavior is not specified by the API, so it is subject to
change in later versions of Hadoop. The API merely allows you to infer that “something is
happening.”)
Example 3-4. Copying a local file to a Hadoop filesystem
public class FileCopyWithProgress{
public static void main(String[] args) throws Exception { String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystemfs= FileSystem.get(URI.create(dst), conf); OutputStream out =
fs.create(new Path(dst), new Progressable() {
public void progress() { System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
Typical usage:
% hadoopFileCopyWithProgress input/docs/1400-8.txt hdfs://localhost/user/tom/ 1400-8.txt
Currently, none of the other Hadoop filesystems call progress() during writes. Progress is important
in MapReduce applications, as you will see in later chapters.
DATA FLOW
ANATOMY OF A FILE READ
To get an idea of how dataflows between the client interacting with HDFS,the name- node and the
datanodes,which shows them a insequence of events when reading a file.
Next we’ll look at how files are written to HDFS.Although quite detailed,it is instructive to
understand the data flow since it clarifies HDFS’s coherency model.The case we’re going to
consider is the case of creating a new file, writing data to it, then c l o s i n g t h e file.The client
creates the file by
UNIT IV
FUNDAMENTALS OF MAP REDUCE
➢ Introduction to Map Reduce: its Framework
➢ Features of Map Reduce
➢ Working of Map Reduce
➢ Analyze Map and Reduce Functions
➢ Map Reduce Techniques to optimize job
➢ Uses
➢ Controlling input formats in Map Reduce Execution
➢ Different Phases in map reduce
➢ Applications
MapReduce Features
This chapter looks at some of the more advanced features of MapReduce, including counters and sorting
and joining datasets.
Counters
There are often things you would like to know about the data you are analyzing but that are peripheral to
the analysis you are performing. For example, if you were counting invalid records and discovered that
the proportion of invalid records in the whole dataset was very high, you might be prompted to check why
so many records were being marked as invalid—perhaps there is a bug in the part of the program that
detects invalid records? Or if the data were of poor quality and genuinely did have very many invalid
records, after discovering this, you might decide to increase the size of the dataset so that the number of
good records was large enough for meaningful analysis.
Counters are a useful channel for gathering statistics about the job: for quality control or for application-
level statistics. They are also useful for problem diagnosis. If you are tempted to put a log message into
your map or reduce task, it is often better to see whether you can use a counter instead to record that a
particular condition occurred. In addition to counter values being much easier to retrieve than log output
for large distributed jobs, you get a record of the number of times that condition occurred, which is more
work to obtain from a set of log files.
Sorting
The ability to sort data is at the heart of MapReduce. Even if your application isn’t concerned with sorting
per se, it may be able to use the sorting stage that MapReduce provides to organize its data.Joins
MapReduce can perform joins between large datasets but writing the code to do joins from scratch is
fairly involved. Rather than writing MapReduce programs, you might consider using a higher-level
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 65
Introduction to Big Data Analytics Dept.of IT
framework such as Pig, Hive, or Cascading, in which join operations are a core part of the
implementation.Let’s briefly consider the problem we are trying to solve. We have two datasets—for
example, the weather stations database and the weather records—and we want to reconcile the two.
Let’s say we want to see each station’s history, with the station’s metadata inlined in each output row.
This is illustrated
in Figure 8-2.
How we implement the join depends on how large the datasets are and how they are partitioned. If one
dataset is large (the weather records) but the other one is small enough to be distributed to each node in
the cluster (as the station metadata is), the join can be effected by a MapReduce job that brings the
records for each station together (a partial sort on station ID, for example). The mapper or reducer uses
the smaller dataset to look up the station metadata for a station ID, so it can be written out with each
record. See Side Data Distribution for a discussion of this approach, where we focus on the mechanics
of distributing the data to task trackers.
If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by the
reducer it is called a reduce-side join.
If both datasets are too large for either to be copied to each node in the cluster, we can still join them
using MapReduce with a map-side or reduce-side join, depending on how the data is structured. One
common example of this case is a user database and a log of some user activity (such as access logs).
For a popular service, it is not feasible to distribute the user database (or the logs) to all the MapReduce
nodes.
MapReduce Working
The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and
reducing.
Consider you have following input data for your Map Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
MapReduce Architecture
Bad 1
Class 1
Good 1
Hadoop 3
Is 2
To 1
Welcome 1
The data goes through the following phases
Input Splits:
An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of
the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each split is
passed to a mapping function to produce output values. In our example, a job of mapping phase is to
count a number of occurrences of each word from input splits (more details about input-split is given
below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from
Mapping phase output. In our example, the same words are clubed together along with their respective
frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from
Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences
of each word.
MapReduce
1. Traditional Enterprise Systems normally have a centralized server to store and process data.
2. The following illustration depicts a schematic view of a traditional enterprise system. Traditional
model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated
by standard database servers.
3. Moreover, the centralized system creates too much of a bottleneck while processing multiple files
simultaneously.
4. Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers.Later, the results are collected at one
place and integrated to form the result dataset.
iii. The master controller process knows how many Reduce tasks there will be, say r such tasks.
iv. The user typically tells the MapReduce system what r should be.
v. Then the master controller picks a hash function that applies to keys and produces a bucket
number from 0 to r − 1.
vi. Each key that is output by a Map task is hashed and its key-value pair is put in one of r local files.
Each file is destined for one of the Reduce tasks.1.
vii. To perform the grouping by key and distribution to the Reduce tasks, the master controller merges
the files from each Map task that are destined for a particular Reduce task and feeds the merged file to
that process as a sequence of key-list-of-value pairs.
viii. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form (k,
[v1, v2, . . . , vn]), where (k, v1), (k, v2), . . . , (k, vn) are all the key-value pairs with key k coming from
all the Map tasks.
C. The Reduce Task
i. The Reduce function’s argument is a pair consisting of a key and its list of associated values.
ii. The output of the Reduce function is a sequence of zero or more key-value pairs.
iii. These key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks,
but often they are the same type.
iv. We shall refer to the application of the Reduce function to a single key and its associated list of
values as a reducer. A Reduce task receives one or more keys and their associated value lists.
v. That is, a Reduce task executes one or more reducers. The outputs from all the Reduce tasks are
merged into a single file.
vi. Reducers may be partitioned among a smaller number of Reduce tasks is by hashing the keys and
associating each
vii. Reduce task with one of the buckets of the hash function.
The Reduce function simply adds up all the values. The output of a reducer consists of the word and the
sum. Thus, the output of all the Reduce tasks is a sequence of (w, m) pairs, where w is a word that appears
at least once among all the input documents and m is the total number of occurrences of w among all those
documents.
D. Combiners
i. A Reduce function is associative and commutative. That is, the values to be combined can be
combined in any order, with the same result.
ii. The addition performed in Example 1 is an example of an associative and commutative operation.
It doesn’t matter how we group a list of numbers v1, v2, . . . ,vn; the sum will be the same. iii.When the
Reduce function is associative and commutative, we can push some of what the reducers do to the Map
tasks
iv. These key-value pairs would thus be replaced by one pair with key w and value equal to the sum
of all the 1’s in all those pairs.
v. That is, the pairs with key w generated by a single Map task would be replaced by a pair (w, m),
where m is the number of times that w appears among the documents handled by this Map task.
E.Details of MapReduce task
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
i. The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
F. MapReduce-Example
Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages its tweets with the help of MapReduce.
Figure4.7: Example
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.
G. MapReduce – Algorithm
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as input
by Reducer class, which in turn searches matching pairs and reduces them.
How the input files are split up and read in Hadoop is defined by the InputFormat.
An Hadoop InputFormat is the first component in Map-Reduce, it is responsible for creating the input
splits and dividing them into records.
Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS.
Although these files format is arbitrary, line-based log files and binary format can be used. Using
InputFormat we define how these input files are split and read.
The InputFormat class is one of the fundamental classes in the Hadoop MapReduce framework which
provides the following functionality:
• The files or other objects that should be used for input is selected by the InputFormat.
• InputFormat defines the Data splits, which defines both the size of individual Map tasks and its
potential execution server.
3. InputFormat defines the RecordReader, which is responsible for reading actual records from the
input files.
4. How we get the data to mapper?
We have 2 methods to get the data to mapper in MapReduce: getsplits() and createRecordReader() as
shown below:
1. public abstract class InputFormat<K, V>
2. {
3. public abstract List<InputSplit> getSplits(JobContext context)
4. throws IOException, InterruptedException;
5. public abstract RecordReader<K, V>
6. createRecordReader(InputSplit split,
7. TaskAttemptContext context) throws IOException,
8. InterruptedException;
9. }
4. TypesofInputFormat inMapReduce
It is the base class for all file-based Input Formats. Hadoop File Input Format specifies input directory
where data files are located. When we start a Hadoop job, File Input Format is provided with a path
containing files to read. File Input Format will read all files and divides these files into one or more
Input Splits.
TextInputFormat
It is the default Input Format of MapReduce. Text Input Format treats each line of each input file as a
separate record and performs no parsing. This is useful for unformatted data or line-based records like
log files.
• Key – It is the byte offset of the beginning of the line within the file (not whole file just one
split), so it will be unique if combined with the file name.
• Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat
It is similar to TextInputFormat as it also treats each line of input as a separate record. While
TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks the line itself
into key and value by a tab character (‘/t’). Here Key is everything up to the tab character while the
value is the remaining part of the line after tab character.
SequenceFileInputForma
Hadoop SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence files are
binary files that stores sequences of binary key-value pairs. Sequence files block-compress and provide
direct serialization and deserialization of several arbitrary data types (not just text). Here Key & Value
both are user-defined.
Sequence File As Text Input Format
Hadoop SequenceFileAsTextInputFormat is another form of SequenceFileInputFormat which
converts the sequence file key values to Text objects. By calling ‘tostring()’ conversion is performed on
the keys and values. This InputFormat makes sequence files suitable input for streaming.
SequenceFileAsBinaryInputFormat
Hadoop SequenceFileAsBinaryInputFormat is a SequenceFileInputFormat using which we can
extract the sequence file’s keys and values as an opaque binary object.
NLineInputFormat
Hadoop NLineInputFormat is another form of TextInputFormat where the keys are byte offset of the
line and values are contents of the line. Each mapper receives a variable number of lines of input with
TextInputFormat and KeyValueTextInputFormat and the number depends on the size of the split and the
length of the lines. And if we want our mapper to receive a fixed number of lines of input, then we use N
Line Input Format.
N is the number of lines of input that each mapper receives. By default (N=1), each mapper receives
exactly one line of input. If N=2, then each split contains two lines. One mapper will receive the first two
Key-Value pairs and another mapper will receive the second two key-value pairs.
DB Input Format
Hadoop DBInputFormat is an InputFormat that reads data from a relational database, using JDBC. As it
doesn’t have portioning capabilities, so we need to careful not to swamp the database from which we are
reading too many mappers. So it is best for loading relatively small datasets, perhaps for joining with large
datasets from HDFS using Multiple Inputs. Here Key is LongWritables while Value is DB Writable.
UNIT V
BIG DATA PLATFORMS
✓ Sqoop
✓ Cassandra
✓ Mongo DB
✓ HIVE
✓ PIG
✓ Storm
✓ Flink
✓ Apache
Sqoop
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import
data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system
to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.
Audience
This tutorial is prepared for professionals aspiring to make a career in Big Data Analytics using Hadoop
Framework with Sqoop. ETL developers and professionals who are into analytics in general may as well use this
tutorial to good effect.
Prerequisites
Before proceeding with this tutorial, you need a basic knowledge of Core Java, Database concepts of SQL,
Hadoop File system, and any of Linux operating system flavors.
The traditional application management system, that is, the interaction of applications with relational database
using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by RDBMS, is stored in
Relational Database Servers in the relational database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the Hadoop
ecosystem came into picture, they required a tool to interact with the relational database servers for importing
and exporting the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop ecosystem to provide
feasible interaction between relational database server and Hadoop’s HDFS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import
data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. It is provided by the Apache Software Foundation.
How Sqoop Works?
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in
HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain
records, which are called as rows in table. Those are read and parsed into a set of records and delimited with
user-specified delimiter.
Cassandra
Cassandra is an open-source NoSQL distributed database that manages large amounts of data across
commodity servers. It is a decentralized, scalable storage system designed to handle vast volumes of data
across multiple commodity servers, providing high availability without a single point of failure.
Cassandra is an open-source NoSQL distributed database that manages large amounts of data across
commodity servers. It is a decentralized, scalable storage system designed to handle vast volumes of data
across multiple commodity servers, providing high availability without a single point of failure.
Cassandra was created for Facebook but was open-sourced and released to become an Apache project
(maintained by the Americal non-profit, Apache Software Foundation) in 2008. After that, it found top priority
in 2010 and is now among the best NoSQL database systems in the world. Cassandra is trusted and used by
thousands of companies because of the ease of expansion and, better still, its lack of a single point of failure.
Currently, the solution has been deployed to handle databases for Netflix, Twitter, Reddit, etc.
Apache Cassandra, a distributed database management system, is built to manage a large amount of data over
several cloud data centers. Understanding how Cassandra works means understanding three basic processes of
the system. These are the architecture components it is built on, its partitioning system, and its replicability.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 80
Introduction to Big Data Analytics Dept.of IT
1. Architecture of Cassandra
The primary architecture of Cassandra is made up of a cluster of nodes. Apache Cassandra is structured as a
peer-to-peer system and closely resembles DynamoDB and Google Bigtable.
Every node in Cassandra is equal and carries the same level of importance, which is fundamental to the
structure of Cassandra. Each node is the exact point where specific data is stored. A group of nodes that are
related to each other makes up a data center. The complete set of data centers capable of storing data for
processing is what makes up a cluster.
The beautiful thing about Cassandra’s architecture is that it can easily be expanded to house more data. By
adding more nodes, you can double the amount of data the system carries without overwhelming it. This
dynamic scaling ability goes both ways. By reducing the number of nodes, developers can shrink the database
system if necessary. Compared to previous structured query language (SQL) databases and the complexity of
increasing their data-carrying capacity, Cassandra’s architecture gives it a considerable advantage.
Another way Cassandra’s architecture helps its functionality is that it increases data security and protects from
data loss.
2. The partitioning system
In Cassandra, data is stored and retrieved via a partitioning system. A partitioner is what determines where the
primary copy of a data set is stored. This works with nodal tokens in a direct format. Every node owns or is
responsible for a set of tokens based on a partition key. The partition key is responsible for determining where
data is stored.Immediately as data enters a cluster, a hash function is added to the partition key. The
coordinator node (thenode a client connects to with a request) is responsible for sending the data to the node
with the same token under that partition.
3. Cassandra’s replicability
Another way Cassandra works is by replicating data across nodes. These secondary nodes are called replica
nodes, and the number of replica nodes for a given data set is based on the replication factor (RF). A
replication factor of 3 means three nodes cover the same token range, storing the same data. Multiple replicas
are key to the reliability of Cassandra.
Even when one node stops functioning, temporarily or permanently, other nodes hold the same data, meaning
that data is hardly ever wholly lost. Better still, if a temporarily disrupted node is back on track, it receives an
update on the data actions it may have missed and then catches up to speed to continue functioning.
Key Features of Cassandra
Cassandra is a unique database system, and some of its key features include:
1. Open-source availability
Nothing is more exciting than getting a handy product for free. This is probably one of the significant factors
behind Cassandra’s far-reaching popularity and acceptance. Cassandra is among the open-source products
hosted by Apache and is free for anyone who wants to utilize it.
2. Distributed footprint
Another feature of Cassandra is that it is well distributed and meant to run over multiple nodes as opposed to a
central system. All the nodes are equal in significance, and without a master node, no bottleneck slows the
process down. This is very important because the companies that utilize Cassandra need to constantly run on
accurate data and can not tolerate data loss. The equal and wide distribution of Cassandra data across nodes
means that losing one node does not significantly affect the system’s general performance.
3. Scalability
Cassandra has elastic scalability. This means that it can be scaled up or down without much difficulty or
resistance. Cassandra’s scalability once again is due to the nodal architecture. It is intended to grow
horizontally as your needs as a developer or company grow. Scaling-up in Cassandra is very easy and not
limited to location. Adding or removing extra nodes can adjust your database system to suit your dynamic
needs.
Another exciting point about scaling in Cassandra is that there is no slow down, pause or hitch in the system
during the process. This means end-users would not feel the effect of whatever happened, ensuring smooth
service to all individuals connected to the network.
data has been replicated and stored across other nodes in the cluster. Data replication leads to a high level
of backup and recovery.
6. Schema free
SQL is a fixed schema database language making it rigid and fixed. However, Cassandra is a schema-optional
data model and allows the operator to create as many rows and columns as is deemed necessary.
7. Tunable consistency
Cassandra has two types of consistency – the eventual consistency and the setting consistency. The string
consistency is a type that broadcasts any update or information to every node where the concerned data is
located. In eventual consistency, the client has to approve immediately after a cluster receives a write.
Cassandra’s tunable consistency is a feature that allows the developer to decide to use any of the two types
depending on the function being carried out. The developer can use either or both kinds of consistency at any
time.
8. Fast writes
Cassandra is known to have a very high throughput, not hindered by its size. Its ability to write quickly is a
function of its data handling process. The initial step taken is to write to the commit log. This is for durability
to preserve data in case of damage or node downtime. Writing to the commit log is a speedy and efficient
process using this tool.
The next step is to write to the “Memtable” or memory. After writing to Memtable, a node acknowledges the
successful writing of data. The Memtable is found in the database memory, and writing to in-memory is much
faster than writing to a disk. All of these account for the speed Cassandra writes.
Peer-to-peer architecture
Cassandra is built on a peer-to-peer architectural model where all nodes are equal. This is unlike some database
models with a “slave to master” relationship. That is where one unit directs the functioning of the other units,
and the other unit only communicates with the central unit or master. In Cassandra, different units can
communicate with each other as peers in a process called gossiping. This peer-to-peer communication
eliminates a single point of failure and is a prominent defining feature of Cassandra.
1. E-commerce
E-commerce is an extremely sensitive field that cuts across every region and country. The nature of financial
markets means anticipated peak times as well as downtimes. For a finance operation, no customer would want
to experience downtime or lack of access when there is revenue to be earned and lots of opportunities to hold
on to. E-commerce companies can avoid these downtimes or potential blackouts by using a highly reliable
system like Cassandra. Its fault tolerance allows it to keep running even if a whole center is damaged with little
or no hitch in the system.
Due to its easy scalability, especially in peak seasons, E-commerce and inventory management is also a
significant application of Cassandra. When there is a market rush, the company has to increase the ability of
the database to carry and store more data. The seasonal, rapid E-commerce growth that is affordable and does
not cause system restart is simply a perfect fit for companies.
E-commerce websites also benefit from Cassandra as it stores and records visitors’ activities. It then allows
analytical tools to modulate the visitor’s action and, for instance, tempts them to stay on the website.
2. Entertainment websites
With the aid of Cassandra, websites for movies, games and music can keep track of customer behavior and
preferences. The database records for each visitor, including what was clicked, downloaded, time spent, etc.
This information is analyzed and used to recommend further entertainment options to the end-user.
This application of Cassandra falls under the personalization, recommendation and customer experience use
cases. It is not just limited to entertainment sites, but also online shopping platforms and social media
recommendations. This is why users would receive notifications of similar goods to what they spent time
browsing.
• Cassandra allows every individual node to carry out read and write operations.
• It allows for real-time analysis with machine learning and artificial intelligence (AI).
5. Messaging
There are currently several messaging applications in use, and an ever-growing number of individuals are using
them. This creates the need for a stable database system to store ever-flowing information volumes. Cassandra
provides both stability and storage capacity for companies that offer messaging services.
6. Logistics and asset management
Cassandra is used in logistics and asset management to track the movement of any item to get transported.
From the purchase to the final delivery, applications can rely on Cassandra to log each transaction. This is
especially applicable to large logistic companies regularly processing vast amounts of data. Cassandra had
found a robust use case in backend development for such applications. It stores and analyzes data flowing
through without impacting application performance.
Cassandra’s powerful features and unique distributed architecture make it a favorite database management tool
for independent developers and large enterprises. Some of the largest companies in the world that need high-
speed information relay, rely on Cassandra, including social media platforms like Facebook and Twitter, as
well as media platforms like Netflix.
Further, Cassandra has been constantly updated since it first became open-source in 2008. Apache Cassandra
version 4.1 is scheduled for release in July 2022, assuring technical professionals of continued support and
access to cutting-edge features at no added costs.
Programmers who are not so good at Java normally used to struggle working with Hadoop, especially
while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.
• Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex
codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters, ordering,
etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from
MapReduce.
• Features of Pig
• Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are
good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so
the programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to read, process,
and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.
Apache Pig Vs MapReduce
Listed below are the major differences between Apache Pig and MapReduce.
Any novice programmer with a basic knowledge of Exposure to Java is must to work with
SQL can work conveniently with Apache Pig. MapReduce.
Apache Pig uses multi-query approach, thereby MapReduce will require almost 20 times
reducing the length of the codes to a great extent. more the number of lines to perform the
same task.
There is no need for compilation. On execution, every MapReduce jobs have a long compilation
Apache Pig operator is converted internally into a process.
MapReduce job.
Pig SQL
In Apache Pig, schema is optional. We can store data without Schema is mandatory in SQL.
designing a schema (values are stored as $01, $02 etc.)
The data model in Apache Pig is nested relational. The data model used in SQL is flat
relational.
Apache Pig provides limited opportunity for Query There is more opportunity for query
optimization. optimization in SQL.
Apache Pig uses a language called Pig Hive uses a language called
Latin. It was originally created at Yahoo. HiveQL. It was originally created at
Facebook.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured,and semi-structured data.
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc processing and
quick prototyping. Apache Pig is used −
• To process huge data sources such as web logs.
• To perform data processing for search platforms.
• To process time sensitive data loads.
Apache Pig – History
In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and execute
MapReduce jobs on every dataset. In 2007, Apache Pig was open sourced via Apache incubator.In 2008,
the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project.
Apache Pig - Architecture
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high level data
processing language which provides a rich set of data types and operators to perform various operations
on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig
Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded). After execution, these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer's job easy. The architecture of Apache Pig is shown
below.
The data model of Pig Latin is fully nested and it allows complex non-atomic d a t a t y p e s s u c h
as
map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string
and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic
values of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A
tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag.
Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a
table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same
number of fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples
are processed in any particular order).
Apache Pig - Installation
This chapter explains the how to download, install, and set up Apache Pig in your system.
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps given in the
following link −
http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
Download Apache Pig
First of all, download the latest version of Apache Pig from the following website
− https://pig.apache.org/
Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link release page as
shown in the following snapshot.
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page,
under the Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before.
Click on the link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
Step 3
Choose and click any one of these mirrors as shown below.
Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of Apache
Pig. Click the latest version among them.
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various distributions.
Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-
0.15.0.tar.gz.
$ mkdirPig
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
Configure Apache Pig
After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc
and pig.properties.
.bashrc file
In the .bashrc file, set the following variables −
• PIG_HOME folder to the Apache Pig’s installation folder,
• PATH environment variable to the bin folder, and
• PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop
installations (the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files).
export PIG_HOME = /home/Hadoop/Pig
export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set
various parameters as given below.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 99
Introduction to Big Data Analytics Dept.of IT
pig -h properties
Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation is successful,
you will get the version of Apache Pig as shown below.
$ pig –version
Mode
Execution Modes
Pig has six execution modes or exec types:
• Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed
and run using your local host and file system. Specify local mode using the -x flag (pig -x local).
• Tez Local Mode - To run Pig in tez local mode. It is similar to local mode, except internally Pig
will invoke tez runtime engine. Specify Tez local mode using the -x flag (pig -x tez_local).
Note: Tez local mode is experimental. There are some queries which just error out on bigger data in local
mode.
• Spark Local Mode - To run Pig in spark local mode. It is similar to local mode, e x c e p t
i n t e r n a l l y Pig will invoke spark runtime engine. Specify Spark local mode using the -x flag (pig -x
spark_local). Note: Spark local mode is experimental. There are some queries which just error out on
bigger data in local mode.
• Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and
HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the
-x flag (pig OR pig -x mapreduce).
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 100
Introduction to Big Data Analytics Dept.of IT
• Tez Mode - To run Pig in Tez mode, you need access to a Hadoop cluster and HDFS installation.
Specify Tez mode using the -x flag (-x tez).
• Spark Mode - To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos cluster and
HDFS installation. Specify Spark mode using the -x flag (-x spark). In Spark execution mode, it is
necessary to set env::SPARK_MASTER to an appropriate value (local - local mode, yarn-client - yarn-
client mode, mesos://host:port - spark on mesos or spark://host:port - spark cluster. For more information
refer to spark documentation on Master URLs, yarn-cluster mode is currently not supported). Pig scripts
run on Spark can take advantage of the dynamic allocation feature. The feature can be enabled by simply
enabling spark.dynamicAllocation.enabled. Refer to spark configuration for additional configuration
details. In general all properties in the pig script prefixed with spark. are copied to the Spark Application
Configuration. Please note that Yarn auxillary service need to be enabled onSpark for this to work. See Spark
documentation for additional details.
You can run Pig in either mode using the "pig" command (the bin/pig Perl script) or the "java"
command (java -cp pig.jar ...).
Examples
This example shows how to run Pig in local and mapreduce mode using the pig command.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 101
Introduction to Big Data Analytics Dept.of IT
/* local mode */
$ pig -x local ...
/* mapreduce mode */
$ pig ...
or
$ pig -x mapreduce ...
/* Tez mode */
$ pig -x tez ...
/* Spark mode */
$ pig -x spark ...
Interactive Mode
You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command
(as shown below) and then enter your Pig Latin statements and Pig commands interactively at the
command line.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 102
Introduction to Big Data Analytics Dept.of IT
Example
These Pig Latin statements extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file
to your local working directory. Next, invoke the Grunt shell by typing the "pig" command (in local or
hadoop mode). Then, enter the Pig Latin statements interactively at the grunt prompt (be sure to include
the semicolon after each statement). The DUMP operator will display the results to your terminal screen.
grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;
Local Mode
$ pig -x local
... - Connecting to ...
grunt>
Tez Local Mode
$ pig -x tez_local
... - Connecting to ...
grunt>
Spark Local Mode
$ pig -x spark_local
... - Connecting to ...
grunt>
Mapreduce Mode
$ pig -x mapreduce
... - Connecting to ...
grunt>
or
$ pig
... - Connecting to ...
grunt>
Tez Mode
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 103
Introduction to Big Data Analytics Dept.of IT
$ pig -x tez
... - Connecting to ...
grunt>
Spark Mode
$ pig -x spark
... - Connecting to ...
grunt>
Batch Mode
You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode).
Example
The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy
the /etc/passwd file to your local working directory. Next, run the Pig script from the commandline (using
local or mapreduce mode). The STORE operator will write the results to a file (id.out).
/* id.pig */
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 104
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 105
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 106
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 107
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 108
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 109
Introduction to Big Data Analytics Dept.of IT
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;
Apache Pig - Illustrate Operator
The illustrate operator gives you the step-by-step execution of a sequence of statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Apache Pig - Group Operator
The GROUP operator is used to group the data in one or more relations. It collects the data having the
same key.
Syntax
Given below is the syntax of the group operator.
grunt>Group_data = GROUP Relation_name BY age;
Apache Pig - Cogroup Operator
The COGROUP operator works more or less in the same way as the GROUP operator. The onlydifference
between the two operators is that the group operator is normally used with one relation, whilethe cogroup
operator is used in statements involving two or more relations.
Grouping Two Relations using Cogroup
Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS
directory /pig_data/
Apache Pig - Join Operator
The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the
two particular tuples are matched, else the records are dropped. Joins can be of the following types −
• Self-join
• Inner-join
• Outer-join − left join, right join, and full join
This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two
files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS
Apache Pig - Cross Operator
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 110
Introduction to Big Data Analytics Dept.of IT
The CROSS operator computes the cross-product of two or more relations. This chapter explains with
example how to use the cross operator in Pig Latin.
Syntax
Given below is the syntax of the CROSS operator.
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Apache Pig - Union Operator
The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION
operation on two relations, their columns and domains must be identical.
Syntax
Given below is the syntax of the UNION operator.
grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Apache Pig - Split Operator
The SPLIT operator is used to split a relation into two or more relations.
Syntax
Given below is the syntax of the SPLIT operator.
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),
Apache Pig - Filter Operator
The FILTER operator is used to select the required tuples from a relation based on a condition.
Syntax
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Apache Pig - Distinct Operator
The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.
Syntax
Given below is the syntax of the DISTINCT operator.
grunt> Relation_name2 = DISTINCT Relatin_name1;
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 111
Introduction to Big Data Analytics Dept.of IT
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Working with functions in Pig
Apache Pig provides various built-in functions namely eval, load, store, math, string,
bag and tuple functions.
Eval Functions
Given below is the list of eval functions provided by Apache Pig.
1 AVG()
To compute the average of the numerical values within a bag.
2 BagToString()
To concatenate the elements of a bag into a string. While concatenating, we can place a
delimiter between these values (optional).
3 CONCAT()
To concatenate two or more expressions of same type.
4 COUNT()
To get the number of elements in a bag, while counting the number of tuples in a bag.
5 COUNT_STAR()
It is similar to the COUNT() function. It is used to get the number of elements in a bag.
6 DIFF()
To compare two bags (fields) in a tuple.
7 IsEmpty()
To check if a bag or map is empty.
8 MAX()
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 112
Introduction to Big Data Analytics Dept.of IT
To calculate the highest value for a column (numeric values or chararrays) in a single-
column bag.
9 MIN()
To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-
column bag.
10 PluckTuple()
Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter the
columns in a relation that begin with the given prefix.
11 SIZE()
To compute the number of elements based on any Pig data type.
12 SUBTRACT()
To subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples
of the first bag that are not in the second bag.
13 SUM()
To get the total of the numeric values of a column in a single-column bag.
14 TOKENIZE()
To split a string (which contains a group of words) in a single tuple and return a bag which
contains the output of the split operation.
The Load and Store functions in Apache Pig are used to determine how the data goes ad comes out of
Pig. These functions are used with the load and store operators. Given below is the list of load and store
functions available in Pig.
1 PigStorage()
To load and store structured files.
2 TextLoader()
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 113
Introduction to Big Data Analytics Dept.of IT
3 BinStorage()
To load and store data into Pig using machine readable format.
4 Handling Compression
In Pig Latin, we can load and store compressed data.
1 TOBAG()
To convert two or more expressions into a bag.
2 TOP()
To get the top N tuples of a relation.
3 TOTUPLE()
To convert one or more expressions into a tuple.
4 TOMAP()
To convert the key-value pairs into a Map.
1 ENDSWITH(string, testAgainst)
To verify whether a given string ends with a particular substring.
2 STARTSWITH(string, substring)
Accepts two string parameters and verifies whether the first string starts with the second.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 114
Introduction to Big Data Analytics Dept.of IT
4 EqualsIgnoreCase(string1, string2)
To compare two stings ignoring the case.
6 LAST_INDEX_OF(expression)
Returns the index of the last occurrence of a character in a string, searching backward from a
start index.
7 LCFIRST(expression)
Converts the first character in a string to lower case.
8 UCFIRST(expression)
Returns a string with the first character converted to upper case.
9 UPPER(expression)
UPPER(expression) Returns a string converted to upper case.
10 LOWER(expression)
Converts all characters in a string to lower case.
14 TRIM(expression)
Returns a copy of a string with leading and trailing whitespaces removed.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 115
Introduction to Big Data Analytics Dept.of IT
15 LTRIM(expression)
Returns a copy of a string with leading whitespaces removed.
16 RTRIM(expression)
Returns a copy of a string with trailing whitespaces removed.
1 ToDate(milliseconds)
This function returns a date-time object according to the given parameters. The other
alternative for this function are ToDate(iosstring), ToDate(userstring, format),
ToDate(userstring, format, timezone)
2 CurrentTime()
returns the date-time object of the current time.
3 GetDay(datetime)
Returns the day of a month from the date-time object.
4 GetHour(datetime)
Returns the hour of a day from the date-time object.
5 GetMilliSecond(datetime)
Returns the millisecond of a second from the date-time object.
6 GetMinute(datetime)
Returns the minute of an hour from the date-time object.
7 GetMonth(datetime)
Returns the month of a year from the date-time object.
8 GetSecond(datetime)
Returns the second of a minute from the date-time object.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 116
Introduction to Big Data Analytics Dept.of IT
9 GetWeek(datetime)
Returns the week of a year from the date-time object.
10 GetWeekYear(datetime)
Returns the week year from the date-time object.
11 GetYear(datetime)
Returns the year from the date-time object.
12 AddDuration(datetime, duration)
Returns the result of a date-time object along with the duration object.
13 SubtractDuration(datetime, duration)
Subtracts the Duration object from the Date-Time object and returns the result.
14 DaysBetween(datetime1, datetime2)
Returns the number of days between the two date-time objects.
15 HoursBetween(datetime1, datetime2)
Returns the number of hours between two date-time objects.
16 MilliSecondsBetween(datetime1, datetime2)
Returns the number of milliseconds between two date-time objects.
17 MinutesBetween(datetime1, datetime2)
Returns the number of minutes between two date-time objects.
18 MonthsBetween(datetime1, datetime2)
Returns the number of months between two date-time objects.
19 SecondsBetween(datetime1, datetime2)
Returns the number of seconds between two date-time objects.
20 WeeksBetween(datetime1, datetime2)
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 117
Introduction to Big Data Analytics Dept.of IT
21 YearsBetween(datetime1, datetime2)
Returns the number of years between two date-time objects.
1 ABS(expression)
To get the absolute value of an expression.
2 ACOS(expression)
To get the arc cosine of an expression.
3 ASIN(expression)
To get the arc sine of an expression.
4 ATAN(expression)
This function is used to get the arc tangent of an expression.
5 CBRT(expression)
This function is used to get the cube root of an expression.
6 CEIL(expression)
This function is used to get the value of an expression rounded up to the nearest integer.
7 COS(expression)
This function is used to get the trigonometric cosine of an expression.
8 COSH(expression)
This function is used to get the hyperbolic cosine of an expression.
9 EXP(expression)
This function is used to get the Euler’s number e raised to the power of x.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 118
Introduction to Big Data Analytics Dept.of IT
10 FLOOR(expression)
To get the value of an expression rounded down to the nearest integer.
11 LOG(expression)
To get the natural logarithm (base e) of an expression.
12 LOG10(expression)
To get the base 10 logarithm of an expression.
13 RANDOM( )
To get a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0.
14 ROUND(expression)
To get the value of an expression rounded to an integer (if the result type is float) or rounded
to a long (if the result type is double).
15 SIN(expression)
To get the sine of an expression.
16 SINH(expression)
To get the hyperbolic sine of an expression.
17 SQRT(expression)
To get the positive square root of an expression.
18 TAN(expression)
To get the trigonometric tangent of an angle.
19 TANH(expression)
To get the hyperbolic tangent of an expression.
Hive - Introduction
The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and
a variety of data that is increasing day by day. Using traditional data management systems, it is difficult
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 119
Introduction to Big Data Analytics Dept.of IT
to process Big Data. Therefore, the Apache Software Foundation introduced a framework called Hadoop
to solve Big Data management and processing challenges.
Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed environment. It
contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS).
✓ MapReduce: It is a parallel programming model for processing large amounts of structured, semi-
structured, and unstructured data on large clusters of commodity hardware.
✓ HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and process the
datasets. It provides a fault-tolerant file system to run on commodity hardware.
✓ The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are
used to help Hadoop modules.
✓ Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
✓ Pig: It is a procedural language platform used to develop a script for MapReduce operations.
✓ Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
Note: There are various ways to execute MapReduce operations:
✓ The traditional approach using Java MapReduce program for structured, semi-structured, and
unstructured data.
✓ The scripting approach for MapReduce to process structured and semi structured data using Pig.
✓ The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed
it further as an open source under the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
Hive is not
✓ A relational database
✓ A design for OnLine Transaction Processing (OLTP)
✓ A language for real-time queries and row-level updates
Features of Hive
✓ It stores schema in a database and processed data into HDFS.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 120
Introduction to Big Data Analytics Dept.of IT
This component diagram contains different units. The following table describes each unit:
User Interface Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive
Web UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata
of tables, databases, columns in a table, their data types, and HDFS
mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore.
It is one of the replacements of traditional approach for MapReduce
program. Instead of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 121
Introduction to Big Data Analytics Dept.of IT
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques
to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 122
Introduction to Big Data Analytics Dept.of IT
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which
is in Data node. Here, the query executes MapReduce job.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 123
Introduction to Big Data Analytics Dept.of IT
Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-
x.y.z (where x.y.z is the release number):
$ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
Building Hive from Source
The Hive GIT repository for the most recent Hive code is located here: git clone https://git-wip-
us.apache.org/repos/asf/hive.git (the master branch).
All release versions are in branches named "branch-0.#" or "branch-1.#" or the upcoming "branch-2.#",
with the exception of release 0.8.1 which is in "branch-0.8-r2". Any branches with other names are feature
branches for works-in-progress. See Understanding Hive Branches for details.
As of 0.13, Hive is built using Apache Maven.
Compile Hive on master
To build the current Hive code from the master branch:
$ git clone https://git-wip-us.apache.org/repos/asf/hive.git
$ cd hive
$ mvn clean package -Pdist [-DskipTests -Dmaven.javadoc.skip=true]
$ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-
bin
$ ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 124
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 125
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 126
Introduction to Big Data Analytics Dept.of IT
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT
Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where
HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like
jdbc:hive2://localhost:10000.
Or to start Beeline and HiveServer2 in the same process for testing purpose, for a similar user experience
to HiveCLI:
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://
Running HCatalog
To run the HCatalog server from the shell in Hive release 0.11.0 and later:
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh
To use the HCatalog command line interface (CLI) in Hive release 0.11.0 and later:
$ $HIVE_HOME/hcatalog/bin/hcat
For more information, see HCatalog Installation from Tarball and HCatalog CLI in the HCatalog
manual.
Running WebHCat (Templeton)
To run the WebHCat server from the shell in Hive release 0.11.0 and later:
$ $HIVE_HOME/hcatalog/sbin/webhcat_server.sh
For more information, see WebHCat Installation in the WebHCat manual.
Hive services
Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage
and performs the following actions
• Metadata information of tables created in Hive is stored in Hive "Meta storage database".
• Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 127
Introduction to Big Data Analytics Dept.of IT
From the above screenshot we can understand the Job execution flow in Hive with Hadoop
The data flow in Hive behaves in the following pattern;
1. Executing Query from the UI( User Interface)
2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution)
process and its related metadata information gathering
3. The compiler creates the plan for a job to be executed. Compiler communicating with Meta store
for getting metadata request
4. Meta store sends metadata information back to compiler
5. Compiler communicating with Driver with the proposed plan to execute the query
6. Driver Sending execution plans to Execution engine
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS
operations.
• EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data
node only. While from Name Node it only fetches the metadata information for the query.
• It collects actual data from data nodes related to mentioned query
• Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform
DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and
ALTERING tables and databases are done. Meta store will store information about
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 128
Introduction to Big Data Analytics Dept.of IT
database name, table names and column names only. It will fetch data related to query
mentioned.
• Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data
nodes, and job tracker to execute the query on top of Hadoop file system
8. Fetching results from driver
9. Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will
send results back to driver and to UI ( front end)
Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted
arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.
Hive - Data Types
This chapter takes you through the different data types in Hive, which are involved in the table creation.
All the data types in Hive are classified into four types, given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds the range
of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT.
TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 129
Introduction to Big Data Analytics Dept.of IT
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two
data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
VARCHAR 1 to 65355
CHAR 255
Time stamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create union. The
syntax and example is as follows:
UNIONTYPE<int,double, array<string>,struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 130
Introduction to Big Data Analytics Dept.of IT
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type,data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 131
Introduction to Big Data Analytics Dept.of IT
BIGINT floor(double a) It returns the maximum BIGINT value that is equal or less
than the double.
double rand(), rand(int seed) It returns a random number that changes from row to row.
string concat(string A, string B,...) It returns the string resulting from concatenating B after A.
string substr(string A, int start) It returns the substring of A starting from start position till
the end of string A.
string substr(string A, int start, int It returns the substring of A starting from start position
length) with the given length.
string upper(string A) It returns the string resulting from converting all characters
of A to upper case.
string lower(string A) It returns the string resulting from converting all characters
of B to lower case.
string trim(string A) It returns the string resulting from trimming spaces from
both ends of A.
string ltrim(string A) It returns the string resulting from trimming spaces from
the beginning (left hand side) of A.
string regexp_replace(string A, It returns the string resulting from replacing all substrings
string B, string C) in B that match the Java regular expression syntax with C.
value of <type> cast(<expr> as <type>) It converts the results of the expression expr to <type> e.g.
cast('1' as BIGINT) converts the string '1' to it integral
representation. A NULL is returned if the conversion does
not succeed.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 132
Introduction to Big Data Analytics Dept.of IT
string from_unixtime(int convert the number of seconds from Unix epoch (1970-01-
unixtime) 01 00:00:00 UTC) to a string representing the timestamp of
that moment in the current system time zone in the format
of "1970-01-01 00:00:00"
Int year(string date) It returns the year part of a date or a timestamp string:
year("1970-01-01 00:00:00") = 1970, year("1970-01-01")
= 1970
Int month(string date) It returns the month part of a date or a timestamp string:
month("1970-11-01 00:00:00") = 11, month("1970-11-01")
= 11
Int day(string date) It returns the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string get_json_object(string It extracts json object from a json string based on json path
json_string, string path) specified, and returns json string of the extracted json object.
It returns NULL if the input json string is invalid.
Example
The following queries demonstrate some built-in functions:
round() function
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 133
Introduction to Big Data Analytics Dept.of IT
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage of these functions is as same as the
SQL aggregate functions.
BIGINT count(*), count(expr), count(*) - Returns the total number of retrieved rows.
DOUBLE sum(col), It returns the sum of the elements in the group or the sum
sum(DISTINCT col) of the distinct values of the column in the group.
DOUBLE avg(col), It returns the average of the elements in the group or the
avg(DISTINCT col) average of the distinct values of the column in the group.
DOUBLE min(col) It returns the minimum value of the column in the group.
DOUBLE max(col) It returns the maximum value of the column in the group.
HiveDDLCommands–
TypesofDDLHiveCommands
BY DATAFLAIR TEAM · UPDATED · MARCH 4, 2020
Want to run Hive queries for creating, modifying, dropping, altering tables and databases?
In this article, we are going to learn Hive DDL commands. The article describes the Hive Data Definition
Language(DDL) commands for performing various operations like creating a table/databasein Hive,
dropping a table/database in Hive, altering a table/database in Hive, etc. There are many DDL commands.
This article will cover each DDL command individually, along with their syntax and examples.
For running Hive DDL commands, you must have Hive installed on your system.
Introductionto Hive DDLcommands
Hive DDL commands are the statements used for defining and changing the structure of a table or database
in Hive. It is used to build or modify the tables and other objects in the database.
The several types of Hive DDL commands are:
1. CREATE
2. SHOW
3. DESCRIBE
4. USE
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 134
Introduction to Big Data Analytics Dept.of IT
5. DROP
6. ALTER
7. TRUNCATE
Table-1 Hive DDL commands
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 135
Introduction to Big Data Analytics Dept.of IT
SHOW(DATABASES|SCHEMAS);
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 136
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 137
Introduction to Big Data Analytics Dept.of IT
1. USE database_name;
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 138
Introduction to Big Data Analytics Dept.of IT
The default behavior is RESTRICT which means that the database is dropped only when it is empty. To
drop the database with tables, we can use CASCADE.
Syntax:
1. DROP(DATABASE|SCHEMA)[IF EXISTS]database_name[RESTRICT|CASCADE];
ALTER(DATABASE|SCHEMA)database_name
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 139
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 140
Introduction to Big Data Analytics Dept.of IT
Note: The ALTER DATABASE … SET LOCATION statement does not move the database current
directory contents to the newly specified location. This statement does not change the locations associated
with any tables or partitions under the specified database. Instead, it changes the default parent-directory,
where new tables will be added for this database.
No other metadata associated with the database can be changed.
DDL Commands on Tables in Hive
CREATE TABLE
The CREATE TABLE statement in Hive is used to create a table with the given name. If a table or view
already exists with the same name, then the error is thrown. We can use IF NOT EXISTS to skip the
error.
Syntax:
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 141
Introduction to Big Data Analytics Dept.of IT
ROW FORMAT DELIMITED means we are telling the Hive that when it finds a new line character,
that means a new record.
FIELDS TERMINATED BY ‘,’ tells Hive what delimiter we are using in our files to separate each
column.
STORED AS TEXTFILE is to tell Hive what type of file to expect.
Don’t know about different Data Types supported by hive? Read Hive Data Types article.
2. SHOW TABLES in Hive
The SHOW TABLES statement in Hive lists all the base tables and views in the current database.
Syntax:
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 142
Introduction to Big Data Analytics Dept.of IT
1. DESCRIBE [EXTENDED|FORMATTED][db_name.]table_name[.col_name([.field_name])];
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 143
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 144
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 145
Introduction to Big Data Analytics Dept.of IT
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 146
Introduction to Big Data Analytics Dept.of IT
In this example, we are adding two columns ‘Emp_DOB’ and ‘Emp_Contact’ in the ‘Comp_Emp’ table
using the ALTER command.
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 147
Introduction to Big Data Analytics Dept.of IT
TRUNCATE TABLE
TRUNCATE TABLE statement in Hive removes all the rows from the table or partition.
Syntax:
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 148
Introduction to Big Data Analytics Dept.of IT
S
e
t 1. Discuss the following in detail
:
1 a. Conventional challenges in big data
b. Nature of Data
SET:2
3. Explain the stream model and Data stream management system architecture.
5. Write a short note on the following: (i) Counting distinct elements in a stream.
6. What are filters in Big Data? Explain Bloom Filter with example
Set: 3
b. Reducer class
c. Scaling out
4. Explain the map reduce data flow with single reduce and multiple reduce.
6. Define HDFS. Describe name node, data node and block. Explain HDFS operations in detail.
7. Write in detail the concept of developing the Map Reduce Application.
10. Discuss the various types of map reduce & its formats
Set: 4
(b) What are the different types of Hadoop configuration files? Discuss
6. How will you define commissioning new nodes and decommissioning old nodes?
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 150
Introduction to Big Data Analytics Dept.of IT
SET: 5
Mallareddy Engineering College For Women (Autonomous Institution, UGC-Govt. of India) 151