0% found this document useful (0 votes)

19 views23 pages

Unit V

Uploaded by

research.veltech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views23 pages

Unit V

Uploaded by

research.veltech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

UNIT V

Applications on Big Data Using Pig and Hive

Data processing operators in Pig
COGROUP/GROUP Groups the data in one or more relations.
The COGROUP operator groups together tuples that have the same group key (key field)
Example: A = load ‗student‘ AS (name:chararray,age:int,gpa:float);
B = GROUP A BY age;

CROSS Computes the cross product of two or more relations

Example: X = CROSS A,B A = (1, 2, 3) B = (2, 4)
DUMP X; (4, 2, 1) (8, 9)
(1,2,3,2,4) (1, 3)
(1,2,3,8,9)
(1,2,3,1,3)
(4,2,1,2,4)
(4,2,1,8,9)
(4,2,1,1,3)

DEFINE Assigns an alias to a UDF or streaming command.

Example: DEFINE CMD `perl PigStreaming.pl – nameMap` input(stdin using PigStreaming(‗,‘))
output(stdout using PigStreaming(‗,‘));
A = LOAD ‗file‘;
B = STREAM B THROUGH CMD;

DISTINCT Removes duplicate tuples in a relation.

Example: X = DISTINCT A; A = (8,3,4)
DUMP X; (1,2,3)
(1,2,3) (4,3,3)
(4,3,3) (4,3,3)
(8,3,4) (1,2,3)

FILTER Selects tuples from a relation based on some condition.

Example: X = FILTER A BY f3 == 3; A = (1,2,3)
DUMP X; (4,5,6)
(1,2,3) (7,8,9)
(4,3,3) (4,3,3)
(8,4,3) (8,4,3)

FOREACH Generates transformation of data for each row as specified

Example: X = FOREACH A GENERATE a1, a2; A = (1,2,3)
DUMP X; (4,2,5)
(1,2) (8,3,6)
(4,2)
(8,3)

IMPORT Import macros defined in a separate file. /* myscript.pig */

Example: IMPORT ‗my_macro.pig‘;

JOIN Performs an inner join of two or more relations based on common field values.
Example: X = JOIN A BY a1, B BY b1;
DUMP X
(1,2,1,3) A = (1,2) B = (1,3)
(1,2,1,2) (4,5) (1,2)
(4,5,4,7) (4,7)

LOAD Loads data from the file system.

Example: A = LOAD ‗myfile.txt‘;
LOAD ‗myfile.txt‘ AS (f1:int, f2:int, f3:int);

MAPREDUCE Executes native MapReduce jobs inside a Pig script.

Example: A = LOAD ‗WordcountInput.txt‘;
B = MAPREDUCE ‗wordcount.jar‘ STORE A INTO ‗inputDir‘ LOAD ‗outputDir‘
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;

ORDERBY Sorts a relation based on one or more fields.

Example: A = LOAD ‗mydata‘ AS (x: int, y: map[]);
B = ORDER A BY x;

SAMPLE Partitions a relation into two or more relations, selects a random data sample with the
stated sample size. Relation X will contain 1% of the data in relation A.
Example: A = LOAD ‗data‘ AS (f1:int,f2:int,f3:int);
X = SAMPLE A 0.01;

SPLIT Partitions a relation into two or more relations based on some expression.
Example: SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is
null);

STORE Stores or saves results to the file system.

Example: STORE A INTO ‗myoutput‘ USING PigStorage (‗*‘);
1*2*3
4*2*1

STREAM Sends data to an external script or program

Example: A = LOAD ‗data‘;
B = STREAM A THROUGH `stream.pl -n 5`;
UNION Computes the union of two or more relations. (Does not preserve the order of tuples)
Example: X = UNION A, B; A = (1,2,3) B = (2,4)
DUMP X; (4,2,1) (8,9)
(1,2,3) (1,3)
(4,2,1)
(2,4)
(8,9)
(1,3)

HIVE Services
The following are the services provided by Hive :
 Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
 Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
 Hive metastore: It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and its
type information, the serializers and deserializers which is used to read and write data and
the corresponding HDFS files where the data is stored.
 Hive Server: It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
 Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
 Hive Compiler: The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into MapReduce jobs.

 Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
 MetaStore :
o Hive metastore (HMS) is a service that stores Apache Hive and other metadata in
a backend RDBMS, such as MySQL or PostgreSQL.
o Impala, Spark, Hive, and other services share the metastore.
o The connections to and from HMS include HiveServer, Ranger, and the
NameNode, which represents HDFS.
o Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or
JDBC to HiveServer.
o The HiveServer instance reads/writes data to HMS.
o By default, redundant HMS operate in active/active mode.
o The physical data resides in a backend RDBMS, one for HMS.
o All connections are routed to a single RDBMS service at any given time.
o HMS talks to the NameNode over thrift and functions as a client to HDFS.
o HMS connects directly to Ranger and the NameNode (HDFS), and so does
HiveServer.

HIVE QL

Hive Query Language (HiveQL) is a query language in Apache Hive for processing and
analyzing structured data. It separates users from the complexity of Map Reduce
programming. It reuses common concepts from relational databases, such as tables, rows,
columns, and schema, to ease learning. Hive provides a CLI for Hive query writing using Hive
Query Language (HiveQL).
Most interactions tend to take place over a command line interface (CLI). Generally, HiveQL
syntax is similar to the SQL syntax that most data analysts are familiar with. Hive supports four
file formats which are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar
File).
Hive uses derby database for single user metadata storage, and for multiple user Metadata or
shared Metadata case, Hive uses MYSQL.
Features Features Explanation
Supported Computing Engine Hive supports MapReduce, Tez, and Spark computing engine.
Framework Hive is a stable batch-processing framework built on top of the
Hadoop Distributed File system and can work as a data warehouse.
Easy To Code Hive uses HIVE query language to query structure data which is
easy to code. The 100 lines of java code we use to query a structure
data can be minimized to 4 lines with HQL.
Declarative HQL is a declarative language like SQL means it is non-
procedural.
Structure Of Table The table, the structure is similar to the RDBMS. It also supports
partitioning and bucketing.
Supported data structures Partition, Bucket, and tables are the 3 data structures that hive
supports.
Supports ETL Apache hive supports ETL i.e. Extract Transform and Load.
Before Hive python is used for ETL.
Storage Hive supports users to access files from HDFS, Apache HBase,
Amazon S3, etc.
Capable Hive is capable to process very large datasets of Petabytes in size.
Helps in processing We can easily embed custom MapReduce code with Hive to
unstructured data process unstructured data.
Drivers JDBC/ODBC drivers are also available in Hive.
Fault Tolerance Since we store Hive data on HDFS so fault tolerance is provided
by Hadoop.
Area of uses We can use a hive for data mining, predictive modeling, and
document indexing.

Querying Data in HIVE

Querying and analyzing data in Hive involves using Hive Query Language (HQL) to interact
with data stored in Hive tables. Hive is a data warehousing and SQL-like querying tool that
provides an SQL-like interface for querying and analyzing data stored in Hadoop Distributed
File System (HDFS) or other compatible storage systems. Here are the steps to query and
analyze data in Hive:

1. Data Ingestion:

 Data is typically ingested into Hive from various sources, including HDFS, external
databases, or data streams.

2. Data Definition:

 Define the schema of your data by creating Hive tables. You can specify the table name,
column names, data types, and storage format. Hive supports both structured and semi-
structured data.

Example:

CREATE TABLE employee(emp_id INT,emp_name STRING,

emp_salary FLOAT)ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

3. Data Loading:

 Load data into Hive tables using the LOAD DATA command or by inserting data
directly.

Example:

LOAD DATA INPATH '/user/hadoop/employee_data.csv' INTO TABLE employee;

4. Querying Data:

 Use HQL to query data from Hive tables. You can write SQL-like queries to retrieve,
filter, and transform data.

Example:

SELECT emp_name, emp_salary FROM employee WHERE emp_salary > 50000;

5. Aggregations and Grouping:

 Hive supports aggregation functions (e.g., SUM, AVG, COUNT) and GROUP BY
clauses for summarizing data.

Example:

SELECT department, AVG(salary) AS avg_salary FROM employee

GROUP BY department;

6. Joins:

 You can perform joins between Hive tables to combine data from multiple sources.

Example:

SELECT e.emp_name, d.department_name FROM employee e

JOIN department d
ON e.department_id = d.department_id;

7. Data Transformation:

 Hive allows you to transform and process data using user-defined functions (UDFs) and
built-in functions.

SELECT emp_name, UPPER(emp_name) AS uppercase_name

FROM employee;

8. Storing Results:

 You can store the results of queries in Hive tables for further analysis or reporting.

Example:

INSERT OVERWRITE TABLE high_salary_employees

SELECT emp_name, emp_salary FROM employee
WHERE emp_salary > 75000;

9. Running Queries:

 Submit Hive queries using the Hive command-line interface (CLI) or through Hive client
libraries and interfaces in programming languages like Python or Java.
10. Monitoring and Optimization: – Monitor query performance and optimize Hive queries by
creating appropriate indexes, partitions, and tuning configurations.

fundamentals of HBase and Zookeeper

HBase
HBase is a column-oriented non-relational database management system that runs on top
of Hadoop Distributed File System (HDFS), a main component of Apache Hadoop.

HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big
data use cases. It is well suited for real-time data processing or random read/write access to large
volumes of data.

Unlike relational database systems, HBase does not support a structured query language like
SQL; in fact, HBase isn‘t a relational data store at all. HBase applications are written in Java™
much like a typical Apache MapReduce application. HBase does support writing applications
in Apache Avro, REST and Thrift.

An HBase system is designed to scale linearly. It comprises a set of standard tables with rows
and columns, much like a traditional database. Each table must have an element defined as a
primary key, and all access attempts to HBase tables must use this primary key.

Features of HBASE

 Horizontally scalable: You can add any number of columns anytime.

 Automatic Failover: Automatic failover is a resource that allows a system
administrator to automatically switch data handling to a standby system in the event
of system compromise
 Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
 sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey, column key,and timestamp.
 Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
 fundamentally, it's a platform for storing and retrieving data with random access.
 It doesn't care about datatypes(storing an integer in one row and a string in another
for the same column).
 It doesn't enforce relationships within your data.
 It is designed to run on a cluster of computers, built using commodity hardware.

 HBase is a data model that is similar to Google‗s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
 It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
 One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access. .

HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column
families and each column family can have any number of columns. Subsequent column values
are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an
HBase:
Table is a collection of rows.

Row is a collection of column families.

Column family is a collection of columns.
Column is a collection of key value pairs.
ZooKeeper
Zookeeper is a distributed, open-source coordination service for distributed applications. It
exposes a simple set of primitives to implement higher-level services for synchronization,
configuration maintenance, and group and naming.
In a distributed system, there are multiple nodes or machines that need to communicate with each
other and coordinate their actions. ZooKeeper provides a way to ensure that these nodes are
aware of each other and can coordinate their actions. It does this by maintaining a hierarchical
tree of data nodes called ―Znodes―, which can be used to store and retrieve data and maintain
state information. ZooKeeper provides a set of primitives, such as locks, barriers, and queues,
that can be used to coordinate the actions of nodes in a distributed system. It also provides
features such as leader election, failover, and recovery, which can help ensure that the system is
resilient to failures. ZooKeeper is widely used in distributed systems such as Hadoop, Kafka, and
HBase, and it has become an essential component of many distributed applications.

The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a tree-
like structure. Each znode can store data and has a set of permissions that control access to the
znode. The znodes are organized in a hierarchical namespace, similar to a file system. At the root
of the hierarchy is the root znode, and all other znodes are children of the root znode. The
hierarchy is similar to a file system hierarchy, where each znode can have children and
grandchildren, and so

IBM infosphere Big insights and Streams.

IBM released at the end of 2011 the software of InfoSphere BigInsights and InfoSphere Streams
which allows clients to gain a fast impression about information streams in a zone of interests of
their business.
 BigInsights is the platform for data analysis allowing the companies to turn difficult data
sets of scale of the Internet into knowledge. Easily set Apache Hadoop distribution kit
and also a set of the connected tools necessary for application development, data transfer
and management of a cluster are a part of this platform. Thanks to the simplicity and
scalability of Hadoop, the representing Open Source-реализацию of infrastructure
MapReduce, uses deserved recognition in different industries and sciences. In addition to
Hadoop, the following Open Source-технологии are a part of BigInsights (all of them,
except for Jaql, are the Apache Software Foundation projects):
 Pig is the platform including a high-level language of the description of the programs
analyzing big data sets. The compiler transforming the Pig applications to the sequences
of the MapReduce tasks performed in the environment of
Hadoop is a part of Pig.
 Hive is the solution for data warehousing developed on the basis of the Hadoop
environment. In it the familiar principles of relational databases - tables, columns,
sections are implemented. Also set of SQL statements (HiveQL) for work in the
unstructured Hadoop environment is its part. Requests of Hive are compiled in the
MapReduce tasks performed in the environment of Hadoop.
 Jaql is the language of requests with the SQL-like interface developed by IBM and
intended for JavaScript Object Notation (JSON). Jaql perfectly maintains enclosure, is
highly function-oriented and extremely flexible. This language well is suitable for work
with poorly structured data; also it serves as the interface of storage of the HBase
columns and is used for the analysis of the text.
 HBase - the data storage environment focused on columns by a не-SQL intended for
support of big tables with small degree of fullness in Hadoop.
 Flume is the distributed, reliable and available service intended for effective movement
of large volumes of the generated data. Flume well is suitable for obtaining event logs
from several systems and their moving to the Hadoop file system (Hadoop Distributed
File System, HDFS) in process of their generation.
 Lucene is the library of the search system providing the high performance and full text
search.
 Avro is the technology of consecutive ordering of data using JSON for determination of
data types and protocols. Arranges data in a compact binary format.
 ZooKeeper is the centralized service intended for support of the configuration
information and naming; provides the distributed synchronization and group service.
 Oozie is the schedule system of line processing of tasks intended for the organization and
management of Apache Hadoop task performance.

 Visual Data Analysis Techniques

 Big data Visualization is , as the name suggests, a visual representation of big data.
Visualization techniques vary depending on the goal of the illustration. It could be as
simple as line charts, histograms and pie charts or a bit complex like scatter plot, heat
maps, tree maps, etc. Visualization of big data can also be done in 3-Dimensional graphs,
based on the use case.
 The increased popularity of big data and data analysis projects has made visualization
more important than ever. Companies are increasingly using machine learning to gather
massive amounts of data that can be difficult and slow to sort through, comprehend and
explain. Visualization offers a way to speed up the process and present information to
stakeholders in ways they can understand.
 Big data visualization often goes beyond the typical techniques used in normal
visualization, such as pie charts, histograms and graphs. Data visualization can provide
more complex representations, such as heat maps and fever charts.
Data Visualization Techniques
 Once you‘ve analyzed your data, you'll be ready to choose which data visualization
techniques to use.
 Charts and graphs are a basic technique for visualizing data. They‘re easy to use and
immediately recognizable to the people reading your reports, and they offer an effective
way to compare variables.
 Interactive dashboards display multiple data sets for a comprehensive picture of a
business.10 You could use them to evaluate multiple metrics in one location. Financial
data scientists might use an interactive dashboard to show total sales, the average sale per
customer, the return on investment for marketing spend, and the average value per lead.
 Geographic visualization lets you review data on a map.11 You may use it to compare
economic conditions across the country. This tool, for example, could help you
understand the hottest geographical customer bases for a given company.
 Network visualization displays the links between various data points. This is a good tool
for regression and correlation analysis, as it highlights the connections between your
data.12 Correlation examines the degree to which two pieces of data are related, while
regression typically examines the relationship between two variables in more detail. For
example, if you are evaluating seasonal sales volume, you might learn that sales rise
during a certain time of year with correlation analysis, and you could use regression
analysis to find out why.
Types of Big Data Visualization
 1. Line charts
Line chart, also called a line graph or line plot is a common chart. It is used to represent
changes in one variable against another, typically the time. The data points are connected
by lines. It is used for identifying trends and relationships between two variables. For
example, the below chart depicts the sales numbers of three employees.
 2. Histograms
A histogram is used to represent the frequency distribution of data. It groups data into
logical ranges and depicts the count of how many data points fall into each of those
ranges. It allows one to understand the nature of frequency distributions. The distribution
may be categorized as symmetric, right-skewed and left skewed. For example, how many
people are between each range of ages is shown in the following histogram.
 3. Bar chart
Bar chat, also called a bar graph, is used for depicting categorical data with rectangular
strips/bars. The length of the bars shows the value or quantity of a variable. The bars
might be vertical or horizontal. For example, the below shows how many people like
which kind of movies.
 4. Pie charts
Pie chart depicts the information in the form of ―pie slices‖. The ―slices‖ are in
proportion to the relative sizes of data. Above example can be represented in the pie chart
in the following form.
 5. Heat Maps
A heat map uses two-dimensional representation of data in which colors represent the
values or ranges. It provides a quick visual summary of information. Below is an example
of heatmap of temperature-variation data across an year in 4 US cities.
 6. Scatter plot
It uses dots/points to show values for numeric variables. The position of the dots against
both the axes indicates the value of that particular data point. Below is an example of
tree height plotted against the girth of the stem.
 7. Tree map
This type of chart represents hierarchical data in the format of nested rectangles. The size
and color of a rectangle represents the value of that category or variable. It helps to depict
part-to-whole relationships in a complex data set.
 8. Word cloud
Word cloud or tag cloud is a representation of word frequency in a data set. The larger
the word appears, the higher the frequency of that word. This is used for textual data
analysis and summarization. Below is a sample word cloud of jargons commonly used in
the big data industry.
One can also define big data visualization categories in the following manner.
 Temporal
It is a representation of data against time period. For example, gantt charts, timelines, etc.
 Hierarchical
It represents data in tree format. One root node at the top and branches originating from
the root. For example, tree map, flow charts.
 Network
It is used when one wants to show connections between various unrelated data sets. Word
cloud and matrix charts are examples of network type of visualization.
 Geospatial
Geospatial is a special category in which location data is one of the variables. The
variables are plotted against the location variable. Demographic charts, density maps are
examples of this category.

Interaction Techniques
Data interaction techniques refer to various methods and functionalities that allow users
to engage with and manipulate data effectively. They enable users to explore, analyze,
and understand data through visualization, filtering, sorting, searching, drill-down,
grouping, interactive dashboards, tooltips, linked data, collaboration features, interactive
documentation, natural language querying, and predictive analytics. These techniques are
crucial as they empower users to extract insights, identify patterns, make data-driven
decisions, and communicate findings more comprehensively. By providing intuitive and
interactive ways to interact with data, these techniques enhance data comprehension,
facilitate efficient analysis, and enable users to derive maximum value from their data
assets

Visualization:

Presenting data through charts, graphs, maps, and other visual representations helps users
comprehend complex information quickly and identify patterns, trends, and outliers.

Filtering:

Allowing users to filter data based on specific criteria enables them to focus on relevant subsets,
reducing noise and facilitating analysis.
Sorting:

Providing options to sort data by various attributes (e.g., alphabetical order, numerical order,
date) helps users organize and explore data in a meaningful way.

Searching:

Implementing search functionality allows users to locate specific data points or patterns within a
larger dataset efficiently.

Drill-down:

Enabling users to drill down into aggregated data to access more detailed information helps them
understand the underlying factors and explore data at different levels of granularity.

Data grouping:

Grouping data based on specific attributes (e.g., categories, time periods) allows users to
examine data in meaningful clusters, facilitating analysis and comparison.

Interactive dashboards:

Designing interactive dashboards with configurable widgets and controls empowers users to
customize data views, explore different dimensions, and interact with data in real-time.

Tooltips and data labels:

Providing contextual information through tooltips or data labels on visualizations helps users
understand specific data points and their significance.

Linked data:

Establishing links between related datasets or integrating data from multiple sources helps users
gain comprehensive insights by exploring connections and correlations.

Collaboration features:

Enabling users to annotate, comment, and share data with others fosters collaboration, allowing
multiple perspectives and insights to contribute to a comprehensive understanding of the data.

Interactive documentation:

Creating interactive documentation or tutorials that guide users through data analysis techniques
and explain the underlying concepts helps users gain a deeper understanding of the data and its
interpretation.

Natural language querying:

Implementing natural language querying capabilities allows users to interact with data using
everyday language, making it more accessible and intuitive.
Predictive analytics:

Incorporating predictive models and machine learning algorithms enables users to leverage data
to generate forecasts, recommendations, or what-if scenarios, enhancing their understanding and
decision-making.

These techniques can be applied individually or in combination, depending on the nature of the
data and the specific requirements of the users.

Systems and Applications

Tableau (and Tableau Public)

Tableau has a variety of options available, including a desktop app, server and hosted online
versions, and a free public option. There are hundreds of data import options available, from
CSV files to Google Ads and Analytics data to Salesforce data.
Output options include multiple chart formats as well as mapping capability. That means
designers can create color-coded maps that showcase geographically important data in a format
that‘s much easier to digest than a table or chart could ever be.
The public version of Tableau is free to use for anyone looking for a powerful way to create data
visualizations that can be used in a variety of settings. From journalists to political junkies to
those who just want to quantify the data of their own lives, there are tons of potential uses for
Tableau Public. They have an extensive gallery of infographics and visualizations that have been
created with the public version to serve as inspiration for those who are interested in creating
their own.
Pros
 Hundreds of data import options
 Mapping capability
 Free public version available
 Lots of video tutorials to walk you through how to use Tableau
Cons
 Non-free versions are expensive ($70/month/user for the Tableau Creator software)
 Public version doesn‘t allow you to keep data analyses private
Infogram
Infogram is a fully-featured drag-and-drop visualization tool that allows even non-designers to
create effective visualizations of data for marketing reports, infographics, social media posts,
maps, dashboards, and more.
Finished visualizations can be exported into a number of formats: .PNG, .JPG, .GIF, .PDF, and
.HTML. Interactive visualizations are also possible, perfect for embedding into websites or apps.
Infogram also offers a WordPress plugin that makes embedding visualizations even easier for
WordPress users.
Pros
 Tiered pricing, including a free plan with basic features
 Includes 35+ chart types and 550+ map types
 Drag and drop editor
 API for importing additional data sources
Cons
 Significantly fewer built-in data sources than some other apps
ChartBlocks
ChartBlocks claims that data can be imported from ―anywhere‖ using their API, including from
live feeds. While they say that importing data from any source can be done in ―just a few clicks,‖
it‘s bound to be more complex than other apps that have automated modules or extensions for
specific data sources.
The app allows for extensive customization of the final visualization created, and the chart
building wizard helps users pick exactly the right data for their charts before importing the data.
Designers can create virtually any kind of chart, and the output is responsive—a big advantage
for data visualization designers who want to embed charts into websites that are likely to be
viewed on a variety of devices.
Pros
 Free and reasonably priced paid plans are available
 Easy to use wizard for importing the necessary data
Cons
 Unclear how robust their API is
 Doesn‘t appear to have any mapping capability
Datawrapper
Datawrapper was created specifically for adding charts and maps to news stories. The charts and
maps created are interactive and made for embedding on news websites. Their data sources are
limited, though, with the primary method being copying and pasting data into the tool.
Once data is imported, charts can be created with a single click. Their visualization types include
column, line, and bar charts, election donuts, area charts, scatter plots, choropleth and symbol
maps, and locator maps, among others. The finished visualizations are reminiscent of those seen
on sites like the New York Times or Boston Globe. In fact, their charts are used by publications
like Mother Jones, Fortune, and The Times.
The free plan is perfect for embedding graphics on smaller sites with limited traffic, but paid
plans are on the expensive side, starting at $39/month.
Pros
 Specifically designed for newsroom data visualization
 Free plan is a good fit for smaller sites
 Tool includes a built-in color blindness checker
Cons
 Limited data sources
 Paid plans are on the expensive side
D3.js
D3.js is a JavaScript library for manipulating documents using data. D3.js requires at least some
JS knowledge, though there are apps out there that allow non-programming users to utilize the
library.
Those apps include NVD3, which offers reusable charts for D3.js; Plotly‘s Chart Studio, which
also allows designers to create WebGL and other charts; and Ember Charts, which also uses the
Ember.js framework.
Pros
 Very powerful and customizable
 Huge number of chart types possible
 A focus on web standards
 Tools available to let non-programmers create visualizations
 Free and open source
Cons
 Requires programming knowledge to use alone
 Less support available than with paid tools
Google Charts
Google Charts is a powerful, free data visualization tool that is specifically for creating
interactive charts for embedding online. It works with dynamic data and the outputs are based
purely on HTML5 and SVG, so they work in browsers without the use of additional plugins.
Data sources include Google Spreadsheets, Google Fusion Tables, Salesforce, and other SQL
databases.
There are a variety of chart types, including maps, scatter charts, column and bar charts,
histograms, area charts, pie charts, treemaps, timelines, gauges, and many others. These charts
can be customized completely, via simple CSS editing.
Pros
 Free
 Wide variety of chart formats available
 Cross-browser compatible since it uses HTML5/SVG
 Works with dynamic data
Cons
 Beyond the tutorials and forum available, there‘s limited support
FusionCharts
FusionCharts is another JavaScript-based option for creating web and mobile dashboards. It
includes over 150 chart types and 1,000 map types. It can integrate with popular JS frameworks
(including React, jQuery, React, Ember, and Angular) as well as with server-side programming
languages (including PHP, Java, Django, and Ruby on Rails).
FusionCharts gives ready-to-use code for all of the chart and map variations, making it easier to
embed in websites even for those designers with limited programming knowledge. Because
FusionCharts is aimed at creating dashboards rather than just straightforward data visualizations
it‘s one of the most expensive options included in this article. But it‘s also one of the most
powerful.
Pros
 Huge number of chart and map format options
 More features than most other visualization tools
 Integrates with a number of different frameworks and programming languages
Cons
 Expensive (starts at almost $500 for one developer license)
 Overkill for simple visualizations outside of a dashboard environment
Chart.js
Chart.js is a simple but flexible JavaScript charting library. It‘s open source, provides a good
variety of chart types (eight total), and allows for animation and interaction.
Chart.js uses HTML5 Canvas for output, so it renders charts well across all modern browsers.
Charts created are also responsive, so it‘s great for creating visualizations that are mobile-
friendly.
Pros
 Free and open source
 Responsive and cross-browser compatible output
Cons
 Very limited chart types compared to other tools
 Limited support outside of the official documentation
Grafana
Grafana is open-source visualization software that lets users create dynamic dashboards and
other visualizations. It supports mixed data sources, annotations, and customizable alert
functions, and it can be extended via hundreds of available plugins. That makes it one of the
most powerful visualization tools available.
Export functions allow designers to share snapshots of dashboards as well as invite other users to
collaborate. Grafana supports over 50 data sources via plugins. It‘s free to download, or there‘s a
cloud-hosted version for $49/month. (There‘s also a very limited free hosted version.) The
downloadable version also has support plans available, something a lot of other open-source
tools don‘t offer.
Pros
 Open source, with free and paid options available
 Large selection of data sources available
 Variety of chart types available
 Makes creating dynamic dashboards simple
 Can work with mixed data feeds
Cons
 Overkill for creating simple visualizations
 Doesn‘t offer as many visual customization options as some other tools
 Not the best option for creating visualization images
 Not able to embed dashboards in websites, though possible for individual panels
Chartist.js
Chartist.js is a free, open-source JavaScript library that allows for creating simple responsive
charts that are highly customizable and cross-browser compatible. The entire JavaScript library
is only 10KB when GZIPped. Charts created with Chartist.js can also be animated, and plugins
allow it to be extended.
Pros
 Free and open source
 Tiny file size
 Charts can be animated
Cons
 Not the widest selection of chart types available
 No mapping capabilities
 Limited support outside of developer community
Sigmajs
Sigmajs is a single-purpose visualization tool for creating network graphs. It‘s highly
customizable but does require some basic JavaScript knowledge in order to use. Graphs created
are embeddable, interactive, and responsive.
Pros
 Highly customizable and extensible
 Free and open source
 Easy to embed graphs in websites and apps
Cons
 Only creates one type of visualization: network graphs
 Requires JS knowledge to customize and implement
Polymaps
Polymaps is a dedicated JavaScript library for mapping. The outputs are dynamic, responsive
maps in a variety of styles, from image overlays to symbol maps to density maps. It uses SVG to
create the images, so designers can use CSS to customize the visuals of their maps.
Pros
 Free and open source
 Built specifically for mapping
 Easy to embed maps in websites and apps
Cons
 Only creates one type of visualization
 Requires some coding knowledge to customize and implement

Applications
1. Business Intelligence

Business intelligence utilizes data visualization to gather, analyze, and interpret data for
informed decision-making. It involves running various analyses such as sales performance,
market segmentation, and financial forecasting. For example, a company can use data
visualization to analyze sales data across different regions and product categories to identify
the best performing regions and products, enabling them to allocate resources effectively and
optimize their sales strategies.

2. Finance Industries

Data visualization in the finance industry helps professionals analyze financial data, detect
trends, and make informed decisions. It enables them to run analyses such as revenue and
expense tracking, cash flow analysis, and portfolio performance evaluation. For example,
financial analysts can use data visualization to track revenue growth over time, identify
seasonal patterns, and compare performance across different product lines, allowing them to
make strategic decisions and optimize financial strategies accordingly.
3. E-commerce

In the e-commerce industry, data visualization aids in understanding customer behavior,

optimizing marketing campaigns, and enhancing personalized recommendations. Analysis can
include customer segmentation, purchase patterns, and conversion rates. For instance, e-
commerce companies can use data visualization to analyze customer browsing and purchasing
data to identify customer segments and target them with tailored marketing campaigns,
resulting in improved conversion rates and customer satisfaction.

4. Education

In the education industry, data visualization facilitates tracking student performance,

identifying learning outcomes, and informing pedagogical decisions. Analysis can include
student achievement, learning progress, and assessment results. For example, ed ucational
institutions can use data visualization to analyze student test scores over time, identify areas
where students may be struggling, and adjust teaching strategies accordingly to improve
learning outcomes and academic success.

5. Data Science

Data visualization is essential in the field of data science, enabling professionals to extract
insights from complex datasets and communicate findings effectively. Analyses can include
exploratory data analysis, pattern recognition, and model evaluation. For example, data
scientists can use visualizations to analyze customer behavior data, identify patterns in
purchasing habits, and build predictive models to recommend personalized products, leading
to increased customer satisfaction and sales revenue.

6. Military

In the military sector, data visualization plays a critical role in enhancing decision-making
capabilities and situational awareness. Analyses can include intelligence data visualization,
operational analytics, and real-time tracking. For example, military commanders can use data
visualization to track and analyze troop movements, monitor supply chains, and visualize
enemy positions on a map, enabling them to make strategic decisions and respond effectively
to changing circumstances in the battlefield.

7. Healthcare Industries

Here, data visualization supports analyzing patient data, identifying trends, and improving
healthcare outcomes. Analysis can include patient monitoring, disease tracking, and resource
allocation. For example, healthcare providers can use data visualization to track the spread of
infectious diseases, visualize patient vital signs over time, and identify high-risk areas or
populations, allowing for proactive interventions and effective allocation of healthcare
resources.

8. Marketing

In the marketing industry, data visualization enables professionals to analyze campaign

performance, customer segmentation, and market trends for effective decision-making.
Analysis can include campaign ROI, customer behavior, and market share. For example,
marketers can use data visualization to track and visualize the effectiveness of different
marketing channels, identify target audience segments, and analyze customer journey data to
optimize marketing strategies and improve overall campaign performance.

9. Real Estate Business

In the real estate industry, data visualization helps professionals analyze property data, market
trends, and investment opportunities. Analysis can include property prices, rental rates, and
market comparisons. For example, real estate agents can use data visualization to analyze
historical property prices in a specific neighborhood, visualize market trends over time, and
identify areas with high potential for investment, assisting clients in making informed
decisions and maximizing their returns on real estate investments.

10. Food Delivery Apps

Food delivery apps utilize data visualization to optimize logistics, reduce delivery times, and
enhance overall efficiency. Analysis can include order volumes, delivery routes, and service
metrics. For example, food delivery apps can use data visualization to analyze delivery data in
real-time, visualize order volumes during peak hours, and optimize delivery routes to ensure
timely and efficient delivery, resulting in improved customer satisfaction and operational
efficiency.

Session 3.2
No ratings yet
Session 3.2
27 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hive
No ratings yet
Hive
30 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Unit-5 Sgs
No ratings yet
Unit-5 Sgs
10 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
HIVE
No ratings yet
HIVE
80 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Big Data Unit 5 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 5 (Easy Notes) Edushine Classes
42 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Hive
No ratings yet
Hive
5 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
Bda Report
No ratings yet
Bda Report
16 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive
No ratings yet
Hive
29 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Hive for Data Analysts
No ratings yet
Hive for Data Analysts
16 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Unit IV
No ratings yet
Unit IV
22 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
7 Hive
No ratings yet
7 Hive
30 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Big Data
No ratings yet
Big Data
120 pages
HIVE
No ratings yet
HIVE
18 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
Hive
No ratings yet
Hive
49 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Hive Final
No ratings yet
Hive Final
75 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Hive Main
No ratings yet
Hive Main
33 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Introduction to Hive Architecture
No ratings yet
Introduction to Hive Architecture
23 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Hive
No ratings yet
Hive
28 pages
Computer Architecture Insights
No ratings yet
Computer Architecture Insights
35 pages
LXF - 254 - September 2019
No ratings yet
LXF - 254 - September 2019
100 pages
PowerSchool Admin Job Desc
No ratings yet
PowerSchool Admin Job Desc
2 pages
GJU Hisar Non Teaching Recruitment 2024 Notification
No ratings yet
GJU Hisar Non Teaching Recruitment 2024 Notification
16 pages
Adobe Forms
No ratings yet
Adobe Forms
19 pages
STA Lab Manual
No ratings yet
STA Lab Manual
44 pages
CertPrep NCP-CN PDF Questions
No ratings yet
CertPrep NCP-CN PDF Questions
5 pages
AD Airport Service (CAA) List Updated, Notes & Syllabus - Page 2 - CSS Forums
No ratings yet
AD Airport Service (CAA) List Updated, Notes & Syllabus - Page 2 - CSS Forums
6 pages
EDPM Paper 1 2020
No ratings yet
EDPM Paper 1 2020
10 pages
LAB HMIWeb Display Builder Advanced Scripting
No ratings yet
LAB HMIWeb Display Builder Advanced Scripting
52 pages
Black Magic ATEM Mini Pro
No ratings yet
Black Magic ATEM Mini Pro
6 pages
Data Structures and Algorithms Tutorial
No ratings yet
Data Structures and Algorithms Tutorial
7 pages
Template Shopee Product Mass Upload Nov 6 2024
No ratings yet
Template Shopee Product Mass Upload Nov 6 2024
114 pages
Project Scope Statement Checklist
No ratings yet
Project Scope Statement Checklist
1 page
CV of Moeen Khan (Software Engineer)
No ratings yet
CV of Moeen Khan (Software Engineer)
1 page
Astrology Insights for Practitioners
100% (2)
Astrology Insights for Practitioners
11 pages
Tanmay Taneja CV (Tech) - 2025.02.25
No ratings yet
Tanmay Taneja CV (Tech) - 2025.02.25
1 page
(Ebook PDF) Introduction To Solid Modeling Using SOLIDWORKS 2020 16th Edition PDF Download
100% (1)
(Ebook PDF) Introduction To Solid Modeling Using SOLIDWORKS 2020 16th Edition PDF Download
56 pages
Serial Communication With PIC16F690 by Houston Pillay
No ratings yet
Serial Communication With PIC16F690 by Houston Pillay
12 pages
Amino A125: Multi Codec Ip-Stb
No ratings yet
Amino A125: Multi Codec Ip-Stb
2 pages
PG Lab 12-194
No ratings yet
PG Lab 12-194
4 pages
AWS - Interview Questions For Beginners
No ratings yet
AWS - Interview Questions For Beginners
14 pages
Excel Exercise 6 2023
No ratings yet
Excel Exercise 6 2023
4 pages
Industrial Automation Solutions
No ratings yet
Industrial Automation Solutions
25 pages
Server-Side Development Basics
No ratings yet
Server-Side Development Basics
15 pages
SBS BM 6A Pautan Latihan Tambahan
No ratings yet
SBS BM 6A Pautan Latihan Tambahan
3 pages
Assignment of C
No ratings yet
Assignment of C
24 pages
Blitz-Logs 20220531192630
No ratings yet
Blitz-Logs 20220531192630
37 pages
("!I9Cl. - Qggsgls:P11Glggltqlg$Ail: Examination Control Division
No ratings yet
("!I9Cl. - Qggsgls:P11Glggltqlg$Ail: Examination Control Division
30 pages
Access Modifiers Lect 4
No ratings yet
Access Modifiers Lect 4
9 pages

Unit V

Uploaded by

Unit V

Uploaded by

UNIT V

Applications on Big Data Using Pig and Hive

CROSS Computes the cross product of two or more relations

DEFINE Assigns an alias to a UDF or streaming command.

DISTINCT Removes duplicate tuples in a relation.

FILTER Selects tuples from a relation based on some condition.

FOREACH Generates transformation of data for each row as specified

IMPORT Import macros defined in a separate file. /* myscript.pig */

LOAD Loads data from the file system.

MAPREDUCE Executes native MapReduce jobs inside a Pig script.

ORDERBY Sorts a relation based on one or more fields.

STORE Stores or saves results to the file system.

STREAM Sends data to an external script or program

Querying Data in HIVE

CREATE TABLE employee(emp_id INT,emp_name STRING,

LOAD DATA INPATH '/user/hadoop/employee_data.csv' INTO TABLE employee;

SELECT emp_name, emp_salary FROM employee WHERE emp_salary > 50000;

5. Aggregations and Grouping:

SELECT department, AVG(salary) AS avg_salary FROM employee

SELECT e.emp_name, d.department_name FROM employee e

SELECT emp_name, UPPER(emp_name) AS uppercase_name

INSERT OVERWRITE TABLE high_salary_employees

fundamentals of HBase and Zookeeper

 Horizontally scalable: You can add any number of columns anytime.

Row is a collection of column families.

IBM infosphere Big insights and Streams.

 Visual Data Analysis Techniques

Tooltips and data labels:

Natural language querying:

Systems and Applications

Tableau (and Tableau Public)

In the e-commerce industry, data visualization aids in understanding customer behavior,

In the education industry, data visualization facilitates tracking student performance,

In the marketing industry, data visualization enables professionals to analyze campaign

9. Real Estate Business

10. Food Delivery Apps

You might also like