KEMBAR78
Big Data Unit-5 | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
52 views81 pages

Big Data Unit-5

Big_Data_Unit-5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views81 pages

Big Data Unit-5

Big_Data_Unit-5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

UNIT-5

Big Data
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• Pig:
• ETL (Extract, Transform, Load): Pig is often used for data preparation tasks
such as cleaning, transforming, and aggregating large datasets before they are
loaded into a data warehouse or processed further.
• Data Processing Pipelines: Pig enables the creation of complex data
processing pipelines using its scripting language, Pig Latin. These pipelines can
handle large volumes of data efficiently.
• Data Analysis: Pig can be used for exploratory data analysis tasks, allowing
analysts to quickly prototype and test data processing workflows.
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• Hive:
• Data Warehousing: Hive provides a SQL-like interface (HiveQL) for querying
and analyzing data stored in Hadoop's distributed file system (HDFS). It is
commonly used for creating data warehouses and data lakes.
• Ad Hoc Queries: Analysts and data scientists can use Hive to run ad hoc
queries on large datasets stored in HDFS, without needing to know complex
MapReduce programming.
• Batch Processing: Hive supports batch processing of data, making it suitable
for tasks like log analysis, data mining, and reporting.
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• HBase:
• Real-time Data Storage: HBase is a distributed, scalable, and column-oriented
database built on top of Hadoop. It is optimized for storing and retrieving
large volumes of data in real-time.
• NoSQL Database: HBase is commonly used as a NoSQL database for
applications that require low-latency data access and flexible schema design.
• Time-Series Data Storage: HBase is well-suited for storing time-series data
such as sensor readings, clickstream data, and social media interactions,
where data needs to be stored and retrieved based on timestamps.
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• Applications that leverage all three technologies together might look
something like this:
• Data Ingestion: Use Pig to preprocess raw data, clean it, and transform it into
a structured format.
• Data Storage: Store the processed data in HDFS.
• Data Querying and Analysis: Use Hive to create tables and run SQL-like
queries on the data stored in HDFS.
• Real-Time Access: Store frequently accessed or real-time data in HBase for
fast retrieval.
• Analytics: Perform complex analytics and machine learning tasks on the data
using tools like Spark or MapReduce, with data sourced from HDFS or HBase.
PIG
• Pig is a high-level platform that allows developers to create complex data
transformations using a high-level language called Pig Latin, which is then converted into
a series of MapReduce jobs to be executed on Hadoop.
• Pig Represents Big Data as data flows.
• Pig is a used to process the large datasets.
• First, to process the data which is stored in the HDFS, the programmers will write the
scripts using the Pig Latin Language.
• Internally Pig Engine(a component of Apache Pig) converts all these scripts into a specific
map and reduce task.
• But these are not visible to the programmers in order to provide a high-level of
abstraction.
• The result of Pig is always stored in the HDFS.
• Programmers can use Pig to write data transformations without knowing Java. Pig uses
both structured and unstructured data as input to perform analytics.
FEATURES OF PIG
• Pig Latin: Pig Latin is a dataflow scripting language used to express data transformations.
It provides a rich set of operators and functions for manipulating structured and semi-
structured data.
• Optimization: Pig automatically optimizes Pig Latin scripts to improve performance by
optimizing the execution plan, minimizing data movement, and parallelizing
computations whenever possible.
• Extensibility: Pig is designed to be extensible, allowing developers to create custom user-
defined functions (UDFs) in Java, Python, or other languages to perform specialized
processing tasks.
• Integration with Hadoop: Pig seamlessly integrates with the Hadoop ecosystem,
allowing it to read and write data from and to Hadoop's distributed file system (HDFS)
and process data stored in HDFS using MapReduce.
• Ease of Use: Pig's scripting language is designed to be intuitive and easy to learn for
developers familiar with SQL, scripting languages, or data processing concepts.
FEATURES OF PIG
• For performing several operations Apache Pig provides rich sets of
operators like the filtering, joining, sorting, aggregation etc.
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• By integrating with other components of the Apache Hadoop
ecosystem, such as Apache Hive, Apache Spark, and Apache
ZooKeeper, Apache Pig enables users to take advantage of these
components’ capabilities while transforming data.
• The data structure is multivalued, nested, and richer.
EXECUTION MODES OF PIG
• Local Mode:
• In Local Mode, Pig runs on a single machine without using Hadoop's
distributed processing capabilities.
• It is useful for testing and debugging Pig scripts on small datasets, as it
provides faster execution and a simpler development environment.
• Local Mode is not suitable for processing large datasets since it does not take
advantage of Hadoop's scalability.
• In the local mode, the Pig engine takes input from the Linux file system and
the output is stored in the same file system.
EXECUTION MODES OF PIG
• MapReduce Mode:
• MapReduce Mode is the default execution mode for Pig.
• In this mode, Pig scripts are translated into MapReduce jobs, which are then
executed on a Hadoop cluster.
• MapReduce Mode leverages Hadoop's distributed processing framework to
process large volumes of data in parallel across multiple nodes in the cluster.
• It is suitable for processing large-scale datasets stored in Hadoop's distributed
file system (HDFS) and provides scalability and fault tolerance.
PIG V/S MAPREDUCE
Apache Pig MapReduce

It is a scripting language. It is a compiled programming language.

Abstraction is at higher level. Abstraction is at lower level.

It have less line of code as compared to MapReduce. Lines of code is more.

Less effort is needed for Apache Pig. More development efforts are required for MapReduce.

Code efficiency is less as compared to MapReduce. As compared to Pig efficiency of code is higher.

Pig provides built in functions for ordering, sorting and union. Hard to perform data operations.

It allows nested data types like map, tuple and bag It does not allow nested data types
PIG V/S SQL
Difference Pig SQL

Pig is a scripting language used to interact with SQL is a query language used to interact with
Definition
HDFS. databases residing in the database engine.

Query Style Pig offers a step-by-step execution style. SQL offers the single block execution style.

Pig does a lazy evaluation, which means that data


Evaluation is processed only when the STORE or DUMP SQL offers immediate evaluation of a query.
command is encountered.

In SQL, you need to run the “join” command


Pipeline Splits Pipeline Splits are supported in Pig. twice for the result to be materialized as an
intermediate result.
TYPES OF DATA MODELS IN APACHE PIG
It consist of the 4 types of data models as follows:
• Atom: It is a atomic data value which is used to store as a string. The main
use of this model is that it can be used as a number and as well as a string.
• Tuple: It is a sequence of fields (a piece of data) that can be of any data
type.
• Bag: It is a collection of tuples of potentially varying structures and can
contain duplicates.
• Map: It is a set of key/value pairs.
• Relation: It is the outermost structure of the Pig Latin data model. It is
analogous to a table in a database or a dataset in other data processing
frameworks. It represents a structured collection of data, organized into
rows and columns.
PIG LATIN – RELATIONAL OPERATIONS
OPERATOR DESCRIPTION
Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a relation.

STORE To save a relation to the file system (local/HDFS).


Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
To arrange a relation in a sorted order based on one or more fields (ascending or
ORDER
descending).
LIMIT To get a limited number of tuples from a relation.
PIG LATIN – RELATIONAL OPERATIONS
Combining and Splitting
To combine two or more relations into a
UNION
single relation.
To split a single relation into two or more
SPLIT
relations.
Diagnostic Operators
To print the contents of a relation on the
DUMP
console.
DESCRIBE To describe the schema of a relation.
To view the logical, physical, or MapReduce
EXPLAIN
execution plans to compute a relation.
To view the step-by-step execution of a series
ILLUSTRATE
of statements.
PIG LATIN- STATEMENTS
• Statements refer to individual commands or instructions that you write to manipulate
data. These statements are written in Pig Latin script and are executed sequentially by
the Pig Latin interpreter.
• Every statement ends with a semicolon (;).
• Except LOAD and STORE, while performing all other operations, Pig Latin statements take
a relation as input and produce another relation as output.
• As soon as you enter a Load statement in the Grunt shell, its semantic checking will be
carried out. To see the contents of the schema, you need to use the Dump operator. Only
after performing the dump operation, the MapReduce job for loading the data into the
file system will be carried out.
• There are various types of statements in Pig Latin, including:
• Load Statements: Used to load data from external sources into Pig relations.
• Store Statements: Used to store the results of Pig operations into external storage systems.
• Transformations: These include statements for data manipulation such as filtering, grouping,
joining, and sorting.
• Declarations: Statements for declaring variables and defining aliases.
• Comments: Lines beginning with "--" are comments and are ignored by the Pig interpreter. They
are used for documentation or annotating the script.
GRUNT
• "Grunt" refers to the interactive shell or command-line interface provided by Pig.
It allows users to interactively write and execute Pig Latin scripts, as well as
perform ad hoc data exploration and testing.
• When you start Pig in interactive mode by running the ‘pig’ command, you enter
the Grunt shell, indicated by the ‘grunt>’ prompt. From there, you can enter Pig
Latin commands directly, execute scripts, and view the results interactively.
• The Grunt shell provides various features and commands for managing Pig
sessions and interacting with the Hadoop cluster. Some common commands
include:
• run: Executes a Pig script file.
• explain: Displays the logical, physical, and execution plans for a Pig script.
• describe: Shows the schema of a relation or data structure.
• dump: Outputs the contents of a relation.
• illustrate: Generates a step-by-step execution plan with sample data.
GRUNT
• Additionally, Grunt supports various shell commands and shortcuts
for managing files, navigating directories, and interacting with the
underlying operating system.
• Grunt is particularly useful for interactive data exploration, debugging
Pig scripts, and quickly testing data transformations before
incorporating them into larger workflows or production pipelines. It
provides a convenient and flexible environment for working with Pig
and Hadoop without the need for writing and executing separate
scripts.
PIG EXAMPLE
• Use case: Using Pig to find the most occurred start letter.
• Solution:
• Case 1: Load the data into bag named "lines". The entire line is stuck
to element line of type character array.
• grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
• Case 2: The text in the bag lines needs to be tokenized this produces
one word per row.
• grunt>tokens =FOREACH lines GENERATE flatten(TOKENIZE(line)) As token: chararray;
PIG EXAMPLE
• Case 3: To retain the first letter of each word type the below
command .This commands uses substring method to take the first
character.
• grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as letter : chararray;
• Case 4: Create a bag for unique character where the grouped bag will
contain the same character for each occurrence of that character.
• grunt>lettergrp = GROUP letters by letter;
• Case 5: The number of occurrence is counted in each group.
• grunt>countletter = FOREACH lettergrp GENERATE group , COUNT(letters);
PIG EXAMPLE
• Case 6: Arrange the output according to count in descending order
using the commands below.
• grunt>OrderCnt = ORDER countletter BY $1 DESC;
• Case 7: Limit to One to give the result.
• grunt> result =LIMIT OrderCnt 1;
• Case 8: Store the result in HDFS . The result is saved in output
directory under sonoo folder.
• grunt> STORE result into 'home/sonoo/output';
APACHE HIVE
• Apache Hive is an open-source data warehousing tool for performing distributed
processing and data analysis. It was developed by Facebook to reduce the work of
writing the Java MapReduce program.
• Apache Hive uses a Hive Query language, which is a declarative language similar to SQL.
Hive translates the hive queries into MapReduce programs.
• It supports developers to perform processing and analyses on structured and semi-
structured data by replacing complex java MapReduce programs with hive queries.
• One who is familiar with SQL commands can easily write the hive queries.
• Hive supports applications written in any language like Python, Java, C++, Ruby, etc. using
JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one can easily
write a hive client application in any language of its own choice.
• Hive makes the job easy for performing operations like
• Analysis of huge datasets
• Ad-hoc queries
• Data encapsulation
APACHE HIVE ARCHITECTURE

• The major components of Apache Hive are:


• Hive Client
• Hive Services
• Processing and Resource Management
• Distributed Storage
HIVE ARCHITECTURE
Hive Clients: They are categorized into three types:
a) Thrift Clients
• The Hive server is based on Apache Thrift so that it can serve the request
from a thrift client.
b) JDBC client
• Hive allows for the Java applications to connect to it using the JDBC driver.
JDBC driver uses Thrift to communicate with the Hive Server.
c) ODBC client
• Hive ODBC driver allows applications based on the ODBC protocol to
connect to Hive. Similar to the JDBC driver, the ODBC driver uses Thrift to
communicate with the Hive Server.
HIVE ARCHITECTURE
Option Explanation
• $HIVE_HOME/bin/hive is a -d,–define <key=value> Variable subsitution to apply to hive
shell utility which can be used commands. e.g. -d A=B or –define
A=B
to run Hive queries in either
-e <quoted-query-string> SQL from command line
interactive or batch mode.
-f <filename> SQL from files
HiveServer2 (introduced in
-H,–help Print help information
Hive 0.11) has its own CLI
-h <hostname> Connecting to Hive Server on remote
called Beeline, which is a JDBC host
client based on SQLLine. –hiveconf <property=value> Use value for given property
• Hive Command Line Options –hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. –hivevar A=B
• To get help, run “hive -H” or -i <filename> Initialization SQL file
“hive –help”. Usage (as it is in -p <port> Connecting to Hive Server on port
Hive 0.9.0) number
-S,–silent Silent mode in interactive shell
• usage: hive -v,–verbose Verbose mode (echo executed SQL to
the console)
HIVE ARCHITECTURE
Examples
• Example of running a query from the command line
• $HIVE_HOME/bin/hive -e ‘select a.foo from pokes a’
• Example of setting Hive configuration variables
• $HIVE_HOME/bin/hive -e ‘select a.foo from pokes a’ –hiveconf
hive.exec.scratchdir=/opt/my/hive_scratch –hiveconf mapred.reduce.tasks=1
• Example of dumping data out from a query into a file using silent mode
• $HIVE_HOME/bin/hive -S -e ‘select a.foo from pokes a’ > a.txt
• Example of running a script non-interactively from local disk
• $HIVE_HOME/bin/hive -f /home/my/hive-script.sql
• Example of running a script non-interactively from a Hadoop supported
filesystem (starting in Hive 0.14)
• $HIVE_HOME/bin/hive -f hdfs://<namenode>:<port>/hive-script.sql
HIVE ARCHITECTURE
HIVE SERVICES:
• To perform all queries, Hive provides various services like the Hive server2, Beeline, etc. The various services
offered by Hive are:
1. Beeline
• The Beeline is a command shell supported by HiveServer2, where the user can submit its queries and
command to the system. It is a JDBC client that is based on SQLLINE CLI (pure Java-console-based utility for
connecting with relational databases and executing SQL queries).
2. Hive Server 2
• HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients to execute queries against the Hive.
It allows multiple clients to submit requests to Hive and retrieve the final results. It is basically designed to
provide the best support for open API clients like JDBC and ODBC.
• Note: Hive server1, also called a Thrift server, is built on Apache Thrift protocol to handle the cross-platform
communication with Hive. It allows different client applications to submit requests to Hive and retrieve the
final results.
• It does not handle concurrent requests from more than one client due to which it was replaced by
HiveServer2.
3. Hive Driver
• The Hive driver receives the HiveQL statements submitted by the user through the command shell. It creates
the session handles for the query and sends the query to the compiler.
HIVE ARCHITECTURE
4. Hive Compiler
• Hive compiler parses the query. It performs semantic analysis and type-checking on the different query
blocks and query expressions by using the metadata stored in metastore and generates an execution plan.
• The execution plan created by the compiler is the DAG(Directed Acyclic Graph), where each stage is a
map/reduce job, operation on HDFS, a metadata operation.
5. Optimizer
• Optimizer performs the transformation operations on the execution plan and splits the task to improve
efficiency and scalability.
6. Execution Engine
• Execution engine, after the compilation and optimization steps, executes the execution plan created by the
compiler in order of their dependencies using Hadoop.
7. Metastore
• Metastore is a central repository that stores the metadata information about the structure of tables and
partitions, including column and column type information.
• It also stores information of serializer and deserializer, required for the read/write operation, and HDFS files
where data is stored. This metastore is generally a relational database.
HIVE ARCHITECTURE
• Metastore provides a Thrift interface for querying and manipulating Hive metadata.
• We can configure metastore in any of the two modes:
• Remote: In remote mode, metastore is a Thrift service and is useful for non-Java applications.
• Embedded: In embedded mode, the client can directly interact with the metastore using JDBC.
8. HCatalog
• HCatalog is the table and storage management layer for Hadoop. It enables users with
different data processing tools such as Pig, MapReduce, etc. to easily read and write data
on the grid.
• It is built on the top of Hive metastore and exposes the tabular data of Hive metastore to
other data processing tools.
9. WebHCat
• WebHCat is the REST API for HCatalog. It is an HTTP interface to perform Hive metadata
operations. It provides a service to the user for running Hadoop MapReduce (or YARN),
Pig, Hive jobs.
HIVE ARCHITECTURE
Processing framework and resource management:
• Hive internally uses a MapReduce framework as a defacto engine for
executing the queries.
• MapReduce is a software framework for writing those applications
that process a massive amount of data in parallel on the large clusters
of commodity hardware. MapReduce job works by splitting data into
chunks, which are processed by map-reduce tasks.
Distributed Storage:
• Hive is built on top of Hadoop, so it uses the underlying Hadoop
Distributed File System for the distributed storage.
HIVE v/s RDBMS
RDBMS Hive
It is used to maintain database. It is used to maintain data warehouse.

It uses SQL (Structured Query Language). It uses HQL (Hive Query Language).

Schema is fixed in RDBMS. Schema varies in it.

Normalized and de-normalized both type of data is


Normalized data is stored.
stored.

Tables in rdms are sparse. Table in hive are dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used. Sharding method is used for partition.


HiveQL
• A query language called HiveQL (Hive Query Language) is used to
communicate with Apache Hive, a Hadoop data warehouse and SQL-
like query language. For searching and managing massively
distributed data held in Hadoop's HDFS (Hadoop Distributed File
System) or other comparable storage systems.
• To interact with tables, databases, and queries, Hive provides a SQL
like environment through Hadoop HiveQL. To execute various types of
data processing and querying, we can have different types of Clauses
for improved communication with various nodes outside the
ecosystem. HIVE also has JDBC connectivity.
HiveQL
Creating Databases and Tables
• In Hive, a table is a collection of data that is sorted according to a specific
set of identifiers using a schema.
Step 1: Create a Database:
CREATE DATABASE IF NOT EXISTS mydatabase;
If a database with the name "mydatabase" doesn't already exist, this statement creates
one. The database is only created if it doesn't already exist, thanks to the IF NOT EXISTS
condition.
Step 2: Switching to a Database:
USE mydatabase;
By switching to the "mydatabase" database using this line, further activities can be carried
out in that database.
HiveQL
Step 3: Creating a Table:
CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
age INT
);
The "employees" table is created with this statement. It has three columns: "id"
(integer), "name" (string), and "age" (integer). The table is only generated if it
doesn't already exist with the help of IF NOT EXISTS condition.
HiveQL
• Step 4: Creating an External Table:
CREATE EXTERNAL TABLE IF NOT EXISTS ext_employees (
id INT,
name STRING,
age INT
) LOCATION '/path/to/data';
• This hadoop hiveql command creates a new external table called
"ext_employees." External tables point to data that is kept in a location
independent of Hive, preserving the original location of the data. The HDFS
path where the data is located is specified by the LOCATION clause.
HiveQL
Loading Data into Tables
• Load data from HDFS
LOAD DATA INPATH '/path/to/data' INTO TABLE employees;

• Insert data into the table


INSERT INTO TABLE employees VALUES (1, 'John Doe', 30);

• The LOAD DATA statement inserts data into the designated table from
an HDFS path. The "employees" table receives a specific row of data
when the INSERT INTO TABLE query is executed.
HiveQL
Querying Data with HiveQL
• One of the core functions of using Apache Hive is data querying with
HiveQL. You may obtain, filter, transform, and analyse data stored in
Hive tables using HiveQL, which is a language comparable to SQL.
• Following are a few typical HiveQL querying operations:
1. Select All Records:
SELECT * FROM employees;
This Hadoop HiveQL command retrieves all records from the
"employees" table.
HiveQL
2. Filtering:
Example: Select employees older than 25
SELECT * FROM employees WHERE age > 25;
Only those records from the "employees" table that have a "age" greater than 25
are chosen by this.

3. Aggregation:
Example: Count the number of employees
SELECT COUNT(*) FROM employees;
Example: Calculate the average age
SELECT AVG(age) FROM employees;
These Hadoop HiveQL queries count the number of employees and determine the
average age using aggregation operations on the "employees" table.
HiveQL
4. Sorting:
Example: Sort by age in descending order
SELECT * FROM employees ORDER BY age DESC;
In order to extract employee names and their related departments, this query
connects the "employees" and "departments" databases based on the
"department_id" field.

5. Joining Tables:
Example: Join employees and departments based on department_id
SELECT e.id, e.name, d.department
FROM employees e
JOIN departments d ON e.department_id = d.id;
The "department_id" column is used to link the "employees" and "departments"
databases in order to access employee names and their related departments.
HiveQL
6. Grouping and Aggregation:
Example: Count employees in each department
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
This query counts the number of employees in each department and organizes employees
by department.

7. Limiting Results:
Example: Get the top 10 oldest employees
SELECT * FROM employees ORDER BY age DESC LIMIT 10;
This search returns the ten oldest employees in order of age.
HiveQL
Data Filtering and Sorting
1. Data Filtering: You can use the WHERE clause to filter rows based on specific conditions.
Example: Select marks which are more than 60.
SELECT * FROM employees WHERE marks > 60;
The "marks" table's field must be greater than 60 in order for this query to return all items with that
value.

2. Sorting Data: You can use the ORDER BY clause to order the result set according to one or more
columns.
Example: Consider ranking the by marks in increasing order.
SELECT * FROM marks ORDER BY INCR;

3. Combining Filtering and Sorting: To obtain particular subsets of data in a specified order, you can
combine filtering and sorting.
Example: Select and sort marks more than 60.
SELECT * FROM marks WHERE marks > 60 ORDER BY INCR;
HiveQL
Data Transformations and Aggregations
1. Data Transformations: HiveQL provides a number of built-in functions for changing the
data in your query.
Example: Change the case of names
SELECT UPPER(name) as upper_case_name FROM employees;
This Hadoop HiveQL query pulls the "name" column from the "employees" table and uses
the UPPER function to change the names to uppercase.

2. Aggregations: Using functions like COUNT, SUM, AVG, and others, aggregates let you
condense data.
Example: Calculate the average age of the workforce, for instance.
SELECT AVG(age) as average_age FROM employees;
Using the AVG function, this query determines the average age of every employee in the
"employees" table.
HiveQL
3. Grouping and Aggregating: To group data into categories, the GROUP BY clause is used with
aggregate functions.
Example: For instance, total the personnel in each department.
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
The COUNT function is used in this query to count the number of employees in each department
and group the employees by the "department" column.

4. Filtering Before Aggregating: Before doing aggregations, data transformations and filtering might
be used.
Example: Calculate the typical age of your staff members that are over 35.
SELECT AVG(age) as average_age
FROM employees
WHERE age > 35;
This Hadoop HiveQL query determines the average age of the filtered subset of employees by first
excluding those over the age of 35.
HiveQL
Joins and Subqueries

1. Joins: With the use of joins, you can merge rows from various tables based on a
shared column. The INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN are
examples of common join types.
Example: As an illustration, retrieve the employees and the corresponding
departments from an inner join.
SELECT e.id, e.name, d.department
FROM employees e
JOIN departments d ON e.department_id = d.id;
Based on the "department_id" column, this query combines information from the
"employees" and "departments" tables to retrieve employee names and their
related departments.
HiveQL
2. Subqueries: A subquery is a query that is nested inside another query. The
SELECT, WHERE, and FROM clauses can all use them.
Example: Determine the typical age of employees in each department using
a subquery in the SELECT statement.
SELECT department, (
SELECT AVG(age)
FROM employees e
WHERE e.department_id = d.id
) as avg_age
FROM departments d;
The average age of employees for each department in the "departments"
dataset is determined by this query using a subquery.
HiveQL
3. Correlated Subqueries: An inner query that depends on results from the
outer query is referred to as a correlated subquery.
Example: Find employees whose ages are higher than the department's
average, for instance.
SELECT id, name
FROM employees e
WHERE age > (
SELECT AVG(age)
FROM employees
WHERE department_id = e.department_id
);
To locate employees whose ages are higher than the mean ages of
employees in the same department, this query uses a correlated subquery.
HBase
• HBase is a column-oriented non-relational database management system
that runs on top of Hadoop Distributed File System (HDFS), a main
component of Apache Hadoop.
• HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases. It is well suited for real-time data
processing or random read/write access to large volumes of data.
• Unlike relational database systems, HBase does not support a structured
query language like SQL; in fact, HBase isn’t a relational data store at all.
• HBase applications are written in Java much like a typical Apache
MapReduce application. HBase does support writing applications in Apache
Avro, REST and Thrift.
HBase
• An HBase system is designed to scale linearly. It comprises a set of standard
tables with rows and columns, much like a traditional database. Each table must
have an element defined as a primary key, and all access attempts to HBase
tables must use this primary key.
• Avro, as a component, supports a rich set of primitive data types including:
numeric, binary data and strings; and a number of complex types including
arrays, maps, enumerations and records. A sort order can also be defined for the
data.
• HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built
into HBase, but if you’re running a production cluster, it’s suggested that you have
a dedicated ZooKeeper cluster that’s integrated with your HBase cluster.
• HBase works well with Hive, a query engine for batch processing of big data, to
enable fault-tolerant big data applications.
HBASE V/s RDBMS
S. No. Parameters RDBMS HBase
1. SQL It requires SQL (Structured Query Language). SQL is not required in HBase.

It does not have a fixed schema and allows


2. Schema It has a fixed schema.
for the addition of columns on the fly.

3. Database Type It is a row-oriented database It is a column-oriented database.

RDBMS allows for scaling up. That implies, that


rather than adding new servers, we should Scale-out is possible using HBase. It means
upgrade the current server to a more capable that, while we require extra memory and disc
4. Scalability
server whenever there is a requirement for space, we must add new servers to the
more memory, processing power, and disc cluster rather than upgrade the existing ones.
space.

5. Nature It is static in nature Dynamic in nature

6. Data retrieval In RDBMS, slower retrieval of data. In HBase, faster retrieval of data.

It follows the ACID (Atomicity, Consistency, It follows CAP (Consistency, Availability,


7. Rule
Isolation, and Durability) property. Partition-tolerance) theorem.
HBASE DATA MODEL
• HBase is a column-oriented database
and the tables in it are sorted by row.
• The table schema defines only column
families, which are the key value pairs.
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of
columns.
• Column is a collection of key value
pairs.
HBASE DATA MODEL
HBASE V/s RDBMS
It can handle structured, unstructured
8. Type of data It can handle structured data.
as well as semi-structured data.

9. Sparse data It cannot handle sparse data. It can handle sparse data.

In HBase, the .amount of data depends


The amount of data in RDBMS is
10. Volume of data on the number of machines deployed
determined by the server’s configuration.
rather than on a single machine.

In HBase, there is no such guarantee


In RDBMS, mostly there is a guarantee
11. Transaction Integrity associated with the transaction
associated with transaction integrity.
integrity.

Referential integrity is supported by When it comes to referential integrity,


12. Referential Integrity
RDBMS. no built-in support is available.

The data in HBase is not normalized,


which means there is no logical
13. Normalize In RDBMS, you can normalize the data.
relationship or connection between
distinct tables of data.

It is designed to accommodate small It is designed to accommodate large


14. Table size
tables.. Scaling is difficult. tables. HBase may scale horizontally.
HBASE CLIENTS
1) REST: REST server supports the complete client and administrative API. It also
provides support for different message formats, offering many choices for a
client application to communicate with the server.
• REST Java Client: The REST server also comes with a comprehensive Java client API. It is
located in the org.apache.Hadoop.hbase.rest.client package.
2) Thrift: Apache Thrift is written in C++, but provides schema compilers for many
programming languages, including Java, C++, Perl, PHP, Python, Ruby, and
more. Once you’ve compiled a schema, you can exchange messages
transparently between systems implemented in one or more of those
languages.
3) Avro: Apache Avro, like Thrift, provides schema compilers for many
programming languages, including Java, C++, Perl, PHP, Python, Ruby, and
more. Once you’ve compiled a schema, you can exchange messages
transparently between systems implemented in one or more of those
languages.
HBASE CLIENTS
4) Other Clients:
• JRuby: The HBase shell is an example of using a JVM-based language to
access the Java-based API. It comes with the full source code, so you can
use it to add the same features to your own JRuby code.
• HBql: It adds an SQL-like syntax on top of HBase, while adding the
extensions needed where HBase has unique features.
• HBase-DSL: This project gives you dedicated classes that help when
formulating queries against an HBase cluster.
• AsyncHBase: It offers a completely asynchronous, nonblocking, and thread-
safe client to access HBase clusters. It uses the native RPC protocol to talk
directly to the various servers.
HBASE ARCHITECTURE
HMaster
• HMaster in HBase is the HBase architecture’s implementation of a
Master server. It serves as a monitoring agent for all Region Server
instances in the cluster along with as an interface for any metadata
updates. HMaster runs on NameNode in a distributed cluster
context.
• HMaster plays the following critical responsibilities in HBase:
1. It is critical in terms of cluster performance and node
maintenance.
2. HMaster manages admin performance, distributes services to
regional servers, and assigns regions to region servers.
3. HMaster includes functions such as regulating load balancing
and failover to distribute demand among cluster nodes.
4. When a client requests that any schema or Metadata
operations be changed, HMaster assumes responsibility for
these changes.
HBASE ARCHITECTURE
Region Server
• The Region Server, also known as HRegionServer, is in charge of
managing and providing certain data areas in HBase.
• A region is a portion of a table's data that consists of numerous
contiguous rows ordered by the row key.
• Each Region Server is in charge of one or more regions that the
HMaster dynamically assigns to it.
• The Region Server serves as an intermediary for clients sending
write or read requests to HBase, directing them to the appropriate
region based on the requested column family.
• Clients can connect with the Region Server without requiring
HMaster authorization, allowing for efficient and direct access to
HBase data.
HBASE ARCHITECTURE
Advantages of Region Servers:
• Region Servers in HBase enable distributed data management,
allowing data partitioning across Hadoop cluster nodes, parallel
processing, fault tolerance, and scalability.
• Region Servers process read/write requests directly, reducing network
overhead and latency, and improving performance by eliminating
centralized coordination and reducing latency.
• Region Servers allow HBase to automatically split regions into smaller
ones as data accumulates. This assures uniform distribution and
increases query speed and load balancing.
HBASE ARCHITECTURE-ZOOKEEPER
• Zookeeper is a distributed, open-source coordination service for distributed applications. It
exposes a simple set of primitives to implement higher-level services for synchronization,
configuration maintenance, and group and naming.
• In a distributed system, there are multiple nodes or machines that need to communicate with
each other and coordinate their actions.
• ZooKeeper provides a way to ensure that these nodes are aware of each other and can coordinate
their actions.
• It does this by maintaining a hierarchical tree of data nodes called “Znodes“, which can be used to
store and retrieve data and maintain state information.
• ZooKeeper provides a set of primitives, such as locks, barriers, and queues, that can be used to
coordinate the actions of nodes in a distributed system.
• It also provides features such as leader election, failover, and recovery, which can help ensure
that the system is resilient to failures.
• ZooKeeper is widely used in distributed systems such as Hadoop, Kafka, and HBase, and it has
become an essential component of many distributed applications.
ZOOKEEPER
NEED OF ZOOKEEPER:
• Coordination services: The integration/communication of services in a
distributed environment.
• Coordination services are complex to get right. They are especially prone to
errors such as race conditions and deadlock.
• Race condition-Two or more systems trying to perform some task.
• Deadlocks– Two or more operations are waiting for each other.
• To make the coordination between distributed environments easy,
developers came up with an idea called zookeeper so that they don’t have
to relieve distributed applications of the responsibility of implementing
coordination services from scratch.
ARCHITECTURE OF ZOOKEEPER
ARCHITECTURE OF ZOOKEEPER
• The ZooKeeper architecture consists of a hierarchy of nodes called
znodes, organized in a tree-like structure.
• Each znode can store data and has a set of permissions that control
access to the znode.
• The znodes are organized in a hierarchical namespace, similar to a file
system.
• At the root of the hierarchy is the root znode, and all other znodes are
children of the root znode.
• The hierarchy is similar to a file system hierarchy, where each znode
can have children and grandchildren, and so on.
IMPORTANT COMPONENTS IN ZOOKEEPER
IMPORTANT COMPONENTS IN ZOOKEEPER
• Request Processor – Active in Leader Node and is responsible for
processing write requests. After processing, it sends changes to the
follower nodes
• Atomic Broadcast – Present in both Leader Node and Follower
Nodes. It is responsible for sending the changes to other Nodes.
• In-memory Databases (Replicated Databases)-It is responsible for
storing the data in the zookeeper. Every node contains its own
databases. Data is also written to the file system providing
recoverability in case of any problems with the cluster.
OTHER COMPONENTS OF ZOOKEEPER
• Client – One of the nodes in our distributed application cluster.
Access information from the server. Every client sends a message to
the server to let the server know that client is alive.
• Server– Provides all the services to the client. Gives acknowledgment
to the client.
• Ensemble– Group of Zookeeper servers. The minimum number of
nodes that are required to form an ensemble is 3.
ZOOKEEPER DATA MODEL
• In Zookeeper, data is stored in a hierarchical namespace, similar
to a file system.
• Each node in the namespace is called a Znode, and it can store
data and have children.
• Znodes are similar to files and directories in a file system.
• Zookeeper provides a simple API for creating, reading, writing,
and deleting Znodes.
• It also provides mechanisms for detecting changes to the data
stored in Znodes, such as watches and triggers.
• Znodes maintain a stat structure that includes: Version number,
ACL, Timestamp, Data Length
• Types of Znodes:
• Persistence: Alive until they’re explicitly deleted.
• Ephemeral: Active until the client connection is alive.
• Sequential: Either persistent or ephemeral.
WORKING OF ZOOKEEPER
• ZooKeeper operates as a distributed file system and exposes a simple set of
APIs that enable clients to read and write data to the file system.
• It stores its data in a tree-like structure called a znode, which can be
thought of as a file or a directory in a traditional file system.
• ZooKeeper uses a consensus algorithm to ensure that all of its servers have
a consistent view of the data stored in the Znodes. This means that if a
client writes data to a znode, that data will be replicated to all of the other
servers in the ZooKeeper ensemble.
• One important feature of ZooKeeper is its ability to support the notion of a
“watch.” A watch allows a client to register for notifications when the data
stored in a znode changes. This can be useful for monitoring changes to the
data stored in ZooKeeper and reacting to those changes in a distributed
system.
WORKING OF ZOOKEEPER
• In Hadoop, ZooKeeper is used for a variety of purposes, including:
• Storing configuration information: ZooKeeper is used to store configuration
information that is shared by multiple Hadoop components. For example, it
might be used to store the locations of NameNodes in a Hadoop cluster or the
addresses of JobTracker nodes.
• Providing distributed synchronization: ZooKeeper is used to coordinate the
activities of various Hadoop components and ensure that they are working
together in a consistent manner. For example, it might be used to ensure that
only one NameNode is active at a time in a Hadoop cluster.
• Maintaining naming: ZooKeeper is used to maintain a centralized naming
service for Hadoop components. This can be useful for identifying and
locating resources in a distributed system.
Advanced Usage of HBASE:
It is important to have a good understanding of how to design tables, row keys, column names,
and so on, to take full advantage of the architecture.
Key Design
HBase has two fundamental key structures: the row key and the column key. Both can be used to
convey meaning, by either the data they store, or by exploiting their sorting order. In the
following sections, we will use these keys to solve commonly found problems when designing
storage solutions.
Concepts
The first concept to explain in more detail is the logical layout of a table, compared to on-disk
storage. HBase‘s main unit of separation within a table is the column family—not the actual
columns as expected from a column-oriented database in their traditional sense.
Figure - shows the fact that, although you store cells in a table format logically, in reality these
rows are stored as linear sets of the actual cells, which in turn contain all the vital information
inside them.
The top-left part of the figure shows the logical layout of your data—you have rows and
columns. The columns are the typical HBase combination of a column family name and a
column qualifier, forming the column key. The rows also have a row key so that you can address
all columns in one logical row.
HBASE Design Schema
With HBase, you have a ―query-first‖ schema design; all possible queries should be identified
first, and the schema model designed accordingly. You should design your HBase schema to take
advantage of the strengths of HBase. Think about your access patterns, and design your schema
so that the data that is read together is stored together. Remember that HBase is designed for
clustering.

 Distributed data is stored and accessed together

 It is query-centric, so focus on how the data is read

 Design for the questions

Parent-Child Relationship–Nested Entity

 Here is an example of denormalization in HBase, if your tables exist in a one-to-many


relationship, it‘s possible to model it in HBase as a single row. In the example below, the
order and related line items are stored together and can be read together with a get on the
row key. This makes the reads a lot faster than joining tables together.

 The rowkey corresponds to the parent entity id, the OrderId. There is one column family
for the order data, and one column family for the order items. The Order Items are nested,
the Order Item IDs are put into the column names and any non-identifying attributes are
put into the value.

 This kind of schema design is appropriate when the only way you get at the child entities
is via the parent entity.

Self-Join Relationship – HBase

 A self-join is a relationship in which both match fields are defined in the same table.

 Consider a schema for twitter relationships, where the queries are: which users does
userX follow, and which users follow userX? Here‘s a possible solution: The userids are
put in a composite row key with the relationship type as a separator. For example, Carol
follows Steve Jobs and Carol is followed by BillyBob. This allows for row key scans for
everyone carol:follows or carol:followedby
 Below is the example Twitter table:

Schema Design Exploration:

 Raw data from HDFS or HBase

 MapReduce for data transformation and ETL from raw data.

 Use bulk import from MapReduce to HBase

 Serve data for online reads from HBase

Designing for reads means aggressively de-normalizing data so that the data that is read together
is stored together.

Data Access Pattern


The batch layer precomputes the batch views. In the batch view, you read the results from a
precomputed view. The precomputed view is indexed so that it can be accessed quickly with
random reads.
The serving layer indexes the batch view and loads it up so it can be efficiently queried to get
particular values out of the view. A serving layer database only requires batch updates and
random reads. The serving layer updates whenever the batch layer finishes precomputing a batch
view.
You can do stream-based processing with Storm and batch processing with Hadoop. The speed
layer only produces views on recent data, and is for functions computed on data in the few hours
not covered by the batch.

Advance Indexing In HBase

In HBase, the row key provides the same data retrieval benefits as a primary index. So, when you
create a secondary index, use elements that are different from the row key.

Secondary indexes allow you to have a secondary way to read an HBase table. They provide a
way to efficiently access records by means of some piece of information other than the primary
key.

Secondary indexes require additional cluster space and processing because the act of creating a
secondary index requires both space and processing cycles to update.

A method of index maintenance, called Diff-Index, can help IBM® Big SQL to create secondary
indexes for HBase, maintain those indexes, and use indexes to speed up queries.
lOMoARcPSD|43249864

IBM Big Data strategy


Big Data And Analytics Origins

According to the Oxford English Dictionary, the term


‘Big Data’ was first used in 1941 to quantify the growth
rate in the volume of data – alternatively known as the
information explosion. In his article, The Scholar and
the Future of the Research Library (1944), Fremont
Rider, a Wesleyan University librarian, estimated that
in every 16 years, the university libraries in US
doubled in size. He used the term Big Data to refer to
this increase. In 1967, in an article called Automatic
data compression published by BA Marron and PAD
de Maine, the term was used for the first time to
indicate the information explosion on external storage
devices through computers. Later, the term became
more and more related to information in the digital
format and the Information Technology sector. ...

IBM And Big Data

In the 1950s, John Hancock Mutual Life Insurance Co. collected 600 Megabytes of corporate data. This
was the largest amount of corporate data collected till then. The company was one of the pioneers of
digitization. It collected and stored information of two million policy holders on a Univac computing
system. During the 1960s, American Airlines developed a flight reservation system using IBM computing
systems and stored around 807 Megabytes of data. Federal Express, with its scanning and tracking,
collected 80 Gigabytes of data during the 1970s. In the 1980s, with its focus on analyzing ATM
transactions, CitiCorp., gathered 450 Gigabytes of data....

Big Data Strategy

Smarter Planet was a corporate initiative of IBM, which sought to highlight how government and business
leaders were capturing the potential of smarter systems to achieve economic and sustainable growth and
societal progress. In November 2008, in his speech at the Council on Foreign Relations, IBM’s Chairman,
CEO and President Sam Palmisano, outlined an agenda for building a ‘Smarter Planet’. He emphasized
how the world’s various systems – like traffic, water management, communication technology, smart
grids, healthcare solutions, and rail transportation – were struggling to function effectively....

IBM'S Big Data Platform

IBM committed itself to Big Data and Analytics through sustained investments and strategic acquisitions.
In 2011, it invested US$100 million in the research and development of services and solutions that
facilitated Big Data analytics. In addition, it had been bringing together as many Big Data technologies as
possible under its roof. The Big Data strategy of the company was to combine a wide array of the Big
Data analytic solutions and conquer the Big Data market. The company’s goal was to offer the broadest
portfolio of products and solutions with the depth and breadth that no other company could match.......

Addressing Emerging Business Requirements

In 2013, IBM was awarded the contract to support Thames Water Utilities Limited’s (Thames Water) Big
Data project. The UK government planned to install smart meters in every home by 2020. Using these

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

meters, the company would be able to collect a lot of data about the consumption patterns of its
customers. As a part of its next five-year plan, Thames Water planned to invest in Big Data analytics to
improve its operations, customer communication, services, and customer satisfaction using this data. It
chose IBM as an alliance partner for the project to support technology and innovation......

Big Data Challenges

Over the years, consumer attention had shifted from radio, print, and television to the digital media as it
facilitated real-time engagement of consumers. Brands competed for consumer attention through such
media and relied on them for data and analytics for customer acquisition and retention and to offer tailor-
made products and services to them. However, some analysts believed that relying on such data was a
mistake. They said the real challenge with such data was that it was mostly unstructured and the difficulty
lay in structuring it and filtering the genuine data....

Addressing Challenges

IBM had brought in new systems, software, and services to complement its Big Data platform. With these
products it helped its customers to access and analyze data and use it to make informed decisions for the
betterment of their businesses. The Big Data solutions were also meant to protect data and identify and
restrict suspicious activity and block access to company data....

Looking Ahead

In the past few years, Big Data had been the most hyped technology trend, and by 2013 it had started to
gain acceptance as it held promising opportunities for businesses. It was showing its impact on the
healthcare, industrial, retail, and financial sectors to name a few. It enabled companies to run live
simulations of trading strategies, geological and astronomical data, and stock brokers could analyze
public sentiment about a company from social media. Emerging technologies such as Hadoop, NoSQL,
and Storm made such analytics possible. According to a Gartner survey in 2013, 64% of organizations
had invested or planned to invest in the technology, but only 8% of them had actually begun deployment.
Many businesses were in the process of gathering information as to which business problems Big Data
could solve for them.

Introduction to InfoSphere Streams


IBM® InfoSphere® Streams is a software platform that enables the development and
execution of applications that process information in data streams. InfoSphere
Streams enables continuous and fast analysis of massive volumes of moving data to help
improve the speed of business insight and decision making.

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

• InfoSphere Streams features and architecture


InfoSphere Streams consists of a programming language, an API, and an integrated
development environment (IDE) for applications, and a runtime system that can run
the applications on a single or distributed set of resources. The Streams Studio IDE
includes tools for authoring and creating visual representations of streams
processing applications.
• InfoSphere Streams components
InfoSphere Streams consists of many components such as streams processing
applications, domains, instances, and resources.

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

InfoSphere Streams features and architecture


InfoSphere® Streams consists of a programming language, an API, and an integrated
development environment (IDE) for applications, and a runtime system that can run
the applications on a single or distributed set of resources. The Streams Studio IDE
includes tools for authoring and creating visual representations of streams processing
applications.

InfoSphere Streams is designed to address the following data processing platform


objectives:

• Parallel and high performance streams processing software platform that can scale
over a range of hardware environments
• Automated deployment of streams processing applications on configured hardware
• Incremental deployment without restarting to extend streams processing
applications
• Secure and auditable run time environment

The InfoSphere Streams architecture represents a significant change in computing


system organization and capability. InfoSphere Streams provides a runtime platform,
programming model, and tools for applications that are required to process continuous
data streams. The need for such applications arises in environments where information
from one to many data streams can be used to alert humans or other systems, or to
populate knowledge bases for later queries.

InfoSphere Streams is especially powerful in environments where traditional batch or


transactional systems might not be sufficient, for example:

• The environment must process many data streams at high rates.


• Complex processing of the data streams is required.
• Low latency is needed when processing the data streams.

InfoSphere Streams offers the IBM® Streams Processing Language (SPL) interface for
users to operate on data streams. SPL provides a language and runtime framework to
support streams processing applications. Users can create applications without needing
to understand the lower-level stream-specific operations. SPL provides numerous
operators, the ability to import data from outside InfoSphere Streams and export results
outside the system, and a facility to extend the underlying system with user-defined
operators. Many of the SPL built-in operators provide powerful relational functions
such as Join and Aggregate.

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

Starting with InfoSphere Streams Version 4.1, users can also develop streams processing
applications in other supported languages, such as Java™ or Scala. The Java
Application API (Topology Toolkit) supports creating streaming applications
for InfoSphere Streams in these programming languages.

Deploying streams processing applications results in the creation of a dataflow graph,


which runs across the distributed run time environment. As new workloads are
submitted, InfoSphere Streams determines where to best deploy the operators to meet
the resource requirements of both newly submitted and already running
specifications. InfoSphere Streams continuously monitors the state and utilization of its
computing resources. When streams processing applications are running, they can be
dynamically monitored across a distributed collection of resources by using
the Streams Console, Streams Studio, and streamtool commands.

Results from the running applications can be made available to applications that are
running external to InfoSphere Streams by using Sink operators or edge adapters. For
example, an application might use a TCPSink operator to send its results to an
external application that visualizes the results on a map. Alternatively, it might alert an
administrator to unusual or interesting events. InfoSphere Streams also provides many
edge adapters that can connect to external data sources for consuming or storing data.

InfoSphere Streams components


InfoSphere® Streams consists of many components such as streams processing
applications, domains, instances, and resources.

• InfoSphere Streams domains


An InfoSphere Streams domain is a logical grouping of resources in a network for
common management and administration. To use InfoSphere Streams, you must
create at least one domain.
• InfoSphere Streams instances
An InfoSphere Streams instance is an InfoSphere Streams runtime environment. It is
composed of a set of interacting services running across one or more resources.
Before you can use InfoSphere Streams, you must create at least one instance.
• Domain and instance services
When you start an InfoSphere Streams domain, instance, or streams processing
application, services start on the appropriate resources.
• InfoSphere Streams interfaces
InfoSphere Streams provides a command line interface and the following graphical
user interfaces: Domain Manager; Streams Console; Streams Studio.
• Resource managers
InfoSphere Streams includes a default resource manager, which
allocates InfoSphere Streams resources. You can also use external resource

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

managers, which run separately from InfoSphere Streams and allocate externally
managed resources.
• Resources
Resources are physical and logical entities that InfoSphere Streams uses to run
services.
• Streams processing applications
The main components of streams processing applications are tuples, data streams,
operators, processing elements (PEs), and jobs.
• Views, charts, and tables
A view defines the set of attributes that can be displayed in a chart or table for a
specific viewable data stream

Overview of Big SQL

Last Updated: 2021-08-31

With Big SQL, your organization can derive significant value from your enterprise data.

What Big SQL is


IBM Big SQL is a high performance massively parallel processing (MPP) SQL engine for
Hadoop that makes querying enterprise data from across the organization an easy and
secure experience. A Big SQL query can quickly access a variety of data sources including
HDFS, RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database
connection or single query for best-in-class analytic capabilities.

What Big SQL looks like


Big SQL provides tools to help you manage your system and your databases, and you can
use popular analytic tools to visualize your data.

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

How Big SQL works


Big SQL's robust engine executes complex queries for relational data and Hadoop data.
Big SQL provides an advanced SQL compiler and a cost-based optimizer for efficient
query execution. Combining these with a massive parallel processing (MPP) engine
helps distribute query execution across nodes in a cluster.

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

Why Big SQL?


Big SQL provides world-class scalability and performance, and supports the following key use
cases:
Enterprise Data Warehouse (EDW) offloading
Big SQL understands commonly-used SQL syntax from other vendors and producers. You
can offload and consolidate old data more quickly and easily from existing Oracle, IBM®
Db2®, and IBM Netezza® enterprise data warehouses or data marts while preserving
most of the SQL from those platforms.
Federated access to relational data
For data that can't be moved to Hadoop, Big SQL provides federated access to many
relational database management system (RDBMS) sources outside of Hadoop with IBM
Fluid Query technology and NoSQL databases with the use of Spark connectors. You can
use a single database connection to access data across Hadoop and dozens of
relational/NoSQL database types, whether they are on the cloud, on local systems, or
both. Wherever the data resides, Big SQL offers data virtualization and enables querying
disparate sources in a single query.
Big SQL also boasts:

• Elastic boost technology to support more granular resource usage and increase
performance without increasing memory or CPU
• High-performance scans, inserts, updates, and deletes
• Deeper integration with Spark 2.1 than other SQL-on-Hadoop technologies
• Machine learning or graph analytics with Spark with a single security model
• Open Data Platform initiative (ODPi) compliance
• Advanced, ANSI-compliant SQL queries

• Big SQL architecture


Built on the world class IBM common SQL database technology, Big SQL is a
massively parallel processing (MPP) database engine that has all the standard
RDBMS features and is optimized to work with the Apache Hadoop ecosystem.
• What's new
IBM Big SQL v5.0.2 has improved performance, usability, serviceability, and
consumability capabilities.
• Big SQL features
Big SQL features include easy-to-use tools, flexible security options, strong
federation and performance capabilities, and a massively parallel processing (MPP)
SQL engine that provides powerful SQL processing features.
• Big SQL use case: EDW optimization
A key use case for Big SQL is to perform enterprise data warehouse (EDW)
optimization.

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)


lOMoARcPSD|43249864

• Best practices
Best practices articles are available for a wide variety of use cases. Check this list to
find the best practices for your environment and tasks.
• System and software compatibility
The system and software compatibility report provides a complete list of supported
operating systems, system requirements, prerequisites, and optional supported
software for Big SQL v5.0.2.

Downloaded by Priyanka Rahi (priyanka.rahi@liet.in)

You might also like