KEMBAR78
Hadoop | PDF | No Sql | Apache Hadoop
0% found this document useful (0 votes)
5 views4 pages

Hadoop

Uploaded by

goyaniharsh4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Hadoop

Uploaded by

goyaniharsh4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1, Hive optimization techniques :-

1. Partitioning:- Divide tables into partitions based on column values (e.g.,


date, region). Helps Hive read only the required data instead of scanning
the whole table.
2. Bucketing:- Further divides data in each partition into fixed buckets.
Improves performance of joins and sampling.
3. Vectorization:- Executes queries in batches instead of row by row.
Increases query execution speed.
4. File Format Selection:- Use optimized file formats like ORC or Parquet
instead of plain text/CSV. They support compression and faster reads.
5. Column Pruning:- Select only required columns (SELECT col1, col2) instead
of SELECT *. Reduces unnecessary data reading.
6. Predicate Pushdown:- Push filters (WHERE conditions) closer to data
source. Reads only relevant rows, saving time.
7. MapReduce/Tez Execution Engine:- Use Apache Tez or Spark execution
engine instead of default MapReduce for faster performance.
8. Parallel Execution:- Enable parallel execution of independent tasks in
Hive.
9. Joins Optimization:- Use map-side joins or broadcast joins when one table
is small. Helps avoid expensive shuffling in reduce phase.
10.Compression:- Enable intermediate data compression to reduce I/O and
improve performance.
2, Partitioning and bucketing :-
1. Partitioning in Hive:-
• Partitioning means dividing a large table into smaller parts based on the
values of one or more columns (called partition keys).
• Example: A sales table partitioned by year or region.
• When you query, Hive will only scan the required partitions instead of the
whole table → reduces query time.
Types:
1. Static Partitioning → You manually specify partition values while loading
data.
2. Dynamic Partitioning → Hive automatically creates partitions based on
column values.
Benefit: Reduces data scanning → faster queries.
2. Bucketing in Hive:- ucketing further divides data within each partition into
fixed-size buckets based on the hash of a column.
• Example: A student table bucketed by student_id into 10 buckets.
• Rows with the same bucket column value always go to the same bucket.
• Improves performance of joins, sampling, and queries on large datasets.
Benefit: Balanced data distribution and efficient joins.
3, RDBMS VS HIVEQL:-
RDBMS (Relational Database HiveQL (Hive Query
Aspect
Management System) Language)

A database system to store A query language for Hive,


Definition and manage structured data built on top of Hadoop, to
using SQL. process big data.
Designed for TBs to PBs of
Data Size Handles GBs to TBs of data.
data.
Data Stores data in tables (row- Stores data in HDFS (Hadoop
Storage oriented) on disk. Distributed File System).
Real-time transaction Batch processing (OLAP,
Processing
processing (OLTP). analytical queries).

Query Uses HiveQL (very similar to


Uses SQL.
Language SQL).
Schema is strict and must be
Schema is flexible; supports
Schema defined before inserting
schema-on-read.
data.
Queries are converted into
Execution Queries run directly on the
MapReduce / Tez / Spark
Engine RDBMS engine.
jobs.
4, Hadoop ecosystem components:-
Hadoop Ecosystem Components
Hadoop ecosystem is a framework that helps store, process, and analyze large
volumes of data. It has several components grouped into four categories:

1. Storage Layer
• HDFS (Hadoop Distributed File System) → Stores huge datasets across
clusters in a distributed manner.
• HBase → NoSQL database on top of HDFS, supports random read/write
access
2. Processing Layer
• MapReduce → Original batch processing engine of Hadoop.
• Apache Tez → DAG-based framework for faster query execution.
• Apache Spark → In-memory processing engine, much faster than
MapReduce.
3. Resource Management Layer
• YARN (Yet Another Resource Negotiator) → Manages resources and job
scheduling in the Hadoop cluster.
4. Data Access & Analysis Layer
• Hive → SQL-like interface for querying big data (data warehousing).
• Pig → High-level scripting language for analyzing large datasets.
• Mahout → Machine learning library on Hadoop.
• Zookeeper → Coordination service for managing distributed applications.
• Sqoop → Transfers data between RDBMS and Hadoop.
• Flume → Collects and loads large amounts of log/streaming data into
HDFS.
• Oozie → Workflow scheduler for managing Hadoop jobs.
5, SQL VS NOSQL:-
Aspect SQL (Relational Databases) NoSQL (Non-Relational Databases)
Full Form Structured Query Language Not Only SQL
Relational (tables with rows Non-relational (document, key-
Data Model
& columns) value, graph, column-based)
Fixed, predefined schema
Schema Dynamic / flexible schema
(strict)
Vertical scaling (scale-up by Horizontal scaling (scale-out by
Scalability
adding more CPU/RAM) adding more servers)
Supports ACID (Atomicity, Often supports BASE (Basically
Transactions Consistency, Isolation, Available, Soft state, Eventually
Durability) consistent)
Structured data, complex Unstructured or semi-structured
Best For
queries, transactions data, high scalability, big data apps
Database-specific APIs or query
Query
SQL methods (e.g., MongoDB queries,
Language
Cassandra CQL)
MySQL, Oracle, PostgreSQL, MongoDB, Cassandra, CouchDB,
Examples
SQL Server Redis, Neo4j
Banking, ERP, CRM, Social networks, real-time analytics,
Use Cases
traditional applications IoT, big data apps
----------------------------------------------------------------------------------------------------------
6, Hive Architecture Points:-
1, Hive CLI, 2, User Interface, 3, Hive QL, 4, Driver, 5, Compiler, 6, Meta-store,
7, Execution Engine, 8, Hadoop Components
----------------------------------------------------------------------------------------------------------
7, Hadoop Architecture Points:-
1, HDFS, 2, Yarn, 3, Map Reducer/Processing Engine

You might also like