KEMBAR78
Big Data All Unit by Study4sub | PDF | Apache Hadoop | Big Data
0% found this document useful (0 votes)
53 views161 pages

Big Data All Unit by Study4sub

The document provides an overview of Big Data, including its definition, types, history, and key drivers. It discusses the architecture, technology components, and applications of Big Data, emphasizing its importance in various sectors like healthcare, finance, and retail. Additionally, it addresses security, compliance, privacy, and ethical considerations related to Big Data management.

Uploaded by

hokije9151
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views161 pages

Big Data All Unit by Study4sub

The document provides an overview of Big Data, including its definition, types, history, and key drivers. It discusses the architecture, technology components, and applications of Big Data, emphasizing its importance in various sectors like healthcare, finance, and retail. Additionally, it addresses security, compliance, privacy, and ethical considerations related to Big Data management.

Uploaded by

hokije9151
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

[Grab your reader’s attention with a great quote from the document or use this space to

emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]

Our Website Link - CLICK


UNIT - 1
BIG DATA
Study4sub

SYLLABUS:
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to
Big Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big
Data, Big Data technology components, Big Data importance and applications, Big Data
features – security, compliance, auditing and protection, Big Data privacy and ethics, Big Data
Analytics, Challenges of conventional systems, intelligent data analysis, nature of data, analytic
processes and tools, analysis vs reporting, modern data analytic tools.
What is Big Data?
Big Data refers to extremely large and complex data sets that are difficult to store, manage,
and process using traditional data processing tools and methods.
In Simple Words:
• Big Data means “a huge amount of data” — so big and fast that traditional software (like
Excel or simple databases) can't handle it.
Key Points:
• It includes structured, semi-structured, and unstructured data.
Study4sub

• The size of Big Data is usually in terabytes, petabytes, or more.


• Big Data is not only about size but also speed, variety, and complexity.
Example:
• Imagine YouTube — every minute, people upload 500+ hours of videos. All that video
content, user data, likes/dislikes — this is Big Data. Analyzing it helps YouTube recommend
videos, detect spam, and show relevant ads.
1. Types of Digital Data
Digital data is the foundation of Big Data. It is generally categorized into three types:
1. Structured Data
• Definition: Data organized in a fixed schema (rows and columns).
• Storage: Stored in traditional databases (RDBMS).
• Examples:
• Employee records in Excel
• SQL databases like MySQL, Oracle
• Features:
• Easy to enter, store, query, and analyze.
• Machine-readable and well-formatted. Study4sub
2. Semi-Structured Data
• Definition: Data that does not follow a strict structure but still contains tags or markers for easy separation of
elements.
• Storage: Stored in formats like XML, JSON, NoSQL databases.
• Examples:
• Emails (to, from, subject, body)
• Web pages with HTML tags
• Sensor data in JSON format
• Features:
• More flexible than structured data.
• Requires custom parsers or semi-structured data tools.
3. Unstructured Data
• Definition: Data that lacks any predefined format or organization.
• Storage: Data lakes, distributed file systems (e.g., HDFS).
• Examples:
• Social media posts, images, videos, PDFs
• Surveillance footage, audio files
• Features: Study4sub
• Most abundant form of data today.
• Hard to analyze without specialized tools like Hadoop/Spark.
History of Big Data Innovation
The evolution of Big Data spans across decades, and it is driven by advances in storage,
processing power, and the internet.
Timeline Overview:

Early development of databases and data storage systems like IBM’s IMS.
1960s–70s
Data stored on magnetic tapes.

Relational databases (RDBMS) like Oracle, DB2 emerged. Structured data


1980s
dominated.
Study4sub
Explosion of the internet. Web data starts growing. Data Warehousing
1990s
concepts introduced.
Gartner analyst Doug Laney introduces the 3Vs model (Volume, Velocity,
2001
Variety) of Big Data.
Google publishes MapReduce and Google File System (GFS) papers. These
2003–2004
inspired Hadoop.
Apache Hadoop is born (by Doug Cutting). Handles distributed processing
2005
of massive data sets.
Growth of social media, mobile apps, cloud computing – huge surge in
2010s unstructured data. Spark, NoSQL, and cloud-based big data platforms
evolved.

AI, ML, and IoT generate and consume massive real-time data. Focus on
2020 onwards
data privacy, ethics, and advanced analytics tools.
Key Drivers of Big Data
• The term “drivers” refers to the factors or reasons behind the growth and importance
of Big Data. These are the main forces that have pushed Big Data to become essential
in today’s world.
1. Rapid Growth of Internet and Social Media
• Billions of users are active on platforms like Facebook, YouTube, Instagram, and
Twitter.
• Every second, people are uploading photos, videos, comments, likes, etc.
• This creates huge volumes of data every Study4sub
day.
Example: YouTube gets more than 500 hours of video uploads per minute.
2. Increased Use of Smartphones and IoT Devices
• Every smartphone, smartwatch, and smart device (like Alexa, fitness bands, smart
TVs) collects and sends data.
• These devices generate real-time data from various sensors.
• Example: A smart home system collects data about temperature, lights, and
energy usage.
3. Cheap Storage and Cloud Computing
• Earlier, storing large data was expensive.
• Now, cloud services like AWS, Google Cloud, and Microsoft Azure offer cheap and scalable storage.
• This allows companies to collect and store massive amounts of data easily.
• Point to remember: Cloud storage is flexible, cost-effective, and accessible from anywhere.
4. Advancements in Data Processing Technologies
• Tools like Hadoop, Spark, NoSQL databases allow fast and distributed processing of large data.
• These technologies help in handling structured and unstructured data with ease.
• Example: Hadoop breaks big data into smaller parts and processes it in parallel.
Study4sub
5. Need for Real-Time Decision Making
• Companies need to make quick decisions to stay competitive.
• Big Data helps in analyzing trends, customer behavior, and business performance in real-time.
• Example: E-commerce sites suggest products based on what users just searched.
6. Growth of AI and Machine Learning
• AI and ML need huge amounts of data to learn and make accurate predictions.
• Big Data provides the fuel for these smart systems.
• Example: Netflix uses ML and Big Data to recommend shows based on your watch history.
1. Introduction to Big Data Platform
Definition:
A Big Data Platform is an integrated system that combines various tools and
technologies to manage, store, and analyze massive volumes of data efficiently.
It provides the infrastructure and environment required for:
• Ingesting data (bringing data from sources)
• Storing data (on distributed systems)
Study4sub
• Processing data (in batch or real-time)
• Analyzing and visualizing data
Benefits of Big Data Platforms:
• Scalability: Easily handle growing data
• Flexibility: Supports all types of data (structured, semi-structured, unstructured)
• Real-Time Processing: Immediate insights and decisions
• Cost-Effective: Cloud and open-source tools reduce expenses
Main Components of a Big Data Platform:
1.Data Ingestion Tools
1. Used to collect and import data from different sources
2. Examples: Apache Kafka, Apache Flume, Sqoop
2.Data Storage Systems
1. Store large datasets reliably
2. Examples: HDFS (Hadoop Distributed File System), NoSQL (MongoDB, Cassandra)
3.Processing Engines
Study4sub
1. Perform computations and analytics on data
2. Examples: Hadoop MapReduce (batch), Apache Spark (real-time)
4.Data Management
1. Tools to organize, clean, and maintain data quality
2. Examples: Hive, HBase
5.Analytics & Visualization Tools
1. Help in generating reports and dashboards
2. Examples: Tableau, Power BI, Apache Pig, R, Python
Examples of Big Data Platforms:
• Apache Hadoop Ecosystem
• Apache Spark Framework
• Google Cloud BigQuery
• Amazon EMR (Elastic MapReduce)
• Microsoft Azure HDInsight
Big Data Architecture and Characteristics
Study4sub

What is Big Data Architecture?


• Big Data Architecture is the framework that defines how Big Data is collected,
stored, processed, and accessed.
It uses distributed computing — meaning the work is shared across many
machines to handle large-scale data.
Main Components of Big Data Architecture:
1. Data Sources
These are the origins where data is generated. Data can come from websites, social
media, IoT devices, or business applications.
2. Storage Layer
Stores large volumes of data. It uses distributed systems like HDFS (Hadoop) and
NoSQL databases (MongoDB, Cassandra) to store structured and unstructured data.
3. Batch Processing
Study4sub
Processes large sets of data in chunks at specific intervals. Used when real-time
processing isn’t necessary. Tools: Hadoop MapReduce, Hive.
4. Real-Time Message Ingestion
Collects data as it’s generated. Allows for immediate processing. Tools: Apache Kafka,
Flume.
5. Stream Processing
Processes real-time data continuously as it comes in. Ideal for immediate insights like
fraud detection. Tools: Apache Storm, Apache Flink.
6. Analytical Data Store
Stores processed data in a format optimized for analysis. Examples: Amazon
Redshift, Google BigQuery.
7. Analysis & Reporting
Tools used for generating reports and dashboards. Helps users make data-driven
decisions. Examples: Tableau, Power BI.
8. Orchestration
Study4sub
• Automates workflows and tasks across different components. Ensures data
processing runs smoothly. Tools: Apache Oozie, Apache Airflow.
Architecture Flow (Text Diagram):
5 Vs of Big Data
The 5 Vs represent the core characteristics that define Big Data:
1. Volume
• Refers to the massive amount of data generated every day.
• Data can range from terabytes (TB) to petabytes (PB).
• Example: Social media platforms generate billions of posts, tweets, and images daily.
2. Velocity
• The speed at which data is generated and processed.
Study4sub

• Data flows in real-time or near real-time, requiring quick analysis.


• Example: Stock market transactions or sensor data from IoT devices.
3. Variety
• Refers to the different types of data.
• Data can be structured (databases), semi-structured (XML, JSON), or unstructured
(videos, emails).
• Example: Social media posts, images, and structured data from databases.
4. Veracity
• The uncertainty of data; how reliable or accurate data is.
• Deals with data quality, consistency, and correctness.
• Example: Fake news or noisy sensor data.
5. Value
• The usefulness of the data in deriving insights.
• The value comes from processing and analyzing
Study4sub
the data to extract meaningful
patterns.
• Example: Business recommendations based on customer behavior analysis.
Big Data Technology Components
1. Cloud Computing
• What it is: Storing and processing data on remote servers via the internet.
• Why it's important: Offers scalability, flexibility, and cost-effectiveness for Big Data.
• Example: Amazon Web Services (AWS), Google Cloud, Microsoft Azure.
2. Machine Learning (ML)
• What it is: A type of AI where machines learn from data to make predictions without
explicit programming.
• Why it's important: ML helps in pattern recognition, predictions, and automating
decisions using Big Data.
• Example: Recommendation systems, fraud detection.
3. Natural Language Processing (NLP)
• What it is: AI technology for understanding and generating human language from text
data.
Study4sub
• Why it's important: NLP analyzes unstructured text data like customer reviews and
social media posts.
• Example: Sentiment analysis, chatbots.
4. Business Intelligence (BI)
• What it is: Tools that help businesses analyze data to make better decisions through
visualizations and reports.
• Why it's important: BI turns Big Data into actionable insights for businesses.
• Example: Tableau, Power BI, Qlik.
Big Data Importance and Applications
Importance of Big Data:
• Informed Decision Making: Big data allows businesses to base decisions on data-driven insights rather than
intuition or assumptions.
• Competitive Advantage: Organizations can analyze trends, customer behavior, and market shifts to stay ahead.
• Efficiency and Cost Reduction: Automating processes and analyzing data can reduce operational costs and
increase productivity.
Applications of Big Data
1. Business Intelligence
1. Helps businesses analyze customer behavior and trends to improve marketing strategies and make data-
driven decisions.
Study4sub
2. Healthcare
1. Analyzes patient data to predict diseases, personalize treatments, and improve diagnostics.
3. Retail & E-commerce
1. Optimizes inventory, personalizes shopping experiences, and improves supply chain management.
4. Finance
1. Detects fraud, assesses credit risk, and analyzes market trends for better financial decision-making.
5. Social Media Analytics
1. Analyzes user data to understand sentiments, trends, and influencer impact.
6. Smart Cities
1. Monitors and manages traffic, energy usage, and pollution to make cities more efficient and sustainable.
1. Security in Big Data
What is it?
• Security in Big Data refers to protecting data from unauthorized access, theft, or
damage.
Why is it important?
• As more data is collected, it becomes a target for hackers and malicious entities.
Protecting it ensures that sensitive information (like customer details, financial
records, etc.) stays safe.
Study4sub
How is it done?
• Encryption: Data is converted into a code that can only be unlocked with a key.
• Access control: Only authorized users are allowed to access sensitive data.
• Firewalls: These act like a shield, blocking unauthorized access to your data.
Example :
• Imagine you have a bank account. Encryption ensures that only you (with your
password) can access your account, while a firewall blocks hackers from trying to
steal your money.
2. Compliance in Big Data
What is it?
• Compliance means following laws and regulations regarding how data should be
collected, stored, and used.
Why is it important?
• Every country has laws to protect people’s data. Companies must follow these laws
to avoid fines and legal issues. For example, businesses must make sure they don’t
misuse customer data.
How is it done?
Study4sub
• Companies follow guidelines like GDPR (General Data Protection Regulation) in
Europe, which gives people control over their data.
• In healthcare, HIPAA (Health Insurance Portability and Accountability Act) ensures
that patient data is handled properly.
Example :
• If a company collects data about its customers, it must get their permission (consent)
to use it. For example, a website asking for your email address must tell you how they
will use it, and you must agree.
3. Auditing and Protection in Big Data
What is it?
• Auditing is tracking who accesses and uses the data, while protection refers to
preventing misuse or data loss.
Why is it important?
• Auditing helps to detect malicious behavior like unauthorized access or data
breaches.
• Protection makes sure that data cannot be tampered with or lost.
Study4sub
How is it done?
• Audit Trails: Companies keep a record of every time someone accesses the data.
• Intrusion Detection Systems: These tools watch out for unusual activities, like
unauthorized access, and alert security teams.
Example:
• If someone tries to break into your bank account, the bank can track the attack and
block it. Similarly, if you use a company's service, they can track who is accessing
your personal details to ensure no one is misusing it.
4. Big Data Privacy
What is it?
• Privacy in Big Data means making sure that personal data (like names, addresses, or
social security numbers) is protected from being shared without consent.
Why is it important?
• People have a right to keep their personal information private, and companies must
ensure they protect this data from exposure or misuse.
Study4sub
How is it done?
• Anonymization: Removing personal identifiers (like names or IDs) from data to keep
it private.
• Encryption: Protecting data by making it unreadable without a special key.
Example:
• When you sign up for a website, your name and email address are personal data. The
company must protect your information and not share it without your permission.
5. Big Data Ethics
What is it?
• Ethics refers to using Big Data in a way that is fair, responsible, and
transparent.
Why is it important?
• Ethical use of Big Data ensures that companies don’t take advantage of people
by using their data for unethical purposes like discrimination or exploitation.
How is it done? Study4sub

• Companies must be transparent about how they use your data.


• Data should not be used to create biased algorithms that unfairly target certain
groups (e.g., biased hiring algorithms).
Example :
• Imagine a company using data to decide which applicants to hire. If the
algorithm is biased, it might unfairly reject people based on gender or
ethnicity. Ethical Big Data ensures that decisions are based on fairness.
Big Data Analytics
What is Big Data Analytics?
• Big Data Analytics refers to the process of examining large and complex datasets (often called Big
Data) to uncover hidden patterns, correlations, trends, and useful business insights.
Why is it important?
• In today’s world, businesses and organizations gather massive amounts of data. Big Data Analytics
helps them to make informed decisions by analyzing data to uncover valuable insights that weren’t
obvious at first glance.
• It’s used to improve efficiency, optimize operations, and predict future trends.
How is Big Data Analytics performed?
Study4sub
Big Data Analytics involves using a combination of advanced technologies, techniques, and tools to
process and analyze large volumes of data. Here’s how it works:
1.Data Collection: Collecting data from various sources like social media, customer interactions, IoT
devices, etc.
2.Data Processing: Using tools to process the data. Since Big Data can be structured, semi-structured,
or unstructured, special tools are used to organize it into usable formats.
3.Data Analysis: Applying various statistical and machine learning algorithms to analyze the data and
uncover patterns or trends.
4.Data Visualization: Presenting the findings in a visual format like graphs, dashboards, and reports so
businesses can understand the results and act on them.
Advantages of Big Data Analytics:
1.Informed Decision-Making:
1. Businesses can make data-driven decisions based on real-time insights.
2. Example: A retail company can analyze customer preferences and adjust its stock accordingly.
2.Improved Customer Insights:
1. Helps businesses understand customer behavior, preferences, and needs.
2. Example: E-commerce companies use Big Data to personalize product recommendations for
customers.
3.Cost Reduction:
1. Big Data can help businesses optimize operations
Study4sub
and reduce waste, leading to cost savings.
2. Example: Predicting maintenance needs of machinery can prevent breakdowns and lower repair
costs.
4.Better Operational Efficiency:
1. Identifies inefficiencies and helps streamline business processes.
2. Example: Manufacturing companies use Big Data to monitor production lines and improve
efficiency.
5.Competitive Advantage:
1. Companies using Big Data can stay ahead of competitors by leveraging insights for better
strategies.
2. Example: Financial institutions use Big Data to detect fraud faster than traditional methods.
Disadvantages of Big Data Analytics:
1.Data Privacy Concerns:
1. Collecting and analyzing large amounts of personal data can lead to privacy issues.
2. Example: If customer data is not handled securely, it can be exploited for malicious purposes.
2.High Costs:
1. The infrastructure and tools required for Big Data Analytics can be expensive.
2. Example: Setting up cloud storage, advanced software, and hiring specialized talent can be costly.
3.Data Complexity:
1. Analyzing unstructured and diverse data (like images, videos, and social media content) can be difficult.
2. Example: A company might struggle to extractStudy4sub
meaningful insights from text-heavy social media data.
4.Lack of Skilled Professionals:
1. The demand for professionals who can handle Big Data is high, and there is often a shortage of qualified
personnel.
2. Example: It may be challenging to find data scientists who are skilled in handling Big Data tools and
algorithms.
5.Data Security Risks:
1. The larger the dataset, the greater the risk of data breaches and hacking attacks.
2. Example: If a healthcare company experiences a data breach, it could compromise sensitive patient
information.
Challenges of Conventional Systems
• Conventional systems are traditional data processing systems that were designed for
small, structured datasets. As data grows and becomes more complex, these systems
face several key challenges:
1. Limited Data Handling Capacity
• Traditional systems struggle to handle massive volumes of data (Big Data).
• Problem: They may slow down, crash, or fail when handling large datasets.
2. Inability to Handle Unstructured DataStudy4sub
• Conventional systems are built to process structured data (e.g., tables in databases).
• Problem: They struggle with unstructured data like text, videos, and social media
posts.
3. Slower Data Processing
• These systems often rely on batch processing, which is slow.
• Problem: Real-time data analysis is difficult, making it hard to make quick decisions.
4. Scalability Issues
• Traditional systems can't easily scale up or expand to handle larger data.
• Problem: As data grows, these systems cannot adapt efficiently.
5. High Costs
• Maintaining conventional systems with physical servers is expensive.
• Problem: Storing and processing large amounts of data becomes costly.
6. Difficulty in Real-Time Decision Making
Study4sub

• These systems can't process data fast enough for real-time decisions.
• Problem: Businesses miss out on opportunities that require immediate action.
7. Limited Flexibility and Integration
• Traditional systems don’t integrate well with modern technologies like cloud or
machine learning.
• Problem: It's hard to use new tools alongside old systems.
8. Data Quality Issues
• Conventional systems struggle with ensuring clean and consistent data.
• Problem: Data errors or inconsistencies can affect decision-making.
Intelligent Data Analysis
Intelligent Data Analysis refers to using advanced techniques, algorithms, and tools to analyze large
datasets and extract meaningful patterns, insights, and predictions. It involves the application of
artificial intelligence (AI), machine learning (ML), and statistical models to make smarter decisions
based on data.
Key Points:
1.AI & Machine Learning: These technologies help in learning from data and predicting future trends
or behaviors without human intervention. Study4sub
2.Pattern Recognition: Intelligent data analysis identifies hidden patterns in data that are not
immediately obvious.
3.Automation: It automates data analysis processes, making it faster and more efficient.
4.Predictive Analytics: It helps forecast future events, trends, or behaviors based on historical data.
5.Real-time Insights: Intelligent analysis can provide real-time insights, helping businesses to make
quicker, more informed decisions.
Example:
• In retail, intelligent data analysis can be used to predict which products will sell best in the future by
analyzing past sales data, customer preferences, and trends.
Nature of Data
Nature of Data refers to the different forms, types, and characteristics of data that affect how
it is stored, processed, and analyzed.
Types of Data:
1.Structured Data:
1. What it is: Data that is organized into tables, rows, and columns, typically in relational databases (e.g.,
customer records, sales data).
2. Example: A database of employee information where each row represents an employee with columns like
name, age, salary, etc.
2.Unstructured Data:
1. What it is: Data that doesn't have a predefined structure, making it difficult to analyze with traditional
Study4sub
methods (e.g., text, images, audio, video).
2. Example: Social media posts, customer reviews, or video files.
3.Semi-structured Data:
1. What it is: Data that doesn't have a rigid structure but contains tags or markers that make it easier to
analyze (e.g., XML, JSON).
2. Example: A log file that contains a mixture of structured data (timestamps) and unstructured data (event
descriptions).
4.Big Data:
1. What it is: Extremely large datasets that require advanced tools and techniques for storage, processing,
and analysis. Big Data is often characterized by the 5 Vs: Volume, Velocity, Variety, Veracity, and Value.
2. Example: Data from IoT sensors, social media platforms, and web logs.
Characteristics of Data:
1.Volume: The amount of data being generated. It can be terabytes or even
petabytes.
2.Velocity: The speed at which data is generated and needs to be processed
(e.g., real-time data).
3.Variety: The different types of data (structured, unstructured, semi-
structured).
Study4sub
4.Veracity: The quality and accuracy of the data.
5.Value: The usefulness of the data for decision-making or gaining insights.
Nature of Data in Big Data:
• Big Data contains data from multiple sources that vary in type, speed, and
structure. Processing this data requires advanced technologies like Hadoop,
Spark, and machine learning to handle its complexity.
Analytic Processes and Tools
Analytic Process
The analytic process in Big Data involves several steps to extract meaningful insights
from large datasets. These steps are essential for data analysis and decision-making.
1.Data Collection:
1. Collect data from various sources such as sensors, databases, social media, and logs.
2. Example: Collecting sales data from e-commerce websites.
2.Data Cleaning:
1. Remove errors, duplicates, and irrelevant Study4sub
information to ensure high-quality data.
2. Example: Removing duplicate customer entries from a database.
3.Data Analysis:
1. Apply statistical methods, machine learning models, and algorithms to analyze the data and
uncover patterns.
2. Example: Analyzing customer behavior patterns using machine learning.
4.Interpretation of Results:
1. After analysis, interpret the results to make informed decisions.
2. Example: Predicting future sales trends based on past data.
Tools : Excel , R and Python, Hadoop and Spark, Tableau/Power BI , SQL
Analysis vs Reporting
While both analysis and reporting involve working with data, they serve different
purposes.
Analysis:
• Goal: To explore data, find patterns, and make predictions.
• Process: Involves using statistical models, machine learning, and algorithms.
• Outcome: Provides insights that can guide strategic decision-making.
• Example: Using customer data to predict future purchase behavior.
Study4sub

Reporting:
• Goal: To present data in a simple, understandable format.
• Process: Involves summarizing data in charts, graphs, and tables.
• Outcome: Provides an overview of performance or trends, typically for monitoring
purposes.
• Example: A monthly sales report showing the total revenue, top-selling products, and
key metrics.
Key Differences:
• Analysis is more about understanding and extracting insights from data, while reporting is about summarizing and
presenting data for easy consumption.
• Analysis typically involves advanced methods, while reporting is more about presenting results in an understandable way.
Modern Data Analytic Tools (Short Notes - AKTU Oriented)
1. Hadoop
1. Open-source framework for storing and processing large data sets in a distributed manner.
2. Handles structured and unstructured data.
2. Apache Spark
1. Fast in-memory data processing tool.
2. Suitable for real-time analytics.
3. Power BI
Study4sub
1. Microsoft’s tool for creating interactive dashboards and reports.
2. Easy to use and integrates with various data sources.
4. Tableau
1. Data visualization tool.
2. Helps in making graphs, charts, and dashboards for better understanding.
5. Python & R
1. Programming languages for data analysis, visualization, and machine learning.
2. Python is widely used due to its simplicity and libraries like Pandas, NumPy.
6. SQL
1. Language used to query and manage structured data in databases.
2. Essential for data extraction and manipulation.
7. Google Analytics
1. Used to track and report website traffic and user behavior.
THANKS FOR WATCHING

BEST OF LUCK
Study4subFOR EXAM
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]

Our Website Link - CLICK


[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]

Our Website Link - CLICK


UNIT – 3

BIG DATA

SYLLABUS: HDFS (Hadoop Distributed File System): Design of HDFS, HDFS concepts,
benefits and challenges, file sizes, block sizes and block abstraction in HDFS, data
replication, how does HDFS store, read, and write files, Java interfaces to HDFS,
command line interface, Hadoop file system interfaces, data flow, data ingest with
Flume and Scoop, Hadoop archives, Hadoop I/O: Compression, serialization, Avro and
file-based data structures. Hadoop Environment: Setting up a Hadoop cluster, cluster
specification, cluster setup and installation, Hadoop configuration, security in Hadoop,
administering Hadoop, HDFS monitoring & maintenance, Hadoop benchmarks,
Hadoop in the cloud
1. Introduction to HDFS
• HDFS stands for Hadoop Distributed File System. It is the primary storage
system used by Hadoop applications to store large datasets reliably and
efficiently across multiple machines. HDFS is inspired by Google File System
(GFS) and is designed to run on low-cost commodity hardware.
2. Design of HDFS
• HDFS follows a Master-Slave Architecture.
• The NameNode is the master and manages the file system namespace (i.e., file
names, directories, block mappings).
• The DataNodes are the slaves that store the actual data blocks.
• Files are split into large blocks (e.g., 128 MB or 256 MB) and distributed across
DataNodes.
• It is designed for high fault tolerance, high throughput, and large-scale data
storage.
3. HDFS Concepts
• NameNode: Stores metadata of files like file permissions, block locations, and directory
structure. It doesn’t store the actual data.
• DataNode: Stores the actual data blocks. Each block is replicated for reliability.
• Block: The smallest unit of storage in HDFS. A large file is split into blocks which are
distributed.
• Replication: Blocks are replicated across multiple DataNodes to ensure fault tolerance.
• Rack Awareness: HDFS knows about rack locations to place replicas intelligently across
racks.
4. Benefits of HDFS
1.Fault Tolerance: HDFS replicates data blocks, so even if one or more nodes fail, data is still
accessible.
2.Scalability: HDFS clusters can scale by adding more nodes without downtime.
3.High Throughput: Optimized for streaming access to large datasets, ensuring fast
processing.
4.Cost-Effective: Runs on inexpensive commodity hardware.
5.Data Locality: Computation is moved closer to where the data is stored, improving
performance.
Challenges of HDFS
1.Not suitable for real-time processing: It is best for batch processing, not for
low-latency operations.
2.Small Files Issue: Too many small files can overwhelm the NameNode’s
memory, reducing efficiency.
3.Security Limitations: Requires integration with other tools like Kerberos or
Apache Ranger for security.
4.Single Point of Failure: If the NameNode fails, the entire system can stop
(solved using High Availability configuration).
File Sizes in HDFS
• HDFS is optimized for very large files (in gigabytes or terabytes).
• Small files should be avoided or combined because they increase the load on
the NameNode.
• Ideal use-case: applications that write data once and read many times (write-
once-read-many model).
Block Size in HDFS:
• In HDFS, files are split into large blocks (default: 128 MB, can be configured).
• This is much larger than traditional file systems (like 4 KB or 8 KB).
• Why large? It reduces the load on the NameNode, improves performance, and helps
in handling big data efficiently.
• Example: A 400 MB file will be split into 3 blocks of 128 MB and 1 block of 16 MB.
• Blocks are stored independently on different DataNodes.
• Users don’t manage blocks directly—HDFS handles it automatically (this is called
block abstraction)
Data Replication in HDFS
• HDFS maintains multiple replicas (default is 3) of each data block.
• Replication strategy:
• First replica on the local node.
• Second on a different node in the same rack.
• Third on a node in a different rack.
• This ensures high availability, data durability, and fault tolerance.
• If one node fails, the system automatically reads from another replica.
What is Block Abstraction?
• In HDFS, files are divided into large blocks (default size: 128 MB or 256 MB).
• These blocks are treated independently and stored across different machines in the Hadoop cluster.
• Users or applications do not deal with the actual block management—this is handled internally by
the HDFS NameNode.
• This concept of separating file management into blocks is called block abstraction.
Benefits of Block Abstraction:
1.Scalability
1. Large files can be stored across many nodes.
2. Easy to add more storage by adding more nodes.
2.Fault Tolerance
1. If a block is lost due to machine failure, HDFS can retrieve it from a replicated copy.
3.Efficient Storage and Load Distribution
1. Files are divided into blocks and stored across multiple DataNodes.
2. This allows parallel processing and better resource utilization.
4.Simplified Data Management
1. HDFS doesn’t need to keep track of every byte or kilobyte—just blocks.
2. Makes the NameNode efficient and less overloaded.
5.Optimized Data Access
1. Processing can happen where the block is stored (data locality), which reduces network traffic.
How HDFS Stores Data
1.File is Split into Blocks
1. When a file is stored in HDFS, it is automatically divided into large blocks (default size: 128 MB
or 256 MB).
2.Metadata Managed by NameNode
1. The NameNode keeps track of which blocks belong to which file and where those blocks are
stored, but it doesn’t store the actual data.
3.Data Stored in DataNodes
1. The actual file data (blocks) is stored in multiple DataNodes, which are the worker machines in
the cluster.
4.Replication for Safety
1. Each block is replicated (default: 3 copies) across different DataNodes to prevent data loss in
case any node fails.
5.Write Operation
1. When writing, the client sends data directly to the first DataNode, which forwards it to the
second and third – this is called pipelined writing.
6.Acknowledgment
1. After all blocks are written and replicated, the system confirms the file is stored successfully.
HDFS Write Operation (Step-by-Step)
1.Client contacts NameNode to request writing a file.
2.NameNode checks metadata, such as permissions and assigns DataNodes for
each block.
3.File is split into blocks (e.g., 128 MB).
4.The client writes each block to the first DataNode, which then forwards it to
the second and third (pipelining).
5.Once all replicas are written, DataNodes send acknowledgments back to the
client.
6.NameNode updates the metadata once the write is successful.
Important Terms:
• Pipelined writing – data flows from client → DataNode1 → DataNode2 →
DataNode3.
• Replication – ensures fault tolerance (default is 3 copies per block).
HDFS Read Operation (Step-by-Step)
1.Client requests the file from the NameNode.
2.NameNode returns metadata – list of blocks and locations on DataNodes.
3.The client directly connects to the nearest DataNode to read each block (for
efficiency).
4.Data is read in parallel, block by block, and reassembled by the client.
5.The client doesn’t interact with the NameNode during data transfer, only at
the start.
Important Points:
• Reads are fast and parallel.
• If a DataNode is down, replica is read from another DataNode.
Java Interface to Hadoop for File Operations

In Hadoop, Java provides methods to interact with the Hadoop Distributed File System (HDFS). Here are the key
operations that can be performed using Java:

1. Hadoop's URL Scheme:


Hadoop uses a unique URL scheme to refer to files and directories in HDFS. These URLs help Hadoop to
identify and manage files within the system.

2. Creating Directories:
You can create directories in HDFS using the mkdirs() method. This method ensures that all necessary parent
directories are created. If the directory already exists, it will not be recreated.

3. Deleting Files or Directories:


To remove files or directories, you can use the delete() method. It allows you to delete files or entire directories,
including their contents if required. This method can be set to delete contents recursively or just the
directory/file itself.

4. Overloaded Methods:
Java provides overloaded versions of methods in Hadoop, offering flexibility in how files and directories are
handled, such as choosing whether to delete files recursively or not.

5. Filesystem Class:
The Filesystem class in Java provides various methods to manage files in HDFS, including creating files, reading
data, writing data, and deleting files or directories.
Hadoop Command Line Interface (CLI)
• The Hadoop Command Line Interface (CLI) provides an interface for interacting with the
Hadoop Distributed File System (HDFS) via terminal or command prompt. It enables users to
perform various file management operations like listing, uploading, downloading, and
deleting files, as well as managing directories.
Hadoop FileSystem Interface
The Hadoop FileSystem Interface allows users to interact with different types of file systems,
including HDFS, local file systems, and cloud storage. It provides a common set of operations
such as reading, writing, deleting, and checking the existence of files and directories.
• FileSystem Class: This class provides methods for file operations, such as creating, reading,
and deleting files.
• Path Class: Represents the location of files or directories in the file system.
• Operations: Common operations include creating directories, checking file existence,
deleting files, and copying data between local and Hadoop file systems.
• Configuration: File system settings are usually configured through Hadoop's configuration
files, ensuring correct connection to HDFS or other systems.
• The interface abstracts the underlying storage systems, making it easier for users to work
with different storage backends in a consistent manner. It plays a key role in ensuring that
applications can read and write data efficiently across a variety of file systems.
Data Ingestion
Data Ingestion refers to the process of collecting and importing data from various
sources into a system, such as a database or data warehouse, for further analysis and
processing. In the context of Big Data, data ingestion involves transferring large
volumes of data into a platform like Hadoop for storage and analysis.
Challenges in Data Ingestion
1.Data Volume: Handling large amounts of data in a timely and efficient manner can
be difficult.
2.Data Variety: Different types of data (structured, semi-structured, unstructured)
need to be ingested properly.
3.Data Velocity: Data may come in at high speeds (e.g., streaming data), which
requires real-time processing and ingestion.
4.Data Quality: Ensuring data consistency, accuracy, and completeness during the
ingestion process.
5.Integration with Multiple Sources: Collecting data from various sources (databases,
social media, IoT devices) and integrating it into a unified format.
Data Ingestion with Flume
Apache Flume is a distributed and reliable data ingestion service designed to
collect, aggregate, and move large amounts of streaming data into Hadoop.
• How Data Ingestion Works in Flume:
• Source: Data is ingested from various sources, such as log files, websites, or streaming
services.
• Channel: The data is transferred through a channel (like memory or file channel) to
ensure reliable data flow.
• Sink: The data is finally sent to a destination, such as HDFS or other storage systems, for
further processing and analysis.
• Flow Configuration: Flume uses a configuration file to define the flow of data from
source to sink.
• Flume is typically used for real-time data ingestion, where data is continuously
being streamed and stored for later processing.
Data Ingestion with Sqoop
Apache Sqoop is a tool designed for transferring bulk data between Hadoop and
relational databases (e.g., MySQL, Oracle, SQL Server).
• How Data Ingestion Works in Sqoop:
• Importing Data: Sqoop can import data from relational databases into HDFS or Hive. It
performs this by reading data from tables and writing it to HDFS in a distributed fashion.
• Exporting Data: Sqoop can also export data from HDFS back to a relational database.
This is typically used to move processed data from Hadoop back to a database for
further business operations.
• Parallel Import/Export: Sqoop uses parallel processing to divide the import/export tasks
among multiple nodes, ensuring faster data ingestion and better scalability.
• Sqoop is particularly useful for batch ingestion of structured data from
relational databases to Hadoop.
Hadoop Archives (HAR)
Hadoop Archives (HAR) is a feature in HDFS used to store many small files efficiently
by bundling them into a single archive file. This reduces the overhead of managing
numerous small files in HDFS.
How It Works:
• Small files are grouped into a single archive file in HDFS.
• The HAR file is treated as a single file, improving storage efficiency.
• Accessing the data within HAR files is done using standard Hadoop tools.
Limitations:
1.Slower Access: Retrieving data from a HAR file can be slower than from individual
files.
2.Read-Only: Once created, HAR files cannot be modified.
3.No Compression: HAR does not support compression, which may limit its efficiency.
4.Management Complexity: Managing and updating large HAR files can be
cumbersome.
5.Limited Tool Support: Some Hadoop tools might not fully support HAR files.
Compression in Hadoop:
• Compression in Hadoop helps reduce the size of data stored in the system,
which saves space and makes data transfer faster. It improves the overall
performance by reducing the amount of data moved across the network or
stored in HDFS.
• Common Compression Formats: .gz, .bzip2, .lz4, .snappy
Advantages of Compression:
• Reduces disk space usage.
• Speeds up data transfer.
• Improves processing time, especially for large datasets.
• Challenges:
• Requires CPU power for compression and decompression.
• Can add some processing delays, especially with certain compression formats.
Serialization in Hadoop:
Serialization is the process of converting data into a format that can be stored or transmitted. In Hadoop,
serialization helps store and transfer data in a way that can be easily read from or written to the system.
• Serialization Formats in Hadoop:
Serialization Formats in Hadoop:
Writable format (e.g., Text, IntWritable).
Avro, a compact format for data serialization.
Protocol Buffers and SequenceFile are also used
Advantages of Serialization:

Makes data storage and transmission more efficient.

Ensures consistent and compatible data formats across different platforms.


Challenges:

Writing custom serialization can be complex.

The process of serialization and deserialization can affect performance due to CPU and memory usage.
Avro in Hadoop:
• What is Avro?
Avro is a data serialization system used in Hadoop for efficient data exchange. It provides a
compact, fast, binary format and is used to serialize data for storage or transmission. Avro is
especially useful when working with Big Data and supports schema evolution (changing data
structure over time).
Features of Avro:
1.Compact and Fast:
Avro uses a binary format, which makes it faster and smaller in size compared to text-based
formats.
2.Schema-Based:
Data is always stored with its schema. This ensures that the data can be read without
needing an external schema.
3.Supports Schema Evolution:
Avro allows changes in schema over time like adding or removing fields without breaking
compatibility.
4.Interoperability:
Avro supports multiple programming languages like Java, Python, C, etc., making it easier to
work in a multi-language environment.
5.Integrates with Hadoop Ecosystem:
Avro works well with Hadoop tools like Hive, Pig, and MapReduce.
How Avro Works:
• Avro stores data along with its schema in a container file.
• When writing data, it uses a defined schema to serialize the data into binary
format.
• When reading, the system uses the schema (either from the file or provided
externally) to deserialize the data.
• Because both data and schema are stored together, Avro ensures data is
portable and self-describing.
Hadoop Environment: Setting Up a Hadoop Cluster
• Setting up a Hadoop environment involves preparing hardware and software to
work in a distributed system for processing Big Data.
1. Cluster Specification:
Before setting up Hadoop, the hardware and software requirements must be defined:
• Hardware Requirements:
• One Master Node (for NameNode and ResourceManager).
• Multiple Slave Nodes (for DataNode and NodeManager).
• Each node should have sufficient RAM (at least 8GB), CPU, and storage capacity.
• Software Requirements:
• Linux-based OS (Ubuntu/CentOS preferred).
• Java (JDK 8 or later) – mandatory for running Hadoop.
• SSH Configuration – for password-less communication between nodes.
• Hadoop binary files – can be downloaded from Apache website.
2. Cluster Setup:
Setting up a cluster means connecting all the machines to work together. First, we
install Java and Hadoop on each machine. Then, we configure secure communication
using SSH so that the master can control the slaves without needing passwords.
• Next, we assign roles to each machine — which one will act as master and which
ones will be slaves. After that, we edit the configuration files to set up paths, data
directories, ports, and other necessary settings to allow the system to function in a
distributed way.
Cluster Installation:
After setup, we install and configure everything:
• Install Java and Hadoop on each machine.
• Set environment variables like Java home and Hadoop home.
• Configure the cluster by setting paths and data storage settings in the required
configuration files.
• Enable secure communication using SSH keys.
• Format the file system to prepare it for data storage.
• Start the necessary background services (called daemons) to begin using the
cluster.
• Finally, we check the system using the command line or web interface to
ensure everything is running properly.
1. Hadoop Configuration:
Hadoop configuration is essential for controlling the behavior and performance
of Hadoop components like HDFS, YARN, and MapReduce. Configuration
settings are written in XML files. The key configurations include:
• core-site.xml: Contains settings for Hadoop core like file system address
• hdfs-site.xml: Sets parameters for HDFS like replication factor, block size, and
permission settings.
• mapred-site.xml: Configures MapReduce settings such as job tracker address
and number of reduce tasks.
• yarn-site.xml: Configures YARN parameters like resource manager and node
manager settings.
• Proper configuration ensures efficient cluster operation and resource
utilization.
Security in Hadoop:
Security is a major concern in Hadoop due to the large amount of data it
handles. Hadoop provides several mechanisms for ensuring data protection:
• Authentication: Verifies user identity using Kerberos. Only authenticated users
can access the cluster.
• Authorization: Controls what operations (read/write/execute) an
authenticated user can perform on files or jobs.
• Encryption: Protects sensitive data while being transmitted over the network
or stored on disk.
• File and Directory Permissions: Similar to Unix/Linux systems. Permissions can
be set for files and directories to restrict access.
• Advanced security features can also be added using tools like Apache Ranger
and Sentry.
Administering Hadoop:
Hadoop administration refers to the management and maintenance of the
Hadoop cluster. Key responsibilities of an administrator include:
• Managing cluster components: Starting/stopping Hadoop daemons like
NameNode, DataNode, ResourceManager, and NodeManager.
• User management: Creating user accounts and setting file permissions.
• Cluster health monitoring: Ensuring all nodes are working and data is properly
replicated.
• Job management: Monitoring and controlling job execution.
• Backup and recovery: Taking regular backups and preparing for failures.
• Tools like Ambari and Cloudera Manager help simplify administration tasks
through graphical dashboards.
HDFS Monitoring and Maintenance:
Maintaining the HDFS system is important to ensure data reliability and
availability. Monitoring and maintenance involve:
• Checking disk space usage.
• Monitoring DataNodes for availability and health status.
• Monitoring NameNode UI to view cluster status and file system details.
• Checking under-replicated or corrupted blocks.
• Running balancer tool to redistribute blocks evenly across DataNodes.
• Decommissioning and adding DataNodes as needed.
• Timely monitoring helps prevent data loss and performance issues.
Hadoop Benchmarks:
Benchmarks are used to evaluate Hadoop performance in terms of speed, reliability, and resource
usage. Common benchmarks include:
• TestDFSIO: Tests read/write throughput of HDFS.
• TeraSort: Measures the performance of MapReduce in sorting large datasets.
• MRBench: Tests the performance of MapReduce jobs.
• NNBench: Measures performance of NameNode.
• These benchmarks help in cluster tuning and identifying bottlenecks.
Hadoop in the Cloud:
Running Hadoop in the cloud offers flexibility, scalability, and reduced maintenance. Cloud platforms
like Amazon AWS (EMR), Microsoft Azure HDInsight, and Google Cloud Dataproc support Hadoop.
Advantages of Hadoop in the cloud:
• On-demand resource scaling.
• Pay-per-use pricing model.
• No need for physical infrastructure.
• Easy deployment and updates.
• Hadoop in the cloud is suitable for organizations looking to process large-scale data without investing
in heavy infrastructure.
THANK A LOT

PLEASE SUBSCRIBE
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]

Our Website Link - CLICK


UNIT – 4

BIG DATA

SYLLABUS: Hadoop Eco System and YARN: Hadoop ecosystem components, schedulers,
fair and capacity, Hadoop 2.0 New Features – Name Node high availability, HDFS
federation, MRv2, YARN, Running MRv1 in YARN. NoSQL Databases: Introduction to
NoSQL MongoDB: Introduction, data types, creating, updating and deleing documents,
querying, introduction to indexing, capped collections Spark: Installing spark, spark
applications, jobs, stages and tasks, Resilient Distributed Databases, anatomy of a Spark
job run, Spark on YARN SCALA: Introduction, classes and objects, basic types and
operators, built-in control structures, functions and closures, inheritance.
Hadoop Ecosystem Components:
The Hadoop ecosystem includes several tools that work together to handle big data efficiently.
These tools are built around Hadoop's core components: HDFS (storage) and MapReduce
(processing).
Main components of the Hadoop ecosystem:
1.HDFS (Hadoop Distributed File System) – Used for storing large datasets in a distributed
manner.
2.MapReduce – Programming model for processing large data in parallel.
3.YARN (Yet Another Resource Negotiator) – Manages resources and schedules jobs.
4.Hive – SQL-like query language for data summarization and analysis.
5.Pig – High-level scripting language used with MapReduce.
6.HBase – A NoSQL database that runs on top of HDFS.
7.Sqoop – Used to transfer data between Hadoop and relational databases.
8.Flume – Collects and transports large amounts of streaming data into Hadoop.
9.Oozie – Workflow scheduler to manage Hadoop jobs.
10.Zookeeper – Coordinates and manages distributed applications.
11.Mahout – Machine learning library for building predictive models.
12.Avro – A data serialization system used for efficient data exchange.
Hadoop Schedulers:
Schedulers in Hadoop YARN decide how resources (CPU, memory) are allocated among various
jobs. They ensure multiple users can share the Hadoop cluster efficiently.
1. FIFO Scheduler (First-In-First-Out):
• Simple scheduler.
• Jobs are executed in the order they arrive.
• Not fair when some jobs take too long.
2. Fair Scheduler:
• Developed by Facebook.
• Divides resources equally among all running jobs.
• Ensures that small jobs are not stuck behind large ones.
• Jobs are grouped into pools, and each pool gets a fair share of resources.
3. Capacity Scheduler:
• Developed by Yahoo.
• Designed for large organizations with multiple users.
• Cluster resources are divided into queues, and each queue gets a configured capacity.
• Unused capacity in one queue can be used by others.
Hadoop 2.0 – New Features
• Hadoop 2.0 brought major improvements over Hadoop 1.x. It solved scalability,
availability, and resource management issues. Below are the important new features:
1. NameNode High Availability (HA):
• In Hadoop 1.x, there was only one NameNode, so if it failed, the whole system would
stop (single point of failure).
• Hadoop 2.0 introduced two NameNodes: one active and one standby.
• If the active NameNode fails, the standby takes over automatically.
• This ensures that HDFS continues to work without downtime.
2. HDFS Federation:
• In earlier versions, there was only one NameNode, which could become a bottleneck
in large clusters.
• Federation allows multiple NameNodes, each managing part of the file system.
• It improves scalability and isolation, allowing different applications to use different
parts of the file system without conflict.
3. MRv2 (MapReduce Version 2):
• Also called YARN (Yet Another Resource Negotiator).
• In Hadoop 1.x, JobTracker handled both resource management and job
scheduling, which created performance issues.
• MRv2 separates these two responsibilities:
• ResourceManager handles resource allocation.
• ApplicationMaster manages the lifecycle of individual jobs.
4. YARN (Yet Another Resource Negotiator):
• YARN is the core of Hadoop 2.0.
• It allows running multiple applications (not just MapReduce), like Spark, Tez,
etc., on the same cluster.
• It improves resource utilization, scalability, and flexibility.
How to Run MapReduce Version 1 (MRv1) on YARN
1.YARN allows old MapReduce jobs to run – The jobs written for the older version of
MapReduce can still run on the new YARN system.
2.No need to change code – Existing MapReduce programs do not need to be
rewritten. They can work directly on YARN.
3.YARN handles job execution – YARN takes care of distributing the job and managing
the resources required to run it.
4.A special manager helps run old jobs – YARN includes a built-in manager that
supports running MRv1 jobs in the new environment.
5.Same way of job submission – You submit the job in the same way as before, and
YARN will run it in the background.
6.Backward compatibility – YARN supports older applications so that users can
continue using their previous work without problems.
7.Good for migration – This is helpful for companies or users who are moving from
older versions of Hadoop to newer ones.
8.Benefit of new features – Even when using old jobs, you still get advantages like
better resource sharing and job scheduling from YARN.
NoSQL Databases
Introduction:
• NoSQL stands for "Not Only SQL".
• It refers to a group of databases that do not use the traditional relational database
model.
• Designed to handle large volumes of structured, semi-structured, or unstructured
data.
• Useful in big data applications and real-time web apps.
Advantages of NoSQL:
1.Scalability – Easily handles large amounts of data and traffic by scaling horizontally.
2.Flexibility – No fixed schema; supports dynamic data types and structures.
3.High Performance – Faster read/write operations for large datasets.
4.Supports Big Data – Works well with distributed computing frameworks like Hadoop.
5.Easier for Developers – Matches modern programming paradigms (JSON, key-value).
Disadvantages of NoSQL:
1.Lack of Standardization – No uniform query language like SQL.
2.Limited Support for Complex Queries – Not ideal for multi-table joins.
3.Less Mature Tools – Compared to relational databases.
4.Consistency Issues – Often prefers availability and partition tolerance over
consistency (CAP Theorem).
5.Data Redundancy – Due to denormalization, same data may be repeated.
Types of NoSQL Databases (Explained in Detail)
1. Key-Value Stores
1. Data is stored as a pair of key and value, like a dictionary.
2. The key is unique, and the value can be anything (a string, number, JSON, etc.).
3. Very fast and efficient for lookups by key.
4. Best for: Caching, session management, simple data storage.
5. Examples: Redis, Riak, Amazon DynamoDB.
2. Document-Oriented Databases
1. Data is stored in documents (like JSON or XML), which are more flexible than rows and columns.
2. Each document is self-contained and can have different fields.
3. Easy to map to objects in code and update individual fields.
4. Best for: Content management, real-time analytics, product catalogs.
5. Examples: MongoDB, CouchDB.
3. Column-Oriented Databases
1. Stores data in columns instead of rows, making it efficient for reading specific fields across large datasets.
2. Great for analytical queries on big data.
3. Scales well across many machines.
4. Best for: Data warehousing, real-time analytics, logging.
5. Examples: Apache HBase, Cassandra.
4. Graph-Based Databases
1. Focuses on relationships between data using nodes and edges.
2. Very powerful for handling complex relationships like social networks, recommendation engines, etc.
3. Best for: Social networks, fraud detection, recommendation systems.
4. Examples: Neo4j, ArangoDB.
MongoDB –
MongoDB is a NoSQL, open-source, document-oriented database. It stores data in
JSON-like documents with dynamic schemas, meaning the structure of data can vary
across documents in a collection.
Features of MongoDB:
1.Schema-less – Collections do not require a predefined schema.
2.Document-Oriented Storage – Data is stored in BSON (Binary JSON) format, allowing
for embedded documents and arrays.
3.High Performance – Supports fast read and write operations.
4.Scalability – Supports horizontal scaling using sharding.
5.Replication – Ensures high availability with replica sets.
6.Indexing – Supports indexing on any field to improve query performance.
7.Aggregation – Provides a powerful aggregation framework for data processing and
analytics.
8.Flexibility – You can store structured, semi-structured, or unstructured data.
9.Cross-Platform – Works on Windows, Linux, and MacOS.
Common MongoDB Data Types:
1.String – Used for storing text.
2.Integer – Stores numeric values (32-bit or 64-bit).
3.Boolean – True or False values.
4.Double – Stores floating-point numbers.
5.Date – Stores date and time in UTC format.
6.Array – Stores multiple values in a single field.
7.Object/Embedded Document – Stores documents within documents.
8.Null – Represents a null or missing value.
9.ObjectId – A unique identifier for each document (auto-generated).
10.Binary Data – Used to store binary data such as images or files.
1. Creating Documents in MongoDB:
• Creating a document in MongoDB means adding new data to the database. A
document in MongoDB is a record, which is similar to a row in relational databases.
• Example: If you want to store information about a person, like their name, age, and
city, you create a document for that person. MongoDB will automatically store this
data in a collection (similar to a table in a relational database).
• Once created, this document is assigned an _id by MongoDB, which uniquely
identifies it in the collection.
2. Updating Documents in MongoDB:
• Updating means modifying the existing data in a document. You can update a specific
field in a document (e.g., change the person's age or city) without affecting other
fields.
• Example: Suppose you created a document for a person named "John" with age 29.
Later, if you need to change the age to 30, you can update just the age field in that
document. You can also update multiple documents at once if needed, such as
updating the status of everyone living in "New York."
• MongoDB provides flexibility to update documents based on conditions. For example,
you can choose to update only those documents that match certain criteria.
3. Deleting Documents in MongoDB:
• Deleting documents means removing data from the database. If a document is no longer needed or is
outdated, it can be deleted.
• Example: If you want to delete the document of a person named "John," you can remove that document from
the collection. MongoDB allows you to delete just one document or multiple documents at once. For example,
you can delete all people who live in "New York" if required.
Queries in MongoDB:
MongoDB allows you to retrieve data from the database using queries. A query is a way to search for documents
that match specific conditions. The basic idea is to find specific documents based on their field values.
1. Basic Queries: You can search for documents by specifying the field and value. For example, if you want to find
all users who are 25 years old, you would search for documents where the "age" field is equal to 25.
2. Conditional Queries: MongoDB lets you apply conditions to your queries. For example, if you want to find users
older than 30, you can use a condition that searches for documents where the "age" is greater than 30.
3. Logical Queries: You can combine different conditions using logical operators like "AND" and "OR". For
instance, if you want to find users who are older than 30 but live in "New York", you can combine these
conditions.
4. Sorting: MongoDB allows you to sort your query results. For example, if you want to sort users by their age, you
can choose to display the results either in ascending or descending order.
5. Limiting Results: You can limit the number of results returned by a query. For example, if you want to get only
the first 5 documents, you can apply a limit to the query.
6. Projection: You can specify which fields to display in the query result. For example, if you only want to display
the "name" and "age" fields, you can exclude all other fields from the results.
Indexing in MongoDB:
Indexing in MongoDB is used to improve the performance of queries. When you create an
index on a field, MongoDB creates a structure that makes it faster to find documents that
match a specific value for that field.
1.Single Field Index: This is the simplest type of index and is created on a single field. For
example, if you frequently search for users by their name, you can create an index on the
"name" field to speed up those queries.
2.Compound Index: A compound index is created on multiple fields. It allows queries that
filter by several fields to be executed faster. For instance, if you often search for users by
both their name and age, a compound index can improve performance.
3.Text Index: MongoDB allows you to create a text index for full-text search. This type of index
is useful when you need to search for documents that contain specific words or phrases
within a text field.
4.Geospatial Index: If your data involves geographical locations (latitude and longitude),
MongoDB provides special indexing options to efficiently handle these types of queries.
5.Multikey Index: When you store arrays in MongoDB documents, you can create a multikey
index. This type of index is useful for queries that need to search within array fields.
6.Hashed Index: This type of index is used for efficient equality queries. It is useful when you
need to search for documents based on exact matches to a field value.
Benefits of Indexing:
• Faster Query Execution: Indexes make data retrieval quicker, as they allow
MongoDB to quickly locate relevant documents.
• Better Performance for Sorting: Sorting documents by a field with an index is
faster than sorting without one.
• Improved Read Efficiency: Indexes help MongoDB read data more efficiently,
especially with large datasets.
Limitations of Indexing:
• Space and Memory Usage: Indexes consume additional disk space and
memory. Having too many indexes can slow down performance.
• Impact on Write Operations: Every time a document is added, updated, or
deleted, MongoDB has to update the index, which can slow down write
operations.
• Maintenance: Indexes need to be maintained and updated regularly to ensure
optimal performance.
Capped Collections in MongoDB:
• A capped collection is a fixed-size collection.
• It automatically removes the oldest documents when the size limit is reached.
• Capped collections maintain the insertion order.
• They are ideal for use cases like logging or real-time data tracking.
• Capped collections provide high performance because they don’t allow
deletions or updates that would increase the document size.
Spark: Installing Spark, Spark Applications, Jobs, Stages, and Tasks
1. Installing Spark:
To begin using Apache Spark, you need to install it on your system or set it up on a cluster. Here’s a general
overview of how to install Spark:
• Pre-requisites:
• Java: Spark runs on Java, so you must have Java installed (Java 8 or later).
• Scala: Spark is written in Scala, and it provides a Scala API, but the Java API is used more commonly.
• Hadoop (Optional): If you want to run Spark with Hadoop, you need to install Hadoop as well. If not, Spark
can also run in standalone mode.
• Installation Steps:
• Download Spark: Visit the official Apache Spark website and download the appropriate version (usually
the pre-built version for Hadoop).
• Extract the Spark Archive: After downloading, extract the archive to a desired location on your local
system.
• Configure Spark:
• Set up the environment variables (SPARK HOME and PATH).
• You can configure Spark by editing the spark-defaults .conf file and setting options like the master URL,
memory settings, and other parameters.
Run Spark: After installation and configuration, you can start Spark in local mode or connect it to a cluster (e.g.,
Hadoop YARN, Mesos).
Standalone Mode: You can run Spark on your local machine (single node mode).
• Cluster Mode: You can run Spark on a cluster by connecting to YARN or Mesos for distributed computing.
Spark Applications:
A Spark application is a complete program that uses Spark to process data. Every
Spark application has a driver program that runs the main code. The driver
coordinates the execution of the program and sends tasks to worker nodes.
• Driver Program: This controls the execution of the Spark job. It communicates
with the cluster manager to allocate resources and send tasks to worker nodes.
• Cluster Manager: It manages the distribution of tasks across nodes. It can be
Hadoop YARN, Mesos, or Spark’s built-in manager.
• Executors: Executors are the worker processes that run on worker nodes and
perform the tasks assigned to them by the driver.
Jobs:
A Spark job is triggered when you perform an action, such as counting the number of
elements in a dataset or saving the data. A job represents a complete computation and
consists of multiple stages.
Triggering Jobs: Jobs are triggered by actions in Spark. For example, calling an action
like .collect() will trigger the execution of the job.
Stages in Jobs: When a job involves transformations that require data to be shuffled
across the cluster, Spark divides the job into multiple stages. Stages are separated by
operations that require a shuffle of data (e.g., groupBy or join).
Stages:
Stages are subsets of a job that can be executed independently. Spark divides jobs into
stages based on operations that involve shuffling data.
• Shuffling: Shuffling is the process of redistributing data across the cluster when a
stage involves wide dependencies (e.g., aggregating data from different nodes).
• Execution of Stages: Each stage runs tasks in parallel, and the results are passed to
the next stage. The execution is sequential, meaning Stage 2 will not start until Stage
1 is complete.
Tasks:
A task is the smallest unit of work in Spark, corresponding to a single partition of
the data.
• Parallelism: Tasks are executed in parallel across the different worker nodes in
the cluster. The number of tasks depends on how the data is partitioned.
• Task Execution: When a stage is ready to run, Spark creates tasks for each
partition of the data. For example, if you have 100 partitions, Spark will create
100 tasks to process them in parallel.
• Task Failures: If a task fails, Spark can retry the task on another node.
Resilient Distributed Datasets (RDDs) in Spark:
Resilient Distributed Datasets (RDDs) are the fundamental data structure in Spark,
designed for distributed computing. They represent an immutable distributed
collection of objects that can be processed in parallel across a cluster. The key
features of RDDs are:
1.Fault Tolerance:
RDDs can recover from failures by keeping track of their lineage, which is a record of
operations performed on the data. If a partition of an RDD is lost, Spark can
recompute it using the lineage.
2.Parallel Processing:
RDDs allow Spark to process data in parallel across multiple machines in a cluster.
Each partition of an RDD can be processed independently by a task.
3.Immutability:
RDDs are immutable, meaning once created, they cannot be changed. Any
transformation on an RDD results in the creation of a new RDD.
4.Lazy Evaluation:
Spark does not compute RDDs immediately. Instead, it builds a Directed Acyclic Graph
(DAG) of transformations and computes RDDs only when an action is called.
Anatomy of a Spark Job Run:
When a Spark job is executed, it goes through several stages:
Job Submission: A user submits a job by invoking an action on an RDD, like
.collect() or .save(). The job is submitted to the SparkContext, which coordinates
the execution.
Job Division into Stages: Spark divides the job into stages based on operations
that require data shuffling. Each stage is further divided into tasks, and tasks are
assigned to worker nodes for execution.
Task Scheduling: The scheduler places tasks on available worker nodes. Spark
uses a task scheduling mechanism that distributes the tasks across the cluster for
parallel execution.
Execution:The tasks are executed on the worker nodes. Data may be shuffled
between nodes if necessary (for operations like join or groupBy).
Result Collection: After all tasks are executed, the final results are collected and
returned to the driver program, or written to storage like HDFS.
Spark on YARN:
YARN (Yet Another Resource Negotiator) is a resource management layer for Hadoop that
allows Spark to run on top of Hadoop clusters. Here’s how Spark runs on YARN:
1.Resource Manager:
YARN’s ResourceManager manages cluster resources (CPU, memory) and schedules tasks for
Spark jobs. The ResourceManager ensures that Spark applications get the resources they
need for execution.
2.Application Master:
Spark runs in YARN by using an ApplicationMaster, which is responsible for negotiating
resources from the ResourceManager and tracking the execution of the application.
3.Execution on Worker Nodes:
Worker nodes in the Hadoop cluster run the tasks for Spark jobs. These nodes execute the
individual tasks, perform the computations, and send results back to the ApplicationMaster.
4.Data Locality:
YARN allows Spark to schedule tasks based on data locality, meaning that Spark tries to run
tasks on nodes that have the data already, reducing the need for network transfer.
5.Resource Allocation:
YARN dynamically allocates resources for Spark applications, adjusting resources based on
workload requirements, which improves resource utilization and job performance.
Introduction to Scala:
Scala is a high-level programming language that combines object-oriented and functional programming features.
It is designed to be concise, elegant, and expressive. Scala runs on the Java Virtual Machine (JVM), which
means it is compatible with Java and can make use of existing Java libraries. Scala is statically typed, meaning
that types are checked at compile-time, but it also supports type inference to reduce verbosity.
Classes and Objects:
• Classes: A class in Scala is a blueprint for creating objects. It defines the properties (variables) and behaviors
(methods) that the objects of that class will have.
• Objects: An object in Scala is a singleton instance of a class. It is used to define methods and variables that do
not belong to any specific instance of a class. An object is created when the program starts running and can be
used to access functionality without creating an instance.
Basic Types and Operators:
• Basic Data Types: Scala supports a range of basic types such as integers, floating-point numbers, characters,
and boolean values. Examples of basic types include:
• Int (Integer numbers)
• Double (Floating-point numbers)
• Char (Single characters)
• Boolean (True/False)
• String (Text)
• Operators: Scala supports several types of operators like:
• Arithmetic operators (e.g., +, -, *, /)
• Comparison operators (e.g., ==, !=, >, <)
• Logical operators (e.g., &&, ||)
• Assignment operators (e.g., =, +=, -=)
Built-in Control Structures:
• If-Else Statements: These are used to make decisions based on conditions. It
checks if a condition is true or false and executes the appropriate block of code.
• For Loop: The for loop is used to repeat a block of code a specific number of
times. It can be used with ranges or collections (like lists).
• While Loop: The while loop executes a block of code as long as a condition is
true.
• Match Expression: Similar to a switch statement in other languages, the match
expression in Scala is used to compare a value against different patterns and
execute corresponding code. It is a more powerful version of the switch
statement.
Functions and Closures in Scala:
• Functions: A function in Scala is a block of code that takes inputs (parameters),
performs a task, and returns a result. Functions can be defined with a specific
name and can be called anywhere in the program. Scala allows defining
functions with or without parameters. Scala also supports anonymous
functions, which are functions without a name, often used for short tasks.
• Closures: A closure is a function that can capture and carry its environment
with it. This means that the function can access variables from the scope in
which it was created, even after that scope has ended. Closures are useful
when you need to store a function along with the values it depends on.
• Example of Closures:
If a function is defined inside another function, the inner function can access
variables from the outer function, even if the outer function has finished
executing. This is what makes the inner function a closure.
Inheritance in Scala:
Inheritance is a fundamental concept of object-oriented programming, where a class
can inherit properties and behaviors from another class. In Scala, one class can
extend another class using the extend keyword
The class that is inherited from is called the superclass (or base class), and the class
that inherits is called the subclass (or derived class).
• Super class: The class whose properties and methods are inherited by another class.
• Sub class: The class that inherits the properties and methods from another class.
In Scala, a class can extend only one class, which is called single inheritance. However,
Scala supports multiple traits, which allows a class to mix in multiple behaviors.
• Traits: A trait is similar to an interface in other programming languages but can also
contain method implementations. Traits are used to add behavior to classes. A class
can extend multiple traits in Scala.
EXAMPLE
If you have a superclass called Animal with properties like name and methods like
makeSound(), a subclass like Dog can extend Animal, inheriting those properties and
methods, and then possibly adding new behavior specific to Dog.
THANK YOU FOR WATCHING

BEST OF LUCK
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]

Our Website Link - CLICK


UNIT – 5

BIG DATA

SYLLABUS: Hadoop Eco System Frameworks: Applications on Big Data using Pig, Hive
and HBase Pig : Introduction to PIG, Execution Modes of Pig, Comparison of Pig with
Databases, Grunt, Pig Latin, User Defined Functions, Data Processing operators, Hive -
Apache Hive architecture and installation, Hive shell, Hive services, Hive metastore,
comparison with traditional databases, HiveQL, tables, querying data and user defined
functions, sorting and aggregating, Map Reduce scripts, joins & subqueries. HBase –
Hbase concepts, clients, example, Hbase vs RDBMS, advanced usage, schema design,
advance indexing, Zookeeper – how it helps in monitoring a cluster, how to build
applications with Zookeeper. IBM Big Data strategy, introduction to Infosphere,
BigInsights and Big Sheets, introduction to Big SQL.
PIG
Pig is a high-level platform developed by Apache for analyzing large data sets. It
uses a language called Pig Latin, which is similar to SQL but is designed for
handling large-scale data.
Types of Pig Execution Modes
1.Local Mode
1. In this mode, Pig runs on a single local machine.
2. It uses the local file system instead of HDFS.
3. It is mainly used for development and testing purposes with smaller datasets.
4. There is no need for Hadoop setup in local mode.
2.MapReduce Mode (Hadoop Mode)
1. This is the production mode where Pig scripts are converted into MapReduce jobs and
executed over a Hadoop cluster.
2. It supports large datasets that are stored in HDFS.
3. Requires proper Hadoop setup and configuration.
4. It provides scalability and fault tolerance.
Features of Pig
• Ease of Use: Pig Latin language is simple and similar to SQL, making it easier
for developers and analysts.
• Data Handling: It can work with both structured and semi-structured data (like
logs, JSON, XML).
• Extensibility: Users can write their own functions to handle special
requirements (called UDFs).
• Optimization: Pig automatically optimizes the execution of scripts, so users can
focus more on logic than performance tuning.
• Support for Large Datasets: It processes massive volumes of data efficiently by
converting scripts into multiple parallel tasks.
• Interoperability: It can work with other Hadoop tools like Hive, HDFS, and
HBase.
Difference Between Pig Latin and SQL

Pig Latin SQL


Procedural Language Declarative Language
Specifies how data is processed step-by-step Specifies what result is needed
Better for ETL tasks and complex data processing Best for querying structured data
Supports nested data types like tuples and maps Mostly works with tabular data
Developed for large-scale data Used in traditional RDBMS systems
Grunt Shell
Grunt is the interactive shell or command-line interface of Pig.
It allows users to write and execute Pig Latin commands line by line, similar to a SQL command line
or terminal.
• Useful for testing and debugging Pig Latin scripts.
• Helps to run small tasks and check output instantly.
• Automatically starts when you run Pig without any script file.
• Can load data, process it, and display results interactively.
• Example Use:
Analysts use the Grunt shell to experiment with data, apply filters, and view outputs before finalizing
their Pig script.
What are the various syntax and semantics of the Pig Latin programming
language?
• Pig Latin is a high-level language used with Apache Pig for data processing. It
has specific rules (syntax and semantics) that define how the language should
be written and how it behaves.
1. Statements:
• A Pig Latin program is made up of multiple statements.
• Each statement represents an operation or command and usually ends with a
semicolon.
• Comments can be written using double hyphens (--) or C-style comments (/*
*/).
• Pig Latin has reserved keywords that cannot be used for naming variables or
aliases.
• Operators and commands are not case-sensitive, but function names and
aliases are case-sensitive.
2. Expressions:
• Expressions are parts of statements that produce a value.
• They are used with relational operators in Pig.
• Pig supports a variety of expressions, including mathematical and string operations.
3. Types:
• Pig has several data types:
• Simple types: int, long, float, double, bytearray (binary), and chararray (text).
• Complex types:
• Tuple: an ordered set of fields.
• Bag: a collection of tuples.
• Map: a collection of key-value pairs.
4. Schemas:
• Schemas define the structure (field names and data types) of a relation.
• Unlike SQL, Pig allows partial or no schema at all; data types can be inferred later.
• This makes Pig flexible for handling plain files with no predefined structure
5. Functions:
• Pig has built-in functions of four types:
• Eval functions – for computations.
• Filter functions – to filter records.
• Load functions – to load data.
• Store functions – to save data.
• If needed, users can create their own custom functions called User Defined
Functions (UDFs).
• The Pig community also shares functions through a repository called Piggy
Bank.
6. Macros:
• Macros are reusable code blocks within Pig Latin.
• They make scripts cleaner and help avoid repetition.
• Macros can be defined inside the script or in separate files and imported when
needed.
User Defined Functions (UDFs) in Pig:
• UDFs are custom functions created by the user when built-in functions in Pig
are not sufficient.
• They are used to perform specific operations on data like filtering,
transformation, or formatting.
• UDFs are typically written in Java, Python, or other supported languages and
can be used in Pig scripts like any other function.
• Once written and registered in Pig, UDFs help make the script more powerful
and flexible.
• Example in simple words:
If Pig does not have a function to extract only the year from a date field, the
user can create a UDF to do that and use it in their script.
Data Processing Operators in Pig:
Pig provides several operators to process and transform data. Here are the most
common ones:
1.LOAD – Loads data from the file system (like HDFS) into Pig for processing.
2.DUMP – Displays the output of a relation on the screen.
3.STORE – Saves the final result to a file or directory.
4.FILTER – Removes unwanted rows based on a condition.
5.FOREACH...GENERATE – Applies a transformation to each row (like selecting specific
columns or applying functions).
6.GROUP – Groups data by a specified field (used for aggregation).
7.JOIN – Joins two or more datasets on a common field.
8.ORDER BY – Sorts the data in ascending or descending order.
9.DISTINCT – Removes duplicate records from the dataset.
10.LIMIT – Restricts the number of output rows.
11.UNION – Combines two datasets with the same structure.
Apache Hive Architecture and Installation
Hive Architecture is designed to manage and query large datasets stored in
Hadoop’s HDFS using a SQL-like language called HiveQL. The key components
are:
• Metastore: Stores metadata (like table names, columns, data types, location)
in a relational database.
• Driver: Manages the lifecycle of a HiveQL statement (compilation to
execution).
• Compiler: Converts HiveQL queries into execution plans (usually MapReduce
jobs).
• Execution Engine: Runs the execution plan on Hadoop.
• User Interfaces: Includes Hive CLI, Beeline, Web UI, and HiveServer2.
Hive Installation
To install Hive:
1.First, install and configure Hadoop.
2.Download Hive from the Apache website.
3.Extract and configure Hive by setting environment variables.
4.Set up the Metastore (can use MySQL or Derby).
5.Initialize the schema using Hive tools.
6.Start Hive and begin executing queries.
Hive Shell
The Hive Shell is a command-line tool where users can:
• Run HiveQL queries
• Create and manage tables
• Load and query data
• Check outputs and errors
• It is the most basic way to interact with Hive and is useful for testing and learning.
Hive Services
Hive includes several important services:
• HiveServer2: Allows clients to send queries remotely.
• Metastore Service: Handles all metadata operations.
• CLI/Beeline: Command-line interfaces to interact with Hive.
• Web Interface: GUI to manage and run queries (optional).
Hive Metastore
• The Metastore stores metadata about databases, tables, partitions, and
columns. It helps the Hive engine understand the structure of the data. It can
be embedded (using Derby for testing) or remote (using MySQL/PostgreSQL for
production).
Comparison with Traditional Databases
Feature Traditional Databases Apache Hive
Column-based or file-
Storage Format Row-based
based (HDFS)
Schema Type Schema-on-write Schema-on-read
Processing Real-time (OLTP) Batch-processing (OLAP)
Language SQL HiveQL (similar to SQL)
Fast for small/moderate
Speed Efficient for large datasets
data
Large data queries and
Use Case Frequent small updates
analysis
Limited ACID support
ACID Support Full ACID support
(now improving)
HiveQL (Hive Query Language)
HiveQL is a query language similar to SQL used for querying and managing large
datasets in Hive. It allows users to write queries to create tables, load data, and
perform analysis using simple syntax.
Examples of what you can do with HiveQL:
• Create tables
• Load data into tables
• Query data using SELECT
• Perform joins, filtering, grouping, and aggregations
Tables in Hive
Hive supports two types of tables:
1.Managed Tables: Hive controls both the metadata and the data. If you drop the
table, data is also deleted.
2.External Tables: Only metadata is managed by Hive. The data stays in HDFS even if
the table is dropped.
• Tables have a schema (columns and data types) and can be partitioned (organized by
specific columns for faster queries).
Querying Data in Hive
You can query data using HiveQL. You can:
• Use SELECT to retrieve specific columns
• Use WHERE to filter records
• Use GROUP BY to aggregate data
• Use JOIN to combine tables
Hive supports basic querying operations similar to SQL but is designed for batch
processing, not real-time.
User Defined Functions (UDFs)
Hive provides built-in functions for operations like string manipulation, math, date
handling, etc.
If you need a function that is not available, you can create your own UDF. These are
custom functions that users write (usually in Java) and then register in Hive to use in
queries.
Example use cases for UDFs:
• Custom data transformations
• Special filtering conditions
• Advanced calculations
Sorting and Aggregating Data in Hive
Sorting: Hive supports sorting using the ORDER BY clause. It sorts the complete
dataset but is slow for big data.
Distributed Sorting: Use SORT BY (sorts within partitions) or CLUSTER BY (sort
and distribute across reducers).
Aggregating: Hive supports aggregation using functions like:
• COUNT() – Counts rows
• SUM() – Adds values
• AVG() – Averages values
• MAX() / MIN() – Gets maximum or minimum values
Often used with GROUP BY to get results grouped by a column
MapReduce Scripts in Hive
• Hive automatically converts your HiveQL queries into MapReduce jobs.
• You don’t need to write MapReduce code manually to process data.
• Behind the scenes, when you run a query like SELECT, Hive translates it into a
series of MapReduce steps to execute the task in parallel.
• For advanced processing, Hive allows the use of custom MapReduce scripts
(written in Java, Python, etc.) using TRANSFORM clause in HiveQL.
• This feature is helpful when default HiveQL is not enough, and you need
specific processing logic.
Joins in Hive
Joins in Hive are used to combine rows from two or more tables based on a
related column.
Common types of joins in Hive:
1.INNER JOIN: Returns rows that match in both tables.
2.LEFT OUTER JOIN: Returns all rows from the left table and matching rows from
the right.
3.RIGHT OUTER JOIN: Returns all rows from the right table and matching rows
from the left.
4.FULL OUTER JOIN: Returns all rows when there is a match in one of the tables.
5.MapJoin: A special join where the smaller table is loaded into memory to
speed up the join process. Useful when one table is small.
• Hive joins are similar to SQL joins but work on large-scale datasets using
MapReduce.
Subqueries in Hive
Subqueries are queries nested inside another query.
Types of subqueries in Hive:
• Scalar Subquery: Returns a single value. Used in SELECT or WHERE clauses.
• IN/NOT IN Subqueries: Used to check if a value exists in the result of another
query.
• EXISTS Subquery: Checks if a subquery returns any rows.
• Derived Tables (Inline Views): A subquery used in the FROM clause. Acts like a
temporary table.
Hive supports limited subquery usage compared to standard SQL, but commonly
used ones like in SELECT, FROM, and WHERE clauses are available.
Aspect Partitioning Bucketing

Organizing data into subdirectories Dividing data into a fixed number of


Definition
based on column values. files (buckets) using a hash function.

Data is split into a fixed number of


Data is split based on distinct values
How Data is Organized files using the hash value of the
of the partition column.
bucket column.

Improves performance by scanning Improves performance for joins by


Performance Benefits
only the relevant partitions. distributing data evenly into buckets.

Creates a specified number of


Creates separate directories for each
Storage buckets (files) within the partition
partition value (e.g., year=2021).
directory.
Efficient when querying based on Efficient for operations like joins
Data Querying partition column (e.g., by year, when tables are bucketed on the
region). same column.
Best suited when data can be Best suited for evenly distributing
Use Case logically split into categories like data for operations like joins and
dates, regions, etc. aggregations.
A sales table partitioned by year
A user table bucketed by user_id
creates directories like
Example creates 5 files (buckets) based on
/sales/year=2020/,
hash values of user_id.
/sales/year=2021/.
1. HBase Concepts:
• HBase is a NoSQL, distributed, and scalable database built on top of Hadoop HDFS (Hadoop
Distributed File System).
• It stores data in column families rather than rows, which makes it suitable for read/write
operations on large datasets.
• Data Model: HBase stores data in tables. Each table is divided into column families, and
each column family contains a number of rows with unique row keys.
• It is optimized for random, real-time read/write operations on large datasets.
2. HBase Clients:
• Java API: HBase provides a native Java API for interacting with HBase, which is the most
commonly used.
• REST API: A RESTful interface is available for interacting with HBase using HTTP requests.
• Thrift API: A language-agnostic API that allows applications in multiple languages (like
Python, C++, etc.) to interact with HBase.
• JDBC Driver: HBase provides a JDBC (Java Database Connectivity) driver for easier
integration with SQL-based applications.
• MapReduce Integration: HBase integrates seamlessly with Hadoop’s MapReduce framework
for processing large datasets.
3. Example:
• HBase tables are structured with row keys, column families, and columns. For
example, a table might represent information about students where each
student’s row key is their ID, and the columns might include "name",
"address", and "marks".
• For a table called student_data there might be rows for each student like :
Row Key: 123
Column Family: personal -> Name: John Doe
Column Family: academic -> Marks: 90
HBase vs RDBMS:
Aspect HBase RDBMS
Column-oriented, no Row-oriented, requires
Data Model
predefined schema predefined schema
Horizontally scalable Vertically scalable (requires
Scalability
(Distributed) more powerful hardware)
No support for SQL-like Supports SQL for querying and
Query Language
queries, custom APIs management
Joins Doesn’t support joins natively Supports joins between tables
Strong consistency (ACID
Consistency Eventual consistency
transactions)
Limited data integrity Enforces strict data integrity
Data Integrity
enforcement and constraints
Optimized for real-time, large- Optimized for complex
Performance
scale reads/writes queries and transactions
Features of HBase
1. Distributed and Scalable: HBase is a distributed NoSQL database designed to handle large volumes of data
across many machines. It is horizontally scalable, meaning you can add more nodes to the cluster to increase
capacity and throughput.
2. Real-Time Data Access: HBase provides real-time read and write access to data. It is designed for low-latency
access, making it suitable for real-time applications such as online analytics, recommendation engines, and
logging systems.
3. Column-Oriented Storage: Unlike traditional relational databases that store data in rows, HBase stores data in
column families. This makes it more efficient for reading and writing large amounts of data by accessing only
the required columns, reducing I/O operations.
4. Fault Tolerant: HBase is built on top of Hadoop’s HDFS, which provides fault tolerance through data
replication. If a node fails, data is still accessible from other nodes that have replicated copies of the data.
5. Automatic Sharding: HBase automatically splits tables into regions and distributes them across the cluster.
This automatic sharding allows for scalable storage and processing of large datasets without the need for
manual partitioning.
6. Flexible Schema: HBase provides a flexible schema where columns can be added to a column family at any
time, and the schema can evolve as the application grows, making it adaptable to changing requirements.
7. Strong Consistency: HBase provides strong consistency guarantees within a region. When a write is
acknowledged, it is immediately available for reads from any client that requests the data.
8. Integration with Hadoop Ecosystem: HBase integrates seamlessly with Hadoop, MapReduce, and other
Hadoop-based tools, enabling big data processing, analytics, and batch jobs to be run efficiently on the same
data stored in HBase.
Advanced Usage of HBase:
• Data Locality: HBase ensures that data is stored in a way that it can be processed by the
local node, reducing network overhead.
• MapReduce Integration: You can use MapReduce jobs to process data stored in HBase,
making it suitable for big data processing and analysis.
• Bulk Load: HBase supports bulk loading of data from HDFS into HBase, which is efficient for
loading large datasets into HBase tables.
• Real-time Analytics: HBase is commonly used for real-time data analytics due to its ability
to support random read/write operations.
Schema Design in HBase:
• Column Families: Choose the number of column families wisely, as each column family is
stored separately, and each one is served by a different set of HBase Region Servers.
• Row Keys: Design your row keys carefully to avoid hot spotting. Row keys should be unique
and evenly distributed.
• Avoid Wide Rows: Avoid using row keys that would result in very wide rows because they
can cause performance issues.
• Data Model Flexibility: HBase allows for flexible schema design. You can add columns to
column families without affecting existing data, which is a significant advantage for rapidly
changing data models.
Advanced Indexing in HBase:
• Row Key Indexing: The most basic form of indexing in HBase is the row key.
However, querying by row key is efficient, but querying based on non-row key
columns may not be optimal.
• Secondary Indexes: HBase doesn’t support secondary indexing natively, but
you can implement it manually by creating additional tables or using libraries
like Phoenix or HBase Indexer.
• Apache Phoenix: It provides an SQL-like interface to HBase and supports
secondary indexes and other RDBMS features such as joins and aggregation.
• Global Indexing: You can create a global index by creating an additional table
that holds the indexed data, which maps to the row keys of the original table.
• Bloom Filters: HBase supports Bloom filters for column families to speed up
the lookup process by reducing disk access.
What is Zookeeper?
Zookeeper is a centralized service used in big data systems (like Hadoop and
HBase) to manage and coordinate distributed systems. It acts like a manager
that keeps all the nodes (computers) in a cluster connected, informed, and
synchronized.
How Zookeeper Helps in Monitoring a Cluster:
• Manages Nodes: It keeps track of which machines (nodes) are working and
which are not.
• Failure Detection: If a node fails or disconnects, Zookeeper quickly detects it
and informs the system.
• Leader Election: In systems where one node needs to act as a leader (like
NameNode in Hadoop), Zookeeper helps choose one automatically.
• Keeps Configuration Info: It stores settings and configuration that all machines
in the cluster can access.
How to Build Applications with Zookeeper:
• Use of znodes: Zookeeper stores all information in a hierarchical namespace
(like a file system) called znodes. Applications can read and write data to these
znodes.
• Watches and Notifications: Applications can set watches on znodes to get
notifications when data changes. This is useful for real-time configuration
updates.
• Locking and Synchronization: Zookeeper allows distributed locking, which
helps in resource sharing and synchronizing actions across distributed
applications.
• Group Membership: Applications can use Zookeeper to track group
membership, i.e., keeping a record of which services are online and available.
• Naming Service: Zookeeper can be used to manage names of services and
provide a lookup mechanism, like a phone directory for distributed services.
IBM Big Data Strategy:
IBM's Big Data strategy focuses on helping organizations manage, analyze, and use
large volumes of data efficiently. The key elements of the strategy include:
• Volume, Variety, Velocity, and Veracity (4 V’s) of big data.
• Unified platform to integrate data from different sources.
• Real-time analytics and insights.
• Secure and scalable systems to handle enterprise data.
• Integration of AI and machine learning for smarter data processing.
IBM Infosphere:
IBM InfoSphere is a data integration platform that provides tools to collect, clean,
manage, and govern data. It helps in:
• Data warehousing and data quality management.
• Connecting structured and unstructured data.
• Ensuring data security, compliance, and integration across platforms.
• Supporting ETL (Extract, Transform, Load) processes.
IBM BigInsights:
IBM BigInsights is IBM’s Big Data platform built on Apache Hadoop. It is designed
for large-scale data analysis and includes:
• A user-friendly interface for non-technical users.
• Hadoop-based architecture for distributed data processing.
• Tools for data mining, machine learning, and analytics.
• Integration with IBM tools like Infosphere and Big SQL.
IBM Big Sheets:
Big Sheets is a spreadsheet-style web interface that allows users to work with
large datasets easily. It is used in BigInsights and is suitable for:
• Users who are not familiar with programming.
• Analyzing large data sets using a visual interface.
• Performing tasks like sorting, filtering, and charting big data.
Introduction to Big SQL:
Big SQL is an IBM technology that allows SQL queries on big data stored in
Hadoop. It helps in:
• Running SQL queries on Hive tables, HBase, and other data sources.
• Using existing SQL knowledge to analyze big data.
• Providing high performance, security, and compatibility with traditional
databases.
• Allowing integration with BI tools like IBM Cognos and others.
THANK FOR WATCHING

BEST OF LUCK
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]

Our Website Link - CLICK

You might also like