Big Data All Unit by Study4sub
Big Data All Unit by Study4sub
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
SYLLABUS:
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to
Big Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big
Data, Big Data technology components, Big Data importance and applications, Big Data
features – security, compliance, auditing and protection, Big Data privacy and ethics, Big Data
Analytics, Challenges of conventional systems, intelligent data analysis, nature of data, analytic
processes and tools, analysis vs reporting, modern data analytic tools.
What is Big Data?
Big Data refers to extremely large and complex data sets that are difficult to store, manage,
and process using traditional data processing tools and methods.
In Simple Words:
• Big Data means “a huge amount of data” — so big and fast that traditional software (like
Excel or simple databases) can't handle it.
Key Points:
• It includes structured, semi-structured, and unstructured data.
Study4sub
Early development of databases and data storage systems like IBM’s IMS.
1960s–70s
Data stored on magnetic tapes.
AI, ML, and IoT generate and consume massive real-time data. Focus on
2020 onwards
data privacy, ethics, and advanced analytics tools.
Key Drivers of Big Data
• The term “drivers” refers to the factors or reasons behind the growth and importance
of Big Data. These are the main forces that have pushed Big Data to become essential
in today’s world.
1. Rapid Growth of Internet and Social Media
• Billions of users are active on platforms like Facebook, YouTube, Instagram, and
Twitter.
• Every second, people are uploading photos, videos, comments, likes, etc.
• This creates huge volumes of data every Study4sub
day.
Example: YouTube gets more than 500 hours of video uploads per minute.
2. Increased Use of Smartphones and IoT Devices
• Every smartphone, smartwatch, and smart device (like Alexa, fitness bands, smart
TVs) collects and sends data.
• These devices generate real-time data from various sensors.
• Example: A smart home system collects data about temperature, lights, and
energy usage.
3. Cheap Storage and Cloud Computing
• Earlier, storing large data was expensive.
• Now, cloud services like AWS, Google Cloud, and Microsoft Azure offer cheap and scalable storage.
• This allows companies to collect and store massive amounts of data easily.
• Point to remember: Cloud storage is flexible, cost-effective, and accessible from anywhere.
4. Advancements in Data Processing Technologies
• Tools like Hadoop, Spark, NoSQL databases allow fast and distributed processing of large data.
• These technologies help in handling structured and unstructured data with ease.
• Example: Hadoop breaks big data into smaller parts and processes it in parallel.
Study4sub
5. Need for Real-Time Decision Making
• Companies need to make quick decisions to stay competitive.
• Big Data helps in analyzing trends, customer behavior, and business performance in real-time.
• Example: E-commerce sites suggest products based on what users just searched.
6. Growth of AI and Machine Learning
• AI and ML need huge amounts of data to learn and make accurate predictions.
• Big Data provides the fuel for these smart systems.
• Example: Netflix uses ML and Big Data to recommend shows based on your watch history.
1. Introduction to Big Data Platform
Definition:
A Big Data Platform is an integrated system that combines various tools and
technologies to manage, store, and analyze massive volumes of data efficiently.
It provides the infrastructure and environment required for:
• Ingesting data (bringing data from sources)
• Storing data (on distributed systems)
Study4sub
• Processing data (in batch or real-time)
• Analyzing and visualizing data
Benefits of Big Data Platforms:
• Scalability: Easily handle growing data
• Flexibility: Supports all types of data (structured, semi-structured, unstructured)
• Real-Time Processing: Immediate insights and decisions
• Cost-Effective: Cloud and open-source tools reduce expenses
Main Components of a Big Data Platform:
1.Data Ingestion Tools
1. Used to collect and import data from different sources
2. Examples: Apache Kafka, Apache Flume, Sqoop
2.Data Storage Systems
1. Store large datasets reliably
2. Examples: HDFS (Hadoop Distributed File System), NoSQL (MongoDB, Cassandra)
3.Processing Engines
Study4sub
1. Perform computations and analytics on data
2. Examples: Hadoop MapReduce (batch), Apache Spark (real-time)
4.Data Management
1. Tools to organize, clean, and maintain data quality
2. Examples: Hive, HBase
5.Analytics & Visualization Tools
1. Help in generating reports and dashboards
2. Examples: Tableau, Power BI, Apache Pig, R, Python
Examples of Big Data Platforms:
• Apache Hadoop Ecosystem
• Apache Spark Framework
• Google Cloud BigQuery
• Amazon EMR (Elastic MapReduce)
• Microsoft Azure HDInsight
Big Data Architecture and Characteristics
Study4sub
• These systems can't process data fast enough for real-time decisions.
• Problem: Businesses miss out on opportunities that require immediate action.
7. Limited Flexibility and Integration
• Traditional systems don’t integrate well with modern technologies like cloud or
machine learning.
• Problem: It's hard to use new tools alongside old systems.
8. Data Quality Issues
• Conventional systems struggle with ensuring clean and consistent data.
• Problem: Data errors or inconsistencies can affect decision-making.
Intelligent Data Analysis
Intelligent Data Analysis refers to using advanced techniques, algorithms, and tools to analyze large
datasets and extract meaningful patterns, insights, and predictions. It involves the application of
artificial intelligence (AI), machine learning (ML), and statistical models to make smarter decisions
based on data.
Key Points:
1.AI & Machine Learning: These technologies help in learning from data and predicting future trends
or behaviors without human intervention. Study4sub
2.Pattern Recognition: Intelligent data analysis identifies hidden patterns in data that are not
immediately obvious.
3.Automation: It automates data analysis processes, making it faster and more efficient.
4.Predictive Analytics: It helps forecast future events, trends, or behaviors based on historical data.
5.Real-time Insights: Intelligent analysis can provide real-time insights, helping businesses to make
quicker, more informed decisions.
Example:
• In retail, intelligent data analysis can be used to predict which products will sell best in the future by
analyzing past sales data, customer preferences, and trends.
Nature of Data
Nature of Data refers to the different forms, types, and characteristics of data that affect how
it is stored, processed, and analyzed.
Types of Data:
1.Structured Data:
1. What it is: Data that is organized into tables, rows, and columns, typically in relational databases (e.g.,
customer records, sales data).
2. Example: A database of employee information where each row represents an employee with columns like
name, age, salary, etc.
2.Unstructured Data:
1. What it is: Data that doesn't have a predefined structure, making it difficult to analyze with traditional
Study4sub
methods (e.g., text, images, audio, video).
2. Example: Social media posts, customer reviews, or video files.
3.Semi-structured Data:
1. What it is: Data that doesn't have a rigid structure but contains tags or markers that make it easier to
analyze (e.g., XML, JSON).
2. Example: A log file that contains a mixture of structured data (timestamps) and unstructured data (event
descriptions).
4.Big Data:
1. What it is: Extremely large datasets that require advanced tools and techniques for storage, processing,
and analysis. Big Data is often characterized by the 5 Vs: Volume, Velocity, Variety, Veracity, and Value.
2. Example: Data from IoT sensors, social media platforms, and web logs.
Characteristics of Data:
1.Volume: The amount of data being generated. It can be terabytes or even
petabytes.
2.Velocity: The speed at which data is generated and needs to be processed
(e.g., real-time data).
3.Variety: The different types of data (structured, unstructured, semi-
structured).
Study4sub
4.Veracity: The quality and accuracy of the data.
5.Value: The usefulness of the data for decision-making or gaining insights.
Nature of Data in Big Data:
• Big Data contains data from multiple sources that vary in type, speed, and
structure. Processing this data requires advanced technologies like Hadoop,
Spark, and machine learning to handle its complexity.
Analytic Processes and Tools
Analytic Process
The analytic process in Big Data involves several steps to extract meaningful insights
from large datasets. These steps are essential for data analysis and decision-making.
1.Data Collection:
1. Collect data from various sources such as sensors, databases, social media, and logs.
2. Example: Collecting sales data from e-commerce websites.
2.Data Cleaning:
1. Remove errors, duplicates, and irrelevant Study4sub
information to ensure high-quality data.
2. Example: Removing duplicate customer entries from a database.
3.Data Analysis:
1. Apply statistical methods, machine learning models, and algorithms to analyze the data and
uncover patterns.
2. Example: Analyzing customer behavior patterns using machine learning.
4.Interpretation of Results:
1. After analysis, interpret the results to make informed decisions.
2. Example: Predicting future sales trends based on past data.
Tools : Excel , R and Python, Hadoop and Spark, Tableau/Power BI , SQL
Analysis vs Reporting
While both analysis and reporting involve working with data, they serve different
purposes.
Analysis:
• Goal: To explore data, find patterns, and make predictions.
• Process: Involves using statistical models, machine learning, and algorithms.
• Outcome: Provides insights that can guide strategic decision-making.
• Example: Using customer data to predict future purchase behavior.
Study4sub
Reporting:
• Goal: To present data in a simple, understandable format.
• Process: Involves summarizing data in charts, graphs, and tables.
• Outcome: Provides an overview of performance or trends, typically for monitoring
purposes.
• Example: A monthly sales report showing the total revenue, top-selling products, and
key metrics.
Key Differences:
• Analysis is more about understanding and extracting insights from data, while reporting is about summarizing and
presenting data for easy consumption.
• Analysis typically involves advanced methods, while reporting is more about presenting results in an understandable way.
Modern Data Analytic Tools (Short Notes - AKTU Oriented)
1. Hadoop
1. Open-source framework for storing and processing large data sets in a distributed manner.
2. Handles structured and unstructured data.
2. Apache Spark
1. Fast in-memory data processing tool.
2. Suitable for real-time analytics.
3. Power BI
Study4sub
1. Microsoft’s tool for creating interactive dashboards and reports.
2. Easy to use and integrates with various data sources.
4. Tableau
1. Data visualization tool.
2. Helps in making graphs, charts, and dashboards for better understanding.
5. Python & R
1. Programming languages for data analysis, visualization, and machine learning.
2. Python is widely used due to its simplicity and libraries like Pandas, NumPy.
6. SQL
1. Language used to query and manage structured data in databases.
2. Essential for data extraction and manipulation.
7. Google Analytics
1. Used to track and report website traffic and user behavior.
THANKS FOR WATCHING
BEST OF LUCK
Study4subFOR EXAM
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
BIG DATA
SYLLABUS: HDFS (Hadoop Distributed File System): Design of HDFS, HDFS concepts,
benefits and challenges, file sizes, block sizes and block abstraction in HDFS, data
replication, how does HDFS store, read, and write files, Java interfaces to HDFS,
command line interface, Hadoop file system interfaces, data flow, data ingest with
Flume and Scoop, Hadoop archives, Hadoop I/O: Compression, serialization, Avro and
file-based data structures. Hadoop Environment: Setting up a Hadoop cluster, cluster
specification, cluster setup and installation, Hadoop configuration, security in Hadoop,
administering Hadoop, HDFS monitoring & maintenance, Hadoop benchmarks,
Hadoop in the cloud
1. Introduction to HDFS
• HDFS stands for Hadoop Distributed File System. It is the primary storage
system used by Hadoop applications to store large datasets reliably and
efficiently across multiple machines. HDFS is inspired by Google File System
(GFS) and is designed to run on low-cost commodity hardware.
2. Design of HDFS
• HDFS follows a Master-Slave Architecture.
• The NameNode is the master and manages the file system namespace (i.e., file
names, directories, block mappings).
• The DataNodes are the slaves that store the actual data blocks.
• Files are split into large blocks (e.g., 128 MB or 256 MB) and distributed across
DataNodes.
• It is designed for high fault tolerance, high throughput, and large-scale data
storage.
3. HDFS Concepts
• NameNode: Stores metadata of files like file permissions, block locations, and directory
structure. It doesn’t store the actual data.
• DataNode: Stores the actual data blocks. Each block is replicated for reliability.
• Block: The smallest unit of storage in HDFS. A large file is split into blocks which are
distributed.
• Replication: Blocks are replicated across multiple DataNodes to ensure fault tolerance.
• Rack Awareness: HDFS knows about rack locations to place replicas intelligently across
racks.
4. Benefits of HDFS
1.Fault Tolerance: HDFS replicates data blocks, so even if one or more nodes fail, data is still
accessible.
2.Scalability: HDFS clusters can scale by adding more nodes without downtime.
3.High Throughput: Optimized for streaming access to large datasets, ensuring fast
processing.
4.Cost-Effective: Runs on inexpensive commodity hardware.
5.Data Locality: Computation is moved closer to where the data is stored, improving
performance.
Challenges of HDFS
1.Not suitable for real-time processing: It is best for batch processing, not for
low-latency operations.
2.Small Files Issue: Too many small files can overwhelm the NameNode’s
memory, reducing efficiency.
3.Security Limitations: Requires integration with other tools like Kerberos or
Apache Ranger for security.
4.Single Point of Failure: If the NameNode fails, the entire system can stop
(solved using High Availability configuration).
File Sizes in HDFS
• HDFS is optimized for very large files (in gigabytes or terabytes).
• Small files should be avoided or combined because they increase the load on
the NameNode.
• Ideal use-case: applications that write data once and read many times (write-
once-read-many model).
Block Size in HDFS:
• In HDFS, files are split into large blocks (default: 128 MB, can be configured).
• This is much larger than traditional file systems (like 4 KB or 8 KB).
• Why large? It reduces the load on the NameNode, improves performance, and helps
in handling big data efficiently.
• Example: A 400 MB file will be split into 3 blocks of 128 MB and 1 block of 16 MB.
• Blocks are stored independently on different DataNodes.
• Users don’t manage blocks directly—HDFS handles it automatically (this is called
block abstraction)
Data Replication in HDFS
• HDFS maintains multiple replicas (default is 3) of each data block.
• Replication strategy:
• First replica on the local node.
• Second on a different node in the same rack.
• Third on a node in a different rack.
• This ensures high availability, data durability, and fault tolerance.
• If one node fails, the system automatically reads from another replica.
What is Block Abstraction?
• In HDFS, files are divided into large blocks (default size: 128 MB or 256 MB).
• These blocks are treated independently and stored across different machines in the Hadoop cluster.
• Users or applications do not deal with the actual block management—this is handled internally by
the HDFS NameNode.
• This concept of separating file management into blocks is called block abstraction.
Benefits of Block Abstraction:
1.Scalability
1. Large files can be stored across many nodes.
2. Easy to add more storage by adding more nodes.
2.Fault Tolerance
1. If a block is lost due to machine failure, HDFS can retrieve it from a replicated copy.
3.Efficient Storage and Load Distribution
1. Files are divided into blocks and stored across multiple DataNodes.
2. This allows parallel processing and better resource utilization.
4.Simplified Data Management
1. HDFS doesn’t need to keep track of every byte or kilobyte—just blocks.
2. Makes the NameNode efficient and less overloaded.
5.Optimized Data Access
1. Processing can happen where the block is stored (data locality), which reduces network traffic.
How HDFS Stores Data
1.File is Split into Blocks
1. When a file is stored in HDFS, it is automatically divided into large blocks (default size: 128 MB
or 256 MB).
2.Metadata Managed by NameNode
1. The NameNode keeps track of which blocks belong to which file and where those blocks are
stored, but it doesn’t store the actual data.
3.Data Stored in DataNodes
1. The actual file data (blocks) is stored in multiple DataNodes, which are the worker machines in
the cluster.
4.Replication for Safety
1. Each block is replicated (default: 3 copies) across different DataNodes to prevent data loss in
case any node fails.
5.Write Operation
1. When writing, the client sends data directly to the first DataNode, which forwards it to the
second and third – this is called pipelined writing.
6.Acknowledgment
1. After all blocks are written and replicated, the system confirms the file is stored successfully.
HDFS Write Operation (Step-by-Step)
1.Client contacts NameNode to request writing a file.
2.NameNode checks metadata, such as permissions and assigns DataNodes for
each block.
3.File is split into blocks (e.g., 128 MB).
4.The client writes each block to the first DataNode, which then forwards it to
the second and third (pipelining).
5.Once all replicas are written, DataNodes send acknowledgments back to the
client.
6.NameNode updates the metadata once the write is successful.
Important Terms:
• Pipelined writing – data flows from client → DataNode1 → DataNode2 →
DataNode3.
• Replication – ensures fault tolerance (default is 3 copies per block).
HDFS Read Operation (Step-by-Step)
1.Client requests the file from the NameNode.
2.NameNode returns metadata – list of blocks and locations on DataNodes.
3.The client directly connects to the nearest DataNode to read each block (for
efficiency).
4.Data is read in parallel, block by block, and reassembled by the client.
5.The client doesn’t interact with the NameNode during data transfer, only at
the start.
Important Points:
• Reads are fast and parallel.
• If a DataNode is down, replica is read from another DataNode.
Java Interface to Hadoop for File Operations
In Hadoop, Java provides methods to interact with the Hadoop Distributed File System (HDFS). Here are the key
operations that can be performed using Java:
2. Creating Directories:
You can create directories in HDFS using the mkdirs() method. This method ensures that all necessary parent
directories are created. If the directory already exists, it will not be recreated.
4. Overloaded Methods:
Java provides overloaded versions of methods in Hadoop, offering flexibility in how files and directories are
handled, such as choosing whether to delete files recursively or not.
5. Filesystem Class:
The Filesystem class in Java provides various methods to manage files in HDFS, including creating files, reading
data, writing data, and deleting files or directories.
Hadoop Command Line Interface (CLI)
• The Hadoop Command Line Interface (CLI) provides an interface for interacting with the
Hadoop Distributed File System (HDFS) via terminal or command prompt. It enables users to
perform various file management operations like listing, uploading, downloading, and
deleting files, as well as managing directories.
Hadoop FileSystem Interface
The Hadoop FileSystem Interface allows users to interact with different types of file systems,
including HDFS, local file systems, and cloud storage. It provides a common set of operations
such as reading, writing, deleting, and checking the existence of files and directories.
• FileSystem Class: This class provides methods for file operations, such as creating, reading,
and deleting files.
• Path Class: Represents the location of files or directories in the file system.
• Operations: Common operations include creating directories, checking file existence,
deleting files, and copying data between local and Hadoop file systems.
• Configuration: File system settings are usually configured through Hadoop's configuration
files, ensuring correct connection to HDFS or other systems.
• The interface abstracts the underlying storage systems, making it easier for users to work
with different storage backends in a consistent manner. It plays a key role in ensuring that
applications can read and write data efficiently across a variety of file systems.
Data Ingestion
Data Ingestion refers to the process of collecting and importing data from various
sources into a system, such as a database or data warehouse, for further analysis and
processing. In the context of Big Data, data ingestion involves transferring large
volumes of data into a platform like Hadoop for storage and analysis.
Challenges in Data Ingestion
1.Data Volume: Handling large amounts of data in a timely and efficient manner can
be difficult.
2.Data Variety: Different types of data (structured, semi-structured, unstructured)
need to be ingested properly.
3.Data Velocity: Data may come in at high speeds (e.g., streaming data), which
requires real-time processing and ingestion.
4.Data Quality: Ensuring data consistency, accuracy, and completeness during the
ingestion process.
5.Integration with Multiple Sources: Collecting data from various sources (databases,
social media, IoT devices) and integrating it into a unified format.
Data Ingestion with Flume
Apache Flume is a distributed and reliable data ingestion service designed to
collect, aggregate, and move large amounts of streaming data into Hadoop.
• How Data Ingestion Works in Flume:
• Source: Data is ingested from various sources, such as log files, websites, or streaming
services.
• Channel: The data is transferred through a channel (like memory or file channel) to
ensure reliable data flow.
• Sink: The data is finally sent to a destination, such as HDFS or other storage systems, for
further processing and analysis.
• Flow Configuration: Flume uses a configuration file to define the flow of data from
source to sink.
• Flume is typically used for real-time data ingestion, where data is continuously
being streamed and stored for later processing.
Data Ingestion with Sqoop
Apache Sqoop is a tool designed for transferring bulk data between Hadoop and
relational databases (e.g., MySQL, Oracle, SQL Server).
• How Data Ingestion Works in Sqoop:
• Importing Data: Sqoop can import data from relational databases into HDFS or Hive. It
performs this by reading data from tables and writing it to HDFS in a distributed fashion.
• Exporting Data: Sqoop can also export data from HDFS back to a relational database.
This is typically used to move processed data from Hadoop back to a database for
further business operations.
• Parallel Import/Export: Sqoop uses parallel processing to divide the import/export tasks
among multiple nodes, ensuring faster data ingestion and better scalability.
• Sqoop is particularly useful for batch ingestion of structured data from
relational databases to Hadoop.
Hadoop Archives (HAR)
Hadoop Archives (HAR) is a feature in HDFS used to store many small files efficiently
by bundling them into a single archive file. This reduces the overhead of managing
numerous small files in HDFS.
How It Works:
• Small files are grouped into a single archive file in HDFS.
• The HAR file is treated as a single file, improving storage efficiency.
• Accessing the data within HAR files is done using standard Hadoop tools.
Limitations:
1.Slower Access: Retrieving data from a HAR file can be slower than from individual
files.
2.Read-Only: Once created, HAR files cannot be modified.
3.No Compression: HAR does not support compression, which may limit its efficiency.
4.Management Complexity: Managing and updating large HAR files can be
cumbersome.
5.Limited Tool Support: Some Hadoop tools might not fully support HAR files.
Compression in Hadoop:
• Compression in Hadoop helps reduce the size of data stored in the system,
which saves space and makes data transfer faster. It improves the overall
performance by reducing the amount of data moved across the network or
stored in HDFS.
• Common Compression Formats: .gz, .bzip2, .lz4, .snappy
Advantages of Compression:
• Reduces disk space usage.
• Speeds up data transfer.
• Improves processing time, especially for large datasets.
• Challenges:
• Requires CPU power for compression and decompression.
• Can add some processing delays, especially with certain compression formats.
Serialization in Hadoop:
Serialization is the process of converting data into a format that can be stored or transmitted. In Hadoop,
serialization helps store and transfer data in a way that can be easily read from or written to the system.
• Serialization Formats in Hadoop:
Serialization Formats in Hadoop:
Writable format (e.g., Text, IntWritable).
Avro, a compact format for data serialization.
Protocol Buffers and SequenceFile are also used
Advantages of Serialization:
The process of serialization and deserialization can affect performance due to CPU and memory usage.
Avro in Hadoop:
• What is Avro?
Avro is a data serialization system used in Hadoop for efficient data exchange. It provides a
compact, fast, binary format and is used to serialize data for storage or transmission. Avro is
especially useful when working with Big Data and supports schema evolution (changing data
structure over time).
Features of Avro:
1.Compact and Fast:
Avro uses a binary format, which makes it faster and smaller in size compared to text-based
formats.
2.Schema-Based:
Data is always stored with its schema. This ensures that the data can be read without
needing an external schema.
3.Supports Schema Evolution:
Avro allows changes in schema over time like adding or removing fields without breaking
compatibility.
4.Interoperability:
Avro supports multiple programming languages like Java, Python, C, etc., making it easier to
work in a multi-language environment.
5.Integrates with Hadoop Ecosystem:
Avro works well with Hadoop tools like Hive, Pig, and MapReduce.
How Avro Works:
• Avro stores data along with its schema in a container file.
• When writing data, it uses a defined schema to serialize the data into binary
format.
• When reading, the system uses the schema (either from the file or provided
externally) to deserialize the data.
• Because both data and schema are stored together, Avro ensures data is
portable and self-describing.
Hadoop Environment: Setting Up a Hadoop Cluster
• Setting up a Hadoop environment involves preparing hardware and software to
work in a distributed system for processing Big Data.
1. Cluster Specification:
Before setting up Hadoop, the hardware and software requirements must be defined:
• Hardware Requirements:
• One Master Node (for NameNode and ResourceManager).
• Multiple Slave Nodes (for DataNode and NodeManager).
• Each node should have sufficient RAM (at least 8GB), CPU, and storage capacity.
• Software Requirements:
• Linux-based OS (Ubuntu/CentOS preferred).
• Java (JDK 8 or later) – mandatory for running Hadoop.
• SSH Configuration – for password-less communication between nodes.
• Hadoop binary files – can be downloaded from Apache website.
2. Cluster Setup:
Setting up a cluster means connecting all the machines to work together. First, we
install Java and Hadoop on each machine. Then, we configure secure communication
using SSH so that the master can control the slaves without needing passwords.
• Next, we assign roles to each machine — which one will act as master and which
ones will be slaves. After that, we edit the configuration files to set up paths, data
directories, ports, and other necessary settings to allow the system to function in a
distributed way.
Cluster Installation:
After setup, we install and configure everything:
• Install Java and Hadoop on each machine.
• Set environment variables like Java home and Hadoop home.
• Configure the cluster by setting paths and data storage settings in the required
configuration files.
• Enable secure communication using SSH keys.
• Format the file system to prepare it for data storage.
• Start the necessary background services (called daemons) to begin using the
cluster.
• Finally, we check the system using the command line or web interface to
ensure everything is running properly.
1. Hadoop Configuration:
Hadoop configuration is essential for controlling the behavior and performance
of Hadoop components like HDFS, YARN, and MapReduce. Configuration
settings are written in XML files. The key configurations include:
• core-site.xml: Contains settings for Hadoop core like file system address
• hdfs-site.xml: Sets parameters for HDFS like replication factor, block size, and
permission settings.
• mapred-site.xml: Configures MapReduce settings such as job tracker address
and number of reduce tasks.
• yarn-site.xml: Configures YARN parameters like resource manager and node
manager settings.
• Proper configuration ensures efficient cluster operation and resource
utilization.
Security in Hadoop:
Security is a major concern in Hadoop due to the large amount of data it
handles. Hadoop provides several mechanisms for ensuring data protection:
• Authentication: Verifies user identity using Kerberos. Only authenticated users
can access the cluster.
• Authorization: Controls what operations (read/write/execute) an
authenticated user can perform on files or jobs.
• Encryption: Protects sensitive data while being transmitted over the network
or stored on disk.
• File and Directory Permissions: Similar to Unix/Linux systems. Permissions can
be set for files and directories to restrict access.
• Advanced security features can also be added using tools like Apache Ranger
and Sentry.
Administering Hadoop:
Hadoop administration refers to the management and maintenance of the
Hadoop cluster. Key responsibilities of an administrator include:
• Managing cluster components: Starting/stopping Hadoop daemons like
NameNode, DataNode, ResourceManager, and NodeManager.
• User management: Creating user accounts and setting file permissions.
• Cluster health monitoring: Ensuring all nodes are working and data is properly
replicated.
• Job management: Monitoring and controlling job execution.
• Backup and recovery: Taking regular backups and preparing for failures.
• Tools like Ambari and Cloudera Manager help simplify administration tasks
through graphical dashboards.
HDFS Monitoring and Maintenance:
Maintaining the HDFS system is important to ensure data reliability and
availability. Monitoring and maintenance involve:
• Checking disk space usage.
• Monitoring DataNodes for availability and health status.
• Monitoring NameNode UI to view cluster status and file system details.
• Checking under-replicated or corrupted blocks.
• Running balancer tool to redistribute blocks evenly across DataNodes.
• Decommissioning and adding DataNodes as needed.
• Timely monitoring helps prevent data loss and performance issues.
Hadoop Benchmarks:
Benchmarks are used to evaluate Hadoop performance in terms of speed, reliability, and resource
usage. Common benchmarks include:
• TestDFSIO: Tests read/write throughput of HDFS.
• TeraSort: Measures the performance of MapReduce in sorting large datasets.
• MRBench: Tests the performance of MapReduce jobs.
• NNBench: Measures performance of NameNode.
• These benchmarks help in cluster tuning and identifying bottlenecks.
Hadoop in the Cloud:
Running Hadoop in the cloud offers flexibility, scalability, and reduced maintenance. Cloud platforms
like Amazon AWS (EMR), Microsoft Azure HDInsight, and Google Cloud Dataproc support Hadoop.
Advantages of Hadoop in the cloud:
• On-demand resource scaling.
• Pay-per-use pricing model.
• No need for physical infrastructure.
• Easy deployment and updates.
• Hadoop in the cloud is suitable for organizations looking to process large-scale data without investing
in heavy infrastructure.
THANK A LOT
PLEASE SUBSCRIBE
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
BIG DATA
SYLLABUS: Hadoop Eco System and YARN: Hadoop ecosystem components, schedulers,
fair and capacity, Hadoop 2.0 New Features – Name Node high availability, HDFS
federation, MRv2, YARN, Running MRv1 in YARN. NoSQL Databases: Introduction to
NoSQL MongoDB: Introduction, data types, creating, updating and deleing documents,
querying, introduction to indexing, capped collections Spark: Installing spark, spark
applications, jobs, stages and tasks, Resilient Distributed Databases, anatomy of a Spark
job run, Spark on YARN SCALA: Introduction, classes and objects, basic types and
operators, built-in control structures, functions and closures, inheritance.
Hadoop Ecosystem Components:
The Hadoop ecosystem includes several tools that work together to handle big data efficiently.
These tools are built around Hadoop's core components: HDFS (storage) and MapReduce
(processing).
Main components of the Hadoop ecosystem:
1.HDFS (Hadoop Distributed File System) – Used for storing large datasets in a distributed
manner.
2.MapReduce – Programming model for processing large data in parallel.
3.YARN (Yet Another Resource Negotiator) – Manages resources and schedules jobs.
4.Hive – SQL-like query language for data summarization and analysis.
5.Pig – High-level scripting language used with MapReduce.
6.HBase – A NoSQL database that runs on top of HDFS.
7.Sqoop – Used to transfer data between Hadoop and relational databases.
8.Flume – Collects and transports large amounts of streaming data into Hadoop.
9.Oozie – Workflow scheduler to manage Hadoop jobs.
10.Zookeeper – Coordinates and manages distributed applications.
11.Mahout – Machine learning library for building predictive models.
12.Avro – A data serialization system used for efficient data exchange.
Hadoop Schedulers:
Schedulers in Hadoop YARN decide how resources (CPU, memory) are allocated among various
jobs. They ensure multiple users can share the Hadoop cluster efficiently.
1. FIFO Scheduler (First-In-First-Out):
• Simple scheduler.
• Jobs are executed in the order they arrive.
• Not fair when some jobs take too long.
2. Fair Scheduler:
• Developed by Facebook.
• Divides resources equally among all running jobs.
• Ensures that small jobs are not stuck behind large ones.
• Jobs are grouped into pools, and each pool gets a fair share of resources.
3. Capacity Scheduler:
• Developed by Yahoo.
• Designed for large organizations with multiple users.
• Cluster resources are divided into queues, and each queue gets a configured capacity.
• Unused capacity in one queue can be used by others.
Hadoop 2.0 – New Features
• Hadoop 2.0 brought major improvements over Hadoop 1.x. It solved scalability,
availability, and resource management issues. Below are the important new features:
1. NameNode High Availability (HA):
• In Hadoop 1.x, there was only one NameNode, so if it failed, the whole system would
stop (single point of failure).
• Hadoop 2.0 introduced two NameNodes: one active and one standby.
• If the active NameNode fails, the standby takes over automatically.
• This ensures that HDFS continues to work without downtime.
2. HDFS Federation:
• In earlier versions, there was only one NameNode, which could become a bottleneck
in large clusters.
• Federation allows multiple NameNodes, each managing part of the file system.
• It improves scalability and isolation, allowing different applications to use different
parts of the file system without conflict.
3. MRv2 (MapReduce Version 2):
• Also called YARN (Yet Another Resource Negotiator).
• In Hadoop 1.x, JobTracker handled both resource management and job
scheduling, which created performance issues.
• MRv2 separates these two responsibilities:
• ResourceManager handles resource allocation.
• ApplicationMaster manages the lifecycle of individual jobs.
4. YARN (Yet Another Resource Negotiator):
• YARN is the core of Hadoop 2.0.
• It allows running multiple applications (not just MapReduce), like Spark, Tez,
etc., on the same cluster.
• It improves resource utilization, scalability, and flexibility.
How to Run MapReduce Version 1 (MRv1) on YARN
1.YARN allows old MapReduce jobs to run – The jobs written for the older version of
MapReduce can still run on the new YARN system.
2.No need to change code – Existing MapReduce programs do not need to be
rewritten. They can work directly on YARN.
3.YARN handles job execution – YARN takes care of distributing the job and managing
the resources required to run it.
4.A special manager helps run old jobs – YARN includes a built-in manager that
supports running MRv1 jobs in the new environment.
5.Same way of job submission – You submit the job in the same way as before, and
YARN will run it in the background.
6.Backward compatibility – YARN supports older applications so that users can
continue using their previous work without problems.
7.Good for migration – This is helpful for companies or users who are moving from
older versions of Hadoop to newer ones.
8.Benefit of new features – Even when using old jobs, you still get advantages like
better resource sharing and job scheduling from YARN.
NoSQL Databases
Introduction:
• NoSQL stands for "Not Only SQL".
• It refers to a group of databases that do not use the traditional relational database
model.
• Designed to handle large volumes of structured, semi-structured, or unstructured
data.
• Useful in big data applications and real-time web apps.
Advantages of NoSQL:
1.Scalability – Easily handles large amounts of data and traffic by scaling horizontally.
2.Flexibility – No fixed schema; supports dynamic data types and structures.
3.High Performance – Faster read/write operations for large datasets.
4.Supports Big Data – Works well with distributed computing frameworks like Hadoop.
5.Easier for Developers – Matches modern programming paradigms (JSON, key-value).
Disadvantages of NoSQL:
1.Lack of Standardization – No uniform query language like SQL.
2.Limited Support for Complex Queries – Not ideal for multi-table joins.
3.Less Mature Tools – Compared to relational databases.
4.Consistency Issues – Often prefers availability and partition tolerance over
consistency (CAP Theorem).
5.Data Redundancy – Due to denormalization, same data may be repeated.
Types of NoSQL Databases (Explained in Detail)
1. Key-Value Stores
1. Data is stored as a pair of key and value, like a dictionary.
2. The key is unique, and the value can be anything (a string, number, JSON, etc.).
3. Very fast and efficient for lookups by key.
4. Best for: Caching, session management, simple data storage.
5. Examples: Redis, Riak, Amazon DynamoDB.
2. Document-Oriented Databases
1. Data is stored in documents (like JSON or XML), which are more flexible than rows and columns.
2. Each document is self-contained and can have different fields.
3. Easy to map to objects in code and update individual fields.
4. Best for: Content management, real-time analytics, product catalogs.
5. Examples: MongoDB, CouchDB.
3. Column-Oriented Databases
1. Stores data in columns instead of rows, making it efficient for reading specific fields across large datasets.
2. Great for analytical queries on big data.
3. Scales well across many machines.
4. Best for: Data warehousing, real-time analytics, logging.
5. Examples: Apache HBase, Cassandra.
4. Graph-Based Databases
1. Focuses on relationships between data using nodes and edges.
2. Very powerful for handling complex relationships like social networks, recommendation engines, etc.
3. Best for: Social networks, fraud detection, recommendation systems.
4. Examples: Neo4j, ArangoDB.
MongoDB –
MongoDB is a NoSQL, open-source, document-oriented database. It stores data in
JSON-like documents with dynamic schemas, meaning the structure of data can vary
across documents in a collection.
Features of MongoDB:
1.Schema-less – Collections do not require a predefined schema.
2.Document-Oriented Storage – Data is stored in BSON (Binary JSON) format, allowing
for embedded documents and arrays.
3.High Performance – Supports fast read and write operations.
4.Scalability – Supports horizontal scaling using sharding.
5.Replication – Ensures high availability with replica sets.
6.Indexing – Supports indexing on any field to improve query performance.
7.Aggregation – Provides a powerful aggregation framework for data processing and
analytics.
8.Flexibility – You can store structured, semi-structured, or unstructured data.
9.Cross-Platform – Works on Windows, Linux, and MacOS.
Common MongoDB Data Types:
1.String – Used for storing text.
2.Integer – Stores numeric values (32-bit or 64-bit).
3.Boolean – True or False values.
4.Double – Stores floating-point numbers.
5.Date – Stores date and time in UTC format.
6.Array – Stores multiple values in a single field.
7.Object/Embedded Document – Stores documents within documents.
8.Null – Represents a null or missing value.
9.ObjectId – A unique identifier for each document (auto-generated).
10.Binary Data – Used to store binary data such as images or files.
1. Creating Documents in MongoDB:
• Creating a document in MongoDB means adding new data to the database. A
document in MongoDB is a record, which is similar to a row in relational databases.
• Example: If you want to store information about a person, like their name, age, and
city, you create a document for that person. MongoDB will automatically store this
data in a collection (similar to a table in a relational database).
• Once created, this document is assigned an _id by MongoDB, which uniquely
identifies it in the collection.
2. Updating Documents in MongoDB:
• Updating means modifying the existing data in a document. You can update a specific
field in a document (e.g., change the person's age or city) without affecting other
fields.
• Example: Suppose you created a document for a person named "John" with age 29.
Later, if you need to change the age to 30, you can update just the age field in that
document. You can also update multiple documents at once if needed, such as
updating the status of everyone living in "New York."
• MongoDB provides flexibility to update documents based on conditions. For example,
you can choose to update only those documents that match certain criteria.
3. Deleting Documents in MongoDB:
• Deleting documents means removing data from the database. If a document is no longer needed or is
outdated, it can be deleted.
• Example: If you want to delete the document of a person named "John," you can remove that document from
the collection. MongoDB allows you to delete just one document or multiple documents at once. For example,
you can delete all people who live in "New York" if required.
Queries in MongoDB:
MongoDB allows you to retrieve data from the database using queries. A query is a way to search for documents
that match specific conditions. The basic idea is to find specific documents based on their field values.
1. Basic Queries: You can search for documents by specifying the field and value. For example, if you want to find
all users who are 25 years old, you would search for documents where the "age" field is equal to 25.
2. Conditional Queries: MongoDB lets you apply conditions to your queries. For example, if you want to find users
older than 30, you can use a condition that searches for documents where the "age" is greater than 30.
3. Logical Queries: You can combine different conditions using logical operators like "AND" and "OR". For
instance, if you want to find users who are older than 30 but live in "New York", you can combine these
conditions.
4. Sorting: MongoDB allows you to sort your query results. For example, if you want to sort users by their age, you
can choose to display the results either in ascending or descending order.
5. Limiting Results: You can limit the number of results returned by a query. For example, if you want to get only
the first 5 documents, you can apply a limit to the query.
6. Projection: You can specify which fields to display in the query result. For example, if you only want to display
the "name" and "age" fields, you can exclude all other fields from the results.
Indexing in MongoDB:
Indexing in MongoDB is used to improve the performance of queries. When you create an
index on a field, MongoDB creates a structure that makes it faster to find documents that
match a specific value for that field.
1.Single Field Index: This is the simplest type of index and is created on a single field. For
example, if you frequently search for users by their name, you can create an index on the
"name" field to speed up those queries.
2.Compound Index: A compound index is created on multiple fields. It allows queries that
filter by several fields to be executed faster. For instance, if you often search for users by
both their name and age, a compound index can improve performance.
3.Text Index: MongoDB allows you to create a text index for full-text search. This type of index
is useful when you need to search for documents that contain specific words or phrases
within a text field.
4.Geospatial Index: If your data involves geographical locations (latitude and longitude),
MongoDB provides special indexing options to efficiently handle these types of queries.
5.Multikey Index: When you store arrays in MongoDB documents, you can create a multikey
index. This type of index is useful for queries that need to search within array fields.
6.Hashed Index: This type of index is used for efficient equality queries. It is useful when you
need to search for documents based on exact matches to a field value.
Benefits of Indexing:
• Faster Query Execution: Indexes make data retrieval quicker, as they allow
MongoDB to quickly locate relevant documents.
• Better Performance for Sorting: Sorting documents by a field with an index is
faster than sorting without one.
• Improved Read Efficiency: Indexes help MongoDB read data more efficiently,
especially with large datasets.
Limitations of Indexing:
• Space and Memory Usage: Indexes consume additional disk space and
memory. Having too many indexes can slow down performance.
• Impact on Write Operations: Every time a document is added, updated, or
deleted, MongoDB has to update the index, which can slow down write
operations.
• Maintenance: Indexes need to be maintained and updated regularly to ensure
optimal performance.
Capped Collections in MongoDB:
• A capped collection is a fixed-size collection.
• It automatically removes the oldest documents when the size limit is reached.
• Capped collections maintain the insertion order.
• They are ideal for use cases like logging or real-time data tracking.
• Capped collections provide high performance because they don’t allow
deletions or updates that would increase the document size.
Spark: Installing Spark, Spark Applications, Jobs, Stages, and Tasks
1. Installing Spark:
To begin using Apache Spark, you need to install it on your system or set it up on a cluster. Here’s a general
overview of how to install Spark:
• Pre-requisites:
• Java: Spark runs on Java, so you must have Java installed (Java 8 or later).
• Scala: Spark is written in Scala, and it provides a Scala API, but the Java API is used more commonly.
• Hadoop (Optional): If you want to run Spark with Hadoop, you need to install Hadoop as well. If not, Spark
can also run in standalone mode.
• Installation Steps:
• Download Spark: Visit the official Apache Spark website and download the appropriate version (usually
the pre-built version for Hadoop).
• Extract the Spark Archive: After downloading, extract the archive to a desired location on your local
system.
• Configure Spark:
• Set up the environment variables (SPARK HOME and PATH).
• You can configure Spark by editing the spark-defaults .conf file and setting options like the master URL,
memory settings, and other parameters.
Run Spark: After installation and configuration, you can start Spark in local mode or connect it to a cluster (e.g.,
Hadoop YARN, Mesos).
Standalone Mode: You can run Spark on your local machine (single node mode).
• Cluster Mode: You can run Spark on a cluster by connecting to YARN or Mesos for distributed computing.
Spark Applications:
A Spark application is a complete program that uses Spark to process data. Every
Spark application has a driver program that runs the main code. The driver
coordinates the execution of the program and sends tasks to worker nodes.
• Driver Program: This controls the execution of the Spark job. It communicates
with the cluster manager to allocate resources and send tasks to worker nodes.
• Cluster Manager: It manages the distribution of tasks across nodes. It can be
Hadoop YARN, Mesos, or Spark’s built-in manager.
• Executors: Executors are the worker processes that run on worker nodes and
perform the tasks assigned to them by the driver.
Jobs:
A Spark job is triggered when you perform an action, such as counting the number of
elements in a dataset or saving the data. A job represents a complete computation and
consists of multiple stages.
Triggering Jobs: Jobs are triggered by actions in Spark. For example, calling an action
like .collect() will trigger the execution of the job.
Stages in Jobs: When a job involves transformations that require data to be shuffled
across the cluster, Spark divides the job into multiple stages. Stages are separated by
operations that require a shuffle of data (e.g., groupBy or join).
Stages:
Stages are subsets of a job that can be executed independently. Spark divides jobs into
stages based on operations that involve shuffling data.
• Shuffling: Shuffling is the process of redistributing data across the cluster when a
stage involves wide dependencies (e.g., aggregating data from different nodes).
• Execution of Stages: Each stage runs tasks in parallel, and the results are passed to
the next stage. The execution is sequential, meaning Stage 2 will not start until Stage
1 is complete.
Tasks:
A task is the smallest unit of work in Spark, corresponding to a single partition of
the data.
• Parallelism: Tasks are executed in parallel across the different worker nodes in
the cluster. The number of tasks depends on how the data is partitioned.
• Task Execution: When a stage is ready to run, Spark creates tasks for each
partition of the data. For example, if you have 100 partitions, Spark will create
100 tasks to process them in parallel.
• Task Failures: If a task fails, Spark can retry the task on another node.
Resilient Distributed Datasets (RDDs) in Spark:
Resilient Distributed Datasets (RDDs) are the fundamental data structure in Spark,
designed for distributed computing. They represent an immutable distributed
collection of objects that can be processed in parallel across a cluster. The key
features of RDDs are:
1.Fault Tolerance:
RDDs can recover from failures by keeping track of their lineage, which is a record of
operations performed on the data. If a partition of an RDD is lost, Spark can
recompute it using the lineage.
2.Parallel Processing:
RDDs allow Spark to process data in parallel across multiple machines in a cluster.
Each partition of an RDD can be processed independently by a task.
3.Immutability:
RDDs are immutable, meaning once created, they cannot be changed. Any
transformation on an RDD results in the creation of a new RDD.
4.Lazy Evaluation:
Spark does not compute RDDs immediately. Instead, it builds a Directed Acyclic Graph
(DAG) of transformations and computes RDDs only when an action is called.
Anatomy of a Spark Job Run:
When a Spark job is executed, it goes through several stages:
Job Submission: A user submits a job by invoking an action on an RDD, like
.collect() or .save(). The job is submitted to the SparkContext, which coordinates
the execution.
Job Division into Stages: Spark divides the job into stages based on operations
that require data shuffling. Each stage is further divided into tasks, and tasks are
assigned to worker nodes for execution.
Task Scheduling: The scheduler places tasks on available worker nodes. Spark
uses a task scheduling mechanism that distributes the tasks across the cluster for
parallel execution.
Execution:The tasks are executed on the worker nodes. Data may be shuffled
between nodes if necessary (for operations like join or groupBy).
Result Collection: After all tasks are executed, the final results are collected and
returned to the driver program, or written to storage like HDFS.
Spark on YARN:
YARN (Yet Another Resource Negotiator) is a resource management layer for Hadoop that
allows Spark to run on top of Hadoop clusters. Here’s how Spark runs on YARN:
1.Resource Manager:
YARN’s ResourceManager manages cluster resources (CPU, memory) and schedules tasks for
Spark jobs. The ResourceManager ensures that Spark applications get the resources they
need for execution.
2.Application Master:
Spark runs in YARN by using an ApplicationMaster, which is responsible for negotiating
resources from the ResourceManager and tracking the execution of the application.
3.Execution on Worker Nodes:
Worker nodes in the Hadoop cluster run the tasks for Spark jobs. These nodes execute the
individual tasks, perform the computations, and send results back to the ApplicationMaster.
4.Data Locality:
YARN allows Spark to schedule tasks based on data locality, meaning that Spark tries to run
tasks on nodes that have the data already, reducing the need for network transfer.
5.Resource Allocation:
YARN dynamically allocates resources for Spark applications, adjusting resources based on
workload requirements, which improves resource utilization and job performance.
Introduction to Scala:
Scala is a high-level programming language that combines object-oriented and functional programming features.
It is designed to be concise, elegant, and expressive. Scala runs on the Java Virtual Machine (JVM), which
means it is compatible with Java and can make use of existing Java libraries. Scala is statically typed, meaning
that types are checked at compile-time, but it also supports type inference to reduce verbosity.
Classes and Objects:
• Classes: A class in Scala is a blueprint for creating objects. It defines the properties (variables) and behaviors
(methods) that the objects of that class will have.
• Objects: An object in Scala is a singleton instance of a class. It is used to define methods and variables that do
not belong to any specific instance of a class. An object is created when the program starts running and can be
used to access functionality without creating an instance.
Basic Types and Operators:
• Basic Data Types: Scala supports a range of basic types such as integers, floating-point numbers, characters,
and boolean values. Examples of basic types include:
• Int (Integer numbers)
• Double (Floating-point numbers)
• Char (Single characters)
• Boolean (True/False)
• String (Text)
• Operators: Scala supports several types of operators like:
• Arithmetic operators (e.g., +, -, *, /)
• Comparison operators (e.g., ==, !=, >, <)
• Logical operators (e.g., &&, ||)
• Assignment operators (e.g., =, +=, -=)
Built-in Control Structures:
• If-Else Statements: These are used to make decisions based on conditions. It
checks if a condition is true or false and executes the appropriate block of code.
• For Loop: The for loop is used to repeat a block of code a specific number of
times. It can be used with ranges or collections (like lists).
• While Loop: The while loop executes a block of code as long as a condition is
true.
• Match Expression: Similar to a switch statement in other languages, the match
expression in Scala is used to compare a value against different patterns and
execute corresponding code. It is a more powerful version of the switch
statement.
Functions and Closures in Scala:
• Functions: A function in Scala is a block of code that takes inputs (parameters),
performs a task, and returns a result. Functions can be defined with a specific
name and can be called anywhere in the program. Scala allows defining
functions with or without parameters. Scala also supports anonymous
functions, which are functions without a name, often used for short tasks.
• Closures: A closure is a function that can capture and carry its environment
with it. This means that the function can access variables from the scope in
which it was created, even after that scope has ended. Closures are useful
when you need to store a function along with the values it depends on.
• Example of Closures:
If a function is defined inside another function, the inner function can access
variables from the outer function, even if the outer function has finished
executing. This is what makes the inner function a closure.
Inheritance in Scala:
Inheritance is a fundamental concept of object-oriented programming, where a class
can inherit properties and behaviors from another class. In Scala, one class can
extend another class using the extend keyword
The class that is inherited from is called the superclass (or base class), and the class
that inherits is called the subclass (or derived class).
• Super class: The class whose properties and methods are inherited by another class.
• Sub class: The class that inherits the properties and methods from another class.
In Scala, a class can extend only one class, which is called single inheritance. However,
Scala supports multiple traits, which allows a class to mix in multiple behaviors.
• Traits: A trait is similar to an interface in other programming languages but can also
contain method implementations. Traits are used to add behavior to classes. A class
can extend multiple traits in Scala.
EXAMPLE
If you have a superclass called Animal with properties like name and methods like
makeSound(), a subclass like Dog can extend Animal, inheriting those properties and
methods, and then possibly adding new behavior specific to Dog.
THANK YOU FOR WATCHING
BEST OF LUCK
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
BIG DATA
SYLLABUS: Hadoop Eco System Frameworks: Applications on Big Data using Pig, Hive
and HBase Pig : Introduction to PIG, Execution Modes of Pig, Comparison of Pig with
Databases, Grunt, Pig Latin, User Defined Functions, Data Processing operators, Hive -
Apache Hive architecture and installation, Hive shell, Hive services, Hive metastore,
comparison with traditional databases, HiveQL, tables, querying data and user defined
functions, sorting and aggregating, Map Reduce scripts, joins & subqueries. HBase –
Hbase concepts, clients, example, Hbase vs RDBMS, advanced usage, schema design,
advance indexing, Zookeeper – how it helps in monitoring a cluster, how to build
applications with Zookeeper. IBM Big Data strategy, introduction to Infosphere,
BigInsights and Big Sheets, introduction to Big SQL.
PIG
Pig is a high-level platform developed by Apache for analyzing large data sets. It
uses a language called Pig Latin, which is similar to SQL but is designed for
handling large-scale data.
Types of Pig Execution Modes
1.Local Mode
1. In this mode, Pig runs on a single local machine.
2. It uses the local file system instead of HDFS.
3. It is mainly used for development and testing purposes with smaller datasets.
4. There is no need for Hadoop setup in local mode.
2.MapReduce Mode (Hadoop Mode)
1. This is the production mode where Pig scripts are converted into MapReduce jobs and
executed over a Hadoop cluster.
2. It supports large datasets that are stored in HDFS.
3. Requires proper Hadoop setup and configuration.
4. It provides scalability and fault tolerance.
Features of Pig
• Ease of Use: Pig Latin language is simple and similar to SQL, making it easier
for developers and analysts.
• Data Handling: It can work with both structured and semi-structured data (like
logs, JSON, XML).
• Extensibility: Users can write their own functions to handle special
requirements (called UDFs).
• Optimization: Pig automatically optimizes the execution of scripts, so users can
focus more on logic than performance tuning.
• Support for Large Datasets: It processes massive volumes of data efficiently by
converting scripts into multiple parallel tasks.
• Interoperability: It can work with other Hadoop tools like Hive, HDFS, and
HBase.
Difference Between Pig Latin and SQL
BEST OF LUCK
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]
[Grab your reader’s attention with a great quote from the document or use this space to
emphasize a key point. To place this text box anywhere on the page, just drag it.]