KEMBAR78
Big Data Analytics 2023 Solution | PDF | Apache Hadoop | Cluster Analysis
0% found this document useful (0 votes)
59 views17 pages

Big Data Analytics 2023 Solution

The document discusses the characteristics of Big Data, defined by the 5 V's: Volume, Velocity, Variety, Veracity, and Value. It also explains HDFS as Hadoop's storage layer, lists popular Big Data platforms, clustering methods, and provides insights into data manipulation in Hive and operators in Pig Latin. Additionally, it highlights the importance of Big Data analytics in decision-making and innovation across various industries.

Uploaded by

juhi2781
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views17 pages

Big Data Analytics 2023 Solution

The document discusses the characteristics of Big Data, defined by the 5 V's: Volume, Velocity, Variety, Veracity, and Value. It also explains HDFS as Hadoop's storage layer, lists popular Big Data platforms, clustering methods, and provides insights into data manipulation in Hive and operators in Pig Latin. Additionally, it highlights the importance of Big Data analytics in decision-making and innovation across various industries.

Uploaded by

juhi2781
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

‭Big Data Analytics 2023 solution‬

‭(a) What are the characteristics of Big Data?‬

‭Big Data is commonly described by‬‭5 V's‬‭:‬

‭●‬ ‭Volume‬‭– Huge amounts of data.‬

‭●‬ ‭Velocity‬‭– Fast generation and processing of data.‬

‭●‬ ‭Variety‬‭– Different formats: structured, semi-structured, unstructured.‬

‭●‬ ‭Veracity‬‭– Uncertainty or trustworthiness of data.‬

‭●‬ ‭Value‬‭– Useful insights extracted from the data.‬

‭(b) What is HDFS?‬

‭ DFS (Hadoop Distributed File System) is the storage layer of Hadoop. It stores large files‬
H
‭across multiple machines and ensures fault tolerance by replicating data blocks across different‬
‭nodes in a cluster.‬

‭(c) List four popular Big Data platforms.‬

‭1.‬ ‭Apache Hadoop‬

‭2.‬ ‭Apache Spark‬

‭3.‬ ‭Apache Flink‬

‭4.‬ ‭Google BigQuery‬

‭(d) List the categories of clustering methods.‬

‭1.‬ ‭Partitioning Methods (e.g., K-Means)‬

‭2.‬ ‭Hierarchical Methods (e.g., Agglomerative)‬

‭3.‬ ‭Density-Based Methods (e.g., DBSCAN)‬

‭4.‬ ‭Grid-Based Methods (e.g., STING)‬

‭5.‬ ‭Model-Based Method‬‭s‬‭(e.g., EM algorithm)‬

‭(e) Define K-means clustering.‬

‭ -means clustering‬‭is a method of dividing data into K distinct clusters by minimizing the‬
K
‭distance between data points and the centroid (center) of their assigned cluster.‬
‭(f) Write syntax for Bag in PIG.‬
‭ata = LOAD 'file.txt' AS (name:chararray,‬
d
marks:bag{t:tuple(subject:chararray, score:int)});‬

‭(g) List the challenges in data stream query processing.‬

‭1.‬ ‭High data arrival speed‬

‭2.‬ ‭Limited memory and computing resources‬

‭3.‬ ‭Real-time constraints‬

‭4.‬ ‭Handling out-of-order or incomplete data‬

‭5.‬ ‭Maintaining accuracy and consistency‬

‭(h) How many mappers run for a MapReduce job?‬

‭The number of mappers depends on the number of input splits, which in turn is determined by:‬

‭●‬ ‭The size of input data‬

‭●‬ T
‭ he HDFS block size‬
‭So, there is one mapper per input split.‬

‭(i) What is Comparable Interface?‬

I‭n Java, the Comparable interface is used to define a natural ordering of objects. It has the‬
‭method:‬

public int compareTo(Object o);‬


‭Used in sorting objects in collections like lists or arrays.‬

‭(j) What are the features of Hive?‬

‭●‬ ‭SQL-like query language (HiveQL)‬

‭●‬ ‭Works on top of Hadoop using MapReduce‬

‭●‬ ‭Supports large-scale data summarization and analysis‬

‭●‬ ‭Schema flexibility and partitioning‬

‭●‬ ‭Integrates with HDFS for storage‬

‭ .2(a) With suitable diagrams, explain in detail about the read and write operations on‬
Q
‭Hadoop Distributed File System.‬
‭Write Operation in HDFS‬

‭Steps Involved:‬

‭ .‬ C
1 ‭ lient Request‬‭: The client wants to write a file to HDFS.‬
‭2.‬ ‭Contact NameNode‬‭: Client requests the‬‭NameNode‬‭to get information about where to‬
‭store data blocks.‬
‭3.‬ ‭Block Allocation‬‭: NameNode returns a list of‬‭DataNodes‬‭for each block (based on‬
‭replication factor, typically 3).‬
‭4.‬ ‭Data Streaming‬‭:‬
‭○‬ ‭The client starts writing the first block to the first DataNode.‬
‭○‬ ‭The first DataNode‬‭pipes‬‭the data to the second, then the second to the third‬
‭(this is called a‬‭pipeline write‬‭).‬
‭5.‬ ‭Acknowledgement‬‭:‬
‭○‬ ‭After writing is successful on all replicas, acknowledgments are sent back‬
‭through the pipeline.‬
‭6.‬ ‭File Closure‬‭: Once all blocks are written and acknowledged, the client informs the‬
‭NameNode to finalize the file.‬

‭Read Operation in HDFS‬

‭Steps Involved:‬

‭ .‬ C
1 ‭ lient Request‬‭: The client requests to read a file.‬
‭2.‬ ‭Fetch Metadata‬‭: The client contacts the‬‭NameNode‬‭to get metadata (block IDs and‬
‭locations).‬
‭3.‬ ‭Read from DataNodes‬‭:‬
‭○‬ ‭The client connects directly to the‬‭closest DataNode‬‭containing the required‬
‭block.‬
‭○‬ ‭Reads data block by block.‬
‭4.‬ ‭Reconstruction‬‭: The client reassembles the file from individual blocks.‬

(‭ b) Why is big data analytics so important in today's digital era? What are the 5 V's of big‬
‭data?‬

‭ ig Data Analytics is crucial today because it helps‬‭extract meaningful insights‬‭from massive,‬


B
‭fast-growing, and diverse datasets. With data being generated constantly from social media,‬
‭online transactions, sensors, and apps, analyzing this data enables smarter decision-making‬
‭and drives innovation across industries.‬

‭Key Reasons for Importance‬

‭1.‬ B ‭ etter Decision Making:‬‭Organizations can make faster and more accurate decisions‬
‭by analyzing patterns, trends, and predictions from data.‬
‭2.‬ ‭Customer Personalization:‬‭Companies like Amazon and Netflix use big data to‬
‭recommend products and content based on user behavior.‬
‭3.‬ ‭Business Optimization:‬‭Analytics helps in improving operations, reducing costs, and‬
‭increasing efficiency (e.g., supply chain, inventory).‬
‭4.‬ ‭Fraud Detection & Security:‬‭Banks and financial services use real-time data analysis‬
‭to detect fraud and protect customer data.‬
‭5.‬ ‭Healthcare Advancements:‬‭Patient data is analyzed to predict disease outbreaks,‬
‭personalize treatment, and improve hospital management.‬
‭6.‬ ‭Real-time Analytics:‬‭Businesses can monitor events as they happen (e.g., stock‬
‭market, traffic systems) and respond immediately.‬
‭7.‬ ‭Innovation & Product Development:‬‭Companies use analytics to identify market needs‬
‭and design new products faster.‬

‭Real-world Example:‬

‭‬ G
● ‭ oogle Maps uses traffic data to suggest faster routes.‬
‭●‬ ‭Spotify recommends songs using listening behavior analytics.‬
‭●‬ ‭Uber uses demand data to set dynamic pricing.‬

‭The 5 V's of big data are‬


‭1.‬ V
‭ olume‬‭: The huge amount of data generated every day. This can be in terabytes,‬
‭petabytes, or more. Example: billions of transactions on an e-commerce site daily.‬

‭2.‬ V
‭ elocity‬‭: The speed at which data is created, processed, and analyzed. Data may need‬
‭to be processed in real-time or near real-time. Example: real-time social media posts or‬
‭stock market transactions happening every second.‬

‭3.‬ V
‭ ariety‬‭: The different types and sources of data, such as text, images, videos, sensor‬
‭data, social media posts, etc. Example: customer reviews (text), product images (visual‬
‭data), and sensor data from devices (structured data).‬

‭4.‬ V
‭ eracity‬‭: The trustworthiness or quality of the data. With large datasets, it's important to‬
‭ensure that the data is accurate and reliable. Example: ensuring the accuracy of‬
‭financial transaction data.‬

‭5.‬ V
‭ alue‬‭: The insights or benefits that can be derived from analyzing the data. It’s not just‬
‭about having big data but making it useful. Without value, data is just noise. Example:‬
‭analyzing shopping patterns to improve sales.‬

‭Q.3 (b) With a neat sketch, describe the key components of Apache Hive architecture.‬

‭ pache Hive is a data warehouse tool built on top of Hadoop. It allows users to query large‬
A
‭datasets stored in HDFS using a SQL-like language called HiveQL.‬

‭Key Components of Apache Hive Architecture‬

‭1.‬ U ‭ ser Interface (UI)‬‭: Provides various interfaces for users to submit queries and interact‬
‭with Hive, including the Command Line Interface (CLI), Web UI, and JDBC/ODBC‬
‭drivers.‬
‭2.‬ ‭Driver‬‭: Acts as the controller that receives the HiveQL statements. It manages the‬
‭lifecycle of a query, including session handling and monitoring the execution process.‬
‭3.‬ ‭Compiler‬‭: Parses the HiveQL query, performs semantic analysis, and generates an‬
‭execution plan in the form of a Directed Acyclic Graph (DAG) of stages.‬
‭4.‬ ‭Optimizer‬‭: Applies various transformations to the execution plan to optimize query‬
‭performance, such as predicate pushdown and join reordering.‬
‭5.‬ ‭Execution Engine‬‭: Executes the optimized execution plan by interacting with the‬
‭Hadoop cluster. It translates the plan into MapReduce, Tez, or Spark jobs and manages‬
‭their execution.‬
‭6.‬ ‭Metastore: Stores metadata about the Hive tables, such as schema information, table‬
‭locations, and partition details. It is typically backed by a relational database like MySQL‬
‭or PostgreSQL.‬
‭7.‬ ‭Hadoop Distributed File System (HDFS‬‭): Serves as the underlying storage layer‬
‭where the actual data resides. Hive interacts with HDFS to read and write data during‬
‭query execution.‬
‭ .5 (a) Explain the following operators in Pig Latin:‬
Q
‭ i) Grouping and joining‬
‭ ii) Combining and splitting‬
‭ iii) Filtering operators‬
‭Apache Pig Latin offers a suite of operators to process and transform large datasets efficiently.‬
‭Here's an explanation of the following categories of operators: grouping and joining, combining‬
‭and splitting, and filtering.‬

‭i) Grouping and Joining‬

‭ ROUP Operator :‬‭The‬‭


G GROUP‬‭operator collects together records with the same key from a‬
‭single relation, producing a new relation where each group is represented as a single record.‬

‭Syntax:‬

grouped_data = GROUP relation_name BY key;‬


JOIN‬‭operator combines records from two or more relations based on a‬


‭ OIN Operator:‬‭The‬‭
J
‭common field, similar to SQL joins.‬

‭Syntax:‬

joined_data = JOIN relation1 BY key1, relation2 BY key2;‬



‭ii) Combining and Splitting‬

‭ NION Operator:‬‭The‬‭
U UNION‬‭operator merges the contents of two or more relations with the‬
‭same schema into a single relation.‬

‭Syntax:‬

combined_data = UNION relation1, relation2;‬


‭ PLIT Operator:‬‭The‬‭
S SPLIT‬‭operator divides a relation into two or more relations based on‬
‭specified conditions.‬

‭Syntax:‬

‭PLIT relation_name INTO relation1 IF condition1, relation2 IF‬


S
condition2;‬

‭iii) Filtering Operators‬

FILTER‬‭operator selects tuples from a relation that satisfy a given‬


‭ ILTER Operator:‬‭The‬‭
F
‭condition.‬

‭Syntax:‬

filtered_data = FILTER relation_name BY condition;‬


DISTINCT‬‭operator removes duplicate tuples from a relation.‬


‭DISTINCT Operator:‬‭The‬‭

‭Syntax:‬

unique_data = DISTINCT relation_name;‬


LIMIT‬‭operator restricts the number of tuples in a relation to a specified‬


‭ IMIT Operator:‬‭The‬‭
L
‭number.‬

‭Syntax:‬

limited_data = LIMIT relation_name N;‬


‭(b) Explain in brief about Data manipulation in HIVE.‬

I‭n Apache Hive, Data Manipulation Language (DML) encompasses the operations used to‬
‭manage and manipulate data within Hive tables. These operations are essential for performing‬
‭tasks such as inserting, querying, updating, and deleting data stored in Hadoop's HDFS.‬

‭Key DML Operations in Hive‬


‭SELECT‬‭: Used to query and retrieve data from Hive tables.‬

SELECT * FROM employees WHERE department = 'Sales';‬


I‭NSERT‬‭: Adds new data into a table.‬


‭INSERT INTO‬‭: Appends new rows to the existing table.‬
INSERT INTO employees VALUES (1, 'John Doe', 'IT');‬

I‭NSERT OVERWRITE‬‭: Replaces the data in the table with new data.‬
INSERT OVERWRITE TABLE employees SELECT * FROM new_employees;‬

‭ OAD DATA‬‭: Loads data from local or HDFS files into a Hive table.‬
L
LOAD DATA LOCAL INPATH '/path/to/data.csv' INTO TABLE employees;‬

‭ PDATE‬‭and‬‭DELETE‬‭: Traditional row-level updates and deletes are not natively supported in‬
U
‭Hive. However, starting from Hive 0.14, ACID (Atomicity, Consistency, Isolation, Durability)‬
‭transactions are supported, allowing for updates and deletes on tables stored in ORC format.‬
‭These operations require enabling ACID properties and are typically used in transactional‬
‭tables.‬

‭EXPORT‬‭and‬‭IMPORT‬‭: Used for transferring data between Hive and external systems.‬

‭ XPORT‬‭: Exports data from a Hive table to a specified location.‬


E
EXPORT TABLE employees TO '/path/to/exported_data';‬

I‭MPORT‬‭: Imports data into a Hive table from a specified location.‬


IMPORT TABLE employees FROM '/path/to/imported_data';‬

‭ .6 (a) Discuss Analysis of Variance (ANOVA) and correlation indicators of linear‬


Q
‭relationship.‬

‭Analysis of Variance (ANOVA)‬

‭ efinition‬‭: Analysis of Variance (ANOVA) is a statistical method used to analyze the‬


D
‭differences among group means in a dataset. It evaluates whether there is a significant‬
‭difference between the means of two or more groups by examining the variation within and‬
‭between groups.‬

‭Key Concepts‬‭:‬

‭1.‬ ‭Hypothesis Testing‬‭:‬

‭○‬ ‭Null Hypothesis (H‬‭0‭​)‬ : All group means are equal‬


‭ ‬ ‭Alternative Hypothesis (H‬‭a‭​)‬ : At least one group mean is different.‬

‭2.‬ F
‭ -Test‬‭:‬

‭○‬ A
‭ NOVA uses the F-statistic to compare the variances. The F-statistic is the ratio‬
‭of the variance between group means to the variance within groups.‬

‭3.‬ ‭Types of ANOVA‬‭:‬

‭‬ O
○ ‭ ne-way ANOVA‬‭: Tests differences among means of one independent variable.‬
‭○‬ ‭Two-way ANOVA‬‭: Tests differences among means of two independent variables‬
‭and their interaction.‬
‭4.‬ A
‭ ssumptions‬‭:‬

‭ ‬ I‭ndependence of observations.‬

‭○‬ ‭Normality of the data within groups.‬
‭○‬ ‭Homogeneity of variance (equal variances across groups).‬

‭Applications‬‭:‬

‭‬ C
● ‭ omparing treatment effects in experiments.‬
‭●‬ ‭Testing differences in average scores among different groups (e.g., age groups, income‬
‭brackets).‬

‭Correlation Indicators of Linear Relationship‬

‭ efinition‬‭: Correlation measures the strength and direction of the linear relationship between‬
D
‭two variables. It is quantified using the‬‭correlation coefficient‬‭(r).‬

‭Key Concepts‬‭:‬

‭1.‬ ‭Correlation Coefficient (rrr)‬‭:‬

‭○‬ ‭Values range from −1 to 1.‬


‭■‬ ‭r=1: Perfect positive linear relationship.‬
‭■‬ ‭r=−1: Perfect negative linear relationship.‬
‭■‬ ‭r=0: No linear relationship.‬

‭○‬ ‭Calculated as:‬


‭2.‬ ‭Types of Correlation‬‭:‬

‭‬ P
○ ‭ ositive Correlation‬‭: As one variable increases, the other increases.‬
‭○‬ ‭Negative Correlation‬‭: As one variable increases, the other decreases.‬
‭○‬ ‭No Correlation‬‭: No discernible linear pattern.‬

‭3.‬ ‭Scatter Plots‬‭: Used to visually assess the linear relationship between variables.‬

‭4.‬ ‭Limitations‬‭:‬

‭‬ C
○ ‭ orrelation does not imply causation.‬
‭○‬ ‭Sensitive to outliers, which can distort the relationship.‬

‭Applications‬‭:‬

‭●‬ M ‭ easuring the relationship between variables in finance (e.g., stock prices and market‬
‭indices).‬
‭●‬ ‭Determining the association between health metrics (e.g., weight and blood pressure).‬

‭(b) Explain in detail about Naïve Bayes Classification.‬

‭ aïve Bayes is a probabilistic classification algorithm based on Bayes’ Theorem, with the‬
N
‭“naïve” assumption that all features are independent of each other given the class label. It is‬
‭widely used in machine learning, particularly for text classification, spam detection, and‬
‭sentiment analysis.‬

‭ aïve Bayes is a foundational machine learning algorithm that applies probabilistic reasoning‬
N
‭with the simplifying assumption of feature independence. Despite its simplicity, it is highly‬
‭effective for classification tasks, particularly in Natural Language Processing (NLP).‬

‭Bayes' Theorem‬

‭Bayes’ Theorem calculates the‬‭posterior probability‬‭of a class given the observed features:‬

‭Where:‬

‭●‬ ‭P(C∣X) = Posterior probability of class CCC given predictors XXX‬

‭●‬ ‭P(X∣C) = Likelihood of predictors given class‬

‭●‬ ‭P(C) = Prior probability of class‬


‭●‬ ‭P(X) = Prior probability of predictors (often treated as constant)‬

‭How Naïve Bayes Works‬

‭Q.7 (a) Explain the use of Bloom’s Filter to mine data streams.‬

‭ loom’s Filter‬‭is a‬‭space-efficient probabilistic data structure‬‭used for checking whether an‬
B
‭element is present in a set. It is particularly useful in‬‭data stream mining‬‭, where the data‬
‭arrives continuously and memory usage must be minimal.‬

‭Key Features of Bloom’s Filter‬

‭‬ F
● ‭ ast and memory-efficient‬‭.‬
‭●‬ ‭Allows‬‭false positives‬‭(may say an item exists when it doesn't), but‬‭no false negatives‬
‭(won't say an item is absent when it's present).‬
‭●‬ ‭Suitable for situations where exact matches are not critical.‬

‭How It Works‬

‭ .‬
1 I‭nitialize a‬‭bit array‬‭of size‬‭m‬‭with all bits set to 0.‬
‭2.‬ ‭Use‬‭k‬‭independent‬‭hash functions‬‭to map each element to‬‭k‬‭positions in the bit array.‬
‭3.‬ ‭To‬‭add‬‭an element: Set the bits at all‬‭k‬‭hash positions to 1.‬
‭4.‬ ‭To‬‭query‬‭an element: Check if all‬‭k‬‭corresponding bits are 1.‬

‭ ‬ I‭f yes →‬‭Possibly present‬‭.‬



‭○‬ ‭If no →‬‭Definitely not present‬‭.‬

‭Use in Mining Data Streams‬

I‭n data streams, we often need to‬‭check duplicates or repeated items‬‭without storing the‬
‭entire stream. Bloom filters help:‬

‭●‬ D ‭ uplicate detection‬‭: Identify if an element (like a user ID, IP address, etc.) has already‬
‭been seen.‬
‭●‬ ‭Network packet inspection‬‭: Detect known spam or malicious URLs.‬
‭●‬ ‭Web crawlers‬‭: Check if a URL has already been visited.‬
‭●‬ ‭Caching‬‭: Check if a data item is in cache without accessing memory-intensive storage.‬

‭Example‬

I‭magine you're processing a stream of email addresses to detect if an address has already‬
‭subscribed to a newsletter:‬
Input stream: user1@example.com, user2@example.com, user1@example.com‬

‭‬ O
● ‭ n seeing‬‭user1@example.com‬‭the first time → Add it to the Bloom filter.‬
‭●‬ ‭On seeing it again → Query the Bloom filter. Since bits are already set, it returns‬
‭possibly present‬‭→ Detected as a duplicate.‬

‭Advantages‬

‭‬ E
● ‭ xtremely fast lookups and insertions.‬
‭●‬ ‭Minimal memory usage, even for very large datasets.‬

‭ .8 (a) Discuss multiple regressions with assumptions and regression formula.‬


Q
‭Multiple Regression is a statistical technique used to predict the value of a dependent variable‬
‭based on two or more independent variables. It is an extension of simple linear regression,‬
‭which involves only one independent variable.‬

‭Purpose‬

‭‬ U
● ‭ nderstand the relationship between several predictors and a response variable.‬
‭●‬ ‭Predict outcomes using multiple factors.‬
‭●‬ ‭Evaluate the influence of each independent variable on the dependent variable.‬

‭Regression Formula‬

‭Example:‬‭Predicting a student’s exam score based on:‬

‭‬ X
● ‭ 1: Hours studied‬
‭●‬ ‭X2​: Number of practice tests taken‬
‭●‬ ‭X3​: Attendance percentage‬

‭Assumptions of Multiple Regression‬

‭1.‬ L ‭ inearity‬‭:‬
‭The relationship between the dependent variable and each predictor is linear.‬
‭2.‬ ‭Independence‬‭:‬
‭Observations and errors should be independent of each other.‬
‭3.‬ ‭Homoscedasticity‬‭:‬
‭Constant variance of error terms across all values of the independent variables.‬
‭4.‬ N ‭ ormality‬‭:‬
‭The residuals (errors) should be normally distributed.‬
‭5.‬ ‭No Multicollinearity‬‭:‬
‭Independent variables should not be highly correlated with each other.‬

‭Applications‬

‭‬ B
● ‭ usiness: Forecasting sales based on price, advertising, and promotions.‬
‭●‬ ‭Healthcare: Predicting disease risk using multiple patient attributes.‬
‭●‬ ‭Social science: Understanding the impact of socioeconomic factors on academic‬
‭performance.‬

‭(b) Explain the architecture of Google File System with necessary diagrams.‬

‭ he Google File System (GFS) is a scalable distributed file system developed by Google to‬
T
‭handle large-scale data processing across clusters of commodity hardware. It is designed for‬
‭fault tolerance, high throughput, and efficient storage of large files.‬

‭Key Features of GFS‬

‭‬
● ‭ ptimized for large files (typically hundreds of MBs or GBs).‬
O
‭●‬ ‭Supports append-heavy workloads like logs and analytics.‬
‭●‬ ‭High fault tolerance through replication.‬
‭●‬ ‭Uses master-slave architecture.‬

‭GFS Components‬

‭1. GFS Master Server‬

‭‬
● ‭ anages metadata: file names, directory structure, chunk locations, access permissions.‬
M
‭●‬ ‭Handles file creation, deletion, and renaming.‬
‭●‬ ‭Coordinates chunk replication and garbage collection.‬
‭●‬ ‭Stores metadata in memory for fast access.‬

‭2. Chunk Servers‬

‭‬
● ‭ tore actual file data in fixed-size chunks (default 64MB).‬
S
‭●‬ ‭Each chunk has a unique 64-bit ID.‬
‭●‬ ‭Replicates chunks (default replication factor is 3).‬
‭●‬ ‭Periodically reports status to the master.‬

‭3. Clients‬

‭‬ A
● ‭ ccess files by first contacting the master to get metadata.‬
‭●‬ ‭Then read/write data directly from/to chunk servers to reduce load on master.‬
‭Working Process‬

‭ .‬
1 ‭ lient → Master: Request file metadata (chunk locations).‬
C
‭2.‬ ‭Master → Client: Sends chunk server addresses.‬
‭3.‬ ‭Client → Chunk Servers: Directly reads or writes data.‬
‭4.‬ ‭Write Process: Sent to all replicas of a chunk, then confirmed.‬

‭Advantages of GFS‬

‭‬
● ‭ ault Tolerance‬‭: Automatic chunk replication and recovery.‬
F
‭●‬ ‭High Throughput‬‭: Designed for batch processing and large files.‬
‭●‬ ‭Scalability‬‭: Easily scales to thousands of machines.‬
‭●‬ ‭Simplified Failure Handling‬‭: Commodity hardware failures are expected and handled‬
‭automatically.‬

‭Q.9‬‭Write short notes on any two of the following:‬

‭(a) Hadoop Ecosystem‬

‭ he Hadoop Ecosystem refers to a set of tools and frameworks built around the core Hadoop‬
T
‭system, which enables the storage, processing, and analysis of big data in a distributed and‬
‭fault-tolerant environment. The ecosystem is designed to handle large-scale data across‬
‭clusters of machines and supports parallel processing.‬

‭Key Components of Hadoop Ecosystem:‬


‭1.‬ H ‭ DFS‬‭(Hadoop Distributed File System): A distributed storage system that splits large‬
‭files into blocks and stores them across multiple machines to ensure data redundancy‬
‭and fault tolerance.‬
‭2.‬ ‭MapReduce‬‭: A programming model for processing large datasets in parallel across‬
‭distributed nodes.‬
‭3.‬ ‭YARN‬‭(Yet Another Resource Negotiator): Manages cluster resources and schedules‬
‭tasks across nodes.‬

‭Supporting Tools‬‭:‬

‭●‬ H ‭ ive‬‭: A data warehouse system that provides a SQL-like interface for querying data‬
‭stored in Hadoop.‬
‭●‬ ‭Pig‬‭: A high-level platform for creating MapReduce programs using a scripting language‬
‭called Pig Latin.‬
‭●‬ ‭HBase‬‭: A NoSQL database for storing structured data on top of HDFS, providing‬
‭real-time access to large datasets.‬
‭●‬ ‭Sqoop‬‭: A tool for transferring data between Hadoop and relational databases.‬
‭●‬ ‭Zookeeper‬‭: Coordinates distributed systems and ensures synchronization.‬
‭●‬ ‭Oozie‬‭: A scheduler for managing workflows in the Hadoop ecosystem.‬

‭Applications‬‭: Big data storage, processing, data mining, analytics, and machine learning.‬

‭(b) CURE Algorithm (Clustering Using Representatives)‬

‭ he CURE algorithm is a hierarchical clustering method that addresses the problem of‬
T
‭clustering non-spherical clusters and being resistant to outliers. Unlike traditional clustering‬
‭algorithms, which may fail in non-convex shapes or sensitive to outliers, CURE uses‬
‭representative points to define clusters and better capture the shape and structure of the data.‬

‭Key Features‬‭:‬

‭1.‬ R ‭ epresentative Points: The algorithm selects a fixed number of well-scattered‬


‭representative points within each cluster.‬
‭2.‬ ‭Shrinking: These points are shrunk towards the cluster's centroid to improve the cluster's‬
‭shape representation.‬
‭3.‬ ‭Merging: Clusters are merged based on the distance between their representative‬
‭points.‬

‭Advantages‬‭:‬

‭‬ H
● ‭ andles non-spherical clusters effectively.‬
‭●‬ ‭Robust to noise and outliers.‬
‭●‬ ‭Scales well with large datasets due to the use of representative points.‬

‭ se Cases‬‭: Customer segmentation, image clustering, gene expression data, and other‬
U
‭applications that require handling complex cluster shapes.‬

‭(c) Grid Computing‬

‭ rid Computing is a distributed computing model that involves the sharing of computational‬
G
‭resources across multiple organizations or locations to solve complex problems that require‬
‭ igh processing power. It allows the pooling of resources such as CPU time, storage, and‬
h
‭network bandwidth, enabling better performance, scalability, and flexibility for large-scale‬
‭applications.‬

‭Key Features:‬

‭1.‬ D ‭ istributed Resources: Combines resources from various machines located across‬
‭different locations.‬
‭2.‬ ‭Resource Sharing: Allows users and organizations to share computational resources.‬
‭3.‬ ‭Scalability: Can scale to handle large computational tasks by leveraging the combined‬
‭power of multiple machines.‬

‭Advantages‬‭:‬

‭‬ E
● ‭ fficient use of underutilized resources.‬
‭●‬ ‭Facilitates the solution of large-scale scientific and engineering problems.‬
‭●‬ ‭Can be used for data-intensive applications such as simulations, research, and data‬
‭analysis.‬

‭ pplications‬‭: Scientific research, simulations, weather forecasting, and applications requiring‬


A
‭large-scale computation.‬

‭(d) HDFS Block Replication‬

‭ DFS Block Replication is a fundamental feature of the Hadoop Distributed File System (HDFS)‬
H
‭that ensures data reliability and fault tolerance. In HDFS, large files are divided into blocks‬
‭(typically 128MB or 256MB in size), and these blocks are replicated across multiple machines in‬
‭the cluster.‬

‭Key Aspects‬‭:‬

‭1.‬ R ‭ eplication Factor: Each block of data is replicated multiple times (default replication‬
‭factor is 3). These replicas are stored on different machines to ensure data availability in‬
‭case of node failure.‬
‭2.‬ ‭Fault Tolerance: If a machine or block fails, the system can still access the replicated‬
‭copies of the block, ensuring no data loss.‬
‭3.‬ ‭Load Balancing: HDFS dynamically adjusts the number of replicas and the locations‬
‭where blocks are stored to ensure efficient use of resources.‬

‭Advantages‬‭:‬

‭‬ F
● ‭ ault tolerance: Ensures data is not lost even if some machines fail.‬
‭●‬ ‭High availability: Multiple replicas provide continuous access to data.‬
‭●‬ ‭Scalable: The replication factor can be adjusted based on the data importance and‬
‭cluster resources.‬

‭ pplications‬‭: HDFS Block Replication is essential in large-scale data storage systems, data‬
A
‭warehousing, and distributed computing tasks in Hadoop.‬

You might also like