Cloud Computing Unit 2
Direct Attached Storage (DAS) System
• Direct Attached Storage (DAS) is a type of digital storage system directly connected to a
computer, server, or workstation without a network in between.
• It is the most basic and traditional form of storage, typically used for individual systems.
How DAS Works?
• Storage devices like Hard Disk Drives (HDD), Solid State Drives (SSD), or Optical Drives
(CD/DVD) are physically connected to the system via:
o USB (Universal Serial Bus)
o SATA (Serial Advanced Technology Attachment)
o SCSI (Small Computer System Interface)
o NVMe (Non-Volatile Memory Express) for high-speed SSDs
• The host computer fully controls the DAS and manages the data.
Types of DAS Devices
• Internal Storage:
o Hard drives (HDD/SSD) inside desktops/laptops.
• External Storage:
o External hard drives (USB drives, portable SSDs).
o External RAID arrays connected directly via USB, eSATA.
Advantages of DAS
1. High Performance — Direct connection means faster data access.
2. Low Cost — No need for network hardware or complex setups.
3. Simplicity — Easy to install and use.
4. Security — Data is isolated; accessible only to the connected system.
5. No Network Dependency — Works even if the network is down.
6. Disadvantages of DAS
1. Limited Scalability — Not suitable for large organizations.
2. No Centralized Access — Data accessible only to the connected system.
3. Difficult to Share — Not easy to share data with multiple users.
4. Management Overhead — Each system needs its own storage management.
5. No Redundancy (unless RAID used) — Risk of data loss if drive fails.
Storage Area Network (SAN)
• A Storage Area Network (SAN) is a high-speed, specialized network that provides block-
level storage access to multiple servers.
• SAN connects storage devices (like disk arrays, tape libraries) to servers, allowing them to
access storage as if it were locally attached.
• It is mainly used in data centers and large enterprises to handle huge volumes of data
efficiently.
Components of SAN:
How SAN Works?
• Servers send storage requests over the SAN.
• SAN uses switches to route the requests to the appropriate storage device.
• Storage devices respond and deliver data over the same dedicated network.
• SAN uses protocols like Fibre Channel (FC), iSCSI (Internet SCSI), or FCoE (Fibre Channel over
Ethernet).
Advantages of SAN
1. High Performance — Fast data access suitable for databases and enterprise apps.
2. Scalability — Easily expandable as storage needs grow.
3. Centralized Storage — Easier to manage and backup.
4. Improved Data Availability — Redundant connections reduce downtime.
5. Storage Consolidation — Pool storage resources efficiently.
Disadvantages of SAN
1. High Cost — Expensive setup and maintenance.
2. Complex Setup — Requires specialized skills to design and manage.
3. Dedicated Infrastructure — Needs separate network hardware and cabling.
4. Management Overhead — Requires continuous monitoring and optimization.
Network Attached Storage (NAS)
• Network Attached Storage (NAS) is a dedicated storage device connected to a network,
allowing multiple users and devices to access and share data simultaneously over the
network.
• NAS works at the file level, providing file-based storage services to other devices on the
network.
Components of NAS:
How NAS Works?
• NAS connects to a Local Area Network (LAN) via Ethernet.
• Users access NAS using standard network protocols like:
o NFS (Network File System) for UNIX/Linux.
o SMB/CIFS (Server Message Block/Common Internet File System) for Windows.
o FTP (File Transfer Protocol) for file transfers.
• Users see NAS as a shared folder or drive on their computer.
Advantages of NAS
1. Easy File Sharing — Centralized storage accessible from any networked device.
2. Cost-Effective — Affordable solution for small to medium businesses.
3. Scalable — Add more drives or devices as needed.
4. Data Protection — Supports RAID for redundancy and backups.
5. Access Control — User permissions, authentication, and encryption.
Disadvantages of NAS
1. Network Dependent — Access limited to network connectivity and speed.
2. Performance Limitations — Slower than SAN for high-performance needs.
3. Not Suitable for Block-Level Storage — Only file-level access.
4. Limited by LAN Bandwidth — Heavy usage may affect overall network performance.
5. Security Risks — If network is compromised, NAS can be vulnerable.
File Systems
Google File System (GFS)
• Google File System (GFS) is a scalable distributed file system designed by Google to
manage large-scale data processing across thousands of commodity hardware
machines.
• It is optimized for big data applications like web search, indexing, and data mining.
GFS Architecture:
GFS Clients:
They can be computer programs or applications which may be used to request files.
Requests may be made to access and modify already-existing files or add new files to
the system.
GFS Master Server:
• Stores metadata about file system:
• File and directory names.
• Mapping from files to chunks.
• Chunk locations (which chunkserver has which chunk).
• Manages chunk leases for write operations.
• Coordinates system operations like chunk creation, deletion, replication.GFS
Chunk Servers:
• Store file data in chunks (default 64 MB per chunk).
• Each chunk is replicated (typically 3 copies) across different chunkservers for
fault tolerance.
• Handle read/write requests from clients.each chunk and stores them on
various chunk servers in order to assure stability; the default is three copies.
Every replica is referred to as one.
How GFS Works? (Workflow)
File Write Operation:
1. Client requests master for chunk locations.
2. Master returns primary and secondary chunkservers for replication.
3. Client sends data to all chunkservers simultaneously (pipelined).
4. Primary chunkserver coordinates write among replicas.
5. Once all replicas acknowledge, client gets confirmation of successful write.
File Read Operation:
1. Client requests master for chunk location.
2. Master returns list of chunkservers holding that chunk.
3. Client reads data directly from chunkserver (nearest/fastest one).
Features of GFS
• Namespace management and locking.
• Fault tolerance.
• Reduced client and master interaction because of large chunk server size.
• High availability.
• Critical data replication.
• Automatic and efficient data recovery.
• High aggregate throughput.
Advantages of GFS
1. Highly Scalable — Supports petabytes of data.
2. Fault Tolerant — Automatic handling of failures.
3. High Performance — Efficient for large sequential reads/writes.
4. Cost Effective — Uses commodity hardware.
5. Supports Large Files — Optimized for big files used in data analysis.
Disadvantages of GFS
1. Not the best fit for small files.
2. Master may act as a bottleneck.
3. unable to type at random.
Hadoop File System
• Hadoop Distributed File System (HDFS) is an open-source distributed file
system designed to store and process large datasets across multiple machines.
• It is a core component of Apache Hadoop, built to handle big data storage
with fault tolerance and high throughput.
Architecture:
1. NameNode (Master Node)
• Stores metadata of HDFS (file names, directories, block locations).
• Manages file system namespace.
• Coordinates and controls DataNodes.
2. DataNode (Slave Node)
• Stores actual data blocks of files.
• Sends heartbeat signals to NameNode to report status.
• Handles read/write requests from clients.
3. Client
• User or application that interacts with HDFS.
• Requests file metadata from NameNode.
• Reads/Writes data directly from/to DataNodes.
How HDFS Works?
a) File Storage (Write Operation)
1. Client contacts NameNode to get information on where to store data.
2. NameNode splits file into blocks (default block size: 128 MB or 64 MB).
3. NameNode assigns DataNodes to store replicas (default replication factor: 3).
4. Client writes data to first DataNode, which pipelines it to the next DataNode
until all replicas are stored
File Reading (Read Operation)
1. Client requests file metadata from NameNode.
2. NameNode returns list of DataNodes holding blocks of the file.
3. Client directly connects to DataNodes to read data blocks in parallel.
4. Data is assembled back into the complete file on the client side.
Advantages of HDFS
6. Highly Scalable — Supports petabytes of data.
7. Fault Tolerant — Automatic handling of failures.
8. High Performance — Efficient for large sequential reads/writes.
9. Cost Effective — Uses commodity hardware.
10. Supports Large Files — Optimized for big files used in data analysis.
Disadvantages of HDFS
4. Not the best fit for small files.
5. Master may act as a bottleneck.
6. unable to type at random.
Dynamo: Distributed Data Storage System
Dynamo is a highly available and scalable distributed key-value storage system developed
by Amazon to handle large-scale, highly available e-commerce applications like Amazon's
shopping cart service.
Working of Dynamo
• Write Process:
1. Client sends a PUT (write) request to any node.
2. Node stores data and replicates to other nodes based on replication factor.
3. Acknowledgment is sent back to the client.
• Read Process:
1. Client sends a GET (read) request to any node.
2. Node fetches data from multiple replicas.
3. If versions differ, conflicts are resolved using vector clocks or client-side logic.
Advantages of HDFS
11. Highly Scalable — Supports petabytes of data.
12. Fault Tolerant — Automatic handling of failures.
13. High Performance — Efficient for large sequential reads/writes.
14. Cost Effective — Uses commodity hardware.
15. Supports Large Files — Optimized for big files used in data analysis.
Disadvantages of HDFS
7. Not the best fit for small files.
8. Master may act as a bottleneck.
9. unable to type at random.
Write a short note on MapReduce
MapReduce is a programming model and processing technique developed by Google for
handling and processing large datasets in a distributed computing environment.
Key Concepts:
1. Map Phase:
o Takes input data and converts it into a set of key-value pairs.
o Each piece of data is processed independently in parallel.
2. Shuffle & Sort Phase:
o Intermediate key-value pairs are grouped and sorted based on keys.
o Prepares data for reduction.
3. Reduce Phase:
o Aggregates or summarizes the data with the same key.
o Produces the final output.
Features:
• Scalable and fault-tolerant.
• Parallel processing over distributed systems (like Hadoop).
• Suitable for big data analytics and batch processing.
Example Use Cases:
• Word count in large documents.
• Log analysis.
• Data aggregation.
Explain how the Cloud Data Management Works
Cloud Data Management is the process of storing, organizing, and managing data on cloud
platforms instead of local servers or personal computers. It allows secure, scalable, and
flexible access to data from anywhere using the internet.
How Does It Work?
1. Data Collection & Ingestion
• Data is collected from various sources like apps, devices, databases, sensors, etc.
• The data is uploaded to cloud storage (e.g., AWS S3, Google Cloud Storage, Azure
Blob Storage).
2. Storage Management
• Data is stored in different formats: files, databases, data lakes, data warehouses.
• Storage systems provide automatic scaling, backup, and replication to protect
against data loss.
3. Data Organization & Classification
• Data is organized into folders, buckets, or tables.
• Metadata is used to describe data (e.g., type, owner, date).
• Classification helps in quick searching and access control.
4. Data Security & Access Control
• Encryption protects data during transfer and at rest.
• Authentication and Authorization ensure that only authorized users can access data.
• Role-based access (RBAC) and policies manage who can read, write, or delete data.
5. Backup, Replication & Recovery
• Data is backed up automatically and replicated across multiple regions for disaster
recovery.
• Enables quick restoration in case of failure or loss.
6. Data Sharing & Collaboration
• Data can be shared securely with other users or organizations.
• Enables real-time collaboration and data integration.
Cloud Data Management enables organizations to store, manage, secure, and analyze their
data efficiently using cloud platforms, offering scalability, security, and flexibility without
investing in physical infrastructure.
Describe Data Intensive Technologies for Cloud Computing?
Data-Intensive Computing deals with large-scale data processing, storage, and management. In
cloud computing, data-intensive technologies handle big data efficiently, ensuring high
performance, scalability, and fault tolerance.
Data Intensive Technologies for Cloud Computing:
1. Distributed File Systems
• Store and manage massive data across multiple machines.
• Ensure fault tolerance and high availability.
• Examples:
o HDFS (Hadoop Distributed File System)
o Google File System (GFS)
o Amazon S3 (object storage)
2. MapReduce Programming Model
• A framework for processing large data sets in parallel across distributed clusters.
• Map step: Filters and sorts data.
• Reduce step: Aggregates data.
• Examples: Hadoop MapReduce, Google MapReduce.
3. Distributed Databases & NoSQL Databases
• Manage unstructured and semi-structured data efficiently.
• Handle huge volumes of data with high scalability.
• Types & Examples:
o Key-Value Stores: DynamoDB, Riak
o Column-Oriented: Cassandra, HBase
o Document Stores: MongoDB, CouchDB
o Graph Databases: Neo4j
4. Data Warehousing and Big Data Platforms
• Store and analyze structured data for decision-making and analytics.
• Examples:
o Google BigQuery
o Amazon Redshift
o Snowflake
o Azure Synapse Analytics
7. Machine Learning and AI Frameworks
• Analyze big data for predictions, trends, and AI models.
• Examples:
o TensorFlow, PyTorch
o ML on AWS (SageMaker), Azure ML, Google AI Platform
List and explain cloud data storage challenges.
Cloud Data Storage Challenges
Cloud data storage offers scalability and flexibility, but it comes with several challenges that need to
be addressed to ensure security, reliability, and cost-efficiency.
1. Data Security and Privacy
• Challenge: Storing sensitive data on third-party cloud platforms raises concerns about
unauthorized access, data breaches, and privacy violations.
• Key Issues:
o Data leakage or hacking.
o Insider threats from cloud provider employees.
2. Data Availability and Downtime
• Challenge: Ensuring that data is always accessible even during system failures or outages.
• Key Issues:
o Cloud service downtime affects user access.
3. Data Integrity
• Challenge: Maintaining the accuracy and consistency of stored data over time.
• Key Issues:
o Data corruption due to hardware/software issues.
o Transmission errors during data transfer.
4. Vendor Lock-in
• Challenge: Difficulty in migrating data between different cloud providers due to proprietary
formats and APIs.
• Key Issues:
o High switching costs.
o Limited flexibility to change providers.
5. Cost Management
• Challenge: Unpredictable and escalating costs of storage as data grows.
• Key Issues:
o Hidden fees for data transfer, retrieval, and redundancy.
o Cost of long-term data storage and backups.
6. Data Backup and Recovery
• Challenge: Ensuring proper backup strategies and quick data recovery in case of data loss or
attack.
• Key Issues:
o Delays in data restoration.
o Incomplete or outdated backups.