Parallel Databases - Detailed Q&A
Parallel Databases - Important Questions and Detailed Answers
1. Why are Parallel Database Systems Important?
1. Cheaper Hardware: Modern processors and disks are affordable, making it cost-effective to build parallel
systems.
2. Handling Big Data: Large-scale data from transactions, logs, and media require powerful storage and
retrieval systems.
3. Speed for Complex Queries: Parallel systems divide heavy tasks to get faster results for analytical and
decision support queries.
4. Better User Support: Many users can access the system at the same time, handled efficiently using
multiple processors.
5. Scalability: We can grow the system easily by adding more hardware as data increases.
2. What are the Partitioning Techniques in I/O Parallelism? Explain with Examples.
1. Round-Robin: Distributes tuples evenly in a rotating order across disks. Best for full table scans.
2. Hash Partitioning: Uses a hash function on attributes to determine disk location. Good for point queries.
3. Range Partitioning: Distributes tuples based on value ranges (e.g., values 1-10 go to Disk 1). Ideal for
range queries.
4. Round-Robin Example: Tuple 1 to Disk 1, Tuple 2 to Disk 2, and so on.
5. Range Example: Age < 20 on Disk 1, 20-40 on Disk 2, Age > 40 on Disk 3.
3. What is Skew and How Do We Handle It in Parallel Databases?
1. Skew means uneven distribution of data across disks, causing some to overload.
2. Attribute Skew: Some values (e.g., status='active') appear very frequently.
3. Partition Skew: Poorly chosen ranges cause imbalance (e.g., many users in age 20-30 group).
4. Handling Methods: Use histograms or frequency tables to choose balanced partition ranges.
5. Good hash functions or dynamic rebalancing can help reduce skew.
4. What are Interquery and Intraquery Parallelism?
Parallel Databases - Detailed Q&A
1. Interquery: Multiple queries processed in parallel. Increases throughput for transactional workloads.
2. Intraquery: A single query broken into sub-tasks, processed in parallel to reduce response time.
3. Intraquery Types: Intraoperation (e.g., parallel sorting), Interoperation (e.g., pipelining joins).
4. Interquery is simpler; intraquery is better for complex, long queries.
5. Both improve performance but are suited to different needs.
5. How is Parallel Sorting Performed in Databases?
1. Range Partitioning Sort: Data is split into value ranges, each processor sorts its range.
2. Parallel External Merge Sort: Each processor sorts its data, then merges results across processors.
3. Final Merge: Sorted ranges are simply combined, as each covers a unique part of the value range.
4. Efficient Sorting: Reduces time compared to sequential sorting of large datasets.
5. Example: Processor 1 sorts IDs 1-1000, Processor 2 sorts 1001-2000, and so on.