Data Science
with
Visualization
By
Prof. Madhusmita Behera
Department of
Computer Science & Engineering
Department of Computer Science & Engineering www.cambridge.edu.in www.cambridge.edu.in
Department of Computer Science & Engineering www.cambridge.edu.in
Module-2: Handling large data
Handling large data on a single computer: The
problems faced, General techniques for handling large
volumes of data, General programming tips for
dealing with large datasets.
First steps in big data: Distributing data storage and
processing with frameworks - Hadoop, Spark
Department of Computer Science & Engineering www.cambridge.edu.in
4.1 The problems you face when handling
large data
• Memory Overload
• Algorithm Runtime Issues
• CPU Starvation (I/O Bottlenecks)
Department of Computer Science & Engineering www.cambridge.edu.in
4.1 The problems you face when handling large
data
• Memory Issues:
• Computers have limited RAM.
• If data doesn’t fit, the OS swaps to disk (slow).
• Many algorithms load all data into memory → causes out-of-memory errors.
• Some algorithms keep multiple copies or intermediate results → worsens memory
use.
• Time Issues:
• Some algorithms may run forever or take too long.
• Even small datasets can take unreasonable time if the algorithm is inefficient.
• CPU and I/O Bottlenecks:
• Processors may sit idle while waiting for data from storage.
• Hard drives (HDDs) are slow; SSDs are faster but more costly.
• Poor data feeding to CPU leads to CPU starvation.
Department of Computer Science & Engineering www.cambridge.edu.in
4.2 General techniques for handling large volumes of
data
• No one-to-one mapping – A single
solution can fix both memory and
computation issues.
• Data compression:
• Makes datasets smaller → solves
memory problems.
• Shifts workload from slow hard disk to
fast CPU → improves speed.
• Without compression → many I/O
operations, CPU stays idle.
• With compression → CPU gets more
work, system is balanced.
Department of Computer Science & Engineering www.cambridge.edu.in
4.2.1 Choosing the right algorithm
• Choosing the right algorithm is key to
handling big data efficiently.
• Algorithms matter more than
hardware: A good algorithm can solve
problems better than just adding faster
machines.
• Efficient algorithms for large data:
They avoid loading the entire dataset
into memory.
• Parallelization: The best algorithms
support running tasks in parallel
(multiple calculations at the same
time).
Department of Computer Science & Engineering www.cambridge.edu.in
ONLINE LEARNING ALGORITHMS
• Online learning is memory-efficient and
suitable for large, continuous data streams.
• Online learning algorithms can be trained
one data point at a time instead of loading
all data at once.
• When a new data point arrives, the model
updates itself and then forgets the raw data
(only keeps updated parameters).
• Example: A weather prediction model
updates its parameters (like pressure,
temperature) for each region, then moves
on.
• This “use and forget” method prevents
memory overload because individual
observations are small.
• Refer to Listing 4.1 for an example
Department of Computer Science & Engineering www.cambridge.edu.in
ONLINE LEARNING ALGORITHMS
• Ways of Feeding Data to Algorithms
• Full batch learning – All data is fed to the algorithm at once (needs lots of
memory).
• Mini-batch learning – Data is fed in small batches (10–1,000 samples at a time).
• Online learning – Data is fed one observation at a time.
• Mini-batches & Sliding Window
• Online algorithms can also handle mini-batches.
• A sliding window can move over the data to train gradually.
• Online Learning vs Streaming Algorithms
• Both: Process data one point at a time.
• Difference:
• Streaming algorithms → See each observation once only (e.g., Twitter stream).
• Online learning algorithms → Can reuse the same data multiple times (works for both static
datasets and streaming data).
Department of Computer Science & Engineering www.cambridge.edu.in
DIVIDING A LARGE MATRIX INTO MANY SMALL ONES
• Large data tables (matrices) may not fit in memory.
• Splitting them into smaller matrices allows algorithms (like linear regression) to
still work.
• Linear regression weights can be calculated using matrix calculus.
• Python tools for handling large data
• bcolz
• Stores arrays compactly.
• Uses hard drive storage when arrays don’t fit into main memory (RAM).
• Dask
• Optimizes and parallelizes calculations.
• Handles computations on large datasets.
• Needs to be installed separately (conda install dask).
• May cause issues with 64-bit Python, but dependencies (e.g., toolz) are usually installed
automatically.
• Refer to Listing 4.3 for an example
Department of Computer Science & Engineering www.cambridge.edu.in
MapReduce
• MapReduce = Split task into
smaller chunks (map) →
process in parallel → combine
results (reduce).It’s powerful
for big data processing and
forms the backbone of systems
like Hadoop.
Department of Computer Science & Engineering www.cambridge.edu.in
MapReduce
• Centralized way: Collect all ballots nationwide and count them in one place
→ slow and inefficient.
• Distributed way: Each local office counts votes for 25 parties, then sends
results → central office only aggregates totals.
• Advantages
• Easy to parallelize (run on many machines at once).
• Fits well in distributed environments like Hadoop.
• Can also run on a single computer.
• Python Libraries for MapReduce
• Hadoopy
• Octopy
• Disco
• Dumbo
Department of Computer Science & Engineering www.cambridge.edu.in
4.2.2 Choosing the right data structure
• The choice of algorithm can
significantly affect your program's
efficiency.
• But how you store data is as crucial as
the algorithms you use.
• Various data structures have different
memory and storage needs.
• Data structures affect the
performance of CRUD operations
(Create, Read, Update, Delete) and
other data operations.
Department of Computer Science & Engineering www.cambridge.edu.in
Choosing the right data structure : Sparse Data
• A sparse data set has many entries but relatively little actual information; most
values are zero.
• When converting textual data (e.g., tweets) to binary features, most entries are
zero since each text uses only a small fraction of possible words.
• The resulting large matrix consumes memory despite containing minimal
meaningful data.
• Sparse matrices can be stored efficiently by recording only non-zero values
with their positions, e.g., [(2, 9, 1)].
• Many Python libraries and algorithms now support working with sparse
matrices
Department of Computer Science & Engineering www.cambridge.edu.in
Choosing the right data structure : Tree Structures
• Trees organize data hierarchically, starting from a root and branching
into subtrees of children.
• Trees allow faster information retrieval compared to scanning entire
tables.
• In the example, the tree uses age-based decision rules to narrow
down the search for a person quickly.
• Start at the root → choose the most discriminating factor (e.g., age)
→ move down through branches until reaching the correct leaf (the
actual record).
• Contains the actual data records like names, ages, and IDs (e.g.,
Danil, 22, 6003).
• Similar to family trees or biological trees, where branches divide into
smaller subsets.
• Trees (and hash tables) are commonly used in indices to speed up
data lookup, avoiding full-table scans.
• Index-based searching drastically improves efficiency for large
datasets.
• http://en.akinator.com/ The Akinator is a djinn in a magical lamp
that tries to guess a person in your mind by asking you a few
questions about him or her.
Department of Computer Science & Engineering www.cambridge.edu.in
Choosing the right data structure : Hash Tables
• Hash tables are data structures that map each value to a computed
key and store keys in buckets.
• Enable quick data retrieval by looking up the correct bucket using
the key.
• Python dictionaries (dict) are a built-in implementation of hash
tables. Hash tables are similar to key-value stores.
• Commonly used as indices to accelerate information retrieval.
• Often used in systems like recommender engines for fast lookups.
Department of Computer Science & Engineering www.cambridge.edu.in
4.2.3 Selecting the right tools
• After selecting algorithms and data structures, choose the right tool.
Tools can be Python libraries or external tools controllable via Python.
Many tools exist, but only a few are listed.
• Python libraries for large data – Offer optimized data structures, code
optimization, and just-in-time (JIT) compilation.
1. Cython – Superset of Python; enforces data type declarations for faster
execution. http://cython.org/
2. Numexpr – Faster numerical expression evaluator than NumPy; uses
internal JIT compiler. https://github.com/pydata/numexpr
3. Numba – JIT compiler for Python; speeds up execution to near C-level
performance. http://numba.pydata.org/
Department of Computer Science & Engineering www.cambridge.edu.in
4.2.3 Selecting the right tools
4. Bcolz – Stores arrays in compressed form; reduces memory usage;
integrates with Numexpr. http://bcolz.blosc.org/
5. Blaze – Translates Python code into queries for multiple data sources
(SQL, CSV, Spark). http://blaze.readthedocs.org/en/latest/ index.html
6. Theano – Uses GPU, symbolic math, and tensors; includes JIT compiler
for speed. http:// deeplearning.net/software/Theano/
7. Dask – Optimizes and distributes computation workflows for parallel
processing.
Purpose – Mainly for Python-based data processing; Blaze also integrates
with databases.
Department of Computer Science & Engineering www.cambridge.edu.in
USE PYTHON AS A MASTER TO CONTROL OTHER
TOOLS
• Python as a controller – Python can control and integrate specialized
software via its interfaces. Most tool and software vendors provide
Python APIs.
• This flexibility makes Python stand out compared to R and SAS.
• Leverage Python to exploit the power of specialized tools fully.
• Connecting Python with NoSQL databases and graph data systems.
Department of Computer Science & Engineering www.cambridge.edu.in
4.3 General programming tips for dealing with large data
sets
• The general tricks split into three
parts:
• Don’t reinvent the wheel. Use tools
and libraries developed by others.
• Get the most out of your hardware.
Your machine is never used to its full
potential; with simple adaptions you
can make it work harder.
• Reduce the computing need. Slim
down your memory and processing
needs as much as possible.
Department of Computer Science & Engineering www.cambridge.edu.in
4.3.1 Don’t reinvent the wheel
• STOP DRA
• Avoid solving problems that have already been solved; focus on
adding value.
• Use databases effectively – Prepare analytical tables in databases; for
complex tasks, leverage user-defined functions and procedures.
• Use optimized libraries – Rely on proven libraries like Mahout or
Weka instead of building from scratch, unless for learning purposes.
Department of Computer Science & Engineering www.cambridge.edu.in
4.3.2 Get the most out of your hardware
• Avoid idle resources and shift workload from overused to underused
components.
• Feed the CPU compressed data to reduce disk I/O bottlenecks.
• Use the GPU for parallelizable computations to improve
performance.
• Leverage Python packages like Theano, NumbaPro, or PyCUDA for
GPU acceleration.
• Apply multi-threading in Python to parallelize CPU tasks.
Department of Computer Science & Engineering www.cambridge.edu.in
4.3.3 Reduce your computing needs
• Reduce computing needs by working smart and minimizing unnecessary
processing.
• Profile code to detect and optimize slow sections.
• Use compiled and optimized library functions for numerical tasks instead of
custom loops.
• Apply JIT compilation or rewrite critical code in lower-level languages like C
or Fortran.
• Leverage optimized libraries such as LAPACK, BLAS, Intel MKL, and ATLAS.
• Avoid loading all data into memory; process in chunks or streams.
• Use generators to prevent storing intermediate data.
• Train on a sample of the data when full-scale algorithms aren’t available.
• Simplify calculations mathematically to reduce computational effort.
Department of Computer Science & Engineering www.cambridge.edu.in
Case Studies.
• 4.4 Case study 1: Predicting malicious URLs
• 4.5 Case study 2: Building a recommender system inside a database
Department of Computer Science & Engineering www.cambridge.edu.in
Chapter 5
First steps in big data: Distributing data
storage and processing with frameworks -
Hadoop, Spark.
Department of Computer Science & Engineering www.cambridge.edu.in
5.1 Distributing data storage and processing
with frameworks
• New big data tools like Hadoop and Spark make managing computer
clusters easier.
• Hadoop can scale to thousands of machines with petabytes of storage,
helping businesses use the huge amounts of data available.
• 5.1.1 Distributing data storage and processing with frameworks 121
Hadoop: a framework for storing and processing large data sets
• 5.1.2 Spark: replacing MapReduce for better performance
Department of Computer Science & Engineering www.cambridge.edu.in
5.1.1 Distributing data storage and processing with frameworks
Hadoop: a framework for storing and processing large data sets
• Apache Hadoop is a framework that makes it easier to manage and use clusters
of computers, offering multiple capabilities and features.
• Reliable—By automatically creating multiple copies of the data and redeploying processing
logic in case of failure.
• Fault tolerant —It detects faults and applies automatic recovery.
• Scalable—Data and its processing are distributed over clusters of computers (horizontal
scaling).
• Portable—Installable on all kinds of hardware and operating systems.
• Hadoop’s core includes a distributed file system, resource manager, and tools to
run programs, enabling easy access to data spread across many servers.
Department of Computer Science & Engineering www.cambridge.edu.in
The Different Components Of Hadoop
• At the heart of Hadoop, we find
• A distributed file system (HDFS).
• A method to execute programs on a massive scale (MapReduce).
• A system to manage the cluster resources (YARN)
Department of Computer Science & Engineering www.cambridge.edu.in
Hadoop Ecosystem Framework
• Core Component:
• HDFS (Hadoop Distributed File System)
• Stores data across multiple nodes in a cluster.
• Provides fault tolerance and scalability.
• YARN (Yet Another Resource Negotiator)
• Manages and allocates cluster resources.
• Handles job scheduling and execution.
• MapReduce
• Distributed processing framework for analyzing large data sets in parallel.
• Supporting component:
• Ambari: For provisioning, managing, and monitoring Hadoop clusters.
• Ranger: Provides security, authentication, and authorization.
Department of Computer Science & Engineering www.cambridge.edu.in
Hadoop Ecosystem Framework Continue…
• Supporting components Continue:
• Sqoop: Transfers data between Hadoop and relational databases.
• Flume: Collects and ingests log data into HDFS.
• Zookeeper: Coordinates distributed systems.
• Oozie: Workflow scheduler for Hadoop jobs.
• Pig: High-level scripting language for data analysis.
• Hive: SQL-like engine for querying big data.
• HCatalog: Metadata service to manage schema and table definitions.
• Mahout: Machine learning library for Hadoop.
• HBase: Column-oriented NoSQL database for real-time read/write.
Department of Computer Science & Engineering www.cambridge.edu.in
MapReduce: How Hadoop Achieves Parallelism
• Hadoop uses MapReduce to process data in parallel.
• It splits the data, processes chunks simultaneously, then combines
and aggregates the results.
• However, it’s not ideal for interactive or iterative tasks because it
writes data to disk after each step, which is slow for large datasets.
Department of Computer Science & Engineering www.cambridge.edu.in
MapReduce: How Hadoop Achieves Parallelism
As the name suggests, the process roughly boils down to two big
phases:
• Mapping phase—The documents are split up into key-value pairs.
Until we reduce, we can have many duplicates.
• Reduce phase—It’s not unlike a SQL “group by.” The different unique
occurrences are grouped together, and depending on the reducing
function, a differ ent result can be created.
Department of Computer Science & Engineering www.cambridge.edu.in
Figure 5.4 An example of a MapReduce flow for
counting the colors in input texts
• Explanation
1. Reading the input files.
2. Passing each line to a mapper job.
3. The mapper reads the file, extracts
colors as keys, and outputs each
color with its occurrence count as
the value. In short, it maps each
color (key) to how many times it
appears (value).
4. The keys are shuffled and sorted so
that all identical keys are grouped
together, making it easier to
aggregate their values.
5. The reduce phase combines values
for each key and outputs the total
occurrences per color.
6. The keys are collected in an output
file.
Department of Computer Science & Engineering www.cambridge.edu.in
NOTE:
• Hadoop simplifies big data processing, but setting up and maintaining
a cluster is complex.
• Tools like Apache Mesos help, yet many mid-sized companies struggle
with Hadoop management.
• To simplify learning, the Hortonworks Sandbox—a pre-configured
Hadoop environment—is used.
Department of Computer Science & Engineering www.cambridge.edu.in
5.1.2 Spark: replacing MapReduce for better
performance
• Spark replaces MapReduce to provide faster performance, especially
for interactive and iterative data analysis, where MapReduce is less
efficient. It can improve such tasks by an order of magnitude.
• Spark:
• Spark is a cluster computing framework like MapReduce, but it relies on
systems such as HDFS, YARN, or Mesos for storage and resource management,
making it complementary to Hadoop.
• It can also run locally for testing and development.
Department of Computer Science & Engineering www.cambridge.edu.in
How Does Spark Solve The Problems Of
MapReduce?
• Spark provides a kind of shared RAM memory across the cluster.
• Workers can share variables and state without writing intermediate
results to disk.
• Uses RDD (Resilient Distributed Datasets):
• A distributed memory abstraction.
• Enables in-memory computations on large clusters.
• Ensures fault tolerance.
• Avoids costly disk I/O operations, making it much faster.
Department of Computer Science & Engineering www.cambridge.edu.in
The Different Components Of The Spark Ecosystem
• Spark Core:
• Provides a NoSQL environment for interactive, exploratory analysis.
• Can run in both batch mode and interactive mode.
• Supports Python (along with other languages).
Department of Computer Science & Engineering www.cambridge.edu.in
The Different Components Of The Spark Ecosystem
• Spark has four other large components, as listed below:
1. Spark streaming is a tool for real-time analysis.
2. Spark SQL provides a SQL interface to work with Spark.
3. MLLib is a tool for machine learning inside the Spark framework.
4. GraphX is a graph database for Spark.
Department of Computer Science & Engineering www.cambridge.edu.in
Department of Computer Science & Engineering www.cambridge.edu.in