0% found this document useful (0 votes)

17 views39 pages

Module 2 - DSV-1

Data Science VTU

Uploaded by

taranilakshmi.23ise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views39 pages

Module 2 - DSV-1

Data Science VTU

Uploaded by

taranilakshmi.23ise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Data Science

with
Visualization
By
Prof. Madhusmita Behera

Department of
Computer Science & Engineering
Department of Computer Science & Engineering www.cambridge.edu.in www.cambridge.edu.in
Department of Computer Science & Engineering www.cambridge.edu.in
Module-2: Handling large data
Handling large data on a single computer: The
problems faced, General techniques for handling large
volumes of data, General programming tips for
dealing with large datasets.
First steps in big data: Distributing data storage and
processing with frameworks - Hadoop, Spark

Department of Computer Science & Engineering www.cambridge.edu.in

4.1 The problems you face when handling
large data
• Memory Overload
• Algorithm Runtime Issues
• CPU Starvation (I/O Bottlenecks)

Department of Computer Science & Engineering www.cambridge.edu.in

4.1 The problems you face when handling large
data
• Memory Issues:
• Computers have limited RAM.
• If data doesn’t fit, the OS swaps to disk (slow).
• Many algorithms load all data into memory → causes out-of-memory errors.
• Some algorithms keep multiple copies or intermediate results → worsens memory
use.
• Time Issues:
• Some algorithms may run forever or take too long.
• Even small datasets can take unreasonable time if the algorithm is inefficient.
• CPU and I/O Bottlenecks:
• Processors may sit idle while waiting for data from storage.
• Hard drives (HDDs) are slow; SSDs are faster but more costly.
• Poor data feeding to CPU leads to CPU starvation.
Department of Computer Science & Engineering www.cambridge.edu.in
4.2 General techniques for handling large volumes of
data
• No one-to-one mapping – A single
solution can fix both memory and
computation issues.
• Data compression:
• Makes datasets smaller → solves
memory problems.
• Shifts workload from slow hard disk to
fast CPU → improves speed.
• Without compression → many I/O
operations, CPU stays idle.
• With compression → CPU gets more
work, system is balanced.

Department of Computer Science & Engineering www.cambridge.edu.in

4.2.1 Choosing the right algorithm
• Choosing the right algorithm is key to
handling big data efficiently.
• Algorithms matter more than
hardware: A good algorithm can solve
problems better than just adding faster
machines.
• Efficient algorithms for large data:
They avoid loading the entire dataset
into memory.
• Parallelization: The best algorithms
support running tasks in parallel
(multiple calculations at the same
time).
Department of Computer Science & Engineering www.cambridge.edu.in
ONLINE LEARNING ALGORITHMS
• Online learning is memory-efficient and
suitable for large, continuous data streams.
• Online learning algorithms can be trained
one data point at a time instead of loading
all data at once.
• When a new data point arrives, the model
updates itself and then forgets the raw data
(only keeps updated parameters).
• Example: A weather prediction model
updates its parameters (like pressure,
temperature) for each region, then moves
on.
• This “use and forget” method prevents
memory overload because individual
observations are small.
• Refer to Listing 4.1 for an example

Department of Computer Science & Engineering www.cambridge.edu.in

ONLINE LEARNING ALGORITHMS
• Ways of Feeding Data to Algorithms
• Full batch learning – All data is fed to the algorithm at once (needs lots of
memory).
• Mini-batch learning – Data is fed in small batches (10–1,000 samples at a time).
• Online learning – Data is fed one observation at a time.
• Mini-batches & Sliding Window
• Online algorithms can also handle mini-batches.
• A sliding window can move over the data to train gradually.
• Online Learning vs Streaming Algorithms
• Both: Process data one point at a time.
• Difference:
• Streaming algorithms → See each observation once only (e.g., Twitter stream).
• Online learning algorithms → Can reuse the same data multiple times (works for both static
datasets and streaming data).

Department of Computer Science & Engineering www.cambridge.edu.in

DIVIDING A LARGE MATRIX INTO MANY SMALL ONES
• Large data tables (matrices) may not fit in memory.
• Splitting them into smaller matrices allows algorithms (like linear regression) to
still work.
• Linear regression weights can be calculated using matrix calculus.
• Python tools for handling large data
• bcolz
• Stores arrays compactly.
• Uses hard drive storage when arrays don’t fit into main memory (RAM).
• Dask
• Optimizes and parallelizes calculations.
• Handles computations on large datasets.
• Needs to be installed separately (conda install dask).
• May cause issues with 64-bit Python, but dependencies (e.g., toolz) are usually installed
automatically.
• Refer to Listing 4.3 for an example

Department of Computer Science & Engineering www.cambridge.edu.in

MapReduce
• MapReduce = Split task into
smaller chunks (map) →
process in parallel → combine
results (reduce).It’s powerful
for big data processing and
forms the backbone of systems
like Hadoop.

Department of Computer Science & Engineering www.cambridge.edu.in

MapReduce
• Centralized way: Collect all ballots nationwide and count them in one place
→ slow and inefficient.
• Distributed way: Each local office counts votes for 25 parties, then sends
results → central office only aggregates totals.
• Advantages
• Easy to parallelize (run on many machines at once).
• Fits well in distributed environments like Hadoop.
• Can also run on a single computer.
• Python Libraries for MapReduce
• Hadoopy
• Octopy
• Disco
• Dumbo

Department of Computer Science & Engineering www.cambridge.edu.in

4.2.2 Choosing the right data structure
• The choice of algorithm can
significantly affect your program's
efficiency.
• But how you store data is as crucial as
the algorithms you use.
• Various data structures have different
memory and storage needs.
• Data structures affect the
performance of CRUD operations
(Create, Read, Update, Delete) and
other data operations.

Department of Computer Science & Engineering www.cambridge.edu.in

Choosing the right data structure : Sparse Data
• A sparse data set has many entries but relatively little actual information; most
values are zero.
• When converting textual data (e.g., tweets) to binary features, most entries are
zero since each text uses only a small fraction of possible words.
• The resulting large matrix consumes memory despite containing minimal
meaningful data.
• Sparse matrices can be stored efficiently by recording only non-zero values
with their positions, e.g., [(2, 9, 1)].
• Many Python libraries and algorithms now support working with sparse
matrices

Department of Computer Science & Engineering www.cambridge.edu.in

Choosing the right data structure : Tree Structures
• Trees organize data hierarchically, starting from a root and branching
into subtrees of children.
• Trees allow faster information retrieval compared to scanning entire
tables.
• In the example, the tree uses age-based decision rules to narrow
down the search for a person quickly.
• Start at the root → choose the most discriminating factor (e.g., age)
→ move down through branches until reaching the correct leaf (the
actual record).
• Contains the actual data records like names, ages, and IDs (e.g.,
Danil, 22, 6003).
• Similar to family trees or biological trees, where branches divide into
smaller subsets.
• Trees (and hash tables) are commonly used in indices to speed up
data lookup, avoiding full-table scans.
• Index-based searching drastically improves efficiency for large
datasets.
• http://en.akinator.com/ The Akinator is a djinn in a magical lamp
that tries to guess a person in your mind by asking you a few
questions about him or her.
Department of Computer Science & Engineering www.cambridge.edu.in
Choosing the right data structure : Hash Tables
• Hash tables are data structures that map each value to a computed
key and store keys in buckets.
• Enable quick data retrieval by looking up the correct bucket using
the key.
• Python dictionaries (dict) are a built-in implementation of hash
tables. Hash tables are similar to key-value stores.
• Commonly used as indices to accelerate information retrieval.
• Often used in systems like recommender engines for fast lookups.

Department of Computer Science & Engineering www.cambridge.edu.in

4.2.3 Selecting the right tools
• After selecting algorithms and data structures, choose the right tool.
Tools can be Python libraries or external tools controllable via Python.
Many tools exist, but only a few are listed.
• Python libraries for large data – Offer optimized data structures, code
optimization, and just-in-time (JIT) compilation.
1. Cython – Superset of Python; enforces data type declarations for faster
execution. http://cython.org/
2. Numexpr – Faster numerical expression evaluator than NumPy; uses
internal JIT compiler. https://github.com/pydata/numexpr
3. Numba – JIT compiler for Python; speeds up execution to near C-level
performance. http://numba.pydata.org/

Department of Computer Science & Engineering www.cambridge.edu.in

4.2.3 Selecting the right tools

4. Bcolz – Stores arrays in compressed form; reduces memory usage;

integrates with Numexpr. http://bcolz.blosc.org/
5. Blaze – Translates Python code into queries for multiple data sources
(SQL, CSV, Spark). http://blaze.readthedocs.org/en/latest/ index.html
6. Theano – Uses GPU, symbolic math, and tensors; includes JIT compiler
for speed. http:// deeplearning.net/software/Theano/
7. Dask – Optimizes and distributes computation workflows for parallel
processing.
Purpose – Mainly for Python-based data processing; Blaze also integrates
with databases.

Department of Computer Science & Engineering www.cambridge.edu.in

USE PYTHON AS A MASTER TO CONTROL OTHER
TOOLS
• Python as a controller – Python can control and integrate specialized
software via its interfaces. Most tool and software vendors provide
Python APIs.
• This flexibility makes Python stand out compared to R and SAS.
• Leverage Python to exploit the power of specialized tools fully.
• Connecting Python with NoSQL databases and graph data systems.

Department of Computer Science & Engineering www.cambridge.edu.in

4.3 General programming tips for dealing with large data
sets
• The general tricks split into three
parts:
• Don’t reinvent the wheel. Use tools
and libraries developed by others.
• Get the most out of your hardware.
Your machine is never used to its full
potential; with simple adaptions you
can make it work harder.
• Reduce the computing need. Slim
down your memory and processing
needs as much as possible.

Department of Computer Science & Engineering www.cambridge.edu.in

4.3.1 Don’t reinvent the wheel
• STOP DRA
• Avoid solving problems that have already been solved; focus on
adding value.
• Use databases effectively – Prepare analytical tables in databases; for
complex tasks, leverage user-defined functions and procedures.
• Use optimized libraries – Rely on proven libraries like Mahout or
Weka instead of building from scratch, unless for learning purposes.

Department of Computer Science & Engineering www.cambridge.edu.in

4.3.2 Get the most out of your hardware
• Avoid idle resources and shift workload from overused to underused
components.
• Feed the CPU compressed data to reduce disk I/O bottlenecks.
• Use the GPU for parallelizable computations to improve
performance.
• Leverage Python packages like Theano, NumbaPro, or PyCUDA for
GPU acceleration.
• Apply multi-threading in Python to parallelize CPU tasks.

Department of Computer Science & Engineering www.cambridge.edu.in

4.3.3 Reduce your computing needs
• Reduce computing needs by working smart and minimizing unnecessary
processing.
• Profile code to detect and optimize slow sections.
• Use compiled and optimized library functions for numerical tasks instead of
custom loops.
• Apply JIT compilation or rewrite critical code in lower-level languages like C
or Fortran.
• Leverage optimized libraries such as LAPACK, BLAS, Intel MKL, and ATLAS.
• Avoid loading all data into memory; process in chunks or streams.
• Use generators to prevent storing intermediate data.
• Train on a sample of the data when full-scale algorithms aren’t available.
• Simplify calculations mathematically to reduce computational effort.

Department of Computer Science & Engineering www.cambridge.edu.in

Case Studies.

• 4.4 Case study 1: Predicting malicious URLs

• 4.5 Case study 2: Building a recommender system inside a database

Department of Computer Science & Engineering www.cambridge.edu.in

Chapter 5

First steps in big data: Distributing data

storage and processing with frameworks -
Hadoop, Spark.

Department of Computer Science & Engineering www.cambridge.edu.in

5.1 Distributing data storage and processing
with frameworks
• New big data tools like Hadoop and Spark make managing computer
clusters easier.

• Hadoop can scale to thousands of machines with petabytes of storage,

helping businesses use the huge amounts of data available.

• 5.1.1 Distributing data storage and processing with frameworks 121

Hadoop: a framework for storing and processing large data sets

• 5.1.2 Spark: replacing MapReduce for better performance

Department of Computer Science & Engineering www.cambridge.edu.in

5.1.1 Distributing data storage and processing with frameworks
Hadoop: a framework for storing and processing large data sets
• Apache Hadoop is a framework that makes it easier to manage and use clusters
of computers, offering multiple capabilities and features.
• Reliable—By automatically creating multiple copies of the data and redeploying processing
logic in case of failure.
• Fault tolerant —It detects faults and applies automatic recovery.
• Scalable—Data and its processing are distributed over clusters of computers (horizontal
scaling).
• Portable—Installable on all kinds of hardware and operating systems.

• Hadoop’s core includes a distributed file system, resource manager, and tools to
run programs, enabling easy access to data spread across many servers.

Department of Computer Science & Engineering www.cambridge.edu.in

The Different Components Of Hadoop
• At the heart of Hadoop, we find
• A distributed file system (HDFS).
• A method to execute programs on a massive scale (MapReduce).
• A system to manage the cluster resources (YARN)

Department of Computer Science & Engineering www.cambridge.edu.in

Hadoop Ecosystem Framework
• Core Component:
• HDFS (Hadoop Distributed File System)
• Stores data across multiple nodes in a cluster.
• Provides fault tolerance and scalability.
• YARN (Yet Another Resource Negotiator)
• Manages and allocates cluster resources.
• Handles job scheduling and execution.
• MapReduce
• Distributed processing framework for analyzing large data sets in parallel.
• Supporting component:
• Ambari: For provisioning, managing, and monitoring Hadoop clusters.
• Ranger: Provides security, authentication, and authorization.

Department of Computer Science & Engineering www.cambridge.edu.in

Hadoop Ecosystem Framework Continue…
• Supporting components Continue:
• Sqoop: Transfers data between Hadoop and relational databases.
• Flume: Collects and ingests log data into HDFS.
• Zookeeper: Coordinates distributed systems.
• Oozie: Workflow scheduler for Hadoop jobs.
• Pig: High-level scripting language for data analysis.
• Hive: SQL-like engine for querying big data.
• HCatalog: Metadata service to manage schema and table definitions.
• Mahout: Machine learning library for Hadoop.
• HBase: Column-oriented NoSQL database for real-time read/write.

Department of Computer Science & Engineering www.cambridge.edu.in

MapReduce: How Hadoop Achieves Parallelism
• Hadoop uses MapReduce to process data in parallel.
• It splits the data, processes chunks simultaneously, then combines
and aggregates the results.
• However, it’s not ideal for interactive or iterative tasks because it
writes data to disk after each step, which is slow for large datasets.

Department of Computer Science & Engineering www.cambridge.edu.in

MapReduce: How Hadoop Achieves Parallelism
As the name suggests, the process roughly boils down to two big
phases:
• Mapping phase—The documents are split up into key-value pairs.
Until we reduce, we can have many duplicates.
• Reduce phase—It’s not unlike a SQL “group by.” The different unique
occurrences are grouped together, and depending on the reducing
function, a differ ent result can be created.

Department of Computer Science & Engineering www.cambridge.edu.in

Figure 5.4 An example of a MapReduce flow for
counting the colors in input texts
• Explanation
1. Reading the input files.
2. Passing each line to a mapper job.
3. The mapper reads the file, extracts
colors as keys, and outputs each
color with its occurrence count as
the value. In short, it maps each
color (key) to how many times it
appears (value).
4. The keys are shuffled and sorted so
that all identical keys are grouped
together, making it easier to
aggregate their values.
5. The reduce phase combines values
for each key and outputs the total
occurrences per color.
6. The keys are collected in an output
file.

Department of Computer Science & Engineering www.cambridge.edu.in

NOTE:
• Hadoop simplifies big data processing, but setting up and maintaining
a cluster is complex.

• Tools like Apache Mesos help, yet many mid-sized companies struggle
with Hadoop management.

• To simplify learning, the Hortonworks Sandbox—a pre-configured

Hadoop environment—is used.

Department of Computer Science & Engineering www.cambridge.edu.in

5.1.2 Spark: replacing MapReduce for better
performance
• Spark replaces MapReduce to provide faster performance, especially
for interactive and iterative data analysis, where MapReduce is less
efficient. It can improve such tasks by an order of magnitude.
• Spark:
• Spark is a cluster computing framework like MapReduce, but it relies on
systems such as HDFS, YARN, or Mesos for storage and resource management,
making it complementary to Hadoop.
• It can also run locally for testing and development.

Department of Computer Science & Engineering www.cambridge.edu.in

How Does Spark Solve The Problems Of
MapReduce?
• Spark provides a kind of shared RAM memory across the cluster.
• Workers can share variables and state without writing intermediate
results to disk.
• Uses RDD (Resilient Distributed Datasets):
• A distributed memory abstraction.
• Enables in-memory computations on large clusters.
• Ensures fault tolerance.
• Avoids costly disk I/O operations, making it much faster.

Department of Computer Science & Engineering www.cambridge.edu.in

The Different Components Of The Spark Ecosystem
• Spark Core:
• Provides a NoSQL environment for interactive, exploratory analysis.
• Can run in both batch mode and interactive mode.
• Supports Python (along with other languages).

Department of Computer Science & Engineering www.cambridge.edu.in

The Different Components Of The Spark Ecosystem

• Spark has four other large components, as listed below:

1. Spark streaming is a tool for real-time analysis.

2. Spark SQL provides a SQL interface to work with Spark.

3. MLLib is a tool for machine learning inside the Spark framework.

4. GraphX is a graph database for Spark.

Department of Computer Science & Engineering www.cambridge.edu.in

Efficient Large Data Handling
No ratings yet
Efficient Large Data Handling
6 pages
Unit 4 - DS - 1st Year
No ratings yet
Unit 4 - DS - 1st Year
6 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Module 1 - DSV
No ratings yet
Module 1 - DSV
43 pages
Datascience Unit3
No ratings yet
Datascience Unit3
19 pages
Efficient Single-PC Data Handling
No ratings yet
Efficient Single-PC Data Handling
54 pages
Data Science for CSE Students
No ratings yet
Data Science for CSE Students
7 pages
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
No ratings yet
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
5 pages
DS R Unit-4
No ratings yet
DS R Unit-4
5 pages
Data Science and Big Data Analytics - Unit - 1
No ratings yet
Data Science and Big Data Analytics - Unit - 1
47 pages
Ch-4 Solved Exercise Class Ix
No ratings yet
Ch-4 Solved Exercise Class Ix
9 pages
Algorithms 4th Edition Robert Sedgewick Kevin Wayne No Waiting Time
No ratings yet
Algorithms 4th Edition Robert Sedgewick Kevin Wayne No Waiting Time
117 pages
MDU B.Tech CSE 8th Sem Syllabus
No ratings yet
MDU B.Tech CSE 8th Sem Syllabus
7 pages
B.Tech CSE 8th Sem
No ratings yet
B.Tech CSE 8th Sem
10 pages
Big Data
No ratings yet
Big Data
12 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
1-Pre Requisite For Data Scientist-03!01!2025
No ratings yet
1-Pre Requisite For Data Scientist-03!01!2025
26 pages
Data Modeling
No ratings yet
Data Modeling
12 pages
Data Science
No ratings yet
Data Science
9 pages
Data Structure
No ratings yet
Data Structure
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
Extra - Data Science Unit II
No ratings yet
Extra - Data Science Unit II
41 pages
Data Science
No ratings yet
Data Science
244 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
DS Unit 1 - NUMPY
No ratings yet
DS Unit 1 - NUMPY
29 pages
MCA-161 Unit 4
No ratings yet
MCA-161 Unit 4
45 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
INTRODUCTION TO DATA SCIENCE - VI SEMESTER - BOOK - DR - PS - 58 COPIES
No ratings yet
INTRODUCTION TO DATA SCIENCE - VI SEMESTER - BOOK - DR - PS - 58 COPIES
190 pages
Syllabus Sem 7
No ratings yet
Syllabus Sem 7
10 pages
Big Data Challenges in Bioinformatics
No ratings yet
Big Data Challenges in Bioinformatics
47 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
DSF Notes
No ratings yet
DSF Notes
97 pages
BDA Module Wise Important Questions
No ratings yet
BDA Module Wise Important Questions
5 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
Dsa C Introduction
No ratings yet
Dsa C Introduction
29 pages
Big Data 2024
No ratings yet
Big Data 2024
3 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Ocaml Scientific Computing: Liang Wang Jianxin Zhao Richard Mortier
No ratings yet
Ocaml Scientific Computing: Liang Wang Jianxin Zhao Richard Mortier
372 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
26 pages
BDA Lab Manual 200305105108
No ratings yet
BDA Lab Manual 200305105108
44 pages
SEM VII BDA Syllabus Theory
No ratings yet
SEM VII BDA Syllabus Theory
4 pages
Ocs353 DSF Unit V Notes
No ratings yet
Ocs353 DSF Unit V Notes
7 pages
MR20 Vi-I Syllabus
No ratings yet
MR20 Vi-I Syllabus
22 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Ebook Advanced Data Structuresand Algorithms
No ratings yet
Ebook Advanced Data Structuresand Algorithms
205 pages
BDA Unlocked
100% (1)
BDA Unlocked
69 pages
Big Data Management Syllabus
100% (1)
Big Data Management Syllabus
5 pages
Unit 2.4: Bioinformatics and Databases
No ratings yet
Unit 2.4: Bioinformatics and Databases
55 pages
Data Science With Python Explained PDF
No ratings yet
Data Science With Python Explained PDF
1 page
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Syllabus For ML and Data Visualization
No ratings yet
Syllabus For ML and Data Visualization
7 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
16 pages
M1 - Events and JSON
No ratings yet
M1 - Events and JSON
18 pages
AI-Module 3 - Notes
No ratings yet
AI-Module 3 - Notes
8 pages
Rm306 Rmipr Module 1
No ratings yet
Rm306 Rmipr Module 1
148 pages
Rm306 Rmipr Module 3
No ratings yet
Rm306 Rmipr Module 3
45 pages
12 Essential Types of ETL Testing
No ratings yet
12 Essential Types of ETL Testing
15 pages
304 9 ER To Relational By6
No ratings yet
304 9 ER To Relational By6
6 pages
Qdi Gold Client and Qlik Catalog License Metrics
No ratings yet
Qdi Gold Client and Qlik Catalog License Metrics
3 pages
Solaris Disk Quota Implementation
No ratings yet
Solaris Disk Quota Implementation
2 pages
AggregateFunction Datetime
No ratings yet
AggregateFunction Datetime
5 pages
RDBMS
No ratings yet
RDBMS
2 pages
Oracle DBA Course Syllabus
No ratings yet
Oracle DBA Course Syllabus
8 pages
"Blood Bank Management System": Mrs. V.Soumya Assistant Professor Mrs. Ramya Vani Assistant Professor
No ratings yet
"Blood Bank Management System": Mrs. V.Soumya Assistant Professor Mrs. Ramya Vani Assistant Professor
23 pages
IBM Cognos 10 BI Author Exam Prep
No ratings yet
IBM Cognos 10 BI Author Exam Prep
22 pages
Dbms IV Dec 2024
No ratings yet
Dbms IV Dec 2024
1 page
(I.P) Information Practices Project File For CLASS 11th DAV SCHOOL
No ratings yet
(I.P) Information Practices Project File For CLASS 11th DAV SCHOOL
28 pages
Oracle Security API - FND - FUCTION.TEST - ITPUB博客
No ratings yet
Oracle Security API - FND - FUCTION.TEST - ITPUB博客
3 pages
Document Finder Customizing
No ratings yet
Document Finder Customizing
7 pages
Scalable Entity Resolution
No ratings yet
Scalable Entity Resolution
66 pages
Glossary Terms From Module 2
No ratings yet
Glossary Terms From Module 2
3 pages
Starburst Introduction - March 2021
No ratings yet
Starburst Introduction - March 2021
12 pages
Database Assignment 2 Solution NYU
No ratings yet
Database Assignment 2 Solution NYU
8 pages
Emailing DBMS - QB - Shubhammarotkar Toc Notes
No ratings yet
Emailing DBMS - QB - Shubhammarotkar Toc Notes
14 pages
A Checklist For Preventing Common But Deadly MySQL® Problems
No ratings yet
A Checklist For Preventing Common But Deadly MySQL® Problems
13 pages
Database Quiz Questions
No ratings yet
Database Quiz Questions
4 pages
BASE24-eps - UIS Developers Guide
No ratings yet
BASE24-eps - UIS Developers Guide
130 pages
DOC-25391 - How To Broadcast A BEx Report Through E-Mail PDF
No ratings yet
DOC-25391 - How To Broadcast A BEx Report Through E-Mail PDF
9 pages
3NF-Third Normal Form - NOTES
No ratings yet
3NF-Third Normal Form - NOTES
2 pages
Lab 6 Oracle Database Administration
No ratings yet
Lab 6 Oracle Database Administration
15 pages
Company Database
No ratings yet
Company Database
1 page
Backing Up and Restoring Progeny Databases
No ratings yet
Backing Up and Restoring Progeny Databases
19 pages
AIX Backup Restore
No ratings yet
AIX Backup Restore
37 pages
MySQL DBA Interview Questions
100% (2)
MySQL DBA Interview Questions
4 pages
CH - 03 Mail Merge
No ratings yet
CH - 03 Mail Merge
36 pages
ETAP Workshop Notes 3D Database: Theoretical Concepts
No ratings yet
ETAP Workshop Notes 3D Database: Theoretical Concepts
8 pages

Module 2 - DSV-1

Uploaded by

Module 2 - DSV-1

Uploaded by

Data Science

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

4. Bcolz – Stores arrays in compressed form; reduces memory usage;

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

• 4.4 Case study 1: Predicting malicious URLs

• 4.5 Case study 2: Building a recommender system inside a database

Department of Computer Science & Engineering www.cambridge.edu.in

First steps in big data: Distributing data

Department of Computer Science & Engineering www.cambridge.edu.in

• Hadoop can scale to thousands of machines with petabytes of storage,

• 5.1.1 Distributing data storage and processing with frameworks 121

• 5.1.2 Spark: replacing MapReduce for better performance

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

• To simplify learning, the Hortonworks Sandbox—a pre-configured

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

• Spark has four other large components, as listed below:

2. Spark SQL provides a SQL interface to work with Spark.

3. MLLib is a tool for machine learning inside the Spark framework.

4. GraphX is a graph database for Spark.

Department of Computer Science & Engineering www.cambridge.edu.in

You might also like