KEMBAR78
DATA228 Lecture Notes Week 1 | PDF | Computer Science | Information Technology
0% found this document useful (0 votes)
60 views20 pages

DATA228 Lecture Notes Week 1

The document outlines a course on Big Data Technologies and Applications for Fall 2024, taught by Sangjin Lee, who has expertise in various related fields. It covers topics such as Hadoop, Spark, and the future of Big Data, emphasizing the exponential growth of data and the need for efficient data handling and analysis. Class rules, instructor information, and the significance of Big Data in modern contexts are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views20 pages

DATA228 Lecture Notes Week 1

The document outlines a course on Big Data Technologies and Applications for Fall 2024, taught by Sangjin Lee, who has expertise in various related fields. It covers topics such as Hadoop, Spark, and the future of Big Data, emphasizing the exponential growth of data and the need for efficient data handling and analysis. Class rules, instructor information, and the significance of Big Data in modern contexts are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DATA 228

Big Data Technologies and Applications (Fall 2024)

Sangjin Lee
Overview

• Introduction to Big D t

• H doop

• HDFS

• YARN

• M pReduce

• Sp rk

• High-level APIs

• Sp rkSQL

• Future of Big D t
a
a
a
a
a
a
a
a
Instructor

• S ngjin Lee (s ngjin.lee@sjsu.edu, https://linkedin.com/in/sjlee)

• H doop project PMC (project m n gement committee) member

• Contributor to sever l open-source projects beyond H doop

• Expertise/experience: Big D t , cont iner pl tforms, tr ic m n gement, developer


experience, distributed systems, J v , GoL ng, nd more
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
Class rules

• Ple se refer to the syll bus

• Cl ss will st rt on time

• Homework nd quizzes/tests will use C nv s

• No l te ssignments re ccepted

• Quizzes re in-cl ss

• Ex ms re comprehensive

• TA: Sowmy Kurub (sowmy .kurub @sjsu.edu)


a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Why are you taking this class?

• Poll
Rise of Big Data
Rise of Big Data

• Exponenti l growth of d t everywhere

• Tr ns ctions

• Internet cr wling

• Impressions nd clicks

• Soci l medi feeds

• IoT (c rs, c mer s, devices, etc.)

• Expect ~ 164 zett bytes of tot l d t by 2025


a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Rise of Big Data

• Perfect “d t ” storm

“D t L kehouse in Action”, Menon


a
a
a
a
a
Rise of Big Data

• Not just size or volume

• D t re diverse

• D t re often produced by di erent te ms, org niz tions, nd comp nies

• D t re often unstructured
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
Rise of Big Data
Relational DBs

• “Good ol’ d ys” when d t w s ne tly in rel tion l DBs (Or cle, MySQL, …)

• But…

• Comp nies lre dy h d to ind w ys to sc le beyond gig ntic single-m chine DBs

• How to loc te your d t when the d t is “distributed”

• They were lre dy trying to get to Big D t without the n me


a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Rise of Big Data
Data warehouse

• How bout d t w rehouse (DW)?

• T ble u, Ter d t , etc.

• Highly cur ted w lled g rdens

• They don’t pl y well when they st rt mixing di erent sources of d t

• It’s SUPER expensive ($$$$)

• Side note: OLTP vs. OLAP


a
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
What is Big Data?
What is Big Data?

• Store much l rger volumes of d t

• Compute/ n lyze much l rger volumes of d t

• H ndle diverse nd mostly unstructured d t

• … And do it che ply


a
a
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Store much larger volumes of data

• Let’s t lk bout typic l stor ge server these d ys (~ 2024)

• ~ 10 TBs of HD or SSD stor ge

• Even sm ll comp nies e sily h ve tens of PBs of d t

• How m ny stor ge servers would you need?

• Not just m ny m chines…

• How do you ind your d t ?

• How do you keep your d t f ult-toler nt nd resilient?


a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Compute/analyze much larger volumes of data

• Let’s t lk bout typic l processing server these d ys

• 32-core CPU (3.x GHz)

• 256 GB RAM

• 100 Gig ethernet


a
a
a
a
a
What is Big Data?
Compute/analyze much larger volumes of data

• You w nt to ingest nd n lyze 10 TB of d t with single server

• How long would it t ke to complete this t sk?

• How long does it t ke for single server to stre m 10 TB?

• How much d t c n single server hold?

• Do multiple cores help?

• Wh t is the most import nt bottleneck?


a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Compute/analyze much larger volumes of data

• It should become obvious m ssive p r llelism is required

• Distribute compute to m ny m chines nd run them in p r llel

• Sort nd collect the results

• This points to some sort of d t compute fr mework th t c n model this beh vior nd lets
you progr m this in n e sy nd e icient w y

• —> M pReduce, Sp rk, etc.


a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Handle diverse and mostly unstructured data

• Need bility to “discover” schem from d t

• Need bility to h ndle m ny di erent d t schem s nd encodings

• Need good d t govern nce th t sc les

• Need strong d t security


a
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
a
a
a
a
What is Big Data?
… And do it cheaply

• Big d t is built on che p “commodity” h rdw re

• Horizont l sc ling

• F ult-toler nt nd self-he ling rchitecture

• Big d t is now incre singly coupled with Cloud

• E.g. use of s3 or GCS

• Compute on dem nd
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Your “dev” environment

• Poll

You might also like