DATA 228
Big Data Technologies and Applications (Fall 2024)
Sangjin Lee
Overview
• Introduction to Big D t
• H doop
• HDFS
• YARN
• M pReduce
• Sp rk
• High-level APIs
• Sp rkSQL
• Future of Big D t
a
a
a
a
a
a
a
a
Instructor
• S ngjin Lee (s ngjin.lee@sjsu.edu, https://linkedin.com/in/sjlee)
• H doop project PMC (project m n gement committee) member
• Contributor to sever l open-source projects beyond H doop
• Expertise/experience: Big D t , cont iner pl tforms, tr ic m n gement, developer
experience, distributed systems, J v , GoL ng, nd more
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
Class rules
• Ple se refer to the syll bus
• Cl ss will st rt on time
• Homework nd quizzes/tests will use C nv s
• No l te ssignments re ccepted
• Quizzes re in-cl ss
• Ex ms re comprehensive
• TA: Sowmy Kurub (sowmy .kurub @sjsu.edu)
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Why are you taking this class?
• Poll
Rise of Big Data
Rise of Big Data
• Exponenti l growth of d t everywhere
• Tr ns ctions
• Internet cr wling
• Impressions nd clicks
• Soci l medi feeds
• IoT (c rs, c mer s, devices, etc.)
• Expect ~ 164 zett bytes of tot l d t by 2025
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Rise of Big Data
• Perfect “d t ” storm
“D t L kehouse in Action”, Menon
a
a
a
a
a
Rise of Big Data
• Not just size or volume
• D t re diverse
• D t re often produced by di erent te ms, org niz tions, nd comp nies
• D t re often unstructured
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
Rise of Big Data
Relational DBs
• “Good ol’ d ys” when d t w s ne tly in rel tion l DBs (Or cle, MySQL, …)
• But…
• Comp nies lre dy h d to ind w ys to sc le beyond gig ntic single-m chine DBs
• How to loc te your d t when the d t is “distributed”
• They were lre dy trying to get to Big D t without the n me
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Rise of Big Data
Data warehouse
• How bout d t w rehouse (DW)?
• T ble u, Ter d t , etc.
• Highly cur ted w lled g rdens
• They don’t pl y well when they st rt mixing di erent sources of d t
• It’s SUPER expensive ($$$$)
• Side note: OLTP vs. OLAP
a
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
What is Big Data?
What is Big Data?
• Store much l rger volumes of d t
• Compute/ n lyze much l rger volumes of d t
• H ndle diverse nd mostly unstructured d t
• … And do it che ply
a
a
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Store much larger volumes of data
• Let’s t lk bout typic l stor ge server these d ys (~ 2024)
• ~ 10 TBs of HD or SSD stor ge
• Even sm ll comp nies e sily h ve tens of PBs of d t
• How m ny stor ge servers would you need?
• Not just m ny m chines…
• How do you ind your d t ?
• How do you keep your d t f ult-toler nt nd resilient?
a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Compute/analyze much larger volumes of data
• Let’s t lk bout typic l processing server these d ys
• 32-core CPU (3.x GHz)
• 256 GB RAM
• 100 Gig ethernet
a
a
a
a
a
What is Big Data?
Compute/analyze much larger volumes of data
• You w nt to ingest nd n lyze 10 TB of d t with single server
• How long would it t ke to complete this t sk?
• How long does it t ke for single server to stre m 10 TB?
• How much d t c n single server hold?
• Do multiple cores help?
• Wh t is the most import nt bottleneck?
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Compute/analyze much larger volumes of data
• It should become obvious m ssive p r llelism is required
• Distribute compute to m ny m chines nd run them in p r llel
• Sort nd collect the results
• This points to some sort of d t compute fr mework th t c n model this beh vior nd lets
you progr m this in n e sy nd e icient w y
• —> M pReduce, Sp rk, etc.
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
a
a
a
a
a
a
What is Big Data?
Handle diverse and mostly unstructured data
• Need bility to “discover” schem from d t
• Need bility to h ndle m ny di erent d t schem s nd encodings
• Need good d t govern nce th t sc les
• Need strong d t security
a
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
a
a
a
a
What is Big Data?
… And do it cheaply
• Big d t is built on che p “commodity” h rdw re
• Horizont l sc ling
• F ult-toler nt nd self-he ling rchitecture
• Big d t is now incre singly coupled with Cloud
• E.g. use of s3 or GCS
• Compute on dem nd
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Your “dev” environment
• Poll