KEMBAR78
data platform on kubernetes | PPTX
Data platform
on kubernetes
Jung Chang Un
Kubernetes as data platform
infrastructure
ํ™•์žฅ์„ฑ
โ— Data, ์‚ฌ์šฉ์ž ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ
โ— ํ™•์žฅ์ด ๋น ๋ฅด๊ณ , ๊ฐ€๊ฒฉ์ด ์ ์ •ํ•˜๋ฉฐ, ์›ํ•˜๋Š” ์„ฑ๋Šฅ๊นŒ์ง€ ์‰ฝ๊ฒŒ
โ— Data ์ €์žฅ์—์„œ๋Š” ์žฌ๋ถ„์‚ฐ issue
์•ˆ์ •์„ฑ
โ— HA
โ— Fault Tolerance
์‹ค์‹œ๊ฐ„
โ— Data river
โ— Monitoring, Fraud/Anomaly Detection,
bigdata/cloud ์‹œ๋Œ€์˜ data platform
Kubernetes as data platform infrastructure
๋ฐ์ดํ„ฐํ”Œ๋žซํผ ๊ตฌ์ถ•
โ— ์„œ๋น„์Šค ์„ค์น˜๊ฐ€ ์šฉ์ดํ•จ
โ—‹ Helm (https://helm.sh/)
โ—‹ Yaml ( https://kubernetes.io/ko/docs/tutorials/stateful-application/zookeeper/)
โ— Data ์„œ๋น„์Šค์ œ๊ณต
โ— Pod/์„œ๋น„์Šค ์žฌ์‹œ์ž‘์ด ํŽธํ•จ
EKS
โ— AWS Managed Kubernetes
โ— Master๊ด€๋ฆฌ, multi AZ, upgrade, ๊ฐ€๊ฒฉ์ด ์ €๋ ด
Data Processing(Scheduling, Data Mart, ODS, Realtime)
Data Analysis(Presto)
Data Platform ๊ตฌ์ถ•
๊ฐœ๋ฐœ/์šด์˜ Concept
โ— Job Scheduler/Coordinator
โ— Airflow worker์—์„œ ์ง์ ‘ Data์ฒ˜๋ฆฌ๋ฅผ ํ•˜์ง€ ์•Š์Œ
โ— Kubernetes executor๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์ง€๋งŒ, kubernetes operator๋Š” ์‚ฌ์šฉ
โ—‹ ์‹ค์ œ๋กœ Data Processing์„ ํ•˜๋Š” task๋Š” pod ํ˜•ํƒœ๋กœ ์‹คํ–‰๋จ
๊ด€๋ จLink : https://www.slideshare.net/changunjung/data-platform-data-pipelineairflow-kubernetes
Kubernetes
โ— Webserver, Scheduler - Fault Tolerance
โ— Worker - Scalability
โ— Source - Git sync, EFS
Data Processing(Scheduling - Airflow)
Legacy
โ— DMS - ๋ณ€๊ฒฝ๋ถ„๋งŒ ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์ง€๋งŒ ์ „์ฒด๋ฐ์ดํ„ฐ๋งŒ ๊ฐ€์ ธ์™€์•ผํ•จ
โ— Glue(python) - ์„ค์น˜๋œ library ์™ธ์— ์ถ”๊ฐ€๋กœ ์„ค์น˜๋ถˆ๊ฐ€. ๋‹ค์–‘ํ•œ database ํ™œ์šฉ๋ถˆ
๊ฐ€
โ— Glue(spark)
โ—‹ Glue ETL์˜ GUIํ™˜๊ฒฝ์—์„œ ๊ฐœ๋ฐœ.
โ—‹ Spark ์œผ๋กœ readํ•˜๊ธฐ๋•Œ๋ฌธ์— ์šด์˜db์— ๋ถ€ํ•˜๋ฅผ ์คŒ
โ—‹ Source๊ด€๋ฆฌ ์–ด๋ ค์›€
โ—‹ Data platform ์ ์šฉ
โ–  Glue ETL์˜ GUI์—์„œ ๊ฐœ๋ฐœ๋œ job์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์ง€๋Š” ์•Š๊ณ  script๋ฅผ ๋ณต์‚ฌํ•ด์„œ airflow์—์„œ
submit ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‹คํ–‰. Airflowํ™˜๊ฒฝ๋ณ€์ˆ˜ ํ™œ์šฉ๊ฐ€๋Šฅ
โ–  ์›์ธ ๋ชจ๋ฅผ Hang. Job timeout ๋ณ€์ˆ˜์„ค์ •์œผ๋กœ Hang๋ฐœ์ƒ์‹œ์— ๋‹ค์‹œ ์‹คํ–‰๋ ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ
Data Processing (ODS)
Kubernetes ELT Pods
โ— ODS ์ ์žฌ๋ฅผ ์œ„ํ•œ container
โ—‹ Airflow operator ํ˜•ํƒœ๋กœ ๊ฐœ๋ฐœ
โ—‹ K8s spark operator ๋ฅผ ํ™œ์šฉํ•ด์„œ ์‹คํ–‰ ( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator )
โ— Extract
โ—‹ Extract from source db
โ—‹ python script๋ฅผ ์‚ฌ์šฉ/multi-processing(using query sharding)
โ—‹ ์ตœ์ ํ™”๋œ Query, fetch ๋ฅผ ํ†ตํ•ด db๋ถ€ํ•˜ ์ค„์—ฌ์ฃผ๊ณ , extract performance ์ฆ๊ฐ€
โ—‹ Extract from Mysql, SQLServer, ํ•„์š”์‹œ DB ์ถ”๊ฐ€๊ฐ€ ์‰ฝ๋„๋ก sqlalchemyํ™œ์šฉ
โ— Load
โ—‹ Load to target storage(S3, Hive)
โ—‹ pyspark(k8s as master) ํ™œ์šฉํ•ด์„œ hive table๋กœ insert
โ—‹ Spark container
Data Processing (ODS)
โ— Glue(spark)
โ—‹ ๋น„์Œˆ
โ—‹ executor/driver ์ž์›์„ ๋ฐ›๋Š”๋ฐ ์ฒ˜์Œ์‹œ์ž‘ํ• ๋•Œ๋Š” 10๋ถ„๊ฐ€๋Ÿ‰ ๋Œ€๊ธฐ๊ฐ€ ์žˆ๊ณ  ์ดํ›„์‹คํ–‰์€ ๋ฐ”๋กœ ๋ ๋•Œ๋„
์žˆ๊ณ  ๋‹ค์‹œ 10๋ถ„ ๊ธฐ๋‹ค๋ ค์•ผํ•  ๋•Œ๋„ ์žˆ์Œ. ์ž‘์—…์‹œ๊ฐ„ ์˜ˆ์ƒ์ด ์–ด๋ ต๊ณ  Job์„ ํ•˜๋‚˜์˜ script์— ๋„ฃ์ง€ ์•Š
๋Š”ํ•œ ๋ถˆํ•„์š”ํ•œ ๋Œ€๊ธฐ์‹œ๊ฐ„์ด ๋งŽ์•„์ง -> ๊ฐœ์„ ๋  ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ์Œ
โ—‹ glueContext ๊ฐœ์„ ๊ธฐ๋Œ€
โ—‹ Executor ์‚ฌ์šฉํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‹ˆํ„ฐ๋ง์€ ์ง๊ด€์ ์ž„
โ—‹ EMR์— ๋น„ํ•ด์„œ ์‹คํ–‰์‹œ๊ฐ„,๋น„์šฉ,instance type์ง€์ •์— ์ด์ ์ด ์—†์–ด mart processing์—๋Š” ์‚ฌ์šฉํ•˜๊ณ 
์žˆ์ง€ ์•Š์Œ
โ— EMR
โ—‹ EMR ์‹œ์ž‘ํ•˜๋Š”๋ฐ 3~5๋ถ„๊ฐ€๋Ÿ‰ ์†Œ์š”, AWS Dependency
โ—‹ EC2 ์š”๊ธˆ + EMR์š”๊ธˆ. Spot instanceํ™œ์šฉ์œผ๋กœ ec2๋น„์šฉ์€ ๋งŽ์ด ์ค„์ผ์ˆ˜ ์žˆ์Œ
โ—‹ ๋น ๋ฅด๊ฒŒ hive, spark, hive-metastore(glue) ํ™˜๊ฒฝ ๊ตฌ์„ฑ์ด ๊ฐ€๋Šฅํ•จ
โ—‹ Data platform ์ ์šฉ
โ–  airflow์—์„œ job ๋‹จ์œ„๋กœ EMR start/terminate ํ•จ์œผ๋กœ์จ EMR๊ด€๋ฆฌ์— ๋Œ€ํ•œ resource ๋ฐ EMR
์„œ๋ฒ„ ์•ˆ์ •์„ฑ์— ๋Œ€ํ•œ ์ด์Šˆ๋ฅผ ์ค„์ผ์ˆ˜ ์žˆ์Œ
โ–  Livy(REST service for apache spark)
Data Processing (Data Mart)
โ— Kubernetes ELT(Extract-Load-Transform) Pods
โ—‹ Data Processing(ODS)์™€ ๊ฐ™์€ container
โ–  pyspark ์‹คํ–‰
โ–  Airflow kubernetes operator ๋ฅผ ํ†ตํ•ด์„œ ์‹คํ–‰
โ—‹ EMR ์— ์˜์กด์„ฑ/์ถ”๊ฐ€๋น„์šฉ ์—†์ด Data Processing
โ—‹ ํ•„์š”ํ•œ library๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ custom image ๊ฐœ๋ฐœ
โ–  Glue-metastore ์„ค์ •/๊ด€๋ จ library์„ค์น˜
โ—‹ Data Mart์— ๋Œ€ํ•œ Processing์€ DataLake๋‚ด๋ถ€์˜ Hive -> Hive ๋ฐฉ์‹์œผ๋กœ, pyspark ์„ submit ํ•˜
๋Š” ๋ฐฉ์‹์œผ๋กœ, Transform ์ž‘์—…๋งŒ
Data Processing (Data Mart)
Data Processing (์‹ค์‹œ๊ฐ„)
Log Monitoring
โ— Kinesis - Druidํ™œ์šฉํ•œ ์‹ค์‹œ๊ฐ„ logs ๋ชจ๋‹ˆํ„ฐ๋ง
โ— Kinesis
โ—‹ Log์ˆ˜์ง‘
โ—‹ AWS managed realtime stream service
โ— Druid
โ—‹ Kinesis๋ฅผ data source๋กœ ์‚ฌ์šฉ๊ฐ€๋Šฅ
โ—‹ Pivot, Superset, Tableau ๋“ฑ์—์„œ ์‹ค์‹œ๊ฐ„ Monitoring Dashboard ๊ตฌ์„ฑ
Log data analysis(table)
โ— Spark-streaming ์„ ์ด์šฉํ•œ Log Hiveํ…Œ์ด๋ธ” ์ œ๊ณต
โ—‹ Presto, spark์—์„œ ํ™œ์šฉ๊ฐ€๋Šฅ
โ—‹ K8s spark-operator
โ—‹ spark using kinesis, insert into hive table
์—ญํ• 
โ— log๋ฐ์ดํ„ฐ ๋ชจ๋‹ˆํ„ฐ๋ง, ๋ถ„์„cube
ES์™€ ๋น„๊ต
โ— Druid, ES ๋‘˜๋‹ค ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋น ๋ฅธ ๋ถ„์„ ๋ฐ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
โ— ES๋Š” local storage๊ฐ€ ํ•„์š”ํ•˜๊ณ  druid๋„ cache๋ฅผ ์œ„ํ•ด์„œ local storage๊ฐ€ ํ•„์š”ํ•˜์ง€๋งŒ, druid๋Š” deep
storage ๊ฐ€ ์žˆ์–ด์„œ node์˜ ์ถ”๊ฐ€, ์‚ญ์ œ์‹œ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ์ด ์šฉ์ดํ•จ (๋‹จ์ผํ™”๋œ storage ์ œ๊ณต์œผ๋กœ์„œ์˜
DataLake๊ตฌ์„ฑ concept์— druid๊ฐ€ ๋” ์ ํ•ฉ)
โ— ES์— ๋น„ํ•ด์„œ Druid์˜ ๊ตฌ์„ฑ, ์„ค์น˜๊ฐ€ ๋ณต์žกํ•จ
โ—‹ Master/Data/Query + Zk vs node configure
โ— Druid๊ฐ€ ์ข€๋” Query ์ ์šฉ์ด ์šฉ์ดํ•จ
โ— Druid๊ฐ€ ๋ฐ์ดํ„ฐ ์žฌ์ ์žฌ, ๋ณ€๊ฒฝ๋“ฑ์ด ๋ถˆํŽธํ•จ(ES๋Š” esquery๋ฅผ ํ†ตํ•ด์„œ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ druid๋Š” ๋ณ„๋กœ api๋ฅผ ์‚ฌ
์šฉํ•ด์•ผํ•จ)
Kubernetes์‚ฌ์šฉ์œผ๋กœ ์ธํ•œ ์žฅ์ 
โ— Node scale ๋ณ€๊ฒฝ์‹œ Statefulsets replica์ˆซ์ž ๋ณ€๊ฒฝ์œผ๋กœ ์‰ฝ๊ฒŒ ์ ์šฉ๊ฐ€๋Šฅ
โ— Master HA ๊ตฌ์„ฑ, Druid, Zk ์„ค์น˜
Data Processing (์‹ค์‹œ๊ฐ„ - druid)
Data Analysis(Presto)
์—ญํ• 
โ— EDA
โ— Tableau Report
โ— Table Summary ETL
PrestoSQL
โ— fork from prestodb
โ— CBO
โ— AWS Glue metastore
โ— https://prestosql.io/
โ— Starburstdata.com : presto with k8s, Apache Ranger, Apache Sentry
โ€œIf you were entering Hadoop ecosystem 8-10 years ago, there was this mantra:
bring compute to your storage, tie them together; shipping data is so expensive.
That is no longer true. All modern architectures right now separate storage from
compute. Grow your data without limit, scale your compute power whenever you
need.โ€
Kamil Bajda-Pawlikowski, Data Council NY, Nov 7-8, 2018
โ— S3 as DataLake
โ— Presto as Compute
Data Analysis(Presto)
Kubernetes์‚ฌ์šฉํ•˜๋Š” ์žฅ์ 
โ— EMR์—์„œ ์ œ๊ณตํ•˜๋Š” presto ์— ๋น„ํ•ด ๋น„์šฉ์ด ์ €๋ ดํ•˜๊ณ  ์„œ๋น„์Šค ๋ฌธ์ œ์‹œ์— ์žฌ์‹œ์ž‘,
version upgrade ๋“ฑ์ด ์šฉ์ดํ•จ
โ— data processing ์—ญํ• ๋งŒ ๋‹ด๋‹นํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋Š” s3์ €์žฅ๋˜์–ด์žˆ๊ธฐ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ
database๋“ฑ๊ณผ ๋น„๊ตํ–ˆ์„๋•Œ ๋ฐ์ดํ„ฐ ๋™๊ธฐํ™”/DQ ๊ด€๋ฆฌ์šฉ์ด
โ— Worker Scaling
โ—‹ Deployments replicas ์กฐ์ •์œผ๋กœ ๋ฐ”๋กœ ์ ์šฉ๊ฐ€๋Šฅ
โ—‹ HPA(cpu,memory,custom metrics)/AutoScaler/Scheduled Scaling
โ— Multi Cluster
โ—‹ K8s Service์„ค์ •์œผ๋กœ multi cluster, load balancing
โ–  https://github.com/lyft/presto-gateway
โ—‹ Sandbox ํ˜•ํƒœ๋กœ ์‚ฌ์šฉ์žgroup์—๊ฒŒ Cluster ์ œ๊ณต ๊ฐ€๋Šฅ
Data Analysis(Presto)
Kubernetes ๊ธฐ๋ฐ˜ data platform์˜ ์ด์ 
K8s ๊ธฐ๋ฐ˜ data platform์˜ ์ด์ 
Scalability
โ— HPA/VPA, AutoScaler
โ— Metrics๊ธฐ๋ฐ˜, Scheduled Scaling
โ— Service๋ณ„ ๊ตฌ์„ฑ
โ—‹ Airflow worker : worker๊ฐ€ ๋ถ€์กฑํ• ๊ฒฝ์šฐ worker pod ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ resource ๋ถ€์กฑํ•ด๊ฒฐ
โ—‹ Druid data server : ํ•„์š”์— ๋”ฐ๋ผ druid data server statefulset replicas ์กฐ์ •
โ—‹ Presto worker
โ–  ํ•„์š”์— ๋”ฐ๋ผ presto worker deployments replicas ์กฐ์ •
โ–  Presto multi cluster
์•ˆ์ •์„ฑ
โ— Service pods ๋ฅผ statefulSet/deployments๋กœ ๊ตฌ์„ฑํ•˜๊ฒŒ๋˜๋ฉด ์žฅ์• ์‹œ ์ž๋™์œผ๋กœ
pod ์žฌ์‹œ์ž‘(Falut Tolerence)
โ—‹ Spot instance๋กœ ์„œ๋น„์Šค ๊ตฌ์„ฑ
โ— Presto multi cluster(HA)
K8s ๊ธฐ๋ฐ˜ data platform์˜ ์ด์ 
์„œ๋น„์Šค์ œ๊ณต
โ— K8s service, Ingress ๊ตฌ์„ฑ์„ ํ†ตํ•ด์„œ data ์„œ๋น„์Šค๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•˜๊ธฐ ์šฉ์ด
ํ•จ
โ— Docker image, helm์„ ํ†ตํ•ด jupyter, superset, redash๋“ฑ์˜ ๋ฐ์ดํ„ฐ์กฐํšŒ,๋ถ„์„
solution์— ๋Œ€ํ•œ ๋น ๋ฅธ ์ œ๊ณต/์‚ญ์ œ
Computing Node๊ด€๋ฆฌ
โ— ec2๋ฅผ k8s์—์„œ ๊ด€๋ฆฌ
โ— Spot instance/on demand + fargate ๊ตฌ์„ฑ
โ— Node group ์— ๋Œ€ํ•œ auto scaling group ์„ค์ •
โ—‹ Auto scale ๊ฐ€๋Šฅํ•˜์ง€๋งŒ instance๊ฐ€ ๋œจ๋Š”๋ฐ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ธฐ๋•Œ๋ฌธ์— instance ์‚ฌ์šฉ resource ๊ณ„ํš์—
๋”ฐ๋ผ์„œ schedule ๊ฐ€๋Šฅ

data platform on kubernetes

  • 1.
  • 2.
    Kubernetes as dataplatform infrastructure
  • 3.
    ํ™•์žฅ์„ฑ โ— Data, ์‚ฌ์šฉ์ž์ฆ๊ฐ€์— ๋”ฐ๋ฅธ โ— ํ™•์žฅ์ด ๋น ๋ฅด๊ณ , ๊ฐ€๊ฒฉ์ด ์ ์ •ํ•˜๋ฉฐ, ์›ํ•˜๋Š” ์„ฑ๋Šฅ๊นŒ์ง€ ์‰ฝ๊ฒŒ โ— Data ์ €์žฅ์—์„œ๋Š” ์žฌ๋ถ„์‚ฐ issue ์•ˆ์ •์„ฑ โ— HA โ— Fault Tolerance ์‹ค์‹œ๊ฐ„ โ— Data river โ— Monitoring, Fraud/Anomaly Detection, bigdata/cloud ์‹œ๋Œ€์˜ data platform
  • 4.
    Kubernetes as dataplatform infrastructure ๋ฐ์ดํ„ฐํ”Œ๋žซํผ ๊ตฌ์ถ• โ— ์„œ๋น„์Šค ์„ค์น˜๊ฐ€ ์šฉ์ดํ•จ โ—‹ Helm (https://helm.sh/) โ—‹ Yaml ( https://kubernetes.io/ko/docs/tutorials/stateful-application/zookeeper/) โ— Data ์„œ๋น„์Šค์ œ๊ณต โ— Pod/์„œ๋น„์Šค ์žฌ์‹œ์ž‘์ด ํŽธํ•จ EKS โ— AWS Managed Kubernetes โ— Master๊ด€๋ฆฌ, multi AZ, upgrade, ๊ฐ€๊ฒฉ์ด ์ €๋ ด
  • 5.
    Data Processing(Scheduling, DataMart, ODS, Realtime) Data Analysis(Presto) Data Platform ๊ตฌ์ถ•
  • 6.
    ๊ฐœ๋ฐœ/์šด์˜ Concept โ— JobScheduler/Coordinator โ— Airflow worker์—์„œ ์ง์ ‘ Data์ฒ˜๋ฆฌ๋ฅผ ํ•˜์ง€ ์•Š์Œ โ— Kubernetes executor๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์ง€๋งŒ, kubernetes operator๋Š” ์‚ฌ์šฉ โ—‹ ์‹ค์ œ๋กœ Data Processing์„ ํ•˜๋Š” task๋Š” pod ํ˜•ํƒœ๋กœ ์‹คํ–‰๋จ ๊ด€๋ จLink : https://www.slideshare.net/changunjung/data-platform-data-pipelineairflow-kubernetes Kubernetes โ— Webserver, Scheduler - Fault Tolerance โ— Worker - Scalability โ— Source - Git sync, EFS Data Processing(Scheduling - Airflow)
  • 7.
    Legacy โ— DMS -๋ณ€๊ฒฝ๋ถ„๋งŒ ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์ง€๋งŒ ์ „์ฒด๋ฐ์ดํ„ฐ๋งŒ ๊ฐ€์ ธ์™€์•ผํ•จ โ— Glue(python) - ์„ค์น˜๋œ library ์™ธ์— ์ถ”๊ฐ€๋กœ ์„ค์น˜๋ถˆ๊ฐ€. ๋‹ค์–‘ํ•œ database ํ™œ์šฉ๋ถˆ ๊ฐ€ โ— Glue(spark) โ—‹ Glue ETL์˜ GUIํ™˜๊ฒฝ์—์„œ ๊ฐœ๋ฐœ. โ—‹ Spark ์œผ๋กœ readํ•˜๊ธฐ๋•Œ๋ฌธ์— ์šด์˜db์— ๋ถ€ํ•˜๋ฅผ ์คŒ โ—‹ Source๊ด€๋ฆฌ ์–ด๋ ค์›€ โ—‹ Data platform ์ ์šฉ โ–  Glue ETL์˜ GUI์—์„œ ๊ฐœ๋ฐœ๋œ job์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์ง€๋Š” ์•Š๊ณ  script๋ฅผ ๋ณต์‚ฌํ•ด์„œ airflow์—์„œ submit ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‹คํ–‰. Airflowํ™˜๊ฒฝ๋ณ€์ˆ˜ ํ™œ์šฉ๊ฐ€๋Šฅ โ–  ์›์ธ ๋ชจ๋ฅผ Hang. Job timeout ๋ณ€์ˆ˜์„ค์ •์œผ๋กœ Hang๋ฐœ์ƒ์‹œ์— ๋‹ค์‹œ ์‹คํ–‰๋ ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ Data Processing (ODS)
  • 8.
    Kubernetes ELT Pods โ—ODS ์ ์žฌ๋ฅผ ์œ„ํ•œ container โ—‹ Airflow operator ํ˜•ํƒœ๋กœ ๊ฐœ๋ฐœ โ—‹ K8s spark operator ๋ฅผ ํ™œ์šฉํ•ด์„œ ์‹คํ–‰ ( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ) โ— Extract โ—‹ Extract from source db โ—‹ python script๋ฅผ ์‚ฌ์šฉ/multi-processing(using query sharding) โ—‹ ์ตœ์ ํ™”๋œ Query, fetch ๋ฅผ ํ†ตํ•ด db๋ถ€ํ•˜ ์ค„์—ฌ์ฃผ๊ณ , extract performance ์ฆ๊ฐ€ โ—‹ Extract from Mysql, SQLServer, ํ•„์š”์‹œ DB ์ถ”๊ฐ€๊ฐ€ ์‰ฝ๋„๋ก sqlalchemyํ™œ์šฉ โ— Load โ—‹ Load to target storage(S3, Hive) โ—‹ pyspark(k8s as master) ํ™œ์šฉํ•ด์„œ hive table๋กœ insert โ—‹ Spark container Data Processing (ODS)
  • 9.
    โ— Glue(spark) โ—‹ ๋น„์Œˆ โ—‹executor/driver ์ž์›์„ ๋ฐ›๋Š”๋ฐ ์ฒ˜์Œ์‹œ์ž‘ํ• ๋•Œ๋Š” 10๋ถ„๊ฐ€๋Ÿ‰ ๋Œ€๊ธฐ๊ฐ€ ์žˆ๊ณ  ์ดํ›„์‹คํ–‰์€ ๋ฐ”๋กœ ๋ ๋•Œ๋„ ์žˆ๊ณ  ๋‹ค์‹œ 10๋ถ„ ๊ธฐ๋‹ค๋ ค์•ผํ•  ๋•Œ๋„ ์žˆ์Œ. ์ž‘์—…์‹œ๊ฐ„ ์˜ˆ์ƒ์ด ์–ด๋ ต๊ณ  Job์„ ํ•˜๋‚˜์˜ script์— ๋„ฃ์ง€ ์•Š ๋Š”ํ•œ ๋ถˆํ•„์š”ํ•œ ๋Œ€๊ธฐ์‹œ๊ฐ„์ด ๋งŽ์•„์ง -> ๊ฐœ์„ ๋  ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ์Œ โ—‹ glueContext ๊ฐœ์„ ๊ธฐ๋Œ€ โ—‹ Executor ์‚ฌ์šฉํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‹ˆํ„ฐ๋ง์€ ์ง๊ด€์ ์ž„ โ—‹ EMR์— ๋น„ํ•ด์„œ ์‹คํ–‰์‹œ๊ฐ„,๋น„์šฉ,instance type์ง€์ •์— ์ด์ ์ด ์—†์–ด mart processing์—๋Š” ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์ง€ ์•Š์Œ โ— EMR โ—‹ EMR ์‹œ์ž‘ํ•˜๋Š”๋ฐ 3~5๋ถ„๊ฐ€๋Ÿ‰ ์†Œ์š”, AWS Dependency โ—‹ EC2 ์š”๊ธˆ + EMR์š”๊ธˆ. Spot instanceํ™œ์šฉ์œผ๋กœ ec2๋น„์šฉ์€ ๋งŽ์ด ์ค„์ผ์ˆ˜ ์žˆ์Œ โ—‹ ๋น ๋ฅด๊ฒŒ hive, spark, hive-metastore(glue) ํ™˜๊ฒฝ ๊ตฌ์„ฑ์ด ๊ฐ€๋Šฅํ•จ โ—‹ Data platform ์ ์šฉ โ–  airflow์—์„œ job ๋‹จ์œ„๋กœ EMR start/terminate ํ•จ์œผ๋กœ์จ EMR๊ด€๋ฆฌ์— ๋Œ€ํ•œ resource ๋ฐ EMR ์„œ๋ฒ„ ์•ˆ์ •์„ฑ์— ๋Œ€ํ•œ ์ด์Šˆ๋ฅผ ์ค„์ผ์ˆ˜ ์žˆ์Œ โ–  Livy(REST service for apache spark) Data Processing (Data Mart)
  • 10.
    โ— Kubernetes ELT(Extract-Load-Transform)Pods โ—‹ Data Processing(ODS)์™€ ๊ฐ™์€ container โ–  pyspark ์‹คํ–‰ โ–  Airflow kubernetes operator ๋ฅผ ํ†ตํ•ด์„œ ์‹คํ–‰ โ—‹ EMR ์— ์˜์กด์„ฑ/์ถ”๊ฐ€๋น„์šฉ ์—†์ด Data Processing โ—‹ ํ•„์š”ํ•œ library๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ custom image ๊ฐœ๋ฐœ โ–  Glue-metastore ์„ค์ •/๊ด€๋ จ library์„ค์น˜ โ—‹ Data Mart์— ๋Œ€ํ•œ Processing์€ DataLake๋‚ด๋ถ€์˜ Hive -> Hive ๋ฐฉ์‹์œผ๋กœ, pyspark ์„ submit ํ•˜ ๋Š” ๋ฐฉ์‹์œผ๋กœ, Transform ์ž‘์—…๋งŒ Data Processing (Data Mart)
  • 11.
    Data Processing (์‹ค์‹œ๊ฐ„) LogMonitoring โ— Kinesis - Druidํ™œ์šฉํ•œ ์‹ค์‹œ๊ฐ„ logs ๋ชจ๋‹ˆํ„ฐ๋ง โ— Kinesis โ—‹ Log์ˆ˜์ง‘ โ—‹ AWS managed realtime stream service โ— Druid โ—‹ Kinesis๋ฅผ data source๋กœ ์‚ฌ์šฉ๊ฐ€๋Šฅ โ—‹ Pivot, Superset, Tableau ๋“ฑ์—์„œ ์‹ค์‹œ๊ฐ„ Monitoring Dashboard ๊ตฌ์„ฑ Log data analysis(table) โ— Spark-streaming ์„ ์ด์šฉํ•œ Log Hiveํ…Œ์ด๋ธ” ์ œ๊ณต โ—‹ Presto, spark์—์„œ ํ™œ์šฉ๊ฐ€๋Šฅ โ—‹ K8s spark-operator โ—‹ spark using kinesis, insert into hive table
  • 12.
    ์—ญํ•  โ— log๋ฐ์ดํ„ฐ ๋ชจ๋‹ˆํ„ฐ๋ง,๋ถ„์„cube ES์™€ ๋น„๊ต โ— Druid, ES ๋‘˜๋‹ค ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋น ๋ฅธ ๋ถ„์„ ๋ฐ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ โ— ES๋Š” local storage๊ฐ€ ํ•„์š”ํ•˜๊ณ  druid๋„ cache๋ฅผ ์œ„ํ•ด์„œ local storage๊ฐ€ ํ•„์š”ํ•˜์ง€๋งŒ, druid๋Š” deep storage ๊ฐ€ ์žˆ์–ด์„œ node์˜ ์ถ”๊ฐ€, ์‚ญ์ œ์‹œ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ์ด ์šฉ์ดํ•จ (๋‹จ์ผํ™”๋œ storage ์ œ๊ณต์œผ๋กœ์„œ์˜ DataLake๊ตฌ์„ฑ concept์— druid๊ฐ€ ๋” ์ ํ•ฉ) โ— ES์— ๋น„ํ•ด์„œ Druid์˜ ๊ตฌ์„ฑ, ์„ค์น˜๊ฐ€ ๋ณต์žกํ•จ โ—‹ Master/Data/Query + Zk vs node configure โ— Druid๊ฐ€ ์ข€๋” Query ์ ์šฉ์ด ์šฉ์ดํ•จ โ— Druid๊ฐ€ ๋ฐ์ดํ„ฐ ์žฌ์ ์žฌ, ๋ณ€๊ฒฝ๋“ฑ์ด ๋ถˆํŽธํ•จ(ES๋Š” esquery๋ฅผ ํ†ตํ•ด์„œ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ druid๋Š” ๋ณ„๋กœ api๋ฅผ ์‚ฌ ์šฉํ•ด์•ผํ•จ) Kubernetes์‚ฌ์šฉ์œผ๋กœ ์ธํ•œ ์žฅ์  โ— Node scale ๋ณ€๊ฒฝ์‹œ Statefulsets replica์ˆซ์ž ๋ณ€๊ฒฝ์œผ๋กœ ์‰ฝ๊ฒŒ ์ ์šฉ๊ฐ€๋Šฅ โ— Master HA ๊ตฌ์„ฑ, Druid, Zk ์„ค์น˜ Data Processing (์‹ค์‹œ๊ฐ„ - druid)
  • 13.
    Data Analysis(Presto) ์—ญํ•  โ— EDA โ—Tableau Report โ— Table Summary ETL PrestoSQL โ— fork from prestodb โ— CBO โ— AWS Glue metastore โ— https://prestosql.io/ โ— Starburstdata.com : presto with k8s, Apache Ranger, Apache Sentry
  • 14.
    โ€œIf you wereentering Hadoop ecosystem 8-10 years ago, there was this mantra: bring compute to your storage, tie them together; shipping data is so expensive. That is no longer true. All modern architectures right now separate storage from compute. Grow your data without limit, scale your compute power whenever you need.โ€ Kamil Bajda-Pawlikowski, Data Council NY, Nov 7-8, 2018 โ— S3 as DataLake โ— Presto as Compute Data Analysis(Presto)
  • 15.
    Kubernetes์‚ฌ์šฉํ•˜๋Š” ์žฅ์  โ— EMR์—์„œ์ œ๊ณตํ•˜๋Š” presto ์— ๋น„ํ•ด ๋น„์šฉ์ด ์ €๋ ดํ•˜๊ณ  ์„œ๋น„์Šค ๋ฌธ์ œ์‹œ์— ์žฌ์‹œ์ž‘, version upgrade ๋“ฑ์ด ์šฉ์ดํ•จ โ— data processing ์—ญํ• ๋งŒ ๋‹ด๋‹นํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋Š” s3์ €์žฅ๋˜์–ด์žˆ๊ธฐ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ database๋“ฑ๊ณผ ๋น„๊ตํ–ˆ์„๋•Œ ๋ฐ์ดํ„ฐ ๋™๊ธฐํ™”/DQ ๊ด€๋ฆฌ์šฉ์ด โ— Worker Scaling โ—‹ Deployments replicas ์กฐ์ •์œผ๋กœ ๋ฐ”๋กœ ์ ์šฉ๊ฐ€๋Šฅ โ—‹ HPA(cpu,memory,custom metrics)/AutoScaler/Scheduled Scaling โ— Multi Cluster โ—‹ K8s Service์„ค์ •์œผ๋กœ multi cluster, load balancing โ–  https://github.com/lyft/presto-gateway โ—‹ Sandbox ํ˜•ํƒœ๋กœ ์‚ฌ์šฉ์žgroup์—๊ฒŒ Cluster ์ œ๊ณต ๊ฐ€๋Šฅ Data Analysis(Presto)
  • 16.
    Kubernetes ๊ธฐ๋ฐ˜ dataplatform์˜ ์ด์ 
  • 17.
    K8s ๊ธฐ๋ฐ˜ dataplatform์˜ ์ด์  Scalability โ— HPA/VPA, AutoScaler โ— Metrics๊ธฐ๋ฐ˜, Scheduled Scaling โ— Service๋ณ„ ๊ตฌ์„ฑ โ—‹ Airflow worker : worker๊ฐ€ ๋ถ€์กฑํ• ๊ฒฝ์šฐ worker pod ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ resource ๋ถ€์กฑํ•ด๊ฒฐ โ—‹ Druid data server : ํ•„์š”์— ๋”ฐ๋ผ druid data server statefulset replicas ์กฐ์ • โ—‹ Presto worker โ–  ํ•„์š”์— ๋”ฐ๋ผ presto worker deployments replicas ์กฐ์ • โ–  Presto multi cluster ์•ˆ์ •์„ฑ โ— Service pods ๋ฅผ statefulSet/deployments๋กœ ๊ตฌ์„ฑํ•˜๊ฒŒ๋˜๋ฉด ์žฅ์• ์‹œ ์ž๋™์œผ๋กœ pod ์žฌ์‹œ์ž‘(Falut Tolerence) โ—‹ Spot instance๋กœ ์„œ๋น„์Šค ๊ตฌ์„ฑ โ— Presto multi cluster(HA)
  • 18.
    K8s ๊ธฐ๋ฐ˜ dataplatform์˜ ์ด์  ์„œ๋น„์Šค์ œ๊ณต โ— K8s service, Ingress ๊ตฌ์„ฑ์„ ํ†ตํ•ด์„œ data ์„œ๋น„์Šค๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•˜๊ธฐ ์šฉ์ด ํ•จ โ— Docker image, helm์„ ํ†ตํ•ด jupyter, superset, redash๋“ฑ์˜ ๋ฐ์ดํ„ฐ์กฐํšŒ,๋ถ„์„ solution์— ๋Œ€ํ•œ ๋น ๋ฅธ ์ œ๊ณต/์‚ญ์ œ Computing Node๊ด€๋ฆฌ โ— ec2๋ฅผ k8s์—์„œ ๊ด€๋ฆฌ โ— Spot instance/on demand + fargate ๊ตฌ์„ฑ โ— Node group ์— ๋Œ€ํ•œ auto scaling group ์„ค์ • โ—‹ Auto scale ๊ฐ€๋Šฅํ•˜์ง€๋งŒ instance๊ฐ€ ๋œจ๋Š”๋ฐ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ธฐ๋•Œ๋ฌธ์— instance ์‚ฌ์šฉ resource ๊ณ„ํš์— ๋”ฐ๋ผ์„œ schedule ๊ฐ€๋Šฅ

Editor's Notes

  • #4ย ๋น…๋ฐ์ดํ„ฐ ํด๋ผ์šฐ๋“œ ์‹œ๋Œ€์˜ ๊ธฐ์กด data-warehouse ์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐํ”Œ๋žซํผ ๋ณด๋‹ค ํ•„์š”ํ•œ ์š”์†Œ๋Š” ์ €๋Š” ํ™•์žฅ์„ฑ, ์•ˆ์ •์„ฑ, ์‹ค์‹œ๊ฐ„ ์ด๋ผ๊ณ  ์ƒ๊ฐ์„ ํ•ฉ๋‹ˆ๋‹ค. ํ™•์žฅ์„ฑ์€ ๋ฐ์ดํ„ฐ๋‚˜ ์‚ฌ์šฉ์ž๊ฐ€ ๊ธฐ์•„๊ธ‰์ˆ˜์ ์œผ๋กœ ๋Š˜์–ด๋‚˜๋Š” ํ™˜๊ฒฝ์—์„œ ๊ทธ์— ๋งž๊ฒŒ ์‹œ์Šคํ…œ์„ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์œผ๋กœ, ๋น ๋ฅด๊ณ ,์ ๋‹นํ•œ ๊ฐ€๊ฒฉ์—, ์›ํ•˜๋Š” ์„ฑ๋Šฅ๊นŒ์ง€ ์‰ฝ๊ฒŒ ํ™•์žฅํ• ์ˆ˜ ์žˆ๋Š” ํ”Œ๋žซํผ์ด์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ํ™•์žฅ์‹œ ์žฌ๋ถ„์‚ฐ๊ณผ ๊ฐ™์€ ์ด์Šˆ๊ฐ€ ์žˆ๋Š”๋ฐ ํ•ด๋‹น์ด์Šˆ์— ๋Œ€์‘์ด ์†”๋ฃจ์…˜ ์„ ํƒ์— ์ค‘์š”ํ•œ ์š”์†Œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์•ˆ์ •์„ฑ์ด๋ผ๊ณ  ํ•˜๋ฉด ๋‘๊ฐ€์ง€ HA(High Availability)์™€ Falut Tolerance๋ฅผ ์ƒ๊ฐํ•ด๋ณผ์ˆ˜ ์žˆ๋Š”๋ฐ, HA๋Š” ๊ณ ๊ฐ€์šฉ์„ฑ์„ ์œ„ํ•ด ์žฅ์• ๋ฐœ์ƒ์‹œ ์‹œ์Šคํ…œ์„ 2์ค‘ํ™” ํ•˜๊ฑฐ๋‚˜ SPOF๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ• ์ˆ˜ ์žˆ๊ฒ ๊ณ  WebServer, API, DB๋“ฑ์—์„œ ์š”๊ตฌํ•˜๋Š” ์•ˆ์ •์„ฑ์ธ๋ฐ ๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ์—์„œ๋Š” HA๊นŒ์ง€ ์›ํ•˜๋Š” ์„œ๋น„์Šค ๋ณด๋‹ค๋Š” falut tolenrace์ •๋„์˜ ์•ˆ์ •์„ฑ๋งŒ ์ œ๊ณตํ•ด๋„ ๋˜๋Š” ๋ถ€๋ถ„์ด ๋งŽ์Šต๋‹ˆ๋‹ค. Fault Tolerance๋Š” ์žฅ์• ๊ฐ€ ๋‚˜๋”๋ผ๋„ ๋น ๋ฅด๊ฒŒ ์ž๋™์œผ๋กœ ๋ณต๊ตฌ๋  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์œผ๋กœ data platform์—์„œ batch๋‚˜ ๋ถ„์„์‹œ์Šคํ…œ worker/execute node๋“ฑ์€ ์‹คํŒจ์‹œ ์žฌ์‹œ์ž‘๋งŒ ๋˜๋ฉด ์„œ๋น„์Šค์ œ๊ณต์— ๋ฌด๋ฆฌ๊ฐ€ ์—†๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ์‹ค์‹œ๊ฐ„์€ ๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ์— ์‹ ์„ ํ•จ/์ตœ์‹ ์„ฑ์„ ์ œ๊ณตํ•˜๋ฉฐ Data Lake ๊ฐœ๋…์— ๋Œ€๋น„ํ•ด data river๋ผ๊ณ  ๋ถˆ๋ฆฌ๊ฑฐ๋‚˜ ์ƒ๊ฐ๋ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋‚˜ ์‚ฌ์ดํŠธ๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋ง ํ•œ๋‹ค๊ฑฐ๋‚˜ fraud/anomaly detection์— ์‚ฌ์šฉ์ด ๋ ์ˆ˜ ์žˆ๊ณ  ๋ฐ์ดํ„ฐํ”Œ๋žซํผ์— ๋Œ€ํ•œ ์‚ฌ์šฉ์„ฑ์ด๋‚˜ ์‹ ๋ขฐ๋„ ํ–ฅ์ƒ์— ์—ญํ• ์„ ํ• ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.
  • #5ย ๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ ๊ตฌ์ถ• ์ž…์žฅ์—์„œ๋„ kubernetes๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์„ค์น˜์— ์žˆ์–ด์„œ helm์ด๋ผ๋Š” ์„ค์น˜ํˆด๋„ ์žˆ๊ณ  , yamlํŒŒ์ผ ๋งŒ์œผ๋กœ๋„ ์„ค์น˜๊ฐ€๋Šฅํ•œ ์„œ๋น„์Šค๋“ค์ด ์žˆ์œผ๋ฉฐ ์ €ํฌ๋„ zookeeper๋ฅผ yamlํŒŒ์ผ์„ ํ†ตํ•ด์„œ ์„ค์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ ์ œํ’ˆ์„ ๊ฐœ๋ฐœ์ด๋‚˜ ์„ค์น˜ํ–ˆ์„๋•Œ๋„ kubernetes์˜ service, ingress๋“ฑ์„ ์‚ฌ์šฉํ•ด์„œ ์‚ฌ์šฉ์ž๋“ค์—๊ฒŒ ์„œ๋น„์Šค๋ฅผ ์‰ฝ๊ฒŒ ์ œ๊ณต๊ฐ€๋Šฅํ•˜๋ฉฐ ์„œ๋ฒ„์— ๋ฌธ์ œ๊ฐ€ ์žˆ์–ด์„œ ์žฌ์‹œ์ž‘ํ• ๋•Œ๋„ ์„œ๋น„์Šค๋ฅผ ์‚ญ์ œํ•˜๊ณ  ๋‹ค์‹œ ์‹œ์ž‘ํ•˜๋Š”๋ฐ ๋ถ€๋‹ด์ด ์—†์–ด์„œ ํŽธ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ง๋ฐฉ์€ kubernetes๋ฅผ aws์—์„œ ๊ตฌํ˜„ํ• ๋•Œ EKS๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ์š”, EKS๋Š” AWS์˜ managed kubernes๋กœ master ์„œ๋ฒ„๋“ค์„ ๊ด€๋ฆฌํ•˜๊ณ , multi AZ๋ฅผ ์ง€์›ํ•˜๋ฉฐ k8s upgrade๋„ ์ตœ๊ทผ์— ์ง€์›ํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€๊ฒฉ์ ์ธ ๋ฉด์—์„œ๋„ 1์›”์— ๊ฐ€๊ฒฉ์„ 50%์ธํ•˜ํ–ˆ๋Š”๋ฐ ์•ˆํ•ด๋„ ์ป์„๊ฒƒ ๊ฐ™๊ณ  EKS๋ฅผ ์•ˆ์“ฐ๊ณ  ์„ค์น˜ํ•ด์„œ ์“ด๋‹ค๋ฉด master์„œ๋ฒ„ node๋ฅผ multi AZ์—์„œ ๊ด€๋ฆฌํ•˜๋Š” ec2๋น„์šฉ์ด๋‚˜ ๊ด€๋ฆฌ์— ๋“ค์–ด๊ฐ€๋Š” resource์ด ํ›จ์”ฌ ๋งŽ์ด ๋“ค๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ์ด ๋ฉ๋‹ˆ๋‹ค.
  • #7ย Airflow ์€ job coordination platform์œผ๋กœ์„œ ์ž์„ธํ•œ ์†Œ๊ฐœ ๋ฐ ์„ค์น˜์— ๋Œ€ํ•ด์„œ๋Š” ์—ฌ๊ธฐ๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์‹œ๊ณ  ๊ฐœ๋ฐœ/์šด์˜์€ airflow worker์—์„œ๋Š” ์ง์ ‘๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด๊ณ (meta db์™ธ ๋ถ„์„๋Œ€์ƒ/๊ฒฐ๊ณผ๊ฐ€ ๋˜๋Š” db๋‚˜ s3๋“ฑ์˜ storage์— ์ ‘๊ทผํ•˜์ง€ ์•Š์Œ). Kubernetes worker๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋Š” ์•Š์ง€๋งŒ kuberntes operator๋Š” ๋‚ด๋ฌด์ ์œผ๋กœ ๊ฐœ๋ฐœํ•ด์„œ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๊ณ , ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” task๋Š” podํ˜•ํƒœ๋กœ kubernetes ์—์„œ ์‹คํ–‰์ด ๋ฉ๋‹ˆ๋‹ค. Airflow ๋ฅผ kubernetes ์— ์„ค์น˜ํ–ˆ์„๋•Œ์˜ ์žฅ์ ์€ webserver๋‚˜ scheduler๋Š” fault tolerance์˜ ์•ˆ์ •์„ฑ์„ ์ œ๊ณตํ•˜๊ณ  worker๋Š” ์ž‘์—…์ด ๋งŽ์•„์ง€๊ฒŒ๋˜๋ฉด ์ถ”๊ฐ€์ ์ธ worker๋ฅผ ๋Š˜๋ฆด์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Airflow webserver๊ฒฝ์šฐ์—๋Š” HA๋„ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ํŒ€๋‚ด๊ฐœ๋ฐœ์šฉ์ด๊ธฐ๋„ ํ•ด์„œ Falut Tolerance๋กœ ๊ตฌ์„ฑํ–ˆ์ง€๋งŒ ์š”๊ตฌ์‚ฌํ•ญ์— ๋”ฐ๋ผ HA๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Airflow DAG source๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋‹ค๋ฅธ ๋ฐฐํฌํ”„๋กœ์„ธ์Šค ์—†์ด git sync๋ฅผ ํ†ตํ•ด์„œ ์ฃผ๊ธฐ์ ์œผ๋กœ ๋™๊ธฐํ™” ํ•˜๊ณ  ์žˆ๊ณ , pod๊ฐ„ storage๊ณต์œ ๋ฅผ ์œ„ํ•ด์„œ๋Š” network file system์ธ AWS EFS๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ํ™”๋ฉด์€ airflow์—์„œ git version์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด custom ๋œ ํ™”๋ฉด์ž…๋‹ˆ๋‹ค. airflow๊ฐ€ ์™„์ „ํ•œ ์†”๋ฃจ์…˜์ด ์•„๋‹ˆ๋ผ ํ•„์š”ํ•œ ๊ธฐ๋Šฅ์„ ๊ฐœ๋ฐœํ•ด์•ผํ•ด์„œ ๋ถˆํŽธํ•˜๊ฑฐ๋‚˜ ๋ถˆ์•ˆ์ •ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ• ์ˆ˜ ์žˆ์ง€๋งŒ ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— kubernetes์—์„œ ์‹คํ–‰๋˜๋Š” ํ™˜๊ฒฝ์—์„œ ํ•„์š”ํ•œ ๊ธฐ๋Šฅ์„ ์‰ฝ๊ฒŒ ๊ฐœ๋ฐœํ•ด์„œ ์ ์šฉํ• ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.
  • #8ย Hudi -์ผ๋ฐฐ์น˜์—์„œ๋Š” ๋ณ„๋กœ ์˜๋ฏธ๊ฐ€ ์—†๊ธฐ๋Š”ํ•œ๋ฐ, ๋” ์ฐพ์€ ๋ฐ์ดํ„ฐ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•œ updatable table store๋ฐฉ์‹์œผ๋กœ ์ด์›ํ™”ํ•ด์„œ ์ ์žฌ์ค‘
  • #9ย Hudi -์ผ๋ฐฐ์น˜์—์„œ๋Š” ๋ณ„๋กœ ์˜๋ฏธ๊ฐ€ ์—†๊ธฐ๋Š”ํ•œ๋ฐ, ๋” ์ฐพ์€ ๋ฐ์ดํ„ฐ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•œ updatable table store๋ฐฉ์‹์œผ๋กœ ์ด์›ํ™”ํ•ด์„œ ์ ์žฌ์ค‘
  • #12ย Hudi -์ผ๋ฐฐ์น˜์—์„œ๋Š” ๋ณ„๋กœ ์˜๋ฏธ๊ฐ€ ์—†๊ธฐ๋Š”ํ•œ๋ฐ, ๋” ์ฐพ์€ ๋ฐ์ดํ„ฐ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•œ updatable table store๋ฐฉ์‹์œผ๋กœ ์ด์›ํ™”ํ•ด์„œ ์ ์žฌ์ค‘
  • #18ย ํ™•์žฅ์„ฑ, ์•ˆ์ •์„ฑ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด์„œ ์ €ํฌ๋Š” kubernetes๋ฅผ ์ ์šฉํ•˜๊ฒŒ ๋˜์—ˆ๊ณ  Kubernetes์—์„œ ํ™•์žฅ์„ฑ์€ HPA/VPA, Autoscaler๋“ฑ์„ ํ†ตํ•ด ์ œ๊ณต๊ฐ€๋Šฅํ•˜๋ฉฐ metrics๊ธฐ๋ฐ˜์ด๋‚˜ schedule ์„ ํ†ตํ•ด์„œ ์ œ์–ด๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์•ˆ์ •์„ฑ์€ services, deployments, statefulset ๋“ฑ์˜ kubernetes ๊ณ ์œ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด์„œ ์ œ๊ณตํ• ์ˆ˜ ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค๋ฉด presto ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ cluster๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๊ฑฐ๋‚˜, superset ๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ ์กฐํšŒํˆด์— ๋Œ€ํ•ด์„œ๋„ service์™€ deployments๋ฅผ ํ†ตํ•ด HA๋ฅผ ๊ตฌ์„ฑํ• ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ Presto worker๋‚˜ druid datanode ํ˜น์€ airflow webserver/worker/scheduler์— ๋Œ€ํ•ด์„œ deployment/statefulset ๋“ฑ์„ ํ†ตํ•ด์„œ ์žฅ์• ๊ฐ€ ๋ฐœ์ƒํ•˜๋”๋ผ๋„ ๋‹ค์‹œ ์žฌ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋Š” fault tolerance ๊ตฌ์„ฑ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • #19ย ํ™•์žฅ์„ฑ, ์•ˆ์ •์„ฑ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด์„œ ์ €ํฌ๋Š” kubernetes๋ฅผ ์ ์šฉํ•˜๊ฒŒ ๋˜์—ˆ๊ณ  Kubernetes์—์„œ ํ™•์žฅ์„ฑ์€ HPA/VPA, Autoscaler๋“ฑ์„ ํ†ตํ•ด ์ œ๊ณต๊ฐ€๋Šฅํ•˜๋ฉฐ metrics๊ธฐ๋ฐ˜์ด๋‚˜ schedule ์„ ํ†ตํ•ด์„œ ์ œ์–ด๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์•ˆ์ •์„ฑ์€ services, deployments, statefulset ๋“ฑ์˜ kubernetes ๊ณ ์œ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด์„œ ์ œ๊ณตํ• ์ˆ˜ ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค๋ฉด presto ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ cluster๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๊ฑฐ๋‚˜, superset ๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ ์กฐํšŒํˆด์— ๋Œ€ํ•ด์„œ๋„ service์™€ deployments๋ฅผ ํ†ตํ•ด HA๋ฅผ ๊ตฌ์„ฑํ• ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ Presto worker๋‚˜ druid datanode ํ˜น์€ airflow webserver/worker/scheduler์— ๋Œ€ํ•ด์„œ deployment/statefulset ๋“ฑ์„ ํ†ตํ•ด์„œ ์žฅ์• ๊ฐ€ ๋ฐœ์ƒํ•˜๋”๋ผ๋„ ๋‹ค์‹œ ์žฌ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋Š” fault tolerance ๊ตฌ์„ฑ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.