AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
남궁영환, 빅데이터 컨설턴트
프로페셔널 서비스
AWS 빅데이터 아키텍처 패턴
및 모범 사례

본 강연에서 다룰 내용
빅데이터에 대한 진입 장벽
아키텍처 관련 기본 원칙
빅데이터 처리를 단순화할 수 있을까요?
어떤 기술을 사용해야 할까요?
• 왜?
• 어떻게?
참조 아키텍처
디자인 패턴
빅데이터 아키텍처 모범 사례

데이터의 지속적인, 폭발적인 증가
Volume
Velocity
Variety

빅 데이터 기술의 진화
배치형 처리 실시간 처리 분석/예측
(Machine
Learning)

클라우드 서비스의 진화
가상 머신
기반
관리형 서비스
기반
서버리스
(Serverless)
기반

너무나도 많은 툴(Tools)
Amazon
Kinesis
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon Kinesis
Streams app
Lambda Amazon ML
SQS
ElastiCache
DynamoDB
Streams Amazon Kinesis
Analytics
Amazon Elasticsearch
Service

빅데이터에 대한 진입장벽
왜?
어떻게 접근하죠?
어떤 툴을 사용하면 되죠?
참조할 수 있는 아키텍처가 있나요?

아키텍처 관련 기본 원칙
각 단계별 독립화된 시스템 구성
• Data → Store → Process → Store → Analyze → Answers
작업에 적합한 툴을 사용
• Data structure, Latency, Throughput, Access patterns
AWS 관리형 서비스의 적용 및 활용
• Scalable/elastic, Available, Reliable, Secure, No(or Low) admin
로그 데이터 특화형 디자인 패턴
• Immutable logs, Materialized views
비용에 대한 고려
• Big data ≠ Big cost

빅데이터 처리를 단순하게…
유입/
수집
시각화/
공유
저장
처리/
분석
Data
1 4
0 9
5
Answers &
Insights
답변 시간(지연)
처리량
비용

데이터의 종류
수집
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
인메모리 데이터 구조
AWS Import/Export
Snowball
DOCUMENTS
FILES
Transport
Messaging
Message MESSAGES
Messaging
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
데이터 스트림
트랜잭션
파일
이벤트
데이터베이스 레코드
Logging
메시지 데이터
로그 파일
검색 문서 데이터 (documents)
Logging
Amazon
CloudWatch
AWS
CloudTrail

데이터 스토어의 종류
수집
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
캐시, 데이터 구조 서버
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Transport
Messaging
Message MESSAGES
Messaging
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
pub/sub 메시지 큐
SQL & NoSQL 데이터베이스
Logging
메시지 큐
파일 시스템
검색 엔진
저장
In-memory
Database
Search
File Store
Queue
Stream
Storage

#1: 메시지 스토리지 & 스트림 스토리지
수집
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Transport
Messaging
Message MESSAGES
Messaging
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTLogging
저장
In-memory
Database
Search
File Store
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
MessageStream
Amazon SQS
• Managed message queue service
Apache Kafka
• High throughput distributed streaming
platform
Amazon Kinesis Streams
• Managed stream storage + processing
Amazon Kinesis Firehose
• Managed data delivery
Amazon DynamoDB
• Managed NoSQL Database
• Tables can be stream-enabled

왜 스트림 스토리지가 필요할까요?
생산자와 소비자를 분리
영구적인 버퍼
다수의 스트림을 수집
메시지의 순서 유지
스트리밍 맵리듀스
병렬적인 소비
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard #1 / Partition #1
Shard #2 / Partition #2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
DynamoDB Stream Kinesis Stream Kafka Topic

Amazon SQS
• 생산자 및 소비자/가입자를
분리
• 영구적인 버퍼
• 다수의 스트림을 수집
• No 메시지 순서 (표준)
• FIFO 큐를 통한 메시지 순서 보존 가능
• No 스트리밍 맵리듀스
• No 병렬적 소비
• Amazon SNS 는 다수의 큐 또는
람다(Lambda) 함수로 전달 가능
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO
Publisher
Amazon SNS
topic
function
AWS Lambda
function
Amazon SQS
queue
queue
Subscriber

어떤 스트림 스토리지를 사용해야 할까?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
(Standard)
Amazon
SQS
(FIFO)
AWS managed Yes Yes Yes No Yes Yes
Guaranteed ordering Yes Yes No Yes No Yes
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once
Data retention period 24 hours 7 days N/A Configurable 14 days 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ
Scale / Throughput
No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS / queue
Parallel consumption Yes Yes No Yes No No
Stream MapReduce Yes Yes N/A Yes N/A N/A
Row/Object size
400 KB 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost
Higher (table cost) Low Low Low (+admin) Low-medium Low-medium
Hot Warm

#2: 파일/객체 스토리지
수집
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Transport
Messaging
Message MESSAGES
Messaging
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTLogging
저장
In-memory
Database
Search
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
MessageStream
Amazon S3
File
Amazon S3
Hot

왜 Amazon S3가 빅데이터에 좋은가?
• 기본적으로 빅데이터 프레임워크 지원(Spark, Hive, Presto, etc.)
• Amazon EC2 스팟 인스턴스를 활용하여 하둡 클러스터 운영 가능
• 오브젝트 갯수 무제한
• 고 가용성 – AZ 장애 극복
• 데이터 복제에 대한 추가 비용 없음
• 수명주기를 활용한 계층-스토리지 (Standard, IA, Amazon Glacier)
• 저비용

왜 Amazon S3가 빅데이터에 좋은가?
• 스토리지를 위한 컴퓨팅 클러스터가 불필요 (HDFS와 다름)
• 동일한 데이터로 여러 종류(Spark, Hive, Presto) 클러스터를 동시에 사용
• 매우 높은 대역폭 – 총 처리량(throughput) 제한 없음
• 99.999999999%의 내구성을 위한 설계
• 버전 관리를 기본 기능으로 지원
• 보안 – SSL, client/server-side encryption at rest

적절한 파일/객체 스토리지 선택 가이드
• (Hot Data) 사용 빈도가 매우 높은 데이터는
HDFS를 사용
• 자주 접근하는 데이터는 Amazon S3
Standard를 사용
• 접근 빈도가 낮은 데이터는 Amazon S3
Standard – IA 를 사용
• (Cold Data) 거의 접근하지 않는 데이터는
Amazon Glacier를 이용하여 아카이브함

#3: 트랜잭션 관련 데이터 저장용 스토리지
수집
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Transport
Messaging
Message MESSAGES
Messaging
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTLogging
저장
In-memory
Database
Search
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
MessageStream
Amazon S3
File
In-memory,
Database, and
Search
Hot

#3: 트랜잭션 관련 데이터 저장용 스토리지
수집
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Transport
Messaging
Message MESSAGES
Messaging
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTLogging
저장
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
MessageStream
Amazon S3
File
Hot
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCache
Amazon ElastiCache
• Managed Memcached or Redis
service
Amazon DynamoDB
• Managed NoSQL database service
Amazon RDS
• Managed relational database service
Amazon Elasticsearch Service
• Managed Elasticsearch service

정리: 데이터 스토어 선택 기준 가이드
• 데이터 구조
→ 고정 스키마, JSON, 키-밸류
• 액세스 패턴
→ 향후 액세스 포맷을 고려하여
데이터를 저장
• 데이터 특성 (접근 빈도) → Hot, Warm, and Cold
• 비용 → 합리적인 비용
데이터 구조 What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) In-memory, NoSQL
액세스 패턴 What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, search Search

데이터 온도 vs 데이터 스토어
SQL
Archive
Storage
Structure
Hot data Warm data Cold data
Low
High
High Request rate
LowHigh Cost / GB
Low HighLatency
Low HighData Volume
Low
In-memory
NoSQL

처리/분석 유형 및 관련 프레임워크
• 배치형 (Batch)
• 소요 시간 : minutes ~ hours
• 일일/주간/월간 보고서
• Amazon EMR (MapReduce, Hive, Pig, Spark)
• 대화형 (Interactive)
• 소요 시간 : seconds
• 셀프 서비스 대시보드
• Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
• 메시지 (Message)
• 소요 시간 : milliseconds ~ seconds
• 메시지 데이터 처리
• Amazon SQS 애플리케이션
• 스트림 (Stream)
• 사기성 이벤트 경고, 1분 측정
• Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm,
AWS Lambda
• 분석 (Machine Learning)
• 사기성 이벤트 추적, 예측 분석 모델링
• Amazon ML, Amazon EMR (Spark ML)
Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
처리/분석
Fast
Stream
Amazon EC2
Amazon EMR
Amazon SQS apps
Amazon Redshift
Amazon
Machine Learning
Presto
Amazon
EMR
FastSlow
Amazon EC2
Amazon Athena
BatchMessageInteractiveML

어떤 데이터 처리 기술을 사용해야 할까?
Amazon EMR
(Spark Streaming)
Apache
Storm
KCL Application
Amazon Kinesis
Analytics
AWS Lambda
Amazon SQS
Application
AWS
managed
Yes (Amazon EMR) No (Do it
yourself)
No ( EC2 + Auto
Scaling)
Yes Yes No (EC2 + Auto
Scaling)
Serverless No No No Yes Yes No
Scale /
throughput
No limits /
~ nodes
No limits /
~ nodes
No limits /
~ nodes
Up to 8 KPU /
automatic
No limits /
automatic
No limits /
~ nodes
Availability
Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java, Python, Scala Almost any
language via
Thrift
Java, others via
MultiLangDaemon
ANSI SQL with
extensions
Node.js, Java,
Python
AWS SDK
languages (Java,
.NET, Python, …)
Uses
Multistage processing Multistage
processing
Single stage
processing
Multistage
processing
Simple event-
based triggers
Simple event based
triggers
Reliability
KCL and Spark
checkpoints
Framework
managed
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by
AWS Lambda
Managed by SQS
Visibility Timeout

어떤 데이터 분석 기술을 사용해야 할까?
Amazon Redshift Amazon Athena
Amazon EMR
Presto Spark Hive
Use case
Optimized for
data warehousing
Ad-hoc
Interactive Queries
Interactive
Query
General purpose
(iterative ML, RT, ..)
Batch
Scale/throughput ~ Nodes Automatic / No limits ~ Nodes
AWS Managed
Service
Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3, HDFS
Optimization
Columnar storage, data
compression, and zone maps
CSV, TSV, JSON, Parquet, ORC,
Apache Web log
Framework dependent
Metadata Amazon Redshift managed Athena Catalog Manager Hive Meta-store
BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom)
Access controls Users, groups, and access controls AWS IAM Integration with LDAP
UDF support Yes (Scalar) No Yes
Slow

ETL은 어떻게 할까?
https://aws.amazon.com/big-data/partner-solutions/
데이터 통합 관련 파트너사 솔루션의 활용
- 데이터의 이전, 정제, 동기화, 관리 등을 위한
전반적인 프로세스에 드는 수고를 덜어줍니다.
저장 처리/분석ETL
AWS Glue
• 완전 관리형 ETL 서비스
• 데이터 소스에 대한 파악, 데이터 준비,
데이터 스토어 간의 데이터 이전을 손쉽게
처리할 수 있도록 지원
출시예정

Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
수집 저장 시각화/공유처리/분석
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
Fast
Stream
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
LoggingIoTApplicationsTransportMessaging
ETL
Amazon EMR
Amazon SQS apps
Amazon Redshift
Amazon
Machine Learning
Presto
Amazon
EMR
FastSlow
Amazon EC2
Amazon Athena
Logging
Amazon
CloudWatch
AWS
CloudTrail

Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
ETL
• 애플리케이션 & API
• 분석 및 시각화
• Notebooks
• IDE
Business
users
Data scientist,
developers

Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
Fast
Stream
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
ETL
Amazon EMR
Amazon SQS apps
Amazon Redshift
Amazon
Machine Learning
Presto
Amazon
EMR
FastSlow
Amazon EC2
Amazon Athena
Logging
Amazon
CloudWatch
AWS
CloudTrail

각 단계별 독립화된 시스템 구성
데이터 처리와 스토리지를 분리
여러 단계에 적용 가능
Store Process Store Process

Pub/Sub
병렬 방식 스트림 데이터 처리/소비
Amazon
Kinesis
AWS
Lambda
Apache
Spark
Amazon Kinesis
Connector Library
store
process

Materialized Views
여러개의 데이터 스토어에서 읽기/쓰기를 지원하는 분석
프레임워크
Amazon
Kinesis
Amazon Kinesis
Connector Library
Amazon
EMR
Spark
SQL
Spark
Streaming
Amazon
S3
Amazon
DynamoDB
AWS
Lambda
store
process

데이터 온도 vs 처리/응답시간
Spark Streaming
Apache Storm
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Hive
Spark
Presto
data
Hot
Data temperature
Processing speed
Slow Answers
Native apps
KCL apps
AWS Lambda
Amazon
Athena
Fast
Cold
Hive
Amazon S3
Amazon
DynamoDB
Amazon
Kinesis

실시간 분석
Amazon
S3
Amazon
ML
Amazon
Kinesis
Analytics
Stream
Amazon
Kinesis
AWS Lambda
Spark
Streaming
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
KCL App
Amazon
Kinesis
Amazon
SNS
Real-time
prediction
Fan out
Alert
Notifications
Log
App
State
KPI
store
process

대화형 분석 & 배치형 분석
대화형 분석
Amazon S3
Amazon Redshift
Amazon
EMR
Consumer
Amazon
Machine Learning
Real-time Prediction
Amazon
EMR
Batch Prediction
배치형 분석
Amazon Athena
Amazon
Kinesis
Firehose
Amazon
Kinesis
Analytics
Stream
File

Data Lake
대화형/배치형 분석
Amazon
S3
Applications
Amazon
ML
Amazon Redshift
Amazon
EMR
Amazon Athena
Amazon
Kinesis
Firehose
Amazon
Kinesis
Analytics
Stream
File
Amazon
DynamoDB
Amazon
RDS
Transactions
실시간
분석
App
State
Amazon
Kinesis
Change
Data
Capture
AWS Lambda
Spark
Streaming
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
KCL

빅데이터 아키텍처 모범 사례

사례 1:
수 초 내에 개인화된 추천
서비스 제공
고객에 대한
스타일리스트의 전문성을
확장성 있게 제공
비용 절감
…
Mobile Users
Desktop Users
Analytics
Tools
Online Stylist
Amazon
Redshift
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDB
AWS
Lambda
Amazon S3
Data Storage
유입/
수집
시각화/
공유
저장
처리/
분석
Data
1 4
0 9
5
Answers &
Insights

사례#2: (1 of 2)
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Streams
리포팅
(3rd Party)
Machine
Learning
Amazon
S3
All Data
(Amazon S3)
ETL
Attribution
Ecosystem of tools and services
고급 분석
(Third Party)
유입/
수집
시각화/
공유
저장
처리/
분석
Data
1 4
0 9
5
Answers &
Insights

사례#2: (2 of 2)
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Streams
리포팅
(3rd Party)
Amazon
S3
Ecosystem of tools and services
고급 분석
(Third Party)
Spark
Pipeline
ETL (Spark SQL)
Attribution & ML
유입/
수집
시각화/
공유
저장
처리/
분석
Data
1 4
0 9
5
Answers &
Insights

Amazon SQS apps
Streaming
KCL
apps
Amazon Redshift
Amazon
Machine Learning
Presto
Amazon
EMR
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
BatchMessageInteractiveStreamML
Amazon EMR
AWS Lambda
Amazon Kinesis
Analytics
Amazon Athena
수집 저장 시각화/공유처리/분석ETL
Logging
Amazon
CloudWatch
AWS
CloudTrail

요약
각 단계별로 구분된 시스템 구성
• Data → Store → Process → Store → Analyze → Answers
해당 작업에 적합한 툴의 사용
• Data structure, Latency, Throughput, Access patterns
AWS 관리형 서비스의 적용 및 활용
• Scalable/elastic, Available, Reliable, Secure, No(or Low) admin
로그 데이터 특화형 디자인 패턴의 사용
• Immutable logs, Materialized views
비용에 대한 고려
• Big data ≠ Big cost

본 강연이 끝난 후…
• AWS 기반 빅데이터 서비스:
https://aws.amazon.com/ko/big-data/
• AWS Big Data Blog:
https://aws.amazon.com/ko/blogs/big-data/
• AWS 한국 블로그:
https://aws.amazon.com/ko/blogs/korea/category/korea-techtips/
• Big Data on AWS 교육:
https://aws.amazon.com/ko/training/course-descriptions/bigdata/

함께 해주셔서 감사합니다!

https://www.awssummit.kr
AWS Summit 모바일 앱을 통해 지금 세션 평가에
참여하시면, 행사후 기념품을 드립니다.
#AWSSummitKR 해시태그로 소셜 미디어에
여러분의 행사 소감을 올려주세요.
발표 자료 및 녹화 동영상은 AWS Korea 공식 소셜
채널로 공유될 예정입니다.
여러분의 피드백을 기다립니다!

AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017

More Related Content

What's hot

Similar to AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017

More from Amazon Web Services Korea

AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017