Ad-Tech on AWS 세미나 | AWS와 데이터 분석

Data Analytics on AWS
AWS 와 데이터 분석

세션의 진행
Piljoong Kim (@PiljoongKim)
Solutions Architect
Amazon Web Services Korea
Data Analytics
Big Data 와 데이터 분석
관련 AWS 서비스
AWS 와 데이터 분석

데이터의 폭발적 증가
Volume
Velocity
Variety

빅데이터의 진화
실시간
알림
예측
전망
배치
보고서

Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data PipelineAmazon Kinesis
Cassandra
CloudSearch
Kinesis-
enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
Amazon
Elasticsearch
너무 많은 툴들

Amazon S3
어떤 것이든 저장
오브젝트 저장소
확장 가능
99.999999999% 내구성
오브젝트 저장소

Amazon Redshift
관계형 데이터 웨어하우스
대용량 병렬 처리 – 페타 바이트 수준
완전 관리형 서비스
SSD 및 HDD 플랫폼 제공
1TB 기준 연간 $1,000, 시간당 $0.25 부터 시작
예약 노드(Reserved Node) 옵션 제공
정형 데이터 처리

Amazon EMR
Hadoop 을 서비스로 제공
Hive, Impala, Spark, Presto, 기타
쉬운 사용과 완전 관리형 서비스
스팟 인스턴스 사용 가능
HDFS 및 S3 파일 시스템
반정형/비정형 데이터 처리

Amazon Kinesis
실시간 스트림 처리
높은 처리량과 탄력성
손 쉬운 사용
S3, Lambda, Redshift, DynamoDB 와의 통합
스트리밍 처리

Amazon ML
손 쉬운 사용, 개발자를 위해 만들어진 관리형 서비스
Amazon 의 내부 시스템을 기반으로한 강력한 기술
AWS 에 저장되어 있는 데이터를 사용하여 모델 생성
예측 분석

Amazon Lambda
이벤트에 응답하는 코드를 작동시키는 Server-less
컴퓨팅 서비스
사용자 정의 커스텀 로직으로 AWS 서비스를 확장
처리된 요청과 동작한 컴퓨팅 시간만큼만 비용 청구
이벤트 처리

다시 데이터 분석으로 돌아와서…
많은 분들이 다음을 궁금해 합니다.

참고할 만한 아키텍처가 있나요?
너무 많아요, 뭘 써야 하죠?
어떻게 써야 하죠?
왜 많은 것 중 그걸 써야 하는거죠?

아키텍처 원리
• “데이터 버스”의 비결합성
• Data → Store → Process → Answers
• 작업에 적합한 도구를 사용
• Data structure, latency, throughput, access patterns
• 람다 아키텍처 활용
• Immutable (append-only) log, batch/speed/serving layer
• AWS 관리형 서비스의 활용
• No/low admin
• Big data != Big cost

Simplify Big Data Processing
ingest /
collect
store
process /
analyze
consume /
visualize
data answers
Time to Answer (Latency)
Throughput
Cost

Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessingBatchInteractive
Logging
StreamStorage
IoTApplications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
Transactional Data
File Data
Stream Data
Notebook
s
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Reference Architecture

A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCache
Logging
StreamStorage
IoTApplications
FileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Database
File
Storage
Search
스트림
저장소
Collect Store

스트림 저장소 옵션들
AWS 관리형 서비스
• Amazon Kinesis: Stream
• Amazon DynamoDB Streams: Table + Streams
• Amazon SQS: Queue
• Amazon SNS: Pub/Sub
비관리형 서비스
• Apache Kafka: Stream

어떤 스트림 저장소를 사용해야 할까?
Amazon
Kinesis
Amazon DynamoDB
Streams
Amazon SQS
Amazon SNS
Kafka
Managed Yes Yes Yes No
Ordering Yes Yes No Yes
Delivery at-least-once exactly-once at-least-once at-least-once
Lifetime 7 days 24 hours 14 days Configurable
Replication 3 AZ 3 AZ 3 AZ Configurable
Throughput No Limit No Limit No Limit ~ Nodes
Parallel Clients Yes Yes No (SQS) Yes
MapReduce Yes Yes No Yes
Record size 1MB 400KB 256KB Configurable
Cost Low Higher(table cost) Low-Medium Low (+admin)

A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCache
Logging
StreamStorage
IoTApplications
FileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Database
Search
파일
저장소
Collect Store

왜 Amazon S3 가 빅데이터에 좋은가?
• 기본적으로 빅데이터 프레임워크 지원(Spark, Hive, Presto, etc.)
• 스토리지를 위한 컴퓨팅 클러스터가 불필요 (HDFS와 다름)
• Amazon EC2 스팟 인스턴스를 활용하여 하둡 클러스터 운영 가능
• 동일한 데이터로 여러 종류(Spark, Hive, Presto) 클러스터를 동시에 사용
• 오브젝트 갯수 무제한
• 99.999999999%의 내구성을 위한 설계
• 고 가용성 – AZ 장애 극복
• 수명주기를 활용한 계층-스토리지 (Standard, IA, Amazon Glacier)
• 보안 – SSL, client/server-side encryption at rest
• 저비용
• 매우 높은 대역폭 – 총 처리량 제한 없음

• 매우 자주 접근하는(hot) 데이터는 HDFS
사용
• 자주 접근하는 데이터는 Amazon S3
Standard 사용
• 드물게 접근하는 데이터는 Amazon S3
Standard – IA 사용
• 거의 접근하지 않는(cold) 데이터는 Amazon
Glacier 사용하여 아카이브
S3와 HDFS, Amazon Glacier를 함께…

A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCache
Logging
StreamStorage
IoTApplications
FileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Collect Store
데이터베이스
+ 검색
계층

Data Tier
Search
Amazon
Elasticsearch
Service
Amazon
CloudSearch
Cache
Redis
Memcached
SQL
Amazon Aurora
MySQL
MariaDB
PostgreSQL
Oracle
SQL Server
NoSQL
Cassandra
Amazon
DynamoDB
HBase
MongoDB
Database + Search Tier
모범 사례 – 성격에 맞는 적합한 도구 사용
Applications

데이터 구조와 접근 패턴
접근 패턴 What to use?
Put/Get (Key, Value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, Search Search
데이터 구조 What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, Value) Cache, NoSQL

Cache
SQL
Request Rate
High Low
Cost/GB
High Low
Latency
Low High
Data Volume
Low High
Glacier
Structure
NoSQL
Hot Data Warm Data Cold Data
Low
High
Search

처리와 분석
데이터 분석은 유용한 정보를 발견, 결론을 제시, 의사
결정의 목적으로 데이터를 검사, 정제, 변환, 모델링하는
과정을 의미
예시)
대화형 대시보드 à 대화형 분석(Interactive Analytics)
일일/주간/월간 보고서 à 배치 분석(Batch Analytics)
결제/부정행위 경고, 1분 측정 à 실시간 분석(Real-Time Analytics)
심리 분석, 예측 모델 à 기계 학습(Machine Learning)

대화형 분석
대량의 (warm/cold) 데이터를 대상
답변을 얻기까지 수초가 걸림
예: 셀프 서비스 대시보드

배치 분석
대량의 (warm/cold) 데이터를 대상
답변을 얻기까지 수분에서 수시간이 걸림
예: 일일, 주간, 월간 보고서 생성

실시간 분석
소량의 hot 데이터를 대상
답변을 얻기까지 적은 시간(수밀리초 ~ 수초)이 걸림
실시간 (이벤트)
- 데이터 스트림의 이벤트에 실시간으로 응답
- 예: 결제/부정행위 알림
근 실시간 (마이크로 배치)
- 데이터 스트림의 마이크로 배치를 통한 근 실시간
운영
- 예: 1분 측정

기계 학습을 통한 예측
기계 학습(ML)은 명시적으로 프로그래밍 하지 않고도 컴퓨터가 학습
할 수 있는 능력을 제공
기계 학습 알고리즘:
감독 학습 ß “teach” 프로그램
- Classification ß 이 거래가 부정행위 인가? (Yes/No)
- Regression ß 고객의 LTV 는?
자율 학습 ß let it learn by itself
- Clustering ß 시장 세분화

분석 툴과 프레임워크
기계 학습
- Mahout, Spark ML, Amazon ML
대화형 분석
- Amazon Redshift, Presto, Impala, Spark
배치 처리
- MapReduce, Hive, Pig, Spark
스트림 처리
- Micro-batch: Spark Streaming, KCL, Hive, Pig
- Real-time: Storm, AWS Lambda, KCL
Amazon
Redshift
Impala
Pig
Amazon Machine
Learning
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
StreamProcessingBatchInteractiveML
Analyze

고객 사례: Hearst
Hearst is one of the world’s largest media and
information companies, with more than 360
businesses.
I don’t know how we could
have made our clickstream
data pipeline work without
Amazon Kinesis.
Peter Jaffe
Data Scientist,
Hearst Corporation
”
“ • 실시간 클릭스트림 이벤트와 트렌드 콘텐츠를
분석할 플랫폼 개발이 필요 했었음
• Amazon Kinesis Streams 와 Amazon
Kinesis Firehose 를 사용해서 매일 발생하는
30 TB 의 클릭스트림 데이터를 전송하고
있음
• 복잡한 데이터 사이언스 일과 분석 쿼리에
Amazon Redshift 를 사용함
• 300 여개 이상의 웹사이트에서 생성되는
데이터가 처리됨
• 수분 이내에 에디터로 클릭스트림 데이터를
전달 함
• 트렌드 콘텐츠의 재순환이 25 퍼센트 이상
증가함
https://aws.amazon.com/solutions/case-studies/hearst/

Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- ProxyUsers to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
100 seconds
1G/day
30 seconds
5GB/day
5 seconds
1G/day
Milliseconds
100GB/day
LATENCY
THROUGHPUT Models
Agg Data

Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- ProxyUsers to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
Models
Agg Data
Data
Science
Toolbox
Data
Models
Amazon Redshift
• IPython Notebook
• On Spark and Amazon Redshift
• Code sharing (and insights)
• User-friendly development
environment for data scientists
• Auto-convert .pynb à .py

잠시 Redshift 에 대해 더 알아볼까요?

Redshift 의 Ad Tech 활용 사례
• 어트리뷰선 분석 (Attribution Analysis)
• 캠페인 성능 (Campaign Performance)
• 데이터 관리 (Data Management)
• 실시간 경매 (Real-Time Bidding)
• 리타겟팅 (Retargeting)

왜 Redshift 일까요?
• 엄청난 데이터
– 160GB – 2TB
– S3 로의 접근
– 싱글 클러스터 vs 멀티 클러스터
• 가능하면 저렴하게!
– $1000/TB/매년
– 비용 때문에 데이터를 잃어 버릴 순 없죠
– 데이터는 온라인 일 수도, 오프라인 일 수도 있습니다!
• 시간은 돈!
– MPP 컬럼너: 수십억개의 이벤트에 쿼리를 수행 후 결과를 얻을 수 있습니다!
– SSD
– approximate 기능

Approximate COUNT DISTINCT
692.8s
34.9s
< 0.76%

COPY from JSON
• Ingest JSON directly into Amazon Redshift
• If you have a 1:1 mapping between JSON elements
and column names, use ‘auto’
• Map elements to columns using a JSONPaths file

데이터 관리
• 일반적으로 최종 고객에게 분석 결과를 제공
• 중앙 클러스터가 모든 데이터에 대해 작업하고, 고객별
클러스터를 가동
• 주변 영향 없이 고객마다 독립적으로 클러스터를 확장
• 1개의 노드로 구성된 10개의 클러스터와, 10개의 노드로
구성된 1개의 클러스터의 가격 차이 없음

Neustar 의 AWS Redshift 경험
re:Invent 2014 (ADV403)
슬라이드: http://bit.ly/NeustarAWS
동영상: http://bit.ly/AWSNeustarVideo

Frequency + Attribution + Overlap + Ad-hoc =
2.5 + 2 + 2.5 + 1.5 =
8.5 시간이 필요

Workload Node Count Node Type Restore Maint. Exec.
Frequency
& Attribution
& Overlap
& Ad Hoc
16 dw2.8xlarge 2h 1h 6h
= $691.20

Workload Node Count Node Type Restore Maint. Exec.
Frequency 8 dw2.8xlarge 1.5h 0.5h 2.5h
Attribution 8 dw2.8xlarge 1.5h 0.5h 2h
Overlap 8 dw2.8xlarge 1h 0.5h 2.5h
Ad-hoc 8 dw2.8xlarge 0h 0.5h 1.5h
= $556.80 (-19%)

Lesson Learned
Amazon Redshift 클러스터의
오케스트레이션이 참 쉬웠어요!
Don’t scale up, scale out.

AWS 에서 구현된 애드테크

• (Front-end) Beanstalk: Click stream ingestion
• Kinesis: Real-time data stream
• (Back-end) Beanstalk: KCL apps (Kinesis -> S3)
• Lambda: Event driven processing (S3 ->Redshift)
• RedShift: Business intelligence reporting with in-house BI tool
• EMR: Data processing on Spark
Mobile Device (sdk 연동)
Elastic
Beanstalk
Kinesis
Elastic
Beanstalk
Clickstream data
collection
Data feeds
Log storage,
data processing & analysis
S3
EMR Lambda
Redshift
adbrix User BI user
Visualize & report
Database
ElastiCache
Dynamo DB
“EMR-Spark를 이용한 차세대 빅데이터 시스템을
구현하여 60 퍼센트 이상의 비용 절감을
달성하게 되었습니다.”
…
“S3-Lambda-RedShift를 사용하여 마이크로배치
분석 시스템을 혼자서 전부 구현하는데
약 10 업무일이 소요되었습니다.”
- 백정상 개발 팀장,
Development team Lead at IGAWorks -

Adbrix User
Mobile
Device
Route 53
EC2
Adbrix Analytics
Database
Adbrix Analytics
EMR-Spark
Daily Batch
Analysis
Dynamo DB
Elastic Beanstalk
Activity Tracker
Amazon Kinesis Elastic Beanstalk
Activity Process
Amazon S3
Activity
Storages
Amazon Lambda
Micro-batch loading
Amazon Redshift
BI Analysis
Amazon RDS
AWS Tokyo region (ap-northeast-1) AWS N. Virginia region (us-east-1)
Cross
Region
Replication
ElastiCache

Amazon Elastic Beanstalk 활용
http://<elastic beanstalk app>/pixel.jpg?cID=10049&cdid=5961&campID=8&&ic_ch=&refVar=http%3A%2F%2F
www.cosmopolitan.com%2F&icxid=1415035174637-8824780787007880&ic_uq=1415035296585-3799348233235
675&ic_mid=&ic_js_ver=20140917&icctm_ht_athr=Tess%2520Koman&icctm_ht_aid=cosmo.article.32782&icctm_h
t_attl=Terminally%2520lll%252029-Year-Old%2520Brittany%2520Maynard%2520Ends%2520Her%25200wn%2520L
ife%2520as%2520Planned&icctm_ht_chnl=Lifestyle&icctm_ht_dspb=NaN&icctm_ht_gack=1047615795&icct_m_ht
_scck=&icctm_ht_q=&icctm_ht_kw=brittany%2520Ends%2520Her%2520wn%2520Life%2520as%2520Planned&icct
m_ht_pgtyp=news&icctm_ht_dtpub=2014-11-03%252002%3A00%3A00&icctm_ht_sthr=Lifestyle&icctm_ht_stnm=
cosmopolitan.com&icctm_ht_sfid=21422*FA0711DBFB-180E7D89E340EDB8&icctm_ht_cnocl=http%3A%2F%2Fww
w.cosmopolitan.com%2Flifestyle%2Fnews%2Fa32782%2Fbrittany-maynard-dies%2F
Client
Browser
AWS Elastic
Beanstalk
running
node.js
Amazon
Kinesis
Amazon
Kinesis–
enabled app
Post to KinesisImage Request

모바일 리타겟팅
수집 데이터 데이터 정제
탐구적
데이터 분석
데이터 보강 성향 모델링
알고리즘
수행
빈 값 처리
중복 제거
부정확한 값 교정
일변량 분석
이변량 분석
사용자 캠페인 기록
사용자/디바이스 프로필
사용자 브라우징 기록
(웹사이트 방문 기록,
확인한 제품들,
수행한 행동) 새로운 변수 생성
가변적으로 변화
모델 비교
최선의 모델 선택
마케팅 캠페인 수정
피드백 모니터링
알고리즘 조정
목표: 고객 성향 예측을 위한 머신 러닝 기반의 실시간 분석 플랫폼

모바일 리타겟팅
Amazon
Kinesis
Amazon ML
Amazon
EMR
Amazon
Redshift
Amazon
DynamoDB
AWS Elastic
Beanstalk
Customer
데이터 수집
데이터 처리
계산
사용자 방문 기록
디바이스 프로필
고객 데모그래픽
분산 데이터 클러스터
- 실시간 처리 +
배치 처리
- 관계형 + NoSQL
광고 제공 알고리즘
- 회귀 모델
- 인공신경망
Bid 가격 최적화
비지니스 규칙 조절

CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
AppsKCL Apps
Archiver
Amazon
Kinesis
Event Replay Amazon S3
빅데이터 스트리밍
Producers Aggregator Continuous
Processing
Store Analytics
ü DSP Running big data processing platform on AWS
ü Evaluating 30T (30조) ad opportunities monthly
ü Processing 86B (860억) messages daily on Kinesis
ü 72 % monthly cost saving on operational costs

Real-time Analytics
Producer
Apache
Kafka
KCL
AWS Lambda
Spark
Streaming
Apache
Storm
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alert
App state
Real-time Prediction
KPI
process
store
DynamoDB
Streams
Amazon
Kinesis

Interactive &
Batch
Analytics
Producer Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon
Redshift
Amazon EMR
Presto
Impala
Spark
Batch
Interactive
Batch Prediction
Real-time Prediction

Batch Layer
Amazon
Kinesis
data
process
store
Lambda Architecture
Amazon
Kinesis S3
Connector
Amazon S3
A
p
p
l
i
c
a
t
i
o
n
s
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
answer
Speed Layer
answer
Serving
Layer
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
answer
Amazon
ML
KCL
AWS Lambda
Spark Streaming
Storm

이번 세션에서 얻어갈 점
• 비결합된 “데이터 버스”를 구축하세요!
– Data → Store → Process → Answers
• 때에 맞는 적절한 툴을 활용 하세요!
– Data Structure, latency, throughput, access patterns
• Lambda 아키텍처를 적극 고려해 보세요!
– Immutable (append-only) log, batch/speed/serving layer
• AWS 관리형 서비스를 활용 하세요!
– No/low admin
• 항상 비용을 고려하세요!
– Big Data != Big Cost

이번 세션에서 얻어갈 점
• 하나의 거대한 클러스터 보다 다수의 작은 클러스터가 좋을 때가 많아요!
– 클라우드의 장점을 적극 활용하세요, 언제든 켜고 끌 수 있어요
• S3 를 Data lake 로 사용해보세요!
– 다른 서비스들과의 통합이 매우 자유로워요

Sacrificial Architecture
For many people throwing away a code
base is a sign of failure, perhaps
understandable given the inherent
exploratory nature of software
development, but still failure. But often
the best code you can write now is code
you'll discard in a couple of years time.
http://martinfowler.com/bliki/SacrificialArchitecture.html

피드백은 언제든 환영합니다!
AWS 공식 블로그: http://aws.amazon.com/ko/blogs/korea
AWS 공식 소셜 미디어
@AWSKorea AWSKorea
AmazonWebServices AWSKorea

Ad-Tech on AWS 세미나 | AWS와 데이터 분석

More Related Content

What's hot

Viewers also liked

Similar to Ad-Tech on AWS 세미나 | AWS와 데이터 분석

More from Amazon Web Services Korea

Ad-Tech on AWS 세미나 | AWS와 데이터 분석