KEMBAR78
[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션즈 아키텍트, 김대근 AWS 솔루션즈 아키텍트 | PDF
AWS 기반 기계 학습 자동화 및
최적화를 위한 실전 기법
남궁영환
데이터 사이언티스트 SA
아마존웹서비스
A I / M L
김대근
데이터 사이언티스트 SA
아마존웹서비스
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• AI/ML at AWS
• 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
o Part 1
§ Infrastructure for ML on AWS
§ Horovod & TensorFlow distributed training on EC2, EKS, and SageMaker
o Part 2
§ on AWS
§ MnasNet on AWS
• Summary
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS ML Stack
ML FRAMEWORKS
& INFRASTRUCTURE
A I S E R V I C E S
REKOGNITION
IMAGE
POLLY TRANSCRIBE TRANSLATE COMPREHEND L E X
REKOGNITION
VIDEO
Vision Speech Language Chatbots
AMAZON
SAGEMAKER
BUILD TRAIN
FORECAST
Forecasting
TEXTRACT PERSONALIZE
Recommendations
DEPLOY
Pre-built algorithms & notebooks
Data labeling (GROUND TRUTH)
One-click model training & tuning
Optimization (N E O )
One-click deployment & hosting
M L S E R V I C E S
Frameworks Interfaces Infrastructure
EC2 P3
&
P3DN
EC2 C5 FPGAs GREENGRASS ELASTIC
INFERENCE
Reinforcement learningAlgorithms & models
(AWS MARKETPLACE FOR MACHINE LEARNING)
(App developers with
little knowledge of ML)
(ML developers and
data scientists)
(ML researchers and
academics)
INFERENTIA
: 가장 깊고 폭넓은 역량과 기술의 집약
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling TensorFlow near-linearly 256 GPUs at
Amazon SageMaker
및
AWS Deep Learning AMIs
에서 사용 가능
Stock
TensorFlow
65%
30 min
training
time
AWS-Optimized
TensorFlow
90%
scaling efficiency
with 256 GPUs
14 min
https://aws.amazon.com/about-aws/whats-new/2018/11/tensorflow-scalability-to-256-gpus/
2018
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
대규모 머신러닝이 중요한 이유 (1/3)
- Andrew Ng
How do data science techniques
scale with amount of data?
• 데이터 축적에 따라 모델의 성능은
지속적으로 향상
• 딥러닝 적용 사례가 다양한
분야에서 꾸준히 증가하고 있음
• 대량의 데이터 기반 ML/DL 모델 트레이닝은
많은 시간과 자원들을 필요로 함
• “분산 트레이닝”
https://www.slideshare.net/ExtractConf
https://eng.uber.com/horovod/
The “data parallel” approach
to distributed training
- Uber
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
대규모 머신러닝이 중요한 이유 (2/3)
Scaling to Very Very Large Corpora for Natural Language Disambiguation, Banko and Brill, Microsoft Research (2001)
http://www.aclweb.org/anthology/P01-1005
“These results suggest that we
may want to reconsider the
trade-off between spending time
and money on algorithm
development versus spending it
on corpus development.”
알고리즘 선정도 중요하지만
많은 양의 트레이닝 데이터의 확보가
무엇보다 중요
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
대규모 머신러닝이 중요한 이유 (3/3)
• 공통 목표
ü 컴퓨팅, 네트워킹, 컨테이너, 분산 트레이닝 성능 튜닝, . . .
ü 머신러닝 엔지니어는 선호하는 ML/DL 프레임워크를 이용하여
비즈니스 성공에 기여할 수 있는 모델 개발에 집중
• Data Management
ü 데이터의 규모 ∝ 해결 과제 및 알고리즘의 복잡도
ü 데이터의 견고성(durability) 및 가용성(availability)
• Distributed Computing Frameworks
ü Data pipelines feature (Dask, Ray, PyToolz, ipyparallel, etc.)
ü CPU ➝ GPU ➝ Multi-GPUs ➝ Multi-nodes
ü TensorFlow, PyTorch, MxNet, . . .
• Build Compute Clusters to fit the workload!
대규모 머신러닝은
문제 및 접근 방식에
따라 해결 방안이
매우 다양할 수 있음
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to train and deploy deep learning models
Amazon SageMaker
Amazon
Elastic Container Service for
Kubernetes
Amazon
Elastic Container Service
Amazon EC2
AWS Deep Learning
AMIs
AWS Deep Learning
Containers
“해결하려는 워크로드를 고려하여 적절한
ML/DL 모델 트레이닝 및 배포 환경을 선택합니다”
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
Infrastructure for ML on AWS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
P3 instance
대규모 병렬 처리가 필요한 워크로드에 적합
• 기계학습 모델 트레이닝
• HPC(High Performance Computing) 시뮬레이션
• 3D 모델 렌더링
• 비디오 인코딩
최대 8 개의 NVIDIA Tesla V100 GPU
• 1 PetaFLOPs 컴퓨팅 성능
(P2 인스턴스 대비 최대 14배 ↑)
• 300 GB/s 의 GPU 간 통신 속도 지원 (NVLink)
(P2인스턴스 대비 9배 ↑)
• 모든 ML 프레임워크 및 모델 타입 지원
• 다양한 형태의 인스턴스 사용 가능
(Spot instance 사용 시 최대 70% 비용 절감 가능)
P3.2xlarge
1 V100
GPU
8 vCPU
61 GB
Mem
P3.8xlarge
4 V100
GPU
32 vCPU
244 GB
Mem
P3.16xlarge
8 V100
GPU
64 vCPU
488 GB
Mem
3 가지 타입 중 14 리전
https://aws.amazon.com/ko/ec2/instance-types/p3/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
P3dn.24xlarge instance
Description P3.16xlarge P3dn.24xlarge Improvements
Number and
Type of GPUs
8 x NVIDIA V100 8 x NVIDIA V100 -
GPU Memory 16GB/GPU 32GB/GPU 100%
GPU Peer to Peer NVLink - 300 GB/s NVLink - 300 GB/s -
CPU Family Broadwell Skylake w AVX512
vCPU 64 96 50%
System Memory 488 GB 768 GB 57%
Networking
Throughput
25Gbps 100Gbps 200%
EBS Throughput 14Gbps 14Gbps -
Local Instance
Storage
No 2.0TBs NVMe SSD
• 클라우드에서 사용 가능한 가장 강력한 GPU
인스턴스
• 효율적인 대규모 ML 트레이닝 및 HPC
시뮬레이션 지원
(100Gbps 네트워크 대역폭을 이용한 멀티-노드 클러스터
(32대 이상) 구성 가능)
• 모델 트레이닝 및 시뮬레이션을 위한 데이터에
빠른 액세스 지원
(Amazon S3, 네트워크 기반 파일 시스템, 로컬 인스턴스 스토리지)
• 대규모 ML 모델 트레이닝 및 대규모 데이터 처리
(32GB GPU 메모리를 장착한 최신 NVIDA V100 GPU)
• 데이터 전처리 최적화에 적합
(96 vCPUs using AWS Custom Skylake CPUs and 768GB of
System Memory)
https://aws.amazon.com/ko/ec2/instance-types/p3/#Amazon_EC2_P3dn.24xlarge_Instances
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS FSx for Lustre
• 머신 러닝, HPC, 동영상 처리, 금융 모델링 등을 위한 고성능 파일
시스템
• S3와 기본적으로 연동됨
• Lustre는 1 millisecond 미만의 지연 시간과 초당 수백 Gigabytes,
수백만 IOPS로 확장되는 처리량을 지원
• POSIX와 호환되므로, 특별히 추가 변경 없이 기존 Linux 기반
애플리케이션 사용 가능
• 사용한 리소스에 대해서만 비용 지불 (최소약정/선수금 없음)
• 클라이언트 OS 커널 모듈 변경 작업 필요없음
(https://aws.amazon.com/ko/fsx/lustre/)
Amazon FSx
for Lustre
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure for ML on AWS (1/3)
전통적 HPC 머신러닝 클러스터
Auto Scaling BeeGFS RAM storage nodes
Auto Scaling worker nodes
Bastion host | BeeGFS management node | Cluster monitoring
Deep Learning
Placement
Group
Amazon EFS
Deep Learning
Application Stack
Cluster-wide
persistent storage
Model parameter
Object store
BeeGFS RAM-based storage array
Multi-node parallel
Deep Learning
Placement Group
Amazon S3
Cloud-native 머신러닝 클러스터
AWS Batch
Amazon FSx
for Lustre
P3 / P3dn container instances
commit
hydrate
Lustre
kernel
driver
Amazon ECR
Multi-node TensorFlow
Container Registry
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure for ML on AWS (2/3)
Traditional AWS Deep Learning Cluster
https://aws.amazon.com/ko/blogs/compute/distributed-deep-learning-made-easy/
https://github.com/aws-samples/deep-learning-models/tree/master/hpc-cluster
https://github.com/awslabs/deeplearning-cfn
Amazon SQS
Worker Queue
Amazon SQS
Master Queue
Internet
Gateway
AWS
Lambda
Amazon
SNS
Auto Scaling Group
Auto Scaling Group
VPC Public
subnet
Private
subnet
AWS Elastic File System
EC2 Master
Instance
EC2 Workers
Public: 203.0.113.0
Private: 10.0.0.1
Workers
10.0.1.1
10.0.1.2
10.0.1.3
AWS
Cloud
Default VPC: 10.0.0.0/16 NAT Gateway
Private Subnet
10.0.1.0/16
Worker setup
Public Subnet
10.0.0.0/24
Auto Scaling
Setup Complete
Internet
Router
Amazon
S3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure for ML on AWS (3/3)
Cloud-native AWS Deep Learning Cluster
https://aws.amazon.com/ko/blogs/compute/scalable-deep-learning-training-using-multi-node-parallel-jobs-with-aws-batch-and-amazon-fsx-for-lustre/
Amazon
CloudWatch
Amazon Glacier
AWS Cloud
Training Output
bucket
AWS Step Functions workflow
Event
trigger
TFRecord Input
bucket
TensorFlow
Container Registry
Multi-node Parallel Job
NVIDIA GPU-backed
running containers
FSx for Lustre AWS Batch
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
with Horovod & TensorFlow
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (1/9)
• 분산 딥러닝을 위한 오픈 소스 프레임워크
• Stock TensorFlow, Keras, PyTorch 등과 연동하여 동작
• 쉽고 간단한 설치 `pip install horovod`
• 고급 알고리즘 사용 가능
• High-Performance 네트워크 (RDMA, GPUDirect) 지원
• ML 엔지니어와 인프라를 분리
ü 인프라팀은 컨테이너 및 MPI 환경을 제공
ü ML 엔지니어는 선호하는 딥러닝 프레임워크 사용
ü 프레임워크 상에서 분산 트레이닝에 대한 공통 기대치
(인프라팀 & ML 엔지니어)
horovod.ai
https://eng.uber.com/horovod/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (2/9)
• Ring-AllReduce
ü 데이터의 규모 ∝ 클러스터 노드의 개수
• Synchronous updates
• NVIDIA’s NCCL library (for GPU-level communication)
• Configurations
ü Sing-ring NCCL vs. Hierarchical AllReduce
HOROVOD_HIERARCHICAL_ALLREDUCE=1
ü Tensor Fusion
HOROVOD_FUSION_THRESHOLD=67108864
HOROVOD_CYCLE_TIME=5
ü FP16 all-reduce
hvd.DistributedOptimizer(...,compression=hvd.Compression.fp16)
https://eng.uber.com/horovod/
Worker A
5 13 8 19 42 1
Worker C
9 27 3 15 8 4
Worker B
8 11 4 2 7 7
Worker A
5 13 8 19 50 5
Worker C
9 27 7 17 8 4
Worker B
13 24 4 2 7 7
Worker A
5 13 15 36 50 5
Worker C
22 51 7 17 8 4
Worker B
13 24 4 2 57 12
Worker A
22 51 15 36 50 5
Worker C
22 51 7 17 57 12
Worker B
13 24 15 36 57 12
Worker A
22 51 15 36 57 12
Worker C
22 51 15 36 57 12
Worker B
22 51 15 36 57 12
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (3/9)
2. 사용할 GPU 세팅
config = tf.ConfigProto()
config.gpu_options.visible_device_list =
str(hvd.local_rank())
3. Learning Rate 조정 및
Horovod 분산 Optimizer 추가
opt = tf.train.MomentumOptimizer(
lr=0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)
4. Synchronize initial state between workers
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(hooks=hooks,...) as mon_sess:
...
# OR
bcast_op = hvd.broadcast_global_variables(0)
sess.run(bcast_op)
5. Use checkpoints only on the first worker
ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …)
as mon_sess:
...
1. 라이브러리 초기화
import horovod.tensorflow as hvd
hvd.init()
* Horovod for TensorFlow, Keras, and PyTorch
import horovod.tensorflow as hvd
import horovod.keras as hvd
import horovod.tensorflow.keras as hvd
import horovod.torch as hvd
# more frameworks coming
( source code from https://github.com/horovod/horovod )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (4/9)
실행 예
# Use AWS Deep Learning AMI
laptop$ ssh ubuntu@<aws-ip-1>
aws-ip-1$ source activate tensorflow_p27
aws-ip-1$ ssh-keygen
aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub
[copy contents of the pubkey]
aws-ip-1$ exit
laptop$ ssh ubuntu@<aws-ip-2>
aws-ip-2$ source activate tensorflow_p27
aws-ip-2$ cat >> /home/ubuntu/.ssh/authorized_keys
[paste contents of the pubkey]
aws-ip-2$ exit
laptop$ ssh ubuntu@<aws-ip-1>
aws-ip-2$ ssh aws-ip-2
[will ask for prompt, say yes]
aws-ip-2$ exit
aws-ip-1$ mpirun -np 2 -H aws-ip-1,aws-ip-2 
wget https://raw.githubusercontent.com/uber/horovod
/master/examples/tensorflow_mnist.py
aws-ip-1$ mpirun -bind-to none -map-by slot
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 
-x LD_LIBRARY_PATH -x PATH 
-mca btl_tcp_if_exclude lo,docker0 
–np 16 -H aws-ip-1:8,aws-ip-2:8 
python tensorflow_mnist.py
# Pro tip: hide mpirun args into mpirun.sh
aws-ip-1$ mpirun.sh 
–np 16 –H aws-ip-1:8,aws-ip-2:8 
python tensorflow_mnist.py
( source code from https://github.com/horovod/horovod )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (5/9)
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to
# process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list =
str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.MomentumOptimizer(
lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to synchronize initial state
hooks =[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs" 
if hvd.rank() == 0 else None
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of
# session initialization, restoring from a
# checkpoint, saving to a checkpoint, and
# closing when done or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint
_dir=ckpt_dir, config=config, hooks=hooks) as mon
_sess:
while not mon_sess.should_stop():
# Perform synchronous training
mon_sess.run(train_op)
[참고] 예제 코드 – Horovod for TensorFlow
( source code from https://github.com/horovod/horovod )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (6/9)
[참고] 예제 코드 – Estimator API
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list =
str(hvd.local_rank())
# Build model...
def model_fn(features, labels, mode):
loss = ...
opt = tf.train.MomentumOptimizer(
lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
return tf.estimator.EstimatorSpec(...)
# Broadcast initial variable state.
hooks = 
[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs" 
if hvd.rank() == 0 else None
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn,
model_dir=ckpt_dir,
config=tf.estimator.RunConfig(
session_config=config))
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=hooks)
( source code from https://github.com/horovod/horovod )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (7/9)
import mxnet as mx
import horovod.mxnet as hvd
from mxnet import autograd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()
# Build model
model = ...
model.hybridize()
# Create optimizer
optimizer_params = ...
opt = mx.optimizer.create('sgd', **optimizer_params)
# Initialize parameters
model.initialize(initializer, ctx=context)
# Fetch and broadcast parameters
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)
# Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)
# Create loss function
loss_fn = ...
# Train model
for epoch in range(num_epoch):
train_data.reset()
for nbatch, batch in enumerate(train_data, start=1):
data = batch.data[0].as_in_context(context)
label = batch.label[0].as_in_context(context)
with autograd.record():
output = model(data.astype(dtype, copy=False))
loss = loss_fn(output, label)
loss.backward()
trainer.step(batch_size)
[참고] 예제 코드 – Horovod for MxNet
( source code from https://github.com/horovod/horovod )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (8/9)
[참고] 예제 코드 – Horovod for Keras
import keras
from keras import backend as K
import tensorflow as tf
import horovod.keras as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = 
str(hvd.local_rank())
K.set_session(tf.Session(config=config))
# Build model...
model = ...
opt = keras.optimizers.Adadelta(lr=1.0 * hvd.size())
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(
loss='categorical_crossentropy’,
optimizer=opt,
metrics=['accuracy'])
# Broadcast initial variable state.
callbacks = [hvd.callbacks.BroadcastGlobalVariabl
esCallback(0)]
...
model.fit(
x_train,
y_train,
callbacks=callbacks,
epochs=10,
validation_data=(x_test, y_test))
( source code from https://github.com/horovod/horovod )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod (9/9)
[참고] 예제 코드 – Horovod for PyTorch
import torch
import horovod.torch as hvd
# Initialize Horovod
hvd.init()
# Horovod: pin GPU to local rank
torch.cuda.set_device(hvd.local_rank())
# Build model...
model = Net()
model.cuda()
optimizer = optim.SGD(model.parameters())
# Wrap optimizer with DistributedOptimizer
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters())
# Horovod: broadcast parameters
hvd.broadcast_parameters(
model.state_dict(),
root_rank=0)
for epoch in range(100):
for batch_idx, (data, target) in ...:
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
( source code from https://github.com/horovod/horovod )
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
Scalable multi-node training (EC2)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2
• TFRecord 변환 전용 인스턴스로 전처리 수행
ü t2.large Instance with 1.0 TB EBS sc1 Volume
ü Download ImageNet dataset
ü Transform the raw dataset with TFRecord
ü Upload the transformed dataset to the Amazon S3
nohup aws s3 sync /data s3://YOUR_BUCKET_NAME >& upload.log &
• Setting up all the EC2 instances having the same type of instances, AMI, the path of
data, and the path of models
• Need to check the utilization of GPUs on P3dn.24xlarge (and/or P3.16xlarge)
ImageNet을 이용한 트레이닝 실행 예시
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2
ImageNet을 이용한 트레이닝 실행 예시
• Time-to-train: around 45 mins
• 8 * P3dn.24xlarge instances
• ML Models: ResNet-50
• Top-1 Validation Accuracy : 75.59 %
https://docs.aws.amazon.com/ko_kr/dlami/latest/devguide/tutorial-horovod-tensorflow.html
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2
https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/
• 8 * P3.16xlarge instances
• DL Framework: TensorFlow, MxNet
• ML model: ResNet-50
• Dataset: ImageNet (1.2 millions of images)
• Top-1 validation accuracy: 76%
-
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
1 2 4 8 16 32 64
Images/Second
Number of GPUs
time-to-train: 47 min ~ 50 min Training using
P3 instances
(ResNet-50 & ImageNet)
구성 정보
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2
https://aws.amazon.com/ko/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/
• 32 * P3.16xlarge instances
• DL Framework: TensorFlow
• ML model: ResNet-50
• Dataset: ImageNet
• Top-1 validation accuracy 75.4%
• Top-5 validation accuracy 92.6%
time-to-train: 14.6 min
Training performance
w.r.t. TensorFlow & CUDA
(ResNet-50 & ImageNet)
(Images/sec)
Time to train vs Number of GPUs vs
Images/sec, efficiency, and
communication overhead
구성 정보
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
Amazon EKS 기반 분산 딥러닝 성능 최적화
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (1/11)
[참고] Modular and Scalable Amazon EKS Architecture
https://aws.amazon.com/ko/quickstart/architecture/amazon-eks/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (2/11)
• STEP 1. Install Kubeflow to setup a cluster for distributed training
• STEP 2. Set the app name and initialize it.
• STEP 3. Install mpi-operator from kubeflow
• STEP 4. Create a MPI Job template, define the number of nodes (replicas),
number of GPUs each node has (gpusPerReplica)
• STEP 5. Apply the manifest to the default environment.
The MPI Job will create a launch pod
Using Horovod in Amazon EKS
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-eks-tutorials-distributed-gpu-training.html
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (3/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark
• 클러스터 생성부터 종료까지 자동화된
벤치마크 워크플로 제공
• 다양한 백엔드 스토리지 시스템 지원
(예: Amazon EFS, Amazon FSx for Lustre)
• S3와 연동하여 환경설정 정보 및 결과 저장
• Backed by kubeflow operators and kubebench.
• 다양한 딥러닝 프레임워크 지원
(TF, TF + Horovod + OpenMPI, PyTorch, MxNet)
• 사용자의 요구사항에 맞는 Kubernetes 클러스터 환경
설정 지원
• 중간 결과 저장 및 자동 클러스터 종료 기능
• 동시에 여러 실험을 병렬로 진행 가능
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (4/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark
• Setup NFS
• Install Argo Workflow
• Configure AWS credentials
• Conifgure your GitHub token
• Setup S3 buckets for your benchmark results and
your training data
• Configure your Kubernetes cluster
kubectl create -f deploy/benchmark-nfs-svc.yaml
kubectl get svc benchmark-nfs-svc -o=jsonpath={.spec.clusterIP}
# Replace ip in the `deploy/benchmark-nfs-volume.yaml` before following step
kubectl create -f deploy/benchmark-nfs-volume.yaml
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.2.1/manifests/install.yaml
# you can forward port to localhost and look at Argo UI
kubectl port-forward deployment/argo-ui 8001:8001 -n argo
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (5/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark
• Run the benchmmark jobs
s3ResultPath: 's3://kubeflow-pipeline-data/benchmark/',
s3DatasetPath: 's3://eks-dl-benchmark/imagenet/',
clusterConfig: 's3://kubeflow-pipeline-data/benchmark/cluster_config.yaml',
experiments: [{
experiment: 'experiment-20190415-01',
trainingJobConfig: 's3://kubeflow-pipeline-data/benchmark/mpi-job-imagenet.yaml',
trainingJobPkg: 'mpi-job',
trainingJobPrototype: 'mpi-job-custom',
// Change to upstream once https://github.com/kubeflow/kubeflow/pull/3062 is merged
trainingJobRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow',
}],
githubSecretName: 'github-token',
githubSecretTokenKeyName: 'GITHUB_TOKEN',
image: 'seedjeffwan/benchmark-runner:20190424',
name: '20190424-00',
namespace: 'default',
nfsVolume: 'benchmark-pv',
nfsVolumeClaim: 'benchmark-pvc',
region: 'us-west-2',
trainingDatasetVolume: 'dataset-claim',
s3SecretName: 'aws-secret',
s3SecretAccesskeyidKeyName: 'AWS_ACCESS_KEY_ID',
s3SecretSecretaccesskeyKeyName: 'AWS_SECRET_ACCESS_KEY',
storageBackend: 'fsx',
kubeflowRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow'
1. Update your workflow
setting using ks command
2. Update benchmark
workflow manifest directly
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (6/11)
EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (7/11)
• Kubernetes
ü 컨테이너 기반의 다양한 ML/DL 프레임워크 지원
ü 탄력성 및 손쉬운 확장성 지원
ü Deep Neural Network 트레이닝 환경으로서 지속적으로 확산 중
• Amazon EKS
ü 완전 관리형 Kubernetes 서비스
ü EC2 P2, P3 인스턴스 상에서 Kubernetes 워크로드의 손쉬운 실행
• Kubeflow
ü 머신러닝 워크로드 효율적인 개발, 관리, 배포 등을 지원하는 Kubernetes-native 플랫폼
ü 분산 트레이닝 지원
(native TensorFlow architecture or MPI AllReduce (NVIDIA NCCL library or Horovod))
https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
( Amazon EKS + Kubeflow + AWS FSx CSI driver )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (8/11)
• Amazon FSx for Lustre
ü High Performance 파일 시스템
ü 빠른 처리를 요구하는 워크로드에 최적 (예: 머신러닝, HPC )
ü Amazon S3와 연동, 통합 지원
• AWS FSx CSI driver
ü Kubernetes-native 형태로 컨테이너에서 FSx for Lustre 파일시스템 이용 가능
ü Static/Dynamic volume provisioning
ü Containers from multiple nodes within a cluster (connected to the same Lustre filesystem)
ü Lustre 데이터 저장소로 S3 사용 가능
https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
( Amazon EKS + Kubeflow + AWS FSx CSI driver )
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (9/11)
https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
( Amazon EKS + Kubeflow + AWS FSx CSI driver )
Machines • 20 * p3.16xlarge (mixed precision)
Amazon EKS-optimized
AMI with GPU support
• Kubernetes v1.11.8
• MPI Operator Alpha from Kubeflow 0.4.1
• CUDA 10 with NVIDIA Tesla 410.104 driver
• Docker 18.06.1-ce (incl. nvidia-docker2)
AWS FSx for Lustre
filesystem
• FSx CSI Driver v0.1
• Hydrated from an S3 bucket
(for ImageNet TFRecords)
TensorFlow
(customized image)
• TENSORFLOW_VERSION: v1.13.1
• HOROVOD_VERSION: 0.16.0
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
• NCCL_VERSION: 2.4.2-1+cuda10.0
• OPENMPI 4.0.0
Dataset (ImageNet)
• 1.28 millions of images (1000 classes)
• 1024 training files & 128 validation files
(TFRecords)
Relevant tools
• awscli, eksctl, ksonnet, and
aws-iam-authenticator
“90%-100%의 near-linear scaling performance를 확인”
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (10/11)
https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
( Amazon EKS + Kubeflow + AWS FSx CSI driver )
• 성능 최적화를 위한 체크리스트 (Part #1)
ü 최신 딥러닝 툴킷 사용 (예: AMI for EKS)
ü GPU clock speed를 최대값으로 설정 (참고: bootstrap command)
ü Placement Group 내에 인스턴스 생성 (낮은 지연시간)
ü AWS VPC CNI 플러그인 (최신버전)을 사용 (모든 NIC들이 EKS 클러스터 상에서 기본적으로
Jumbo Frame을 사용하도록)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EKS 기반 분산 딥러닝 성능 최적화 (11/11)
https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
( Amazon EKS + Kubeflow + AWS FSx CSI driver )
• 성능 최적화를 위한 체크리스트 (Part #2)
ü 적절한 스토리지 백엔드를 선택 (EBS, EFS, FSx for Lustre, etc.)
ü Static Kubernetes CPU 관리 정책을 사용
ü MPI processor
ü Intel MKL DNN 으로 TensorFlow 환경을 구축하여 GPU 성능을 최적화
ü 데이터 변환 프로세스 및 스레드 병렬화를 위한 TensorFlow 최적화
ü 스레드 풀 조정 및 CPU 성능 튜닝
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
Amazon SageMaker에서 TensorFlow 분산 트레이닝
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (1/6)
• Amazon SageMaker는 Prebuilt TensorFlow 컨테이너를 제공 (TensorFlow v1.11+)
• ML 모델 트레이닝을 위한 하드웨어 리소스, 하이퍼파라미터 설정
• Training instances: ML 모델 트레이닝을 위한 비용 효율적이고 자동화된 클러스터
• Approaches for distributed training
ü TensorFlow’s native parameter server (TF v1.11+)
ü Horovod (TF v.1.12+)
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (2/6)
• Multiple dedicated processes to
ü Collect gradients
(computed by “worker” processes)
ü Aggregate gradients
ü Distribute the updated gradients back to the
workers asynchronously
ü All-to-all communication model
• In Amazon SageMaker
ü No need to setup and manage the
parameter server cluster manually
ü A built-in script mode option
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
Parameter servers
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (3/6)
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
Parameter servers – example code
from sagemaker.tensorflow import TensorFlow
ps_instance_type = 'ml.p3.2xlarge’
ps_instance_count = 2
distributions = {
'parameter_server': {
'enabled': True
}
}
hyperparameters = {'epochs': 60, 'batch-size' : 256}
estimator_ps = TensorFlow(
base_job_name='hvd-imagenet-tf',
source_dir='code',
entry_point='train_ps.py',
role=role,
framework_version='1.13',
py_version='py3',
hyperparameters=hyperparameters,
train_instance_count=ps_instance_count,
train_instance_type=ps_instance_type,
model_dir=model_dir,
distributions=distributions)
# start training; inputs can be in
# Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_hvd.fit(inputs)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (4/6)
• Amazon SageMaker 상에서 손쉬운 Horovod 클러스터 구성 자동화 및 실행 가능
• SageMaker TensorFlow container
ü sets up the MPI environment
ü run the mpirun command to start jobs on the cluster nodes
• Estimator의 distributions 파라미터에서 다음 필드들의 설정값을 고려할 것
ü enabled (bool): set up for executing mpirun
ü processes_per_host (int): Number of processes MPI launching on each host
ü custom_mpi_options (bool): For adding flags to the mpirun and then run on Amazon
SageMaker (for Horovod training)
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
Horovod
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (5/6)
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
Horovod – example code
from sagemaker.tensorflow import TensorFlow
hvd_instance_type = 'ml.p3.2xlarge'
hvd_processes_per_host = 1
hvd_instance_count = 2
distributions = {
'mpi': {
'enabled': True,
'processes_per_host': hvd_processes_per_host,
'custom_mpi_options':
'-verbose --NCCL_DEBUG=INFO
-x OMPI_MCA_btl_vader_single_copy_mechanism=none'
}
}
hyperparameters = {'epochs': 60, 'batch-size' : 256}
estimator_hvd = TensorFlow(
base_job_name='hvd-imagenet-tf',
source_dir='code',
entry_point='train_hvd.py',
role=role,
framework_version='1.13',
py_version='py3',
hyperparameters=hyperparameters,
train_instance_count=hvd_instance_count,
train_instance_type=hvd_instance_type,
distributions=distributions)
# start training; inputs can be in
# Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_hvd.fit(inputs)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TensorFlow 분산 트레이닝 in Amazon SageMaker (6/6)
• Scaling-up on a single machine with multiple GPUs (“data parallelism”)
• Scaling-out with either Parameter server or Horovod (“cluster size”)
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
필요에 따라 적절한 분산 트레이닝 방법을 선택합니다
Time to share gradient 더 높은 CPU 성능을 원할 경우 더 높은 GPU 성능을 원할 경우
Long Parameter Server
Parameter Server or Horovod on
a single instance with Multi-GPUs
Short Parameter Server Horovod
Larger # of gradients
Bigger model size
Smaller # of gradients
Lesser model size
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
fast.ai – Now anyone can train Imagenet in 18 minutes
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (1/5)
• ImageNet training results
ü Time : 18 minutes
ü Machines : 16 * p3.16xlarge on AWS (EC2)
ü Compute cost : $48.00
ü PyTorch
• Collaborators
ü Yaroslv Bulatov
ü Jeremy Howard
ü Andrew Shaw
The summary of results
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (2/5)
• Step 1
ü Find a good baseline
for single machine
• Step 2
ü Scale to multiple machines
An analysis of Deep Neural Network models for
practical applications
by Alfredo Canziani, Adam Paszke, Eugenio Culurciello
How to train fast?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (3/5)
Trained ImageNet in 30 epochs
(instead of 90)
Single p3.16xlarge instance
trains to 93% in 1.5 hours
Progressive resizing
for classification
Rectangular
Image validation
• 상대적으로 높은
Learning Rate 로 시작
• 20% faster convergence for
single machine
LearningRate
Number of steps
(Leslie Smith) (fast.ai) (fast.ai)
• Faster initial epochs:
2x speedup in training 128 vs. 224
• More accurate final epochs:
288 images increased accuracy 0.8%
• Validate images close to original aspect
ratio (instead center crop images to
224 x 224)
• 23% speedup of training time to reach
the benchmark accuracy of 93%
One cycle
Learning Rate
단일 머신 트레이닝
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (4/5)
All-Reduce - NVIDIA NCCL*
• Sync gradient after backprop
Distributed Data Parallel - PyTorch
NCCL: NVIDIA Collective Communications Library
• Optimization: Overlap sync with computation
Data
BackpropForward
Sync
BackpropForward
Data
BackpropForward
Sync
BackpropForward
Sync Sync
Gradients
Gradients
GradientsGradients
Gradients
Gradients GPU0
batch0_0
GPU1
batch0_1
GPU2
batch0_2
GPU3
batch0_3
GPU4
batch0_4
GPU5
batch0_5
분산 아키텍처
[참고] apex.parallel.DistributedDataParallel
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
fast.ai: Now anyone can train ImageNet in 18 minutes (5/5)
Tips and Tricks
Run lots of
experiments
• Batch Normalization 튜닝
및 Learning Rate 스케일링
(By Goyal)
• Learning Rate를 감소시키는
대신 Batch size를 늘림
(By Google Brain)
• Spot instance - 70% 저렴
• AMI - ImageNet baked in AWS Deep Learning AMI
• 지연시간 - IOPS + Placement Groups
• AWS 상에서 손쉽게 분산
환경을 구축하고 실험할 수
있습니다.
Scaling
Techniques
Number of steps
Images/sec
S3
AMI
Io2 volume
P3 instance
10k IOPS
git clone git@github.com:diux-dev/imagenet18.git
pip install -r requirements.txt
aws configure
python train.py
기타 고려사항
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
Distributed Training of MnasNet on AWS
D E V D A Y
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Training of MnasNet on AWS (1/4)
• An automated mobile NAS* approach
• Trade-off between Accuracy and Latency
• An example of MnasNet network architecture
MnasNet
https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html
https://arxiv.org/pdf/1807.11626
https://www.youtube.com/watch?v=4uDZxefPd-I
where
* NAS: Neural Architecture Search
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
$ pip install ec2-cluster==0.3.1
$ ec3 create ec2cluster_p3_mnasnet_example.yaml
$ ec3 setup-horovod ec2cluster_p3_mnasnet_example.yaml
$ ec3 ssh-cmd ec2cluster_p3_mnasnet_example.yaml
...
...
$ source activate tensorflow_p36
$ mpirun -np 16 -hostfile /home/ubuntu/hostfile 
-bind-to socket -map-by slot -mca plm_rsh_no_tree_spawn 1 
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 
-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib 
-x NCCL_SOCKET_IFNAME=ens5 -mca btl_tcp_if_exclude lo.docker0 -x TF_CPP_MIN_LOG_LEVEL=0 
python /home/ubuntu/aws-ai-optimized-models/mnasnet/mnasnet_main_hvd.py --use_tpu=False 
--data_dir=/home/ubutu/data --model_dir=./results_hvd 
--train_batch_size=256 --eval_batch_size=256 
--train_steps=109475 --skip_host_call=Fall --data_format='channels_first' 
--transport_input=False --use_horovod=True --eval_on_single_gpu=True
...
Distributed Training of MnasNet on AWS (2/4)
실행 예 (1/2)
# Define the base params
# Naming
# Launch Location
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Training of MnasNet on AWS (3/4)
...
I0923 16:15:22.086650 140202663954176 saver.py:1276] Restoring parameters from ./results_hvd/model.ckpt-62560
I0923 16:15:22.418808 140202663954176 session_manager.py:491] Running local_init_op.
I0923 16:15:22.426828 140202663954176 session_manager.py:493] Done running local_init_op.
I0923 16:15:47.475176 140202663954176 evaluation.py:277] Finished evaluation at 2019-09-23-16:15:47
I0923 16:15:47.475430 140202663954176 estimator.py:1979] Saving dict for global step 62560: global_step = 62560, loss =
2.1191003, top_1_accuracy = 0.74759614, top_5_accuracy = 0.9215545
I0923 16:15:47.475846 140202663954176 estimator.py:2039] Saving 'checkpoint_path' summary for global step 62560:
./results_hvd/model.ckpt-62560
I0923 16:15:47.476232 140202663954176 error_handling.py:93] evaluation_loop marked as finished
I0923 16:15:47.476345 140202663954176 mnasnet_main_hvd.py:1041] Eval results at step 62560: {'loss': 2.1191003,
'top_1_accuracy': 0.74759614, 'top_5_accuracy': 0.9215545, 'global_step': 62560}. Hvd rank 0
I0923 16:15:47.476416 140202663954176 mnasnet_main_hvd.py:1051] Finished training up to step 62560. Elapsed seconds 40649.
실행 예 (2/2)
• time-to-train: ≈ 11.29 hrs
• Top-1 accuracy : 74.76%
• Top-5 accuracy : 92.16%
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed Training of MnasNet on AWS (4/4)
성능 테스트 결과 예
Num. of
instances
(p3dn.24xlarge)
Time-to-train
(hours)
Top-1
Validation
Accuracy (%)
1 29 75.2
2 24.3 74.5
4 9.0 74.67
8 4.6 74.16
16 1.8 ~ 2.6 73.9 ~ 74.6
Machines • p3dn.24xlarge
TensorFlow
• TENSORFLOW_VERSION: v1.13.1
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
• NCCL_VERSION: 2.4.2-1+cuda10.0
• OPENMPI 4.0.0
Dataset
(ImageNet)
• 1.28 millions of images (1000 classes)
• 1024 training files & 128 validation files
(TFRecords)
Optimizations
• Mixed Channel ordering
• Mixed XLA (for all ops except depth-wise convolution)
• LARC Optimizer
• HOROVOD_VERSION: 0.16.1
LARC (Layer-wise adaptive rate control)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
요약 정리
• Train smart with tools making distributed training easily
• Experiment, experiment, and experiment
ü Efficient Linear Scalability ü Efficient Linear Scalability
ü Flexibility
ü Efficient Linear Scalability
ü Flexibility
ü Fully-managed service
Amazon EC2
AWS DL AMI
Amazon ECSAmazon EKS
AWS DL Container
Amazon SageMaker
감사합니다
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
여러분의 피드백을 기다립니다!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
#AWSDEVDAYSEOUL

[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션즈 아키텍트, 김대근 AWS 솔루션즈 아키텍트

  • 1.
    AWS 기반 기계학습 자동화 및 최적화를 위한 실전 기법 남궁영환 데이터 사이언티스트 SA 아마존웹서비스 A I / M L 김대근 데이터 사이언티스트 SA 아마존웹서비스
  • 2.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Agenda • AI/ML at AWS • 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝 o Part 1 § Infrastructure for ML on AWS § Horovod & TensorFlow distributed training on EC2, EKS, and SageMaker o Part 2 § on AWS § MnasNet on AWS • Summary D E V D A Y © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. fast.ai
  • 3.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 4.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. AWS ML Stack ML FRAMEWORKS & INFRASTRUCTURE A I S E R V I C E S REKOGNITION IMAGE POLLY TRANSCRIBE TRANSLATE COMPREHEND L E X REKOGNITION VIDEO Vision Speech Language Chatbots AMAZON SAGEMAKER BUILD TRAIN FORECAST Forecasting TEXTRACT PERSONALIZE Recommendations DEPLOY Pre-built algorithms & notebooks Data labeling (GROUND TRUTH) One-click model training & tuning Optimization (N E O ) One-click deployment & hosting M L S E R V I C E S Frameworks Interfaces Infrastructure EC2 P3 & P3DN EC2 C5 FPGAs GREENGRASS ELASTIC INFERENCE Reinforcement learningAlgorithms & models (AWS MARKETPLACE FOR MACHINE LEARNING) (App developers with little knowledge of ML) (ML developers and data scientists) (ML researchers and academics) INFERENTIA : 가장 깊고 폭넓은 역량과 기술의 집약
  • 5.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Scaling TensorFlow near-linearly 256 GPUs at Amazon SageMaker 및 AWS Deep Learning AMIs 에서 사용 가능 Stock TensorFlow 65% 30 min training time AWS-Optimized TensorFlow 90% scaling efficiency with 256 GPUs 14 min https://aws.amazon.com/about-aws/whats-new/2018/11/tensorflow-scalability-to-256-gpus/ 2018
  • 6.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 대규모 머신러닝이 중요한 이유 (1/3) - Andrew Ng How do data science techniques scale with amount of data? • 데이터 축적에 따라 모델의 성능은 지속적으로 향상 • 딥러닝 적용 사례가 다양한 분야에서 꾸준히 증가하고 있음 • 대량의 데이터 기반 ML/DL 모델 트레이닝은 많은 시간과 자원들을 필요로 함 • “분산 트레이닝” https://www.slideshare.net/ExtractConf https://eng.uber.com/horovod/ The “data parallel” approach to distributed training - Uber
  • 7.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 대규모 머신러닝이 중요한 이유 (2/3) Scaling to Very Very Large Corpora for Natural Language Disambiguation, Banko and Brill, Microsoft Research (2001) http://www.aclweb.org/anthology/P01-1005 “These results suggest that we may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development.” 알고리즘 선정도 중요하지만 많은 양의 트레이닝 데이터의 확보가 무엇보다 중요
  • 8.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 대규모 머신러닝이 중요한 이유 (3/3) • 공통 목표 ü 컴퓨팅, 네트워킹, 컨테이너, 분산 트레이닝 성능 튜닝, . . . ü 머신러닝 엔지니어는 선호하는 ML/DL 프레임워크를 이용하여 비즈니스 성공에 기여할 수 있는 모델 개발에 집중 • Data Management ü 데이터의 규모 ∝ 해결 과제 및 알고리즘의 복잡도 ü 데이터의 견고성(durability) 및 가용성(availability) • Distributed Computing Frameworks ü Data pipelines feature (Dask, Ray, PyToolz, ipyparallel, etc.) ü CPU ➝ GPU ➝ Multi-GPUs ➝ Multi-nodes ü TensorFlow, PyTorch, MxNet, . . . • Build Compute Clusters to fit the workload! 대규모 머신러닝은 문제 및 접근 방식에 따라 해결 방안이 매우 다양할 수 있음
  • 9.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Where to train and deploy deep learning models Amazon SageMaker Amazon Elastic Container Service for Kubernetes Amazon Elastic Container Service Amazon EC2 AWS Deep Learning AMIs AWS Deep Learning Containers “해결하려는 워크로드를 고려하여 적절한 ML/DL 모델 트레이닝 및 배포 환경을 선택합니다”
  • 10.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 11.
    D E VD A Y © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝 Infrastructure for ML on AWS
  • 12.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. P3 instance 대규모 병렬 처리가 필요한 워크로드에 적합 • 기계학습 모델 트레이닝 • HPC(High Performance Computing) 시뮬레이션 • 3D 모델 렌더링 • 비디오 인코딩 최대 8 개의 NVIDIA Tesla V100 GPU • 1 PetaFLOPs 컴퓨팅 성능 (P2 인스턴스 대비 최대 14배 ↑) • 300 GB/s 의 GPU 간 통신 속도 지원 (NVLink) (P2인스턴스 대비 9배 ↑) • 모든 ML 프레임워크 및 모델 타입 지원 • 다양한 형태의 인스턴스 사용 가능 (Spot instance 사용 시 최대 70% 비용 절감 가능) P3.2xlarge 1 V100 GPU 8 vCPU 61 GB Mem P3.8xlarge 4 V100 GPU 32 vCPU 244 GB Mem P3.16xlarge 8 V100 GPU 64 vCPU 488 GB Mem 3 가지 타입 중 14 리전 https://aws.amazon.com/ko/ec2/instance-types/p3/
  • 13.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. P3dn.24xlarge instance Description P3.16xlarge P3dn.24xlarge Improvements Number and Type of GPUs 8 x NVIDIA V100 8 x NVIDIA V100 - GPU Memory 16GB/GPU 32GB/GPU 100% GPU Peer to Peer NVLink - 300 GB/s NVLink - 300 GB/s - CPU Family Broadwell Skylake w AVX512 vCPU 64 96 50% System Memory 488 GB 768 GB 57% Networking Throughput 25Gbps 100Gbps 200% EBS Throughput 14Gbps 14Gbps - Local Instance Storage No 2.0TBs NVMe SSD • 클라우드에서 사용 가능한 가장 강력한 GPU 인스턴스 • 효율적인 대규모 ML 트레이닝 및 HPC 시뮬레이션 지원 (100Gbps 네트워크 대역폭을 이용한 멀티-노드 클러스터 (32대 이상) 구성 가능) • 모델 트레이닝 및 시뮬레이션을 위한 데이터에 빠른 액세스 지원 (Amazon S3, 네트워크 기반 파일 시스템, 로컬 인스턴스 스토리지) • 대규모 ML 모델 트레이닝 및 대규모 데이터 처리 (32GB GPU 메모리를 장착한 최신 NVIDA V100 GPU) • 데이터 전처리 최적화에 적합 (96 vCPUs using AWS Custom Skylake CPUs and 768GB of System Memory) https://aws.amazon.com/ko/ec2/instance-types/p3/#Amazon_EC2_P3dn.24xlarge_Instances
  • 14.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. AWS FSx for Lustre • 머신 러닝, HPC, 동영상 처리, 금융 모델링 등을 위한 고성능 파일 시스템 • S3와 기본적으로 연동됨 • Lustre는 1 millisecond 미만의 지연 시간과 초당 수백 Gigabytes, 수백만 IOPS로 확장되는 처리량을 지원 • POSIX와 호환되므로, 특별히 추가 변경 없이 기존 Linux 기반 애플리케이션 사용 가능 • 사용한 리소스에 대해서만 비용 지불 (최소약정/선수금 없음) • 클라이언트 OS 커널 모듈 변경 작업 필요없음 (https://aws.amazon.com/ko/fsx/lustre/) Amazon FSx for Lustre
  • 15.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Infrastructure for ML on AWS (1/3) 전통적 HPC 머신러닝 클러스터 Auto Scaling BeeGFS RAM storage nodes Auto Scaling worker nodes Bastion host | BeeGFS management node | Cluster monitoring Deep Learning Placement Group Amazon EFS Deep Learning Application Stack Cluster-wide persistent storage Model parameter Object store BeeGFS RAM-based storage array Multi-node parallel Deep Learning Placement Group Amazon S3 Cloud-native 머신러닝 클러스터 AWS Batch Amazon FSx for Lustre P3 / P3dn container instances commit hydrate Lustre kernel driver Amazon ECR Multi-node TensorFlow Container Registry
  • 16.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Infrastructure for ML on AWS (2/3) Traditional AWS Deep Learning Cluster https://aws.amazon.com/ko/blogs/compute/distributed-deep-learning-made-easy/ https://github.com/aws-samples/deep-learning-models/tree/master/hpc-cluster https://github.com/awslabs/deeplearning-cfn Amazon SQS Worker Queue Amazon SQS Master Queue Internet Gateway AWS Lambda Amazon SNS Auto Scaling Group Auto Scaling Group VPC Public subnet Private subnet AWS Elastic File System EC2 Master Instance EC2 Workers Public: 203.0.113.0 Private: 10.0.0.1 Workers 10.0.1.1 10.0.1.2 10.0.1.3 AWS Cloud Default VPC: 10.0.0.0/16 NAT Gateway Private Subnet 10.0.1.0/16 Worker setup Public Subnet 10.0.0.0/24 Auto Scaling Setup Complete Internet Router Amazon S3
  • 17.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Infrastructure for ML on AWS (3/3) Cloud-native AWS Deep Learning Cluster https://aws.amazon.com/ko/blogs/compute/scalable-deep-learning-training-using-multi-node-parallel-jobs-with-aws-batch-and-amazon-fsx-for-lustre/ Amazon CloudWatch Amazon Glacier AWS Cloud Training Output bucket AWS Step Functions workflow Event trigger TFRecord Input bucket TensorFlow Container Registry Multi-node Parallel Job NVIDIA GPU-backed running containers FSx for Lustre AWS Batch
  • 18.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝 with Horovod & TensorFlow D E V D A Y
  • 19.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (1/9) • 분산 딥러닝을 위한 오픈 소스 프레임워크 • Stock TensorFlow, Keras, PyTorch 등과 연동하여 동작 • 쉽고 간단한 설치 `pip install horovod` • 고급 알고리즘 사용 가능 • High-Performance 네트워크 (RDMA, GPUDirect) 지원 • ML 엔지니어와 인프라를 분리 ü 인프라팀은 컨테이너 및 MPI 환경을 제공 ü ML 엔지니어는 선호하는 딥러닝 프레임워크 사용 ü 프레임워크 상에서 분산 트레이닝에 대한 공통 기대치 (인프라팀 & ML 엔지니어) horovod.ai https://eng.uber.com/horovod/
  • 20.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (2/9) • Ring-AllReduce ü 데이터의 규모 ∝ 클러스터 노드의 개수 • Synchronous updates • NVIDIA’s NCCL library (for GPU-level communication) • Configurations ü Sing-ring NCCL vs. Hierarchical AllReduce HOROVOD_HIERARCHICAL_ALLREDUCE=1 ü Tensor Fusion HOROVOD_FUSION_THRESHOLD=67108864 HOROVOD_CYCLE_TIME=5 ü FP16 all-reduce hvd.DistributedOptimizer(...,compression=hvd.Compression.fp16) https://eng.uber.com/horovod/ Worker A 5 13 8 19 42 1 Worker C 9 27 3 15 8 4 Worker B 8 11 4 2 7 7 Worker A 5 13 8 19 50 5 Worker C 9 27 7 17 8 4 Worker B 13 24 4 2 7 7 Worker A 5 13 15 36 50 5 Worker C 22 51 7 17 8 4 Worker B 13 24 4 2 57 12 Worker A 22 51 15 36 50 5 Worker C 22 51 7 17 57 12 Worker B 13 24 15 36 57 12 Worker A 22 51 15 36 57 12 Worker C 22 51 15 36 57 12 Worker B 22 51 15 36 57 12
  • 21.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (3/9) 2. 사용할 GPU 세팅 config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) 3. Learning Rate 조정 및 Horovod 분산 Optimizer 추가 opt = tf.train.MomentumOptimizer( lr=0.01 * hvd.size()) opt = hvd.DistributedOptimizer(opt) 4. Synchronize initial state between workers hooks = [hvd.BroadcastGlobalVariablesHook(0)] with tf.train.MonitoredTrainingSession(hooks=hooks,...) as mon_sess: ... # OR bcast_op = hvd.broadcast_global_variables(0) sess.run(bcast_op) 5. Use checkpoints only on the first worker ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …) as mon_sess: ... 1. 라이브러리 초기화 import horovod.tensorflow as hvd hvd.init() * Horovod for TensorFlow, Keras, and PyTorch import horovod.tensorflow as hvd import horovod.keras as hvd import horovod.tensorflow.keras as hvd import horovod.torch as hvd # more frameworks coming ( source code from https://github.com/horovod/horovod )
  • 22.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (4/9) 실행 예 # Use AWS Deep Learning AMI laptop$ ssh ubuntu@<aws-ip-1> aws-ip-1$ source activate tensorflow_p27 aws-ip-1$ ssh-keygen aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub [copy contents of the pubkey] aws-ip-1$ exit laptop$ ssh ubuntu@<aws-ip-2> aws-ip-2$ source activate tensorflow_p27 aws-ip-2$ cat >> /home/ubuntu/.ssh/authorized_keys [paste contents of the pubkey] aws-ip-2$ exit laptop$ ssh ubuntu@<aws-ip-1> aws-ip-2$ ssh aws-ip-2 [will ask for prompt, say yes] aws-ip-2$ exit aws-ip-1$ mpirun -np 2 -H aws-ip-1,aws-ip-2 wget https://raw.githubusercontent.com/uber/horovod /master/examples/tensorflow_mnist.py aws-ip-1$ mpirun -bind-to none -map-by slot -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_exclude lo,docker0 –np 16 -H aws-ip-1:8,aws-ip-2:8 python tensorflow_mnist.py # Pro tip: hide mpirun args into mpirun.sh aws-ip-1$ mpirun.sh –np 16 –H aws-ip-1:8,aws-ip-2:8 python tensorflow_mnist.py ( source code from https://github.com/horovod/horovod )
  • 23.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (5/9) import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used to # process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.MomentumOptimizer( lr=0.01 * hvd.size()) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to synchronize initial state hooks =[hvd.BroadcastGlobalVariablesHook(0)] # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of # session initialization, restoring from a # checkpoint, saving to a checkpoint, and # closing when done or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint _dir=ckpt_dir, config=config, hooks=hooks) as mon _sess: while not mon_sess.should_stop(): # Perform synchronous training mon_sess.run(train_op) [참고] 예제 코드 – Horovod for TensorFlow ( source code from https://github.com/horovod/horovod )
  • 24.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (6/9) [참고] 예제 코드 – Estimator API import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... def model_fn(features, labels, mode): loss = ... opt = tf.train.MomentumOptimizer( lr=0.01 * hvd.size()) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) return tf.estimator.EstimatorSpec(...) # Broadcast initial variable state. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Create the Estimator mnist_classifier = tf.estimator.Estimator( model_fn=cnn_model_fn, model_dir=ckpt_dir, config=tf.estimator.RunConfig( session_config=config)) mnist_classifier.train( input_fn=train_input_fn, steps=100, hooks=hooks) ( source code from https://github.com/horovod/horovod )
  • 25.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (7/9) import mxnet as mx import horovod.mxnet as hvd from mxnet import autograd # Initialize Horovod hvd.init() # Pin GPU to be used to process local rank context = mx.gpu(hvd.local_rank()) num_workers = hvd.size() # Build model model = ... model.hybridize() # Create optimizer optimizer_params = ... opt = mx.optimizer.create('sgd', **optimizer_params) # Initialize parameters model.initialize(initializer, ctx=context) # Fetch and broadcast parameters params = model.collect_params() if params is not None: hvd.broadcast_parameters(params, root_rank=0) # Create DistributedTrainer, a subclass of gluon.Trainer trainer = hvd.DistributedTrainer(params, opt) # Create loss function loss_fn = ... # Train model for epoch in range(num_epoch): train_data.reset() for nbatch, batch in enumerate(train_data, start=1): data = batch.data[0].as_in_context(context) label = batch.label[0].as_in_context(context) with autograd.record(): output = model(data.astype(dtype, copy=False)) loss = loss_fn(output, label) loss.backward() trainer.step(batch_size) [참고] 예제 코드 – Horovod for MxNet ( source code from https://github.com/horovod/horovod )
  • 26.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (8/9) [참고] 예제 코드 – Horovod for Keras import keras from keras import backend as K import tensorflow as tf import horovod.keras as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) # Build model... model = ... opt = keras.optimizers.Adadelta(lr=1.0 * hvd.size()) # Add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) model.compile( loss='categorical_crossentropy’, optimizer=opt, metrics=['accuracy']) # Broadcast initial variable state. callbacks = [hvd.callbacks.BroadcastGlobalVariabl esCallback(0)] ... model.fit( x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test)) ( source code from https://github.com/horovod/horovod )
  • 27.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Horovod (9/9) [참고] 예제 코드 – Horovod for PyTorch import torch import horovod.torch as hvd # Initialize Horovod hvd.init() # Horovod: pin GPU to local rank torch.cuda.set_device(hvd.local_rank()) # Build model... model = Net() model.cuda() optimizer = optim.SGD(model.parameters()) # Wrap optimizer with DistributedOptimizer optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters()) # Horovod: broadcast parameters hvd.broadcast_parameters( model.state_dict(), root_rank=0) for epoch in range(100): for batch_idx, (data, target) in ...: optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step() ( source code from https://github.com/horovod/horovod )
  • 28.
    클라우드 기반 대규모머신러닝/딥러닝 트레이닝 Scalable multi-node training (EC2) © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. D E V D A Y
  • 29.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Scaling performance using distributed training TensorFlow & Horovod on Amazon EC2 • TFRecord 변환 전용 인스턴스로 전처리 수행 ü t2.large Instance with 1.0 TB EBS sc1 Volume ü Download ImageNet dataset ü Transform the raw dataset with TFRecord ü Upload the transformed dataset to the Amazon S3 nohup aws s3 sync /data s3://YOUR_BUCKET_NAME >& upload.log & • Setting up all the EC2 instances having the same type of instances, AMI, the path of data, and the path of models • Need to check the utilization of GPUs on P3dn.24xlarge (and/or P3.16xlarge) ImageNet을 이용한 트레이닝 실행 예시
  • 30.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Scaling performance using distributed training TensorFlow & Horovod on Amazon EC2 ImageNet을 이용한 트레이닝 실행 예시 • Time-to-train: around 45 mins • 8 * P3dn.24xlarge instances • ML Models: ResNet-50 • Top-1 Validation Accuracy : 75.59 % https://docs.aws.amazon.com/ko_kr/dlami/latest/devguide/tutorial-horovod-tensorflow.html
  • 31.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Scaling performance using distributed training TensorFlow & Horovod on Amazon EC2 https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/ • 8 * P3.16xlarge instances • DL Framework: TensorFlow, MxNet • ML model: ResNet-50 • Dataset: ImageNet (1.2 millions of images) • Top-1 validation accuracy: 76% - 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 1 2 4 8 16 32 64 Images/Second Number of GPUs time-to-train: 47 min ~ 50 min Training using P3 instances (ResNet-50 & ImageNet) 구성 정보
  • 32.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Scaling performance using distributed training TensorFlow & Horovod on Amazon EC2 https://aws.amazon.com/ko/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/ • 32 * P3.16xlarge instances • DL Framework: TensorFlow • ML model: ResNet-50 • Dataset: ImageNet • Top-1 validation accuracy 75.4% • Top-5 validation accuracy 92.6% time-to-train: 14.6 min Training performance w.r.t. TensorFlow & CUDA (ResNet-50 & ImageNet) (Images/sec) Time to train vs Number of GPUs vs Images/sec, efficiency, and communication overhead 구성 정보
  • 33.
    클라우드 기반 대규모머신러닝/딥러닝 트레이닝 Amazon EKS 기반 분산 딥러닝 성능 최적화 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. D E V D A Y
  • 34.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (1/11) [참고] Modular and Scalable Amazon EKS Architecture https://aws.amazon.com/ko/quickstart/architecture/amazon-eks/
  • 35.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (2/11) • STEP 1. Install Kubeflow to setup a cluster for distributed training • STEP 2. Set the app name and initialize it. • STEP 3. Install mpi-operator from kubeflow • STEP 4. Create a MPI Job template, define the number of nodes (replicas), number of GPUs each node has (gpusPerReplica) • STEP 5. Apply the manifest to the default environment. The MPI Job will create a launch pod Using Horovod in Amazon EKS https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-eks-tutorials-distributed-gpu-training.html
  • 36.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (3/11) EKS Deep Learning Benchmark Utility https://github.com/aws-samples/aws-eks-deep-learning-benchmark • 클러스터 생성부터 종료까지 자동화된 벤치마크 워크플로 제공 • 다양한 백엔드 스토리지 시스템 지원 (예: Amazon EFS, Amazon FSx for Lustre) • S3와 연동하여 환경설정 정보 및 결과 저장 • Backed by kubeflow operators and kubebench. • 다양한 딥러닝 프레임워크 지원 (TF, TF + Horovod + OpenMPI, PyTorch, MxNet) • 사용자의 요구사항에 맞는 Kubernetes 클러스터 환경 설정 지원 • 중간 결과 저장 및 자동 클러스터 종료 기능 • 동시에 여러 실험을 병렬로 진행 가능
  • 37.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (4/11) EKS Deep Learning Benchmark Utility https://github.com/aws-samples/aws-eks-deep-learning-benchmark • Setup NFS • Install Argo Workflow • Configure AWS credentials • Conifgure your GitHub token • Setup S3 buckets for your benchmark results and your training data • Configure your Kubernetes cluster kubectl create -f deploy/benchmark-nfs-svc.yaml kubectl get svc benchmark-nfs-svc -o=jsonpath={.spec.clusterIP} # Replace ip in the `deploy/benchmark-nfs-volume.yaml` before following step kubectl create -f deploy/benchmark-nfs-volume.yaml kubectl create ns argo kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.2.1/manifests/install.yaml # you can forward port to localhost and look at Argo UI kubectl port-forward deployment/argo-ui 8001:8001 -n argo
  • 38.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (5/11) EKS Deep Learning Benchmark Utility https://github.com/aws-samples/aws-eks-deep-learning-benchmark • Run the benchmmark jobs s3ResultPath: 's3://kubeflow-pipeline-data/benchmark/', s3DatasetPath: 's3://eks-dl-benchmark/imagenet/', clusterConfig: 's3://kubeflow-pipeline-data/benchmark/cluster_config.yaml', experiments: [{ experiment: 'experiment-20190415-01', trainingJobConfig: 's3://kubeflow-pipeline-data/benchmark/mpi-job-imagenet.yaml', trainingJobPkg: 'mpi-job', trainingJobPrototype: 'mpi-job-custom', // Change to upstream once https://github.com/kubeflow/kubeflow/pull/3062 is merged trainingJobRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow', }], githubSecretName: 'github-token', githubSecretTokenKeyName: 'GITHUB_TOKEN', image: 'seedjeffwan/benchmark-runner:20190424', name: '20190424-00', namespace: 'default', nfsVolume: 'benchmark-pv', nfsVolumeClaim: 'benchmark-pvc', region: 'us-west-2', trainingDatasetVolume: 'dataset-claim', s3SecretName: 'aws-secret', s3SecretAccesskeyidKeyName: 'AWS_ACCESS_KEY_ID', s3SecretSecretaccesskeyKeyName: 'AWS_SECRET_ACCESS_KEY', storageBackend: 'fsx', kubeflowRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow' 1. Update your workflow setting using ks command 2. Update benchmark workflow manifest directly
  • 39.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (6/11) EKS Deep Learning Benchmark Utility https://github.com/aws-samples/aws-eks-deep-learning-benchmark
  • 40.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (7/11) • Kubernetes ü 컨테이너 기반의 다양한 ML/DL 프레임워크 지원 ü 탄력성 및 손쉬운 확장성 지원 ü Deep Neural Network 트레이닝 환경으로서 지속적으로 확산 중 • Amazon EKS ü 완전 관리형 Kubernetes 서비스 ü EC2 P2, P3 인스턴스 상에서 Kubernetes 워크로드의 손쉬운 실행 • Kubeflow ü 머신러닝 워크로드 효율적인 개발, 관리, 배포 등을 지원하는 Kubernetes-native 플랫폼 ü 분산 트레이닝 지원 (native TensorFlow architecture or MPI AllReduce (NVIDIA NCCL library or Horovod)) https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/ ( Amazon EKS + Kubeflow + AWS FSx CSI driver )
  • 41.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (8/11) • Amazon FSx for Lustre ü High Performance 파일 시스템 ü 빠른 처리를 요구하는 워크로드에 최적 (예: 머신러닝, HPC ) ü Amazon S3와 연동, 통합 지원 • AWS FSx CSI driver ü Kubernetes-native 형태로 컨테이너에서 FSx for Lustre 파일시스템 이용 가능 ü Static/Dynamic volume provisioning ü Containers from multiple nodes within a cluster (connected to the same Lustre filesystem) ü Lustre 데이터 저장소로 S3 사용 가능 https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/ ( Amazon EKS + Kubeflow + AWS FSx CSI driver )
  • 42.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (9/11) https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/ ( Amazon EKS + Kubeflow + AWS FSx CSI driver ) Machines • 20 * p3.16xlarge (mixed precision) Amazon EKS-optimized AMI with GPU support • Kubernetes v1.11.8 • MPI Operator Alpha from Kubeflow 0.4.1 • CUDA 10 with NVIDIA Tesla 410.104 driver • Docker 18.06.1-ce (incl. nvidia-docker2) AWS FSx for Lustre filesystem • FSx CSI Driver v0.1 • Hydrated from an S3 bucket (for ImageNet TFRecords) TensorFlow (customized image) • TENSORFLOW_VERSION: v1.13.1 • HOROVOD_VERSION: 0.16.0 • CUDNN_VERSION: 7.4.2.24-1+cuda10.0 • NCCL_VERSION: 2.4.2-1+cuda10.0 • OPENMPI 4.0.0 Dataset (ImageNet) • 1.28 millions of images (1000 classes) • 1024 training files & 128 validation files (TFRecords) Relevant tools • awscli, eksctl, ksonnet, and aws-iam-authenticator “90%-100%의 near-linear scaling performance를 확인”
  • 43.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (10/11) https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/ ( Amazon EKS + Kubeflow + AWS FSx CSI driver ) • 성능 최적화를 위한 체크리스트 (Part #1) ü 최신 딥러닝 툴킷 사용 (예: AMI for EKS) ü GPU clock speed를 최대값으로 설정 (참고: bootstrap command) ü Placement Group 내에 인스턴스 생성 (낮은 지연시간) ü AWS VPC CNI 플러그인 (최신버전)을 사용 (모든 NIC들이 EKS 클러스터 상에서 기본적으로 Jumbo Frame을 사용하도록)
  • 44.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Amazon EKS 기반 분산 딥러닝 성능 최적화 (11/11) https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/ ( Amazon EKS + Kubeflow + AWS FSx CSI driver ) • 성능 최적화를 위한 체크리스트 (Part #2) ü 적절한 스토리지 백엔드를 선택 (EBS, EFS, FSx for Lustre, etc.) ü Static Kubernetes CPU 관리 정책을 사용 ü MPI processor ü Intel MKL DNN 으로 TensorFlow 환경을 구축하여 GPU 성능을 최적화 ü 데이터 변환 프로세스 및 스레드 병렬화를 위한 TensorFlow 최적화 ü 스레드 풀 조정 및 CPU 성능 튜닝
  • 45.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝 Amazon SageMaker에서 TensorFlow 분산 트레이닝 D E V D A Y
  • 46.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. TensorFlow 분산 트레이닝 in Amazon SageMaker (1/6) • Amazon SageMaker는 Prebuilt TensorFlow 컨테이너를 제공 (TensorFlow v1.11+) • ML 모델 트레이닝을 위한 하드웨어 리소스, 하이퍼파라미터 설정 • Training instances: ML 모델 트레이닝을 위한 비용 효율적이고 자동화된 클러스터 • Approaches for distributed training ü TensorFlow’s native parameter server (TF v1.11+) ü Horovod (TF v.1.12+) https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/ Amazon SageMaker
  • 47.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. TensorFlow 분산 트레이닝 in Amazon SageMaker (2/6) • Multiple dedicated processes to ü Collect gradients (computed by “worker” processes) ü Aggregate gradients ü Distribute the updated gradients back to the workers asynchronously ü All-to-all communication model • In Amazon SageMaker ü No need to setup and manage the parameter server cluster manually ü A built-in script mode option https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/ Parameter servers
  • 48.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. TensorFlow 분산 트레이닝 in Amazon SageMaker (3/6) https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/ Parameter servers – example code from sagemaker.tensorflow import TensorFlow ps_instance_type = 'ml.p3.2xlarge’ ps_instance_count = 2 distributions = { 'parameter_server': { 'enabled': True } } hyperparameters = {'epochs': 60, 'batch-size' : 256} estimator_ps = TensorFlow( base_job_name='hvd-imagenet-tf', source_dir='code', entry_point='train_ps.py', role=role, framework_version='1.13', py_version='py3', hyperparameters=hyperparameters, train_instance_count=ps_instance_count, train_instance_type=ps_instance_type, model_dir=model_dir, distributions=distributions) # start training; inputs can be in # Amazon S3, Amazon EFS, or Amazon FSx for Lustre estimator_hvd.fit(inputs)
  • 49.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. TensorFlow 분산 트레이닝 in Amazon SageMaker (4/6) • Amazon SageMaker 상에서 손쉬운 Horovod 클러스터 구성 자동화 및 실행 가능 • SageMaker TensorFlow container ü sets up the MPI environment ü run the mpirun command to start jobs on the cluster nodes • Estimator의 distributions 파라미터에서 다음 필드들의 설정값을 고려할 것 ü enabled (bool): set up for executing mpirun ü processes_per_host (int): Number of processes MPI launching on each host ü custom_mpi_options (bool): For adding flags to the mpirun and then run on Amazon SageMaker (for Horovod training) https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/ Horovod
  • 50.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. TensorFlow 분산 트레이닝 in Amazon SageMaker (5/6) https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/ Horovod – example code from sagemaker.tensorflow import TensorFlow hvd_instance_type = 'ml.p3.2xlarge' hvd_processes_per_host = 1 hvd_instance_count = 2 distributions = { 'mpi': { 'enabled': True, 'processes_per_host': hvd_processes_per_host, 'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none' } } hyperparameters = {'epochs': 60, 'batch-size' : 256} estimator_hvd = TensorFlow( base_job_name='hvd-imagenet-tf', source_dir='code', entry_point='train_hvd.py', role=role, framework_version='1.13', py_version='py3', hyperparameters=hyperparameters, train_instance_count=hvd_instance_count, train_instance_type=hvd_instance_type, distributions=distributions) # start training; inputs can be in # Amazon S3, Amazon EFS, or Amazon FSx for Lustre estimator_hvd.fit(inputs)
  • 51.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. TensorFlow 분산 트레이닝 in Amazon SageMaker (6/6) • Scaling-up on a single machine with multiple GPUs (“data parallelism”) • Scaling-out with either Parameter server or Horovod (“cluster size”) https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/ 필요에 따라 적절한 분산 트레이닝 방법을 선택합니다 Time to share gradient 더 높은 CPU 성능을 원할 경우 더 높은 GPU 성능을 원할 경우 Long Parameter Server Parameter Server or Horovod on a single instance with Multi-GPUs Short Parameter Server Horovod Larger # of gradients Bigger model size Smaller # of gradients Lesser model size
  • 52.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 53.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝 fast.ai – Now anyone can train Imagenet in 18 minutes D E V D A Y
  • 54.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. fast.ai: Now anyone can train ImageNet in 18 minutes (1/5) • ImageNet training results ü Time : 18 minutes ü Machines : 16 * p3.16xlarge on AWS (EC2) ü Compute cost : $48.00 ü PyTorch • Collaborators ü Yaroslv Bulatov ü Jeremy Howard ü Andrew Shaw The summary of results
  • 55.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. fast.ai: Now anyone can train ImageNet in 18 minutes (2/5) • Step 1 ü Find a good baseline for single machine • Step 2 ü Scale to multiple machines An analysis of Deep Neural Network models for practical applications by Alfredo Canziani, Adam Paszke, Eugenio Culurciello How to train fast?
  • 56.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. fast.ai: Now anyone can train ImageNet in 18 minutes (3/5) Trained ImageNet in 30 epochs (instead of 90) Single p3.16xlarge instance trains to 93% in 1.5 hours Progressive resizing for classification Rectangular Image validation • 상대적으로 높은 Learning Rate 로 시작 • 20% faster convergence for single machine LearningRate Number of steps (Leslie Smith) (fast.ai) (fast.ai) • Faster initial epochs: 2x speedup in training 128 vs. 224 • More accurate final epochs: 288 images increased accuracy 0.8% • Validate images close to original aspect ratio (instead center crop images to 224 x 224) • 23% speedup of training time to reach the benchmark accuracy of 93% One cycle Learning Rate 단일 머신 트레이닝
  • 57.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. fast.ai: Now anyone can train ImageNet in 18 minutes (4/5) All-Reduce - NVIDIA NCCL* • Sync gradient after backprop Distributed Data Parallel - PyTorch NCCL: NVIDIA Collective Communications Library • Optimization: Overlap sync with computation Data BackpropForward Sync BackpropForward Data BackpropForward Sync BackpropForward Sync Sync Gradients Gradients GradientsGradients Gradients Gradients GPU0 batch0_0 GPU1 batch0_1 GPU2 batch0_2 GPU3 batch0_3 GPU4 batch0_4 GPU5 batch0_5 분산 아키텍처 [참고] apex.parallel.DistributedDataParallel
  • 58.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. fast.ai: Now anyone can train ImageNet in 18 minutes (5/5) Tips and Tricks Run lots of experiments • Batch Normalization 튜닝 및 Learning Rate 스케일링 (By Goyal) • Learning Rate를 감소시키는 대신 Batch size를 늘림 (By Google Brain) • Spot instance - 70% 저렴 • AMI - ImageNet baked in AWS Deep Learning AMI • 지연시간 - IOPS + Placement Groups • AWS 상에서 손쉽게 분산 환경을 구축하고 실험할 수 있습니다. Scaling Techniques Number of steps Images/sec S3 AMI Io2 volume P3 instance 10k IOPS git clone git@github.com:diux-dev/imagenet18.git pip install -r requirements.txt aws configure python train.py 기타 고려사항
  • 59.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝 Distributed Training of MnasNet on AWS D E V D A Y
  • 60.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Distributed Training of MnasNet on AWS (1/4) • An automated mobile NAS* approach • Trade-off between Accuracy and Latency • An example of MnasNet network architecture MnasNet https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html https://arxiv.org/pdf/1807.11626 https://www.youtube.com/watch?v=4uDZxefPd-I where * NAS: Neural Architecture Search
  • 61.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. $ pip install ec2-cluster==0.3.1 $ ec3 create ec2cluster_p3_mnasnet_example.yaml $ ec3 setup-horovod ec2cluster_p3_mnasnet_example.yaml $ ec3 ssh-cmd ec2cluster_p3_mnasnet_example.yaml ... ... $ source activate tensorflow_p36 $ mpirun -np 16 -hostfile /home/ubuntu/hostfile -bind-to socket -map-by slot -mca plm_rsh_no_tree_spawn 1 -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib -x NCCL_SOCKET_IFNAME=ens5 -mca btl_tcp_if_exclude lo.docker0 -x TF_CPP_MIN_LOG_LEVEL=0 python /home/ubuntu/aws-ai-optimized-models/mnasnet/mnasnet_main_hvd.py --use_tpu=False --data_dir=/home/ubutu/data --model_dir=./results_hvd --train_batch_size=256 --eval_batch_size=256 --train_steps=109475 --skip_host_call=Fall --data_format='channels_first' --transport_input=False --use_horovod=True --eval_on_single_gpu=True ... Distributed Training of MnasNet on AWS (2/4) 실행 예 (1/2) # Define the base params # Naming # Launch Location
  • 62.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Distributed Training of MnasNet on AWS (3/4) ... I0923 16:15:22.086650 140202663954176 saver.py:1276] Restoring parameters from ./results_hvd/model.ckpt-62560 I0923 16:15:22.418808 140202663954176 session_manager.py:491] Running local_init_op. I0923 16:15:22.426828 140202663954176 session_manager.py:493] Done running local_init_op. I0923 16:15:47.475176 140202663954176 evaluation.py:277] Finished evaluation at 2019-09-23-16:15:47 I0923 16:15:47.475430 140202663954176 estimator.py:1979] Saving dict for global step 62560: global_step = 62560, loss = 2.1191003, top_1_accuracy = 0.74759614, top_5_accuracy = 0.9215545 I0923 16:15:47.475846 140202663954176 estimator.py:2039] Saving 'checkpoint_path' summary for global step 62560: ./results_hvd/model.ckpt-62560 I0923 16:15:47.476232 140202663954176 error_handling.py:93] evaluation_loop marked as finished I0923 16:15:47.476345 140202663954176 mnasnet_main_hvd.py:1041] Eval results at step 62560: {'loss': 2.1191003, 'top_1_accuracy': 0.74759614, 'top_5_accuracy': 0.9215545, 'global_step': 62560}. Hvd rank 0 I0923 16:15:47.476416 140202663954176 mnasnet_main_hvd.py:1051] Finished training up to step 62560. Elapsed seconds 40649. 실행 예 (2/2) • time-to-train: ≈ 11.29 hrs • Top-1 accuracy : 74.76% • Top-5 accuracy : 92.16%
  • 63.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Distributed Training of MnasNet on AWS (4/4) 성능 테스트 결과 예 Num. of instances (p3dn.24xlarge) Time-to-train (hours) Top-1 Validation Accuracy (%) 1 29 75.2 2 24.3 74.5 4 9.0 74.67 8 4.6 74.16 16 1.8 ~ 2.6 73.9 ~ 74.6 Machines • p3dn.24xlarge TensorFlow • TENSORFLOW_VERSION: v1.13.1 • CUDNN_VERSION: 7.4.2.24-1+cuda10.0 • NCCL_VERSION: 2.4.2-1+cuda10.0 • OPENMPI 4.0.0 Dataset (ImageNet) • 1.28 millions of images (1000 classes) • 1024 training files & 128 validation files (TFRecords) Optimizations • Mixed Channel ordering • Mixed XLA (for all ops except depth-wise convolution) • LARC Optimizer • HOROVOD_VERSION: 0.16.1 LARC (Layer-wise adaptive rate control)
  • 64.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 65.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. 요약 정리 • Train smart with tools making distributed training easily • Experiment, experiment, and experiment ü Efficient Linear Scalability ü Efficient Linear Scalability ü Flexibility ü Efficient Linear Scalability ü Flexibility ü Fully-managed service Amazon EC2 AWS DL AMI Amazon ECSAmazon EKS AWS DL Container Amazon SageMaker
  • 66.
    감사합니다 © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 67.
    여러분의 피드백을 기다립니다! ©2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. #AWSDEVDAYSEOUL