[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션즈 아키텍트, 김대근 AWS 솔루션즈 아키텍트

AWS 기반 기계 학습 자동화 및
최적화를 위한 실전 기법
남궁영환
데이터 사이언티스트 SA
아마존웹서비스
A I / M L
김대근
데이터 사이언티스트 SA
아마존웹서비스

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• AI/ML at AWS
• 클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
o Part 1
§ Infrastructure for ML on AWS
§ Horovod & TensorFlow distributed training on EC2, EKS, and SageMaker
o Part 2
§ on AWS
§ MnasNet on AWS
• Summary
D E V D A Y
fast.ai

AWS ML Stack
ML FRAMEWORKS
& INFRASTRUCTURE
A I S E R V I C E S
REKOGNITION
IMAGE
POLLY TRANSCRIBE TRANSLATE COMPREHEND L E X
REKOGNITION
VIDEO
Vision Speech Language Chatbots
AMAZON
SAGEMAKER
BUILD TRAIN
FORECAST
Forecasting
TEXTRACT PERSONALIZE
Recommendations
DEPLOY
Pre-built algorithms & notebooks
Data labeling (GROUND TRUTH)
One-click model training & tuning
Optimization (N E O )
One-click deployment & hosting
M L S E R V I C E S
Frameworks Interfaces Infrastructure
EC2 P3
&
P3DN
EC2 C5 FPGAs GREENGRASS ELASTIC
INFERENCE
Reinforcement learningAlgorithms & models
(AWS MARKETPLACE FOR MACHINE LEARNING)
(App developers with
little knowledge of ML)
(ML developers and
data scientists)
(ML researchers and
academics)
INFERENTIA
: 가장 깊고 폭넓은 역량과 기술의 집약

Scaling TensorFlow near-linearly 256 GPUs at
Amazon SageMaker
및
AWS Deep Learning AMIs
에서 사용 가능
Stock
TensorFlow
65%
30 min
training
time
AWS-Optimized
TensorFlow
90%
scaling efficiency
with 256 GPUs
14 min
https://aws.amazon.com/about-aws/whats-new/2018/11/tensorflow-scalability-to-256-gpus/
2018

대규모 머신러닝이 중요한 이유 (1/3)
- Andrew Ng
How do data science techniques
scale with amount of data?
• 데이터 축적에 따라 모델의 성능은
지속적으로 향상
• 딥러닝 적용 사례가 다양한
분야에서 꾸준히 증가하고 있음
• 대량의 데이터 기반 ML/DL 모델 트레이닝은
많은 시간과 자원들을 필요로 함
• “분산 트레이닝”
https://www.slideshare.net/ExtractConf
https://eng.uber.com/horovod/
The “data parallel” approach
to distributed training
- Uber

Scaling to Very Very Large Corpora for Natural Language Disambiguation, Banko and Brill, Microsoft Research (2001)
http://www.aclweb.org/anthology/P01-1005
“These results suggest that we
may want to reconsider the
trade-off between spending time
and money on algorithm
development versus spending it
on corpus development.”
알고리즘 선정도 중요하지만
많은 양의 트레이닝 데이터의 확보가
무엇보다 중요

• 공통 목표
ü 컴퓨팅, 네트워킹, 컨테이너, 분산 트레이닝 성능 튜닝, . . .
ü 머신러닝 엔지니어는 선호하는 ML/DL 프레임워크를 이용하여
비즈니스 성공에 기여할 수 있는 모델 개발에 집중
• Data Management
ü 데이터의 규모 ∝ 해결 과제 및 알고리즘의 복잡도
ü 데이터의 견고성(durability) 및 가용성(availability)
• Distributed Computing Frameworks
ü Data pipelines feature (Dask, Ray, PyToolz, ipyparallel, etc.)
ü CPU ➝ GPU ➝ Multi-GPUs ➝ Multi-nodes
ü TensorFlow, PyTorch, MxNet, . . .
• Build Compute Clusters to fit the workload!
대규모 머신러닝은
문제 및 접근 방식에
따라 해결 방안이
매우 다양할 수 있음

Where to train and deploy deep learning models
Amazon SageMaker
Amazon
Elastic Container Service for
Kubernetes
Amazon
Elastic Container Service
Amazon EC2
AWS Deep Learning
AMIs
AWS Deep Learning
Containers
“해결하려는 워크로드를 고려하여 적절한
ML/DL 모델 트레이닝 및 배포 환경을 선택합니다”

D E V D A Y
클라우드 기반 대규모 머신러닝/딥러닝 트레이닝
Infrastructure for ML on AWS

P3 instance
대규모 병렬 처리가 필요한 워크로드에 적합
• 기계학습 모델 트레이닝
• HPC(High Performance Computing) 시뮬레이션
• 3D 모델 렌더링
• 비디오 인코딩
최대 8 개의 NVIDIA Tesla V100 GPU
• 1 PetaFLOPs 컴퓨팅 성능
(P2 인스턴스 대비 최대 14배 ↑)
• 300 GB/s 의 GPU 간 통신 속도 지원 (NVLink)
(P2인스턴스 대비 9배 ↑)
• 모든 ML 프레임워크 및 모델 타입 지원
• 다양한 형태의 인스턴스 사용 가능
(Spot instance 사용 시 최대 70% 비용 절감 가능)
P3.2xlarge
1 V100
GPU
8 vCPU
61 GB
Mem
P3.8xlarge
4 V100
GPU
32 vCPU
244 GB
Mem
P3.16xlarge
8 V100
GPU
64 vCPU
488 GB
Mem
3 가지 타입 중 14 리전
https://aws.amazon.com/ko/ec2/instance-types/p3/

P3dn.24xlarge instance
Description P3.16xlarge P3dn.24xlarge Improvements
Number and
Type of GPUs
8 x NVIDIA V100 8 x NVIDIA V100 -
GPU Memory 16GB/GPU 32GB/GPU 100%
GPU Peer to Peer NVLink - 300 GB/s NVLink - 300 GB/s -
CPU Family Broadwell Skylake w AVX512
vCPU 64 96 50%
System Memory 488 GB 768 GB 57%
Networking
Throughput
25Gbps 100Gbps 200%
EBS Throughput 14Gbps 14Gbps -
Local Instance
Storage
No 2.0TBs NVMe SSD
• 클라우드에서 사용 가능한 가장 강력한 GPU
인스턴스
• 효율적인 대규모 ML 트레이닝 및 HPC
시뮬레이션 지원
(100Gbps 네트워크 대역폭을 이용한 멀티-노드 클러스터
(32대 이상) 구성 가능)
• 모델 트레이닝 및 시뮬레이션을 위한 데이터에
빠른 액세스 지원
(Amazon S3, 네트워크 기반 파일 시스템, 로컬 인스턴스 스토리지)
• 대규모 ML 모델 트레이닝 및 대규모 데이터 처리
(32GB GPU 메모리를 장착한 최신 NVIDA V100 GPU)
• 데이터 전처리 최적화에 적합
(96 vCPUs using AWS Custom Skylake CPUs and 768GB of
System Memory)
https://aws.amazon.com/ko/ec2/instance-types/p3/#Amazon_EC2_P3dn.24xlarge_Instances

AWS FSx for Lustre
• 머신 러닝, HPC, 동영상 처리, 금융 모델링 등을 위한 고성능 파일
시스템
• S3와 기본적으로 연동됨
• Lustre는 1 millisecond 미만의 지연 시간과 초당 수백 Gigabytes,
수백만 IOPS로 확장되는 처리량을 지원
• POSIX와 호환되므로, 특별히 추가 변경 없이 기존 Linux 기반
애플리케이션 사용 가능
• 사용한 리소스에 대해서만 비용 지불 (최소약정/선수금 없음)
• 클라이언트 OS 커널 모듈 변경 작업 필요없음
(https://aws.amazon.com/ko/fsx/lustre/)
Amazon FSx
for Lustre

Infrastructure for ML on AWS (1/3)
전통적 HPC 머신러닝 클러스터
Auto Scaling BeeGFS RAM storage nodes
Auto Scaling worker nodes
Bastion host | BeeGFS management node | Cluster monitoring
Deep Learning
Placement
Group
Amazon EFS
Deep Learning
Application Stack
Cluster-wide
persistent storage
Model parameter
Object store
BeeGFS RAM-based storage array
Multi-node parallel
Deep Learning
Placement Group
Amazon S3
Cloud-native 머신러닝 클러스터
AWS Batch
Amazon FSx
for Lustre
P3 / P3dn container instances
commit
hydrate
Lustre
kernel
driver
Amazon ECR
Multi-node TensorFlow
Container Registry

Traditional AWS Deep Learning Cluster
https://aws.amazon.com/ko/blogs/compute/distributed-deep-learning-made-easy/
https://github.com/aws-samples/deep-learning-models/tree/master/hpc-cluster
https://github.com/awslabs/deeplearning-cfn
Amazon SQS
Worker Queue
Amazon SQS
Master Queue
Internet
Gateway
AWS
Lambda
Amazon
SNS
Auto Scaling Group
Auto Scaling Group
VPC Public
subnet
Private
subnet
AWS Elastic File System
EC2 Master
Instance
EC2 Workers
Public: 203.0.113.0
Private: 10.0.0.1
Workers
10.0.1.1
10.0.1.2
10.0.1.3
AWS
Cloud
Default VPC: 10.0.0.0/16 NAT Gateway
Private Subnet
10.0.1.0/16
Worker setup
Public Subnet
10.0.0.0/24
Auto Scaling
Setup Complete
Internet
Router
Amazon
S3

Cloud-native AWS Deep Learning Cluster
https://aws.amazon.com/ko/blogs/compute/scalable-deep-learning-training-using-multi-node-parallel-jobs-with-aws-batch-and-amazon-fsx-for-lustre/
Amazon
CloudWatch
Amazon Glacier
AWS Cloud
Training Output
bucket
AWS Step Functions workflow
Event
trigger
TFRecord Input
bucket
TensorFlow
Container Registry
Multi-node Parallel Job
NVIDIA GPU-backed
running containers
FSx for Lustre AWS Batch

with Horovod & TensorFlow
D E V D A Y

Horovod (1/9)
• 분산 딥러닝을 위한 오픈 소스 프레임워크
• Stock TensorFlow, Keras, PyTorch 등과 연동하여 동작
• 쉽고 간단한 설치 `pip install horovod`
• 고급 알고리즘 사용 가능
• High-Performance 네트워크 (RDMA, GPUDirect) 지원
• ML 엔지니어와 인프라를 분리
ü 인프라팀은 컨테이너 및 MPI 환경을 제공
ü ML 엔지니어는 선호하는 딥러닝 프레임워크 사용
ü 프레임워크 상에서 분산 트레이닝에 대한 공통 기대치
(인프라팀 & ML 엔지니어)
horovod.ai

Horovod (2/9)
• Ring-AllReduce
ü 데이터의 규모 ∝ 클러스터 노드의 개수
• Synchronous updates
• NVIDIA’s NCCL library (for GPU-level communication)
• Configurations
ü Sing-ring NCCL vs. Hierarchical AllReduce
HOROVOD_HIERARCHICAL_ALLREDUCE=1
ü Tensor Fusion
HOROVOD_FUSION_THRESHOLD=67108864
HOROVOD_CYCLE_TIME=5
ü FP16 all-reduce
hvd.DistributedOptimizer(...,compression=hvd.Compression.fp16)
Worker A
5 13 8 19 42 1
Worker C
9 27 3 15 8 4
Worker B
8 11 4 2 7 7
Worker A
5 13 8 19 50 5
Worker C
9 27 7 17 8 4
Worker B
13 24 4 2 7 7
Worker A
5 13 15 36 50 5
Worker C
22 51 7 17 8 4
Worker B
13 24 4 2 57 12
Worker A
22 51 15 36 50 5
Worker C
22 51 7 17 57 12
Worker B
13 24 15 36 57 12
Worker A
22 51 15 36 57 12
Worker C
22 51 15 36 57 12
Worker B
22 51 15 36 57 12

Horovod (3/9)
2. 사용할 GPU 세팅
config = tf.ConfigProto()
config.gpu_options.visible_device_list =
str(hvd.local_rank())
3. Learning Rate 조정 및
Horovod 분산 Optimizer 추가
opt = tf.train.MomentumOptimizer(
lr=0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)
4. Synchronize initial state between workers
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(hooks=hooks,...) as mon_sess:
...
# OR
bcast_op = hvd.broadcast_global_variables(0)
sess.run(bcast_op)
5. Use checkpoints only on the first worker
ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …)
as mon_sess:
...
1. 라이브러리 초기화
import horovod.tensorflow as hvd
hvd.init()
* Horovod for TensorFlow, Keras, and PyTorch
import horovod.keras as hvd
import horovod.tensorflow.keras as hvd
import horovod.torch as hvd
# more frameworks coming
( source code from https://github.com/horovod/horovod )

Horovod (4/9)
실행 예
# Use AWS Deep Learning AMI
laptop$ ssh ubuntu@<aws-ip-1>
aws-ip-1$ source activate tensorflow_p27
aws-ip-1$ ssh-keygen
aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub
[copy contents of the pubkey]
aws-ip-1$ exit
aws-ip-2$ source activate tensorflow_p27
aws-ip-2$ cat >> /home/ubuntu/.ssh/authorized_keys
[paste contents of the pubkey]
aws-ip-2$ exit
aws-ip-2$ ssh aws-ip-2
[will ask for prompt, say yes]
aws-ip-2$ exit
aws-ip-1$ mpirun -np 2 -H aws-ip-1,aws-ip-2
wget https://raw.githubusercontent.com/uber/horovod
/master/examples/tensorflow_mnist.py
aws-ip-1$ mpirun -bind-to none -map-by slot
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1
-x LD_LIBRARY_PATH -x PATH
-mca btl_tcp_if_exclude lo,docker0
–np 16 -H aws-ip-1:8,aws-ip-2:8
python tensorflow_mnist.py
# Pro tip: hide mpirun args into mpirun.sh
aws-ip-1$ mpirun.sh
–np 16 –H aws-ip-1:8,aws-ip-2:8
python tensorflow_mnist.py

Horovod (5/9)
import tensorflow as tf
# Initialize Horovod
hvd.init()
# Pin GPU to be used to
# process local rank (one GPU per process)
# Build model...
loss = ...
# Add Horovod Distributed Optimizer
# Add hook to synchronize initial state
hooks =[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs"
if hvd.rank() == 0 else None
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of
# session initialization, restoring from a
# checkpoint, saving to a checkpoint, and
# closing when done or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint
_dir=ckpt_dir, config=config, hooks=hooks) as mon
_sess:
while not mon_sess.should_stop():
# Perform synchronous training
mon_sess.run(train_op)
[참고] 예제 코드 – Horovod for TensorFlow

Horovod (6/9)
[참고] 예제 코드 – Estimator API
hvd.init()
# Pin GPU to be used
# Build model...
def model_fn(features, labels, mode):
loss = ...
# Add Horovod Distributed Optimizer
return tf.estimator.EstimatorSpec(...)
# Broadcast initial variable state.
hooks =
[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs"
if hvd.rank() == 0 else None
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn,
model_dir=ckpt_dir,
config=tf.estimator.RunConfig(
session_config=config))
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=hooks)

Horovod (7/9)
import mxnet as mx
import horovod.mxnet as hvd
from mxnet import autograd
hvd.init()
# Pin GPU to be used to process local rank
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()
# Build model
model = ...
model.hybridize()
# Create optimizer
optimizer_params = ...
opt = mx.optimizer.create('sgd', **optimizer_params)
# Initialize parameters
model.initialize(initializer, ctx=context)
# Fetch and broadcast parameters
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)
# Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)
# Create loss function
loss_fn = ...
# Train model
for epoch in range(num_epoch):
train_data.reset()
for nbatch, batch in enumerate(train_data, start=1):
data = batch.data[0].as_in_context(context)
label = batch.label[0].as_in_context(context)
with autograd.record():
output = model(data.astype(dtype, copy=False))
loss = loss_fn(output, label)
loss.backward()
trainer.step(batch_size)
[참고] 예제 코드 – Horovod for MxNet

Horovod (8/9)
[참고] 예제 코드 – Horovod for Keras
import keras
from keras import backend as K
import horovod.keras as hvd
hvd.init()
# Pin GPU to be used
K.set_session(tf.Session(config=config))
# Build model...
model = ...
opt = keras.optimizers.Adadelta(lr=1.0 * hvd.size())
# Add Horovod Distributed Optimizer.
model.compile(
loss='categorical_crossentropy’,
optimizer=opt,
metrics=['accuracy'])
# Broadcast initial variable state.
callbacks = [hvd.callbacks.BroadcastGlobalVariabl
esCallback(0)]
...
model.fit(
x_train,
y_train,
callbacks=callbacks,
epochs=10,
validation_data=(x_test, y_test))

Horovod (9/9)
[참고] 예제 코드 – Horovod for PyTorch
import torch
import horovod.torch as hvd
hvd.init()
# Horovod: pin GPU to local rank
torch.cuda.set_device(hvd.local_rank())
# Build model...
model = Net()
model.cuda()
optimizer = optim.SGD(model.parameters())
# Wrap optimizer with DistributedOptimizer
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters())
# Horovod: broadcast parameters
hvd.broadcast_parameters(
model.state_dict(),
root_rank=0)
for epoch in range(100):
for batch_idx, (data, target) in ...:
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()

Scalable multi-node training (EC2)
D E V D A Y

Scaling performance using distributed training
TensorFlow & Horovod on Amazon EC2
• TFRecord 변환 전용 인스턴스로 전처리 수행
ü t2.large Instance with 1.0 TB EBS sc1 Volume
ü Download ImageNet dataset
ü Transform the raw dataset with TFRecord
ü Upload the transformed dataset to the Amazon S3
nohup aws s3 sync /data s3://YOUR_BUCKET_NAME >& upload.log &
• Setting up all the EC2 instances having the same type of instances, AMI, the path of
data, and the path of models
• Need to check the utilization of GPUs on P3dn.24xlarge (and/or P3.16xlarge)
ImageNet을 이용한 트레이닝 실행 예시

ImageNet을 이용한 트레이닝 실행 예시
• Time-to-train: around 45 mins
• 8 * P3dn.24xlarge instances
• ML Models: ResNet-50
• Top-1 Validation Accuracy : 75.59 %
https://docs.aws.amazon.com/ko_kr/dlami/latest/devguide/tutorial-horovod-tensorflow.html

https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/
• 8 * P3.16xlarge instances
• DL Framework: TensorFlow, MxNet
• ML model: ResNet-50
• Dataset: ImageNet (1.2 millions of images)
• Top-1 validation accuracy: 76%
-
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
1 2 4 8 16 32 64
Images/Second
Number of GPUs
time-to-train: 47 min ~ 50 min Training using
P3 instances
(ResNet-50 & ImageNet)
구성 정보

https://aws.amazon.com/ko/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/
• 32 * P3.16xlarge instances
• DL Framework: TensorFlow
• ML model: ResNet-50
• Dataset: ImageNet
• Top-1 validation accuracy 75.4%
• Top-5 validation accuracy 92.6%
time-to-train: 14.6 min
Training performance
w.r.t. TensorFlow & CUDA
(ResNet-50 & ImageNet)
(Images/sec)
Time to train vs Number of GPUs vs
Images/sec, efficiency, and
communication overhead
구성 정보

Amazon EKS 기반 분산 딥러닝 성능 최적화
D E V D A Y

Amazon EKS 기반 분산 딥러닝 성능 최적화 (1/11)
[참고] Modular and Scalable Amazon EKS Architecture
https://aws.amazon.com/ko/quickstart/architecture/amazon-eks/

• STEP 1. Install Kubeflow to setup a cluster for distributed training
• STEP 2. Set the app name and initialize it.
• STEP 3. Install mpi-operator from kubeflow
• STEP 4. Create a MPI Job template, define the number of nodes (replicas),
number of GPUs each node has (gpusPerReplica)
• STEP 5. Apply the manifest to the default environment.
The MPI Job will create a launch pod
Using Horovod in Amazon EKS
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-eks-tutorials-distributed-gpu-training.html

EKS Deep Learning Benchmark Utility
https://github.com/aws-samples/aws-eks-deep-learning-benchmark
• 클러스터 생성부터 종료까지 자동화된
벤치마크 워크플로 제공
• 다양한 백엔드 스토리지 시스템 지원
(예: Amazon EFS, Amazon FSx for Lustre)
• S3와 연동하여 환경설정 정보 및 결과 저장
• Backed by kubeflow operators and kubebench.
• 다양한 딥러닝 프레임워크 지원
(TF, TF + Horovod + OpenMPI, PyTorch, MxNet)
• 사용자의 요구사항에 맞는 Kubernetes 클러스터 환경
설정 지원
• 중간 결과 저장 및 자동 클러스터 종료 기능
• 동시에 여러 실험을 병렬로 진행 가능

• Setup NFS
• Install Argo Workflow
• Configure AWS credentials
• Conifgure your GitHub token
• Setup S3 buckets for your benchmark results and
your training data
• Configure your Kubernetes cluster
kubectl create -f deploy/benchmark-nfs-svc.yaml
kubectl get svc benchmark-nfs-svc -o=jsonpath={.spec.clusterIP}
# Replace ip in the `deploy/benchmark-nfs-volume.yaml` before following step
kubectl create -f deploy/benchmark-nfs-volume.yaml
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.2.1/manifests/install.yaml
# you can forward port to localhost and look at Argo UI
kubectl port-forward deployment/argo-ui 8001:8001 -n argo

• Run the benchmmark jobs
s3ResultPath: 's3://kubeflow-pipeline-data/benchmark/',
s3DatasetPath: 's3://eks-dl-benchmark/imagenet/',
clusterConfig: 's3://kubeflow-pipeline-data/benchmark/cluster_config.yaml',
experiments: [{
experiment: 'experiment-20190415-01',
trainingJobConfig: 's3://kubeflow-pipeline-data/benchmark/mpi-job-imagenet.yaml',
trainingJobPkg: 'mpi-job',
trainingJobPrototype: 'mpi-job-custom',
// Change to upstream once https://github.com/kubeflow/kubeflow/pull/3062 is merged
trainingJobRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow',
}],
githubSecretName: 'github-token',
githubSecretTokenKeyName: 'GITHUB_TOKEN',
image: 'seedjeffwan/benchmark-runner:20190424',
name: '20190424-00',
namespace: 'default',
nfsVolume: 'benchmark-pv',
nfsVolumeClaim: 'benchmark-pvc',
region: 'us-west-2',
trainingDatasetVolume: 'dataset-claim',
s3SecretName: 'aws-secret',
s3SecretAccesskeyidKeyName: 'AWS_ACCESS_KEY_ID',
s3SecretSecretaccesskeyKeyName: 'AWS_SECRET_ACCESS_KEY',
storageBackend: 'fsx',
kubeflowRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow'
1. Update your workflow
setting using ks command
2. Update benchmark
workflow manifest directly

• Kubernetes
ü 컨테이너 기반의 다양한 ML/DL 프레임워크 지원
ü 탄력성 및 손쉬운 확장성 지원
ü Deep Neural Network 트레이닝 환경으로서 지속적으로 확산 중
• Amazon EKS
ü 완전 관리형 Kubernetes 서비스
ü EC2 P2, P3 인스턴스 상에서 Kubernetes 워크로드의 손쉬운 실행
• Kubeflow
ü 머신러닝 워크로드 효율적인 개발, 관리, 배포 등을 지원하는 Kubernetes-native 플랫폼
ü 분산 트레이닝 지원
(native TensorFlow architecture or MPI AllReduce (NVIDIA NCCL library or Horovod))
https://aws.amazon.com/ko/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
( Amazon EKS + Kubeflow + AWS FSx CSI driver )

• Amazon FSx for Lustre
ü High Performance 파일 시스템
ü 빠른 처리를 요구하는 워크로드에 최적 (예: 머신러닝, HPC )
ü Amazon S3와 연동, 통합 지원
• AWS FSx CSI driver
ü Kubernetes-native 형태로 컨테이너에서 FSx for Lustre 파일시스템 이용 가능
ü Static/Dynamic volume provisioning
ü Containers from multiple nodes within a cluster (connected to the same Lustre filesystem)
ü Lustre 데이터 저장소로 S3 사용 가능

Machines • 20 * p3.16xlarge (mixed precision)
Amazon EKS-optimized
AMI with GPU support
• Kubernetes v1.11.8
• MPI Operator Alpha from Kubeflow 0.4.1
• CUDA 10 with NVIDIA Tesla 410.104 driver
• Docker 18.06.1-ce (incl. nvidia-docker2)
AWS FSx for Lustre
filesystem
• FSx CSI Driver v0.1
• Hydrated from an S3 bucket
(for ImageNet TFRecords)
TensorFlow
(customized image)
• TENSORFLOW_VERSION: v1.13.1
• HOROVOD_VERSION: 0.16.0
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
• NCCL_VERSION: 2.4.2-1+cuda10.0
• OPENMPI 4.0.0
Dataset (ImageNet)
• 1.28 millions of images (1000 classes)
• 1024 training files & 128 validation files
(TFRecords)
Relevant tools
• awscli, eksctl, ksonnet, and
aws-iam-authenticator
“90%-100%의 near-linear scaling performance를 확인”

• 성능 최적화를 위한 체크리스트 (Part #1)
ü 최신 딥러닝 툴킷 사용 (예: AMI for EKS)
ü GPU clock speed를 최대값으로 설정 (참고: bootstrap command)
ü Placement Group 내에 인스턴스 생성 (낮은 지연시간)
ü AWS VPC CNI 플러그인 (최신버전)을 사용 (모든 NIC들이 EKS 클러스터 상에서 기본적으로
Jumbo Frame을 사용하도록)

• 성능 최적화를 위한 체크리스트 (Part #2)
ü 적절한 스토리지 백엔드를 선택 (EBS, EFS, FSx for Lustre, etc.)
ü Static Kubernetes CPU 관리 정책을 사용
ü MPI processor
ü Intel MKL DNN 으로 TensorFlow 환경을 구축하여 GPU 성능을 최적화
ü 데이터 변환 프로세스 및 스레드 병렬화를 위한 TensorFlow 최적화
ü 스레드 풀 조정 및 CPU 성능 튜닝

Amazon SageMaker에서 TensorFlow 분산 트레이닝
D E V D A Y

TensorFlow 분산 트레이닝 in Amazon SageMaker (1/6)
• Amazon SageMaker는 Prebuilt TensorFlow 컨테이너를 제공 (TensorFlow v1.11+)
• ML 모델 트레이닝을 위한 하드웨어 리소스, 하이퍼파라미터 설정
• Training instances: ML 모델 트레이닝을 위한 비용 효율적이고 자동화된 클러스터
• Approaches for distributed training
ü TensorFlow’s native parameter server (TF v1.11+)
ü Horovod (TF v.1.12+)
https://aws.amazon.com/ko/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/
Amazon SageMaker

• Multiple dedicated processes to
ü Collect gradients
(computed by “worker” processes)
ü Aggregate gradients
ü Distribute the updated gradients back to the
workers asynchronously
ü All-to-all communication model
• In Amazon SageMaker
ü No need to setup and manage the
parameter server cluster manually
ü A built-in script mode option
Parameter servers

Parameter servers – example code
from sagemaker.tensorflow import TensorFlow
ps_instance_type = 'ml.p3.2xlarge’
ps_instance_count = 2
distributions = {
'parameter_server': {
'enabled': True
}
}
hyperparameters = {'epochs': 60, 'batch-size' : 256}
estimator_ps = TensorFlow(
base_job_name='hvd-imagenet-tf',
source_dir='code',
entry_point='train_ps.py',
role=role,
framework_version='1.13',
py_version='py3',
hyperparameters=hyperparameters,
train_instance_count=ps_instance_count,
train_instance_type=ps_instance_type,
model_dir=model_dir,
distributions=distributions)
# start training; inputs can be in
# Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_hvd.fit(inputs)

• Amazon SageMaker 상에서 손쉬운 Horovod 클러스터 구성 자동화 및 실행 가능
• SageMaker TensorFlow container
ü sets up the MPI environment
ü run the mpirun command to start jobs on the cluster nodes
• Estimator의 distributions 파라미터에서 다음 필드들의 설정값을 고려할 것
ü enabled (bool): set up for executing mpirun
ü processes_per_host (int): Number of processes MPI launching on each host
ü custom_mpi_options (bool): For adding flags to the mpirun and then run on Amazon
SageMaker (for Horovod training)
Horovod

Horovod – example code
from sagemaker.tensorflow import TensorFlow
hvd_instance_type = 'ml.p3.2xlarge'
hvd_processes_per_host = 1
hvd_instance_count = 2
distributions = {
'mpi': {
'enabled': True,
'processes_per_host': hvd_processes_per_host,
'custom_mpi_options':
'-verbose --NCCL_DEBUG=INFO
-x OMPI_MCA_btl_vader_single_copy_mechanism=none'
}
}
hyperparameters = {'epochs': 60, 'batch-size' : 256}
estimator_hvd = TensorFlow(
base_job_name='hvd-imagenet-tf',
source_dir='code',
entry_point='train_hvd.py',
role=role,
framework_version='1.13',
py_version='py3',
hyperparameters=hyperparameters,
train_instance_count=hvd_instance_count,
train_instance_type=hvd_instance_type,
distributions=distributions)
# start training; inputs can be in
# Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_hvd.fit(inputs)

• Scaling-up on a single machine with multiple GPUs (“data parallelism”)
• Scaling-out with either Parameter server or Horovod (“cluster size”)
필요에 따라 적절한 분산 트레이닝 방법을 선택합니다
Time to share gradient 더 높은 CPU 성능을 원할 경우 더 높은 GPU 성능을 원할 경우
Long Parameter Server
Parameter Server or Horovod on
a single instance with Multi-GPUs
Short Parameter Server Horovod
Larger # of gradients
Bigger model size
Smaller # of gradients
Lesser model size

fast.ai – Now anyone can train Imagenet in 18 minutes
D E V D A Y

fast.ai: Now anyone can train ImageNet in 18 minutes (1/5)
• ImageNet training results
ü Time : 18 minutes
ü Machines : 16 * p3.16xlarge on AWS (EC2)
ü Compute cost : $48.00
ü PyTorch
• Collaborators
ü Yaroslv Bulatov
ü Jeremy Howard
ü Andrew Shaw
The summary of results

• Step 1
ü Find a good baseline
for single machine
• Step 2
ü Scale to multiple machines
An analysis of Deep Neural Network models for
practical applications
by Alfredo Canziani, Adam Paszke, Eugenio Culurciello
How to train fast?

Trained ImageNet in 30 epochs
(instead of 90)
Single p3.16xlarge instance
trains to 93% in 1.5 hours
Progressive resizing
for classification
Rectangular
Image validation
• 상대적으로 높은
Learning Rate 로 시작
• 20% faster convergence for
single machine
LearningRate
Number of steps
(Leslie Smith) (fast.ai) (fast.ai)
• Faster initial epochs:
2x speedup in training 128 vs. 224
• More accurate final epochs:
288 images increased accuracy 0.8%
• Validate images close to original aspect
ratio (instead center crop images to
224 x 224)
• 23% speedup of training time to reach
the benchmark accuracy of 93%
One cycle
Learning Rate
단일 머신 트레이닝

All-Reduce - NVIDIA NCCL*
• Sync gradient after backprop
Distributed Data Parallel - PyTorch
NCCL: NVIDIA Collective Communications Library
• Optimization: Overlap sync with computation
Data
BackpropForward
Sync
BackpropForward
Data
BackpropForward
Sync
BackpropForward
Sync Sync
Gradients
Gradients
GradientsGradients
Gradients
Gradients GPU0
batch0_0
GPU1
batch0_1
GPU2
batch0_2
GPU3
batch0_3
GPU4
batch0_4
GPU5
batch0_5
분산 아키텍처
[참고] apex.parallel.DistributedDataParallel

Tips and Tricks
Run lots of
experiments
• Batch Normalization 튜닝
및 Learning Rate 스케일링
(By Goyal)
• Learning Rate를 감소시키는
대신 Batch size를 늘림
(By Google Brain)
• Spot instance - 70% 저렴
• AMI - ImageNet baked in AWS Deep Learning AMI
• 지연시간 - IOPS + Placement Groups
• AWS 상에서 손쉽게 분산
환경을 구축하고 실험할 수
있습니다.
Scaling
Techniques
Number of steps
Images/sec
S3
AMI
Io2 volume
P3 instance
10k IOPS
git clone git@github.com:diux-dev/imagenet18.git
pip install -r requirements.txt
aws configure
python train.py
기타 고려사항

Distributed Training of MnasNet on AWS
D E V D A Y

Distributed Training of MnasNet on AWS (1/4)
• An automated mobile NAS* approach
• Trade-off between Accuracy and Latency
• An example of MnasNet network architecture
MnasNet
https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html
https://arxiv.org/pdf/1807.11626
https://www.youtube.com/watch?v=4uDZxefPd-I
where
* NAS: Neural Architecture Search

$ pip install ec2-cluster==0.3.1
$ ec3 create ec2cluster_p3_mnasnet_example.yaml
$ ec3 setup-horovod ec2cluster_p3_mnasnet_example.yaml
$ ec3 ssh-cmd ec2cluster_p3_mnasnet_example.yaml
...
...
$ source activate tensorflow_p36
$ mpirun -np 16 -hostfile /home/ubuntu/hostfile
-bind-to socket -map-by slot -mca plm_rsh_no_tree_spawn 1
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216
-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib
-x NCCL_SOCKET_IFNAME=ens5 -mca btl_tcp_if_exclude lo.docker0 -x TF_CPP_MIN_LOG_LEVEL=0
python /home/ubuntu/aws-ai-optimized-models/mnasnet/mnasnet_main_hvd.py --use_tpu=False
--data_dir=/home/ubutu/data --model_dir=./results_hvd
--train_batch_size=256 --eval_batch_size=256
--train_steps=109475 --skip_host_call=Fall --data_format='channels_first'
--transport_input=False --use_horovod=True --eval_on_single_gpu=True
...
실행 예 (1/2)
# Define the base params
# Naming
# Launch Location

...
I0923 16:15:22.086650 140202663954176 saver.py:1276] Restoring parameters from ./results_hvd/model.ckpt-62560
I0923 16:15:22.418808 140202663954176 session_manager.py:491] Running local_init_op.
I0923 16:15:22.426828 140202663954176 session_manager.py:493] Done running local_init_op.
I0923 16:15:47.475176 140202663954176 evaluation.py:277] Finished evaluation at 2019-09-23-16:15:47
I0923 16:15:47.475430 140202663954176 estimator.py:1979] Saving dict for global step 62560: global_step = 62560, loss =
2.1191003, top_1_accuracy = 0.74759614, top_5_accuracy = 0.9215545
I0923 16:15:47.475846 140202663954176 estimator.py:2039] Saving 'checkpoint_path' summary for global step 62560:
./results_hvd/model.ckpt-62560
I0923 16:15:47.476232 140202663954176 error_handling.py:93] evaluation_loop marked as finished
I0923 16:15:47.476345 140202663954176 mnasnet_main_hvd.py:1041] Eval results at step 62560: {'loss': 2.1191003,
'top_1_accuracy': 0.74759614, 'top_5_accuracy': 0.9215545, 'global_step': 62560}. Hvd rank 0
I0923 16:15:47.476416 140202663954176 mnasnet_main_hvd.py:1051] Finished training up to step 62560. Elapsed seconds 40649.
실행 예 (2/2)
• time-to-train: ≈ 11.29 hrs
• Top-1 accuracy : 74.76%
• Top-5 accuracy : 92.16%

성능 테스트 결과 예
Num. of
instances
(p3dn.24xlarge)
Time-to-train
(hours)
Top-1
Validation
Accuracy (%)
1 29 75.2
2 24.3 74.5
4 9.0 74.67
8 4.6 74.16
16 1.8 ~ 2.6 73.9 ~ 74.6
Machines • p3dn.24xlarge
TensorFlow
• TENSORFLOW_VERSION: v1.13.1
• CUDNN_VERSION: 7.4.2.24-1+cuda10.0
• NCCL_VERSION: 2.4.2-1+cuda10.0
• OPENMPI 4.0.0
Dataset
(ImageNet)
• 1.28 millions of images (1000 classes)
• 1024 training files & 128 validation files
(TFRecords)
Optimizations
• Mixed Channel ordering
• Mixed XLA (for all ops except depth-wise convolution)
• LARC Optimizer
• HOROVOD_VERSION: 0.16.1
LARC (Layer-wise adaptive rate control)

요약 정리
• Train smart with tools making distributed training easily
• Experiment, experiment, and experiment
ü Efficient Linear Scalability ü Efficient Linear Scalability
ü Flexibility
ü Efficient Linear Scalability
ü Flexibility
ü Fully-managed service
Amazon EC2
AWS DL AMI
Amazon ECSAmazon EKS
AWS DL Container
Amazon SageMaker

감사합니다

여러분의 피드백을 기다립니다!
#AWSDEVDAYSEOUL

[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션즈 아키텍트, 김대근 AWS 솔루션즈 아키텍트

More Related Content

What's hot

Similar to [AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션즈 아키텍트, 김대근 AWS 솔루션즈 아키텍트

More from Amazon Web Services Korea

Recently uploaded

[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션즈 아키텍트, 김대근 AWS 솔루션즈 아키텍트