NVIDIA의 APEX - 학습 시간 단축, 성능 개선

wav2vec2.0 실험을 위한 fairseq 를 설치하는 중

나중에 빠른 학습을 하려면 NVIDIA's apex 라이브러리를 설치하면 좋다는 글을 봤다.

git clone https://github.com/NVIDIA/apex

위 깃헙에서 다운로드 받을 수 있고, apex 가 뭔지 알아보고자 한다.

먼저 모델 훈련을 하다보면 더 빨리 학습시키고 싶어서 다양하게 파라미터들을 조정하게 된다.

이 때 엔비디아의 툴킷인 APEX (A Pytorch Extension) 을 사용하면 pytorch 에서 쉽게 분산학습과 mixed precision 을 사용할 수 있다고 한다.

APEX (A Pytorch EXtension)

APEX 패키지에는 mixed precision training 과 distributed training 의 두 가지 기능이 들어있는데, 본 글에서는 mixed precision training 인 AMP 에 대해 다루고자 한다.

APEX 안에 AMP (Automatic Mixed Precision) 이 있는데, 이는 배치 사이즈를 늘려주고 학습시간을 단축시켜준다.

특히나 APEX 를 이용하면 단 세 줄의 코드만으로 mixed precision 을 가능하게 하는데, 이는 AMP 를 통해서 가능한 것이다.

apex.amp is a tool to enable mixed precision training by changing only 3 lines of your script.
Users can easily experiment with different pure and mixed precision training modes by supplying different flags to amp.initialize.

* 현재는 AMP 모듈이 pytorch 1.5.0 부터 기본 라이브러리에 추가됨

여기서 Mixed precision 은 무엇일까?

- mixed precision 을 통해 배치 사이즈를 증가시킬 수 있고, 학습 시간을 단축시킬 수 있어 성능을 개선할 수 있다.

- mixed precision 은 처리 속도를 높이기 위해 FP16 연산과 FP32 연산을 섞어서 학습시킨다.

- 이 때 FP32 대신 FP16 을 이용하면 메모리는 절반만 사용하면서도 8배의 연산처리량의 이점을 가진다.

- mixed precision 을 사용하지 않을 때와 비교하면 다음과 같은 성능 향상을 보인다.

사용방법 [3줄]

# Declare model and optimizer with default (FP32) Precision
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

# Allow AMP to perform casts as required by the opt-level
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
...
# Loss.backward() becomes:
with amp.scale_loss(loss, optimizer) as scaled_loss:
	scaled_loss.backward()
...

- amp.initialize : 로드한 model과 optimizer 감싸기

- amp.scale_loss : 학습 중에 loss 와 optimizer 감싸기

- 감싼 scaled_loss 로 back propagation 진행

실제 적용

# compute output
if args.prof >= 0: torch.cuda.nvtx.range_push("forward")
output = model(input)
if args.prof >= 0: torch.cuda.nvtx.range_pop()
loss = criterion(output, target)

# compute gradient and do SGD step
optimizer.zero_grad()

if args.prof >= 0: torch.cuda.nvtx.range_push("backward")
with amp.scale_loss(loss, optimizer) as scaled_loss:
	scaled_loss.backward()
if args.prof >= 0: torch.cuda.nvtx.range_pop()

if args.prof >= 0: torch.cuda.nvtx.range_push("optimizer.step()")
optimizer.step()
if args.prof >= 0: torch.cuda.nvtx.range_pop()

(출처 : https://cvml.tistory.com/8)

- loss 계산 후, optimizer 에서 zero_grad 를 계산해주고, amp.scale_loss 로 감싸서 backward 진행

- 그리고 optimizer step 진행

Reference

'Pytorch' 카테고리의 다른 글

Pytorch 기초 : Pytorch 란? (numpy, ndarray, tensors) (1)	2021.01.21

nongdevlog

NVIDIA의 APEX - 학습 시간 단축, 성능 개선

APEX (A Pytorch EXtension)

여기서 Mixed precision 은 무엇일까?

사용방법 [3줄]

Reference

'Pytorch' 카테고리의 다른 글

댓글

티스토리툴바

NVIDIA의 APEX - 학습 시간 단축, 성능 개선

APEX (A Pytorch EXtension)

여기서 Mixed precision 은 무엇일까?

사용방법 [3줄]

Reference

'Pytorch' 카테고리의 다른 글

관련글

댓글

티스토리툴바