IT博客汇 | Strange error from Nvidia’s apex library

Strange error from Nvidia’s apex library

RobinDong发表于 2022-08-25 23:49:13

apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:

Traceback (most recent call last):
  File "train.py", line 353, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 220, in train
    scaled_loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x55d2a620ff60
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
output: TensorDescriptor 0x55d2a6215310
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
weight: FilterDescriptor 0x7fd9e806f1e0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 3712, 3712, 1, 1, 
Pointer addresses: 
    input: 0x7fd73fde3a00
    output: 0x7fd746abb600
    weight: 0x7fd761b5de00

This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.

As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.

However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…

All in all, the terrible error above is simply caused by insufficient GPU memory.