IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Strange error from Nvidia’s apex library

    RobinDong发表于 2022-08-25 23:49:13
    love 0

    apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:

    Traceback (most recent call last):
      File "train.py", line 353, in <module>
        train(args, train_loader, eval_loader)
      File "train.py", line 220, in train
        scaled_loss.backward()
      File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
    You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
    
    import torch
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.allow_tf32 = True
    data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True)
    net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
    net = net.cuda().half()
    out = net(data)
    out.backward(torch.randn_like(out))
    torch.cuda.synchronize()
    
    ConvolutionParams 
        data_type = CUDNN_DATA_HALF
        padding = [0, 0, 0]
        stride = [1, 1, 0]
        dilation = [1, 1, 0]
        groups = 1
        deterministic = false
        allow_tf32 = true
    input: TensorDescriptor 0x55d2a620ff60
        type = CUDNN_DATA_HALF
        nbDims = 4
        dimA = 28, 3712, 10, 10, 
        strideA = 371200, 100, 10, 1, 
    output: TensorDescriptor 0x55d2a6215310
        type = CUDNN_DATA_HALF
        nbDims = 4
        dimA = 28, 3712, 10, 10, 
        strideA = 371200, 100, 10, 1, 
    weight: FilterDescriptor 0x7fd9e806f1e0
        type = CUDNN_DATA_HALF
        tensor_format = CUDNN_TENSOR_NCHW
        nbDims = 4
        dimA = 3712, 3712, 1, 1, 
    Pointer addresses: 
        input: 0x7fd73fde3a00
        output: 0x7fd746abb600
        weight: 0x7fd761b5de00

    This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.

    As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.

    However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…

    All in all, the terrible error above is simply caused by insufficient GPU memory.



沪ICP备19023445号-2号
友情链接