IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Experiments about ‘accelerate’ library of HuggingFace

    RobinDong发表于 2025-02-04 00:28:13
    love 0

    If you want to run your training code with ‘accelerate‘ fp8, you need to install ‘transformer_engine‘ or ‘MS-AMP‘. But these two packages are hard to install beccause they depends on specific CUDA/CUDNN versions. After one afternoon’s efforet, I finally gave up and started to directly using docker image ‘nvcr.io/nvidia/pytorch:24.04-py3’.

    docker run \
      --gpus all \
      -it \
      --rm \
      --shm-size="16g" \
      --network host \
      nvcr.io/nvidia/pytorch:24.04-py3

    After enter the container by using above command, I still need to install ‘accelerate’ directly by using ‘python3 -m pip install accelerate’. In the ‘accelerate config’, I set to use ‘fp8’ with ‘E4M3’. But the training process reported error about LayerNorm. Then I manually modify the code (may not be correct but it works):

    # transformer_engine/pytorch/module/layernorm.py
    
    class _LayerNorm(torch.autograd.Function):
        """functional LayerNorm"""
    
        @staticmethod
        def forward(
            ctx,
            inp: torch.Tensor,
            ln_weight: torch.Tensor,
            ln_bias: torch.Tensor,
            eps: float,
            fwd_ln_sm_margin: int,
            bwd_ln_sm_margin: int,
            zero_centered_gamma: bool,
            is_grad_enabled: bool,
            activation_dtype: torch.dtype,
        ) -> torch.Tensor:
            # Make sure input dimensions are compatible
            in_features = ln_weight.numel()
            assert inp.is_cuda, "TransformerEngine needs CUDA."
            permute = False
            if inp.shape[-1] != in_features:
                inp = inp.permute(0, 2, 3, 1)
                permute = True
            assert inp.shape[-1] == in_features, "LayerNorm not possible"
            if permute:
                inp = inp.permute(0, 3, 1, 2)
            inputmat = inp.reshape((-1, in_features))
    
            # Cast for native AMP
            inputmat = cast_if_needed(inputmat, activation_dtype)
            ln_weight = cast_if_needed(ln_weight, activation_dtype)
            ln_bias = cast_if_needed(ln_bias, activation_dtype)
    
            if is_grad_enabled:
                ln_out, mu, rsigma = tex.layernorm_fwd(inputmat, ln_weight,
                    ln_bias, eps, fwd_ln_sm_margin, zero_centered_gamma)
                ctx.save_for_backward(inputmat, ln_weight, mu, rsigma)
                ctx.inp_shape = inp.shape
                ctx.bwd_ln_sm_margin = bwd_ln_sm_margin
                ctx.zero_centered_gamma = zero_centered_gamma
            else:
                ln_out, mu, rsigma = layernorm_fwd_inf(inputmat, ln_weight,
                    ln_bias, eps, zero_centered_gamma), None, None
            return ln_out.view_as(inp)
    

    Finally the training could work properly. But the speed is the same with bf16…



沪ICP备19023445号-2号
友情链接