IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Does sinusoid Positional Embeddings actually work well?

    RobinDong发表于 2024-03-13 00:52:51
    love 0

    The GPT part of my Multimodal trials mainly comes from nanoGPT. In the nanoGPT, the Positional Encoding is just a learnable tensor (“wpe” means “weights of positional embedding”):

    self.transformer = nn.ModuleDict(dict(
                wte = nn.Embedding(config.vocab_size, config.n_embd),
                wpe = nn.Embedding(config.block_size, config.n_embd),
                drop = nn.Dropout(config.dropout),
                h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
                ln_f = LayerNorm(config.n_embd, bias=config.bias),
            ))

    It’s different from the implementation of the original paper. The original paper mentioned:

    We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results.

    The “vanilla” Positional Embeddings for the transformer are two functions:

    PE_(pos,2i) = sin(pos/10000^{2i/d_{model}})

    PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}})

    Which ones work better in the model training? Let me try running “python train.py config/train_shakespeare_char.py” in nanoGPT and get the best validation loss as metrics.

    I wrote my own sinusoid Positional Embeddings for testing:

    class GPT(nn.Module):
      def __init__(self, config):
    	...
        # Position Embedding from original Transformer paper
        divisor = torch.pow(
            10000, 2 * torch.arange(1, config.n_embd + 1) / config.n_embd
        )
        pe = []
        for pos in range(1, config.block_size + 1):
            if pos % 2 == 0:
                pe.append(torch.sin(pos / divisor).unsqueeze(0))
            else:
                pe.append(torch.cos(pos / divisor).unsqueeze(0))
        self.register_buffer("pos_emb", torch.cat(pe, 0))

    The “10000” (let’s call it “base number” for convenience) looks too big for a shorter sequence length, so I do experiments by changing it to “block_size” “2*block_size” etc.

    The testing result:

    validation loss
    Original nanoGPT1.4754
    Base number: 100001.4959
    Base number: 4 * block_size1.4916
    Base number: 2 * block_size1.4995
    Base number: 3.14/2 * block_size1.4870
    Base number: block_size1.4947

    From my simple tests, the learnable Positional Embeddings has the best effort. nanoGPT wins this round.

    I have a guess about why the author of Transformer chose “10000”. The smallest “pos” is 1 and the biggest 2i/d_{model} is 2. Therefore the smallest value in sin() is 1/10000^2=1e-8 , which is very close to the minimal value of FLOAT16 5.96e-8



沪ICP备19023445号-2号
友情链接