IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Augmentation helps ALBEF a lot

    RobinDong发表于 2024-06-28 05:13:36
    love 0

    I was trying to implement ALBEF by myself for practice. After finishing all the parts (Vision part, BERT part, including Masked Language Model), I trained the model on COCO-Captions/SBU-Captions/CC3M/CC12M dataset (actually more data than the original paper). But the result is quite weird. An old steam train was recognised as a building, and a few fish were recognised as statues.

    To solve these weird mistakes, I reviewed the code many times and finally noticed a sentence in the paper:

    Although it’s just a normal sentence in the paper, the augmentation could improve the ALBEF model significantly. After randomly cropping the 256×256 raw image to 224×224 and also using the RandAugment, I finally got a more stable and suitable model. Let’s see some examples:

    Previously, the fish had been recognised as “shoes”, and the bedroom as “city”. They all become very well after augmentation.

    But there are still some interesting bad cases:

    Adding a prefix of “A picture of” could help the ALBEF model improve its recognition capability, or actually, there is a lot of text like “A picture of XXX” in the CC3M or CC12m dataset.

    Anyhow, I finally implemented and trained a workable ALBEF model by myself, and my RTX-3080Ti card.



沪ICP备19023445号-2号
友情链接