IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Training CIFAR-100 by DeepSpeed

    RobinDong发表于 2024-01-19 06:12:18
    love 0

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \
      --master_addr=rogpt1 \
      --elastic_training \
      --min_elastic_nodes=1 \
      --max_elastic_nodes=2 \
      --hostfile=hostfile \
      train.py \
      --deepspeed_config ds_config.json

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \
      --master_addr=rogpt1 \
      --elastic_training \
      --min_elastic_nodes=1 \
      --max_elastic_nodes=2 \
      --hostfile=hostfile \
      train.py \
      --deepspeed_config ds_config.json

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \
      --master_addr=rogpt1 \
      --elastic_training \
      --min_elastic_nodes=1 \
      --max_elastic_nodes=2 \
      --hostfile=hostfile \
      train.py \
      --deepspeed_config ds_config.json

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \
      --master_addr=rogpt1 \
      --elastic_training \
      --min_elastic_nodes=1 \
      --max_elastic_nodes=2 \
      --hostfile=hostfile \
      train.py \
      --deepspeed_config ds_config.json

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \
      --master_addr=rogpt1 \
      --elastic_training \
      --min_elastic_nodes=1 \
      --max_elastic_nodes=2 \
      --hostfile=hostfile \
      train.py \
      --deepspeed_config ds_config.json

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    1. Using a shared file system (Filestore of GCP, EFS of AWS, or just NFS) for the cluster and only letting the master node save the checkpoint. The saved checkpoint will be seen by all other nodes through the shared file system.
    2. Or, just set “use_node_local_storage” to true. Then all the nodes will save the checkpoints.

    deepspeed \ –master_addr=rogpt1 \ –elastic_training \ –min_elastic_nodes=1 \ –max_elastic_nodes=2 \ –hostfile=hostfile \ train.py \ –deepspeed_config ds_config.json

    deepspeed \
      --master_addr=rogpt1 \
      --elastic_training \
      --min_elastic_nodes=1 \
      --max_elastic_nodes=2 \
      --hostfile=hostfile \
      train.py \
      --deepspeed_config ds_config.json
    {
       "steps_per_print": 2000,
       "checkpoint": {
         "use_node_local_storage": true
       },
       "elasticity": {
         "enabled": true,
         "micro_batch_sizes": [64,128,256],
         "max_train_batch_size": 1024
       },
       "optimizer": {
         "type": "Adam",
         "params": {
           "lr": 0.001,
           "betas": [
             0.8,
             0.999
           ],
           "eps": 1e-8,
           "weight_decay": 3e-7
         }
       },
       "scheduler": {
         "type": "WarmupLR",
         "params": {
           "warmup_min_lr": 0,
           "warmup_max_lr": 0.001,
           "warmup_num_steps": 1000
         }
       },
       "wall_clock_breakdown": false
    }


沪ICP备19023445号-2号
友情链接