Text Generation
Transformers
Safetensors
PyTorch
nvidia
conversational

Inquiry about Nemotron 3 Nano technical report training details

#34
by andresnowak - opened

Hello, I wanted to ask two questions about training details:

  • The first one I wanted to ask about this text "For the MoE layers, we used DeepSeek’s aux-loss-free load balancing strategy (Wang et al., 2024; DeepSeek-AI, 2025b) with an update rate of 10⁻³ in conjunction with the standard load balancing loss (Lepikhin et al., 2020). We used a load balancing loss coefficient of 10⁻⁴." — figure 3 in https://arxiv.org/abs/2512.20848v1, so the model was trained with both aux-loss-free and load balancing loss? how does that work or what is the reason to do this?

  • And for the second question: in the post-training section 3.1.6 "We train for 13000 steps using a batch size of 64 and employ sequence packing to a sequence length of 256K. We use a learning rate of 5·10⁻⁵ and use 800 steps of learning rate warmup. We use a sequence-level MoE load balancing regularizer and set the loss coefficient to 10⁻⁴.", you say you use a batch size of 64. I wanted to ask about this part, because I also saw in https://arxiv.org/abs/2511.18538 that they talk about how in their post-training to make a coding model the Qwen3-30B-A3 MoE model had a higher sensibility to lr and batch size changes compared to the dense Qwen2.5-Coder-14B one. I want to ask if your use of batch size 64 is related to this and if maybe you had an intuition of why this is needed (like because the balancing loss is dependent on the batch size? or well here for you guys you use sequence-level balancing loss, no? so maybe that is not the reason why, or was the use of the sequence level one to help with the specialization in the model? )

Sign up or log in to comment