Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,189 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LLaDA-MoE
|
| 2 |
+
|
| 3 |
+
**LLaDA-MoE** is a new and upgraded series of the LLaDA diffusion language model. This pre-release includes two cutting-edge models:
|
| 4 |
+
|
| 5 |
+
- `LLaDA-MoE-7B-A1B-Base`: A base pre-trained model designed for research and secondary development.
|
| 6 |
+
- `LLaDA-MoE-7B-A1B-Instruct`: An instruction-tuned model optimized for practical applications.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+

|
| 10 |
+

|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
## π Performance Highlights
|
| 14 |
+
|
| 15 |
+
- **Leading MoE Architecture**:
|
| 16 |
+
The first open-source **Mixture-of-Experts (MoE) diffusion large language model**, pre-trained from scratch on approximately **20 trillion tokens**.
|
| 17 |
+
|
| 18 |
+
- **Efficient Inference**:
|
| 19 |
+
With **7 billion total parameters**, only **1.4 billion** are activated during inference. LLaDA-MoE significantly reduces computational costs while outperforming open-source dense models of similar scale.
|
| 20 |
+
|
| 21 |
+
- **Impressive Performance on Code & Complex Reasoning**:
|
| 22 |
+
Excels in tasks such as **code generation** and **advanced mathematical reasoning**, demonstrating strong reasoning capabilities.
|
| 23 |
+
|
| 24 |
+
- **Tool Use**:
|
| 25 |
+
Supports **tool calling** and achieves excellent performance in complex agent-based tasks.
|
| 26 |
+
|
| 27 |
+
- **Open & Extensible**:
|
| 28 |
+
Fully open-source with commitment to transparency. We plan to release a **leading inference framework** in the future and continue investing in cutting-edge areas like **diffusion LLMs (dLLM)** to drive disruptive innovation.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## π¦ Model Variants
|
| 33 |
+
|
| 34 |
+
| Model ID | Description | Hugging Face Link |
|
| 35 |
+
|--------|-------------|-------------------|
|
| 36 |
+
| [`inclusionAI/LLaDA-MoE-7B-A1B-Base`](https://huggingface.co/inclusionAI/LLaDA-MoE-7B-A1B-Base) | Base pre-trained model for research and fine-tuning. | [π€ Model Card](https://huggingface.co/inclusionAI/LLaDA-MoE-7B-A1B-Base) |
|
| 37 |
+
| [`inclusionAI/LLaDA-MoE-7B-A1B-Instruct`](https://huggingface.co/inclusionAI/LLaDA-MoE-7B-A1B-Instruct) | Instruction-tuned model, ready for downstream applications. | [π€ Model Card](https://huggingface.co/inclusionAI/LLaDA-MoE-7B-A1B-Instruct) |
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## π Model Overview
|
| 42 |
+
|
| 43 |
+
**LLaDA-MoE-7B-A1B** has the following specifications:
|
| 44 |
+
|
| 45 |
+
- **Type**: Mixture-of-Experts (MoE) Diffusion Language Model
|
| 46 |
+
- **Total Parameters (Non-Embedding)**: 7.03B
|
| 47 |
+
- **Number of Layers**: 16
|
| 48 |
+
- **Attention Heads**: 16
|
| 49 |
+
- **Context Length**: 4,096 tokens
|
| 50 |
+
- **Position Embedding**: Rotary (RoPE)
|
| 51 |
+
- **Vocabulary Size**: 157,184
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
## β‘ Quickstart
|
| 56 |
+
|
| 57 |
+
Make sure you have `transformers` and its dependencies installed:
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
pip install transformers torch
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
You can then load the model using the AutoModelForCausalLM and AutoTokenizer classes:
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
import torch
|
| 67 |
+
import numpy as np
|
| 68 |
+
import torch.nn.functional as F
|
| 69 |
+
|
| 70 |
+
from transformers import AutoTokenizer, AutoModel
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def add_gumbel_noise(logits, temperature):
|
| 74 |
+
if temperature == 0:
|
| 75 |
+
return logits
|
| 76 |
+
logits = logits.to(torch.float64)
|
| 77 |
+
noise = torch.rand_like(logits, dtype=torch.float64)
|
| 78 |
+
gumbel_noise = (- torch.log(noise)) ** temperature
|
| 79 |
+
return logits.exp() / gumbel_noise
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def get_num_transfer_tokens(mask_index, steps):
|
| 83 |
+
mask_num = mask_index.sum(dim=1, keepdim=True)
|
| 84 |
+
|
| 85 |
+
base = mask_num // steps
|
| 86 |
+
remainder = mask_num % steps
|
| 87 |
+
|
| 88 |
+
num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
|
| 89 |
+
|
| 90 |
+
for i in range(mask_num.size(0)):
|
| 91 |
+
num_transfer_tokens[i, :remainder[i]] += 1
|
| 92 |
+
|
| 93 |
+
return num_transfer_tokens
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
@ torch.no_grad()
|
| 97 |
+
def generate(model, prompt, steps=128, gen_length=128, block_length=128, temperature=0.,
|
| 98 |
+
cfg_scale=0., remasking='low_confidence', mask_id=156895):
|
| 99 |
+
x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(model.device)
|
| 100 |
+
x[:, :prompt.shape[1]] = prompt.clone()
|
| 101 |
+
prompt_index = (x != mask_id)
|
| 102 |
+
|
| 103 |
+
assert gen_length % block_length == 0
|
| 104 |
+
num_blocks = gen_length // block_length
|
| 105 |
+
assert steps % num_blocks == 0
|
| 106 |
+
steps = steps // num_blocks
|
| 107 |
+
|
| 108 |
+
for num_block in range(num_blocks):
|
| 109 |
+
block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id)
|
| 110 |
+
num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps)
|
| 111 |
+
for i in range(steps):
|
| 112 |
+
mask_index = (x == mask_id)
|
| 113 |
+
if cfg_scale > 0.:
|
| 114 |
+
un_x = x.clone()
|
| 115 |
+
un_x[prompt_index] = mask_id
|
| 116 |
+
x_ = torch.cat([x, un_x], dim=0)
|
| 117 |
+
logits = model(x_).logits
|
| 118 |
+
logits, un_logits = torch.chunk(logits, 2, dim=0)
|
| 119 |
+
logits = un_logits + (cfg_scale + 1) * (logits - un_logits)
|
| 120 |
+
else:
|
| 121 |
+
logits = model(x).logits
|
| 122 |
+
|
| 123 |
+
logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
|
| 124 |
+
x0 = torch.argmax(logits_with_noise, dim=-1) # b, l
|
| 125 |
+
|
| 126 |
+
if remasking == 'low_confidence':
|
| 127 |
+
p = F.softmax(logits, dim=-1)
|
| 128 |
+
x0_p = torch.squeeze(
|
| 129 |
+
torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1) # b, l
|
| 130 |
+
elif remasking == 'random':
|
| 131 |
+
x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
|
| 132 |
+
else:
|
| 133 |
+
raise NotImplementedError(remasking)
|
| 134 |
+
|
| 135 |
+
x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -np.inf
|
| 136 |
+
|
| 137 |
+
x0 = torch.where(mask_index, x0, x)
|
| 138 |
+
confidence = torch.where(mask_index, x0_p, -np.inf)
|
| 139 |
+
|
| 140 |
+
transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
|
| 141 |
+
for j in range(confidence.shape[0]):
|
| 142 |
+
_, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i])
|
| 143 |
+
transfer_index[j, select_index] = True
|
| 144 |
+
x[transfer_index] = x0[transfer_index]
|
| 145 |
+
|
| 146 |
+
return x
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
device = 'cuda'
|
| 150 |
+
model = AutoModel.from_pretrained('/mnt/dllm/fengqi/LLaDA-MoE-7B-A1B-Instruct-Release', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()
|
| 151 |
+
tokenizer = AutoTokenizer.from_pretrained('/mnt/dllm/fengqi/LLaDA-MoE-7B-A1B-Instruct-Release', trust_remote_code=True)
|
| 152 |
+
|
| 153 |
+
prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?"
|
| 154 |
+
m = [
|
| 155 |
+
{"role": "system", "content": "You are a helpful AI assistant."},
|
| 156 |
+
{"role": "user", "content": prompt}
|
| 157 |
+
]
|
| 158 |
+
prompt = tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False)
|
| 159 |
+
|
| 160 |
+
input_ids = tokenizer(prompt)['input_ids']
|
| 161 |
+
input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
|
| 162 |
+
|
| 163 |
+
text = generate(model, input_ids, steps=128, gen_length=128, block_length=32, temperature=0., cfg_scale=0., remasking='low_confidence')
|
| 164 |
+
print(tokenizer.batch_decode(text[:, input_ids.shape[1]:], skip_special_tokens=False)[0])
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
## π Citation (Coming Soon)
|
| 173 |
+
|
| 174 |
+
We are preparing the technical report and citation information.
|
| 175 |
+
Stay tuned β citation details will be available soon.
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## π License
|
| 180 |
+
|
| 181 |
+
This project is licensed under the terms of the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## π€ Contact & Collaboration
|
| 186 |
+
|
| 187 |
+
For questions, collaborations, or feedback, please reach out via [Hugging Face](https://huggingface.co/your-model-page) or open an issue in the [repository](https://github.com/your-repo-link).
|
| 188 |
+
|
| 189 |
+
π Join us in advancing open, efficient, and intelligent language models!
|