Update README.md
Browse filesadd some bullets emoji to model card
README.md
CHANGED
|
@@ -25,14 +25,17 @@ By comparing CrystalCoder with other similar work, CrystalCoder is quite balance
|
|
| 25 |
|
| 26 |
|
| 27 |
**Notes**
|
| 28 |
-
|
| 29 |
-
-
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
- As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
|
| 32 |
- Scores for HumanEval is computed with a temperature of 0.2
|
| 33 |
- Scores for MBPP is computed with a temperature of 0.1
|
| 34 |
- For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
| 35 |
|
|
|
|
| 36 |
|
| 37 |
## About LLM360
|
| 38 |
LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
|
|
@@ -47,7 +50,7 @@ effort.
|
|
| 47 |
|
| 48 |
Get access now at [LLM360 site](https://www.llm360.ai/)
|
| 49 |
|
| 50 |
-
## Model Description
|
| 51 |
|
| 52 |
- **Model type:** Language model with the same architecture as LLaMA-7B
|
| 53 |
- **Language(s) (NLP):** English
|
|
@@ -58,7 +61,7 @@ Get access now at [LLM360 site](https://www.llm360.ai/)
|
|
| 58 |
- [Metrics](https://github.com/LLM360/Analysis360)
|
| 59 |
- [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
|
| 60 |
|
| 61 |
-
# Model Architecture
|
| 62 |
|
| 63 |
CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
|
| 64 |
|
|
@@ -85,7 +88,7 @@ For other architecture choices:
|
|
| 85 |
- Training sequence length is `2048`.
|
| 86 |
- Embedding dimension is `32032`.
|
| 87 |
|
| 88 |
-
# Tokenization
|
| 89 |
|
| 90 |
Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
|
| 91 |
- 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
|
|
@@ -94,7 +97,7 @@ Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens
|
|
| 94 |
|
| 95 |
Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
|
| 96 |
|
| 97 |
-
# Training
|
| 98 |
|
| 99 |
Our training has 3 stages:
|
| 100 |
- Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
|
|
@@ -114,12 +117,12 @@ For hyperparameters used in each stage, please refer to the following table:
|
|
| 114 |
|
| 115 |
For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
|
| 116 |
|
| 117 |
-
# Dataset
|
| 118 |
|
| 119 |
Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
| 120 |
|
| 121 |
|
| 122 |
-
# Model Usage
|
| 123 |
|
| 124 |
To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
|
| 125 |
|
|
@@ -146,7 +149,7 @@ print("-"*20 + "Output for model" + 20 * '-')
|
|
| 146 |
print(tokenizer.batch_decode(gen_tokens)[0])
|
| 147 |
```
|
| 148 |
|
| 149 |
-
## Completion Example:
|
| 150 |
|
| 151 |
### prompt:
|
| 152 |
|
|
@@ -185,7 +188,7 @@ def closest_pair(numbers: List[float], threshold: float) -> int:
|
|
| 185 |
<unk> import torch
|
| 186 |
import numpy as np
|
| 187 |
```
|
| 188 |
-
# Training Logs and Evaluation Results
|
| 189 |
|
| 190 |
Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
|
| 191 |
|
|
@@ -204,11 +207,11 @@ Selected Metrics are displayed below.
|
|
| 204 |
|<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
|
| 205 |
|
| 206 |
|
| 207 |
-
# CrystalCoder-Instruct
|
| 208 |
|
| 209 |
We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
|
| 210 |
|
| 211 |
-
# Citation
|
| 212 |
|
| 213 |
**BibTeX:**
|
| 214 |
|
|
|
|
| 25 |
|
| 26 |
|
| 27 |
**Notes**
|
| 28 |
+
|
| 29 |
+
- We compute all evaluation metrics ourselves.
|
| 30 |
+
|
| 31 |
+
- Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
|
| 32 |
+
|
| 33 |
- As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
|
| 34 |
- Scores for HumanEval is computed with a temperature of 0.2
|
| 35 |
- Scores for MBPP is computed with a temperature of 0.1
|
| 36 |
- For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
| 37 |
|
| 38 |
+
|
| 39 |
|
| 40 |
## About LLM360
|
| 41 |
LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
|
|
|
|
| 50 |
|
| 51 |
Get access now at [LLM360 site](https://www.llm360.ai/)
|
| 52 |
|
| 53 |
+
## π£ Model Description
|
| 54 |
|
| 55 |
- **Model type:** Language model with the same architecture as LLaMA-7B
|
| 56 |
- **Language(s) (NLP):** English
|
|
|
|
| 61 |
- [Metrics](https://github.com/LLM360/Analysis360)
|
| 62 |
- [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
|
| 63 |
|
| 64 |
+
# π£ Model Architecture
|
| 65 |
|
| 66 |
CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
|
| 67 |
|
|
|
|
| 88 |
- Training sequence length is `2048`.
|
| 89 |
- Embedding dimension is `32032`.
|
| 90 |
|
| 91 |
+
# π£ Tokenization
|
| 92 |
|
| 93 |
Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
|
| 94 |
- 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
|
|
|
|
| 97 |
|
| 98 |
Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
|
| 99 |
|
| 100 |
+
# π£ Training
|
| 101 |
|
| 102 |
Our training has 3 stages:
|
| 103 |
- Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
|
|
|
|
| 117 |
|
| 118 |
For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
|
| 119 |
|
| 120 |
+
# π£ Dataset
|
| 121 |
|
| 122 |
Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
| 123 |
|
| 124 |
|
| 125 |
+
# π£ Model Usage
|
| 126 |
|
| 127 |
To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
|
| 128 |
|
|
|
|
| 149 |
print(tokenizer.batch_decode(gen_tokens)[0])
|
| 150 |
```
|
| 151 |
|
| 152 |
+
## π£ Completion Example:
|
| 153 |
|
| 154 |
### prompt:
|
| 155 |
|
|
|
|
| 188 |
<unk> import torch
|
| 189 |
import numpy as np
|
| 190 |
```
|
| 191 |
+
# π£ Training Logs and Evaluation Results
|
| 192 |
|
| 193 |
Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
|
| 194 |
|
|
|
|
| 207 |
|<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
|
| 208 |
|
| 209 |
|
| 210 |
+
# π£ CrystalCoder-Instruct
|
| 211 |
|
| 212 |
We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
|
| 213 |
|
| 214 |
+
# π£ Citation
|
| 215 |
|
| 216 |
**BibTeX:**
|
| 217 |
|