LLM360
/

Crystal

@@ -25,14 +25,17 @@ By comparing CrystalCoder with other similar work, CrystalCoder is quite balance
 **Notes**
-- We compute all evaluation metrics ourselves.
-- Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means
-AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
 - As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
    - Scores for HumanEval is computed with a temperature of 0.2
    - Scores for MBPP is computed with a temperature of 0.1
 - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
 ## About LLM360
 LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
@@ -47,7 +50,7 @@ effort.
 Get access now at [LLM360 site](https://www.llm360.ai/)
-## Model Description
 - **Model type:** Language model with the same architecture as LLaMA-7B
 - **Language(s) (NLP):** English
@@ -58,7 +61,7 @@ Get access now at [LLM360 site](https://www.llm360.ai/)
   - [Metrics](https://github.com/LLM360/Analysis360)
   - [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
-# Model Architecture
 CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
@@ -85,7 +88,7 @@ For other architecture choices:
 - Training sequence length is `2048`.
 - Embedding dimension is `32032`.
-# Tokenization
 Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
 - 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
@@ -94,7 +97,7 @@ Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens
 Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
-# Training
 Our training has 3 stages:
 - Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
@@ -114,12 +117,12 @@ For hyperparameters used in each stage, please refer to the following table:
 For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
-# Dataset
 Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
-# Model Usage
 To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
@@ -146,7 +149,7 @@ print("-"*20 + "Output for model"  + 20 * '-')
 print(tokenizer.batch_decode(gen_tokens)[0])
 ```
-## Completion Example:
 ### prompt:
@@ -185,7 +188,7 @@ def closest_pair(numbers: List[float], threshold: float) -> int:
 <unk> import torch
 import numpy as np
 ```
-# Training Logs and Evaluation Results
 Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
@@ -204,11 +207,11 @@ Selected Metrics are displayed below.
 |<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
-# CrystalCoder-Instruct
 We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
-# Citation
 **BibTeX:**

 **Notes**
+- We compute all evaluation metrics ourselves.
+- Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
 - As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
    - Scores for HumanEval is computed with a temperature of 0.2
    - Scores for MBPP is computed with a temperature of 0.1
 - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
 ## About LLM360
 LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
 Get access now at [LLM360 site](https://www.llm360.ai/)
+## 🟣 Model Description
 - **Model type:** Language model with the same architecture as LLaMA-7B
 - **Language(s) (NLP):** English
   - [Metrics](https://github.com/LLM360/Analysis360)
   - [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
+# 🟣 Model Architecture
 CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
 - Training sequence length is `2048`.
 - Embedding dimension is `32032`.
+# 🟣 Tokenization
 Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
 - 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
 Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
+# 🟣   Training
 Our training has 3 stages:
 - Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
 For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
+# 🟣 Dataset
 Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
+# 🟣 Model Usage
 To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
 print(tokenizer.batch_decode(gen_tokens)[0])
 ```
+## 🟣 Completion Example:
 ### prompt:
 <unk> import torch
 import numpy as np
 ```
+# 🟣 Training Logs and Evaluation Results
 Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
 |<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
+# 🟣 CrystalCoder-Instruct
 We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
+# 🟣 Citation
 **BibTeX:**