Update README.md
Browse files
README.md
CHANGED
|
@@ -45,20 +45,22 @@ The summary of the instruction tuning data is as follows:
|
|
| 45 |
## CrystalChat DataMix
|
| 46 |
| Subset | Tokens (Billion) |
|
| 47 |
| ----------- | ----------- |
|
| 48 |
-
| OASST1-guanaco | 4.46 |
|
| 49 |
-
| SlimOrca | 225.63 |
|
| 50 |
-
| ShareGPT | 112.91 |
|
| 51 |
-
| Evol-ShareGPT | 85.95 |
|
| 52 |
-
| ChatLogs | 29.34 |
|
| 53 |
-
| CodeAlpaca | 2.62 |
|
| 54 |
-
| Rosetta Code | 7.99 |
|
| 55 |
-
| Evol-CodeAlpaca 1 | 73.80 |
|
| 56 |
-
| Evol-CodeAlpaca 2 | 34.91 |
|
| 57 |
-
| HTML Instruction
|
| 58 |
-
| General Textbooks | 85.59 |
|
| 59 |
-
| Programming Books | 395.63 |
|
| 60 |
| Total | 1102.52 |
|
| 61 |
|
|
|
|
|
|
|
| 62 |
# Instruction Format
|
| 63 |
|
| 64 |
We've added some new special tokens to the CrystalCoder tokenizer to support the instruction tuning.
|
|
|
|
| 45 |
## CrystalChat DataMix
|
| 46 |
| Subset | Tokens (Billion) |
|
| 47 |
| ----------- | ----------- |
|
| 48 |
+
| [OASST1-guanaco](https://huggingface.co/datasets/openaccess-ai-collective/oasst1-guanaco-extended-sharegpt) | 4.46 |
|
| 49 |
+
| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) | 225.63 |
|
| 50 |
+
| [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) | 112.91 |
|
| 51 |
+
| [Evol-ShareGPT](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 85.95 |
|
| 52 |
+
| [ChatLogs](https://huggingface.co/datasets/winglian/chatlogs-en-cleaned) | 29.34 |
|
| 53 |
+
| [CodeAlpaca](https://huggingface.co/datasets/lucasmccabe-lmi/CodeAlpaca-20k) | 2.62 |
|
| 54 |
+
| [Rosetta Code](https://github.com/sahil280114/codealpaca/blob/master/data/rosetta_alpaca.json) | 7.99 |
|
| 55 |
+
| [Evol-CodeAlpaca 1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 73.80 |
|
| 56 |
+
| [Evol-CodeAlpaca 2](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 34.91 |
|
| 57 |
+
| HTML Instruction | 43.67 |
|
| 58 |
+
| [General Textbooks](https://huggingface.co/datasets/open-phi/textbooks) | 85.59 |
|
| 59 |
+
| [Programming Books](https://huggingface.co/datasets/open-phi/programming_books_llama) | 395.63 |
|
| 60 |
| Total | 1102.52 |
|
| 61 |
|
| 62 |
+
The HTML Instruction dataset was curated by LLM360 and will be made available shortly.
|
| 63 |
+
|
| 64 |
# Instruction Format
|
| 65 |
|
| 66 |
We've added some new special tokens to the CrystalCoder tokenizer to support the instruction tuning.
|