Commit ·
12bed0f
1
Parent(s): b200492
Update README.md
Browse files
README.md
CHANGED
|
@@ -28,21 +28,23 @@ It has not been instruction-tuned and cannot directly be expected to function as
|
|
| 28 |
|
| 29 |
## Evaluation
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|:--------------------------:|:--------------------------:|
|
| 35 |
-
| <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
|
| 36 |
|
|
|
|
| 37 |
|
| 38 |
Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t)
|
|
|
|
|
|
|
| 43 |
|
| 44 |
|
| 45 |
-
The following tables show
|
|
|
|
| 46 |
|
| 47 |
| Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
|
| 48 |
| ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
|
|
@@ -55,6 +57,14 @@ The following tables show, for Danish and English respectively, the performance
|
|
| 55 |
| Pleias-350m-Preview | -1.0 ± 1.5 | -1.8 ± 1.8 | 10.6 ± 2.9 | 12.9 ± 1.8 | 0.7 ± 2.6 | 4.6 ± 2.3 | 11.6 ± 0.9 | -0.3 ± 0.7 | 56.3 ± 1.5 |
|
| 56 |
| Pleias-1.2b-Preview | 0.2 ± 1.1 | 0.7 ± 1.0 | 27.7 ± 2.9 | 27.3 ± 2.2 | -0.6 ± 1.9 | 8.6 ± 3.2 | 35.2 ± 1.3 | -0.0 ± 1.5 | 60.3 ± 0.9 |
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
| Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
|
| 59 |
| ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
|
| 60 |
| base (comma-v0.1-2t) | **29.7** ± 1.9 | **61.8** ± 2.1 | **57.5** ± 2.8 | 41.6 ± 2.4 | **90.4** ± 0.4 | **16.8** ± 0.6 | **63.3** ± 0.9 |
|
|
@@ -68,6 +78,7 @@ The following tables show, for Danish and English respectively, the performance
|
|
| 68 |
|
| 69 |
|
| 70 |
|
|
|
|
| 71 |
## Training details
|
| 72 |
|
| 73 |
Munin-7B-open-pt is continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) using 30B tokens, utilizing a mix of [Danish Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and the [Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.
|
|
|
|
| 28 |
|
| 29 |
## Evaluation
|
| 30 |
|
| 31 |
+
### Performance on Danish
|
| 32 |
|
| 33 |
+
The following plots show the model size on the x-axis and an aggregate performance score for Danish on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
<img src="./images/performance_plot_da.png" width="600"/>
|
| 36 |
|
| 37 |
Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
|
| 38 |
|
| 39 |
+
Below we report results for Danish (see English below) for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
|
| 40 |
|
| 41 |
+
We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t)
|
| 42 |
+
and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)).
|
| 43 |
+
All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
|
| 44 |
|
| 45 |
|
| 46 |
+
The following tables show the performance on each dataset.
|
| 47 |
+
For each, we report the respective main metric from EuroEval and the confidence interval.
|
| 48 |
|
| 49 |
| Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
|
| 50 |
| ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
|
|
|
|
| 57 |
| Pleias-350m-Preview | -1.0 ± 1.5 | -1.8 ± 1.8 | 10.6 ± 2.9 | 12.9 ± 1.8 | 0.7 ± 2.6 | 4.6 ± 2.3 | 11.6 ± 0.9 | -0.3 ± 0.7 | 56.3 ± 1.5 |
|
| 58 |
| Pleias-1.2b-Preview | 0.2 ± 1.1 | 0.7 ± 1.0 | 27.7 ± 2.9 | 27.3 ± 2.2 | -0.6 ± 1.9 | 8.6 ± 3.2 | 35.2 ± 1.3 | -0.0 ± 1.5 | 60.3 ± 0.9 |
|
| 59 |
|
| 60 |
+
### Performance on English
|
| 61 |
+
|
| 62 |
+
<img src="./images/performance_plot_en.png" width="600"/>
|
| 63 |
+
|
| 64 |
+
The goal of this section is to demonstrate how the performance deteriorates for English when adapting the model for Danish. Generally, we seem to have only performance degradation
|
| 65 |
+
across tasks, with the exception of `squad`.
|
| 66 |
+
|
| 67 |
+
|
| 68 |
| Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
|
| 69 |
| ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
|
| 70 |
| base (comma-v0.1-2t) | **29.7** ± 1.9 | **61.8** ± 2.1 | **57.5** ± 2.8 | 41.6 ± 2.4 | **90.4** ± 0.4 | **16.8** ± 0.6 | **63.3** ± 0.9 |
|
|
|
|
| 78 |
|
| 79 |
|
| 80 |
|
| 81 |
+
|
| 82 |
## Training details
|
| 83 |
|
| 84 |
Munin-7B-open-pt is continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) using 30B tokens, utilizing a mix of [Danish Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and the [Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.
|