danish-foundation-models
/

dfm-decoder-open-v0-7b-pt

@@ -28,21 +28,23 @@ It has not been instruction-tuned and cannot directly be expected to function as
 ## Evaluation
-The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
-| Danish | English |
-|:--------------------------:|:--------------------------:|
-| <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
 Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
-We report results in both Danish and English for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
-We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)). All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
-The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
 | Model                    | scala-da (MCC) | dala (MCC)   | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
 | ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
@@ -55,6 +57,14 @@ The following tables show, for Danish and English respectively, the performance
 | Pleias-350m-Preview  | -1.0 ± 1.5   | -1.8 ± 1.8 | 10.6 ± 2.9       | 12.9 ± 1.8             | 0.7 ± 2.6             | 4.6 ± 2.3                | 11.6 ± 0.9          | -0.3 ± 0.7       | 56.3 ± 1.5                 |
 | Pleias-1.2b-Preview  | 0.2 ± 1.1    | 0.7 ± 1.0  | 27.7 ± 2.9       | 27.3 ± 2.2             | -0.6 ± 1.9            | 8.6 ± 3.2                | 35.2 ± 1.3          | -0.0 ± 1.5       | 60.3 ± 0.9                 |
 | Model                    | scala-en (MCC) | sst5 (MCC)   | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1)   | hellaswag (MCC) | cnn-dailymail (BERTScore) |
 | ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
 | base (comma-v0.1-2t)        | **29.7** ± 1.9   | **61.8** ± 2.1 | **57.5** ± 2.8                | 41.6 ± 2.4         | **90.4** ± 0.4 | **16.8** ± 0.6    | **63.3** ± 0.9              |
@@ -68,6 +78,7 @@ The following tables show, for Danish and English respectively, the performance
 ## Training details
 Munin-7B-open-pt is continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) using 30B tokens, utilizing a mix of [Danish Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and the [Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.

 ## Evaluation
+### Performance on Danish
+The following plots show the model size on the x-axis and an aggregate performance score for Danish on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
+<img src="./images/performance_plot_da.png" width="600"/>
 Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
+Below we report results for Danish (see English below) for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
+We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t)
+and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)).
+All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
+The following tables show the performance on each dataset.
+For each, we report the respective main metric from EuroEval and the confidence interval.
 | Model                    | scala-da (MCC) | dala (MCC)   | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
 | ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
 | Pleias-350m-Preview  | -1.0 ± 1.5   | -1.8 ± 1.8 | 10.6 ± 2.9       | 12.9 ± 1.8             | 0.7 ± 2.6             | 4.6 ± 2.3                | 11.6 ± 0.9          | -0.3 ± 0.7       | 56.3 ± 1.5                 |
 | Pleias-1.2b-Preview  | 0.2 ± 1.1    | 0.7 ± 1.0  | 27.7 ± 2.9       | 27.3 ± 2.2             | -0.6 ± 1.9            | 8.6 ± 3.2                | 35.2 ± 1.3          | -0.0 ± 1.5       | 60.3 ± 0.9                 |
+### Performance on English
+<img src="./images/performance_plot_en.png" width="600"/>
+The goal of this section is to demonstrate how the performance deteriorates for English when adapting the model for Danish. Generally, we seem to have only performance degradation
+across tasks, with the exception of `squad`.
 | Model                    | scala-en (MCC) | sst5 (MCC)   | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1)   | hellaswag (MCC) | cnn-dailymail (BERTScore) |
 | ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
 | base (comma-v0.1-2t)        | **29.7** ± 1.9   | **61.8** ± 2.1 | **57.5** ± 2.8                | 41.6 ± 2.4         | **90.4** ± 0.4 | **16.8** ± 0.6    | **63.3** ± 0.9              |
 ## Training details
 Munin-7B-open-pt is continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) using 30B tokens, utilizing a mix of [Danish Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and the [Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.