Text Generation
Safetensors
Danish
English
llama
KennethEnevoldsen commited on
Commit
12bed0f
·
1 Parent(s): b200492

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -7
README.md CHANGED
@@ -28,21 +28,23 @@ It has not been instruction-tuned and cannot directly be expected to function as
28
 
29
  ## Evaluation
30
 
31
- The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
32
 
33
- | Danish | English |
34
- |:--------------------------:|:--------------------------:|
35
- | <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
36
 
 
37
 
38
  Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
39
 
40
- We report results in both Danish and English for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
41
 
42
- We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)). All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
 
 
43
 
44
 
45
- The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
 
46
 
47
  | Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
48
  | ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
@@ -55,6 +57,14 @@ The following tables show, for Danish and English respectively, the performance
55
  | Pleias-350m-Preview | -1.0 ± 1.5 | -1.8 ± 1.8 | 10.6 ± 2.9 | 12.9 ± 1.8 | 0.7 ± 2.6 | 4.6 ± 2.3 | 11.6 ± 0.9 | -0.3 ± 0.7 | 56.3 ± 1.5 |
56
  | Pleias-1.2b-Preview | 0.2 ± 1.1 | 0.7 ± 1.0 | 27.7 ± 2.9 | 27.3 ± 2.2 | -0.6 ± 1.9 | 8.6 ± 3.2 | 35.2 ± 1.3 | -0.0 ± 1.5 | 60.3 ± 0.9 |
57
 
 
 
 
 
 
 
 
 
58
  | Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
59
  | ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
60
  | base (comma-v0.1-2t) | **29.7** ± 1.9 | **61.8** ± 2.1 | **57.5** ± 2.8 | 41.6 ± 2.4 | **90.4** ± 0.4 | **16.8** ± 0.6 | **63.3** ± 0.9 |
@@ -68,6 +78,7 @@ The following tables show, for Danish and English respectively, the performance
68
 
69
 
70
 
 
71
  ## Training details
72
 
73
  Munin-7B-open-pt is continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) using 30B tokens, utilizing a mix of [Danish Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and the [Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.
 
28
 
29
  ## Evaluation
30
 
31
+ ### Performance on Danish
32
 
33
+ The following plots show the model size on the x-axis and an aggregate performance score for Danish on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
 
 
34
 
35
+ <img src="./images/performance_plot_da.png" width="600"/>
36
 
37
  Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
38
 
39
+ Below we report results for Danish (see English below) for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
40
 
41
+ We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t)
42
+ and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)).
43
+ All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
44
 
45
 
46
+ The following tables show the performance on each dataset.
47
+ For each, we report the respective main metric from EuroEval and the confidence interval.
48
 
49
  | Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
50
  | ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
 
57
  | Pleias-350m-Preview | -1.0 ± 1.5 | -1.8 ± 1.8 | 10.6 ± 2.9 | 12.9 ± 1.8 | 0.7 ± 2.6 | 4.6 ± 2.3 | 11.6 ± 0.9 | -0.3 ± 0.7 | 56.3 ± 1.5 |
58
  | Pleias-1.2b-Preview | 0.2 ± 1.1 | 0.7 ± 1.0 | 27.7 ± 2.9 | 27.3 ± 2.2 | -0.6 ± 1.9 | 8.6 ± 3.2 | 35.2 ± 1.3 | -0.0 ± 1.5 | 60.3 ± 0.9 |
59
 
60
+ ### Performance on English
61
+
62
+ <img src="./images/performance_plot_en.png" width="600"/>
63
+
64
+ The goal of this section is to demonstrate how the performance deteriorates for English when adapting the model for Danish. Generally, we seem to have only performance degradation
65
+ across tasks, with the exception of `squad`.
66
+
67
+
68
  | Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
69
  | ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
70
  | base (comma-v0.1-2t) | **29.7** ± 1.9 | **61.8** ± 2.1 | **57.5** ± 2.8 | 41.6 ± 2.4 | **90.4** ± 0.4 | **16.8** ± 0.6 | **63.3** ± 0.9 |
 
78
 
79
 
80
 
81
+
82
  ## Training details
83
 
84
  Munin-7B-open-pt is continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) using 30B tokens, utilizing a mix of [Danish Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and the [Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.