Mayank022 commited on
Commit
e51ab67
·
verified ·
1 Parent(s): 87ccba6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+
7
+ **A Transformer-based Language Model Trained on the Harry Potter Corpus for Experimental Research in Training Dynamics and Architecture.**
8
+ Authors: Srikiran Bandhakavi, Mayank Pratap Singh
9
+
10
+ ---
11
+
12
+ * [Research Paper (PDF)](https://huggingface.co/Mayank022/SLM_harry_potter)
13
+ * [Colab Notebook (Training Code)](https://colab.research.google.com/drive/1aZxQO9Hgt5d5EVQ0yi5USyqg0Kq-6pRi?usp=sharing)
14
+
15
+ ## Model Summary
16
+
17
+ This repository contains multiple checkpoints of a GPT-style transformer model trained from scratch on a curated Harry Potter text dataset sourced from Project Gutenberg. The purpose of this repository is to support and supplement the findings of the accompanying research paper:
18
+
19
+ **“Training a Transformer-based LLM from Scratch on Project Gutenberg Corpus: An Experimental Study”**
20
+
21
+ The uploaded checkpoints represent different configurations used in experiments studying the effects of training duration, learning rate, model depth, attention heads, and ablation of architectural components. The objective was to observe how each of these choices influences convergence, generalization, and generation quality in small-scale LLM training.
22
+
23
+ ---
24
+
25
+ ## Dataset
26
+
27
+ * **Source**: Project Gutenberg (Harry Potter books)
28
+ * **Preprocessing**: Cleaned and tokenized using subword tokenization
29
+ * **Sequence Length**: 128 tokens per input block
30
+
31
+ ---
32
+
33
+ ## Model Architecture
34
+
35
+ * **Model Type**: GPT-style transformer (decoder-only)
36
+ * **Embedding Dimension**: 384
37
+ * **Feedforward Dimension**: 1536
38
+ * **Activation**: ReLU
39
+ * **Attention Heads**: Varied between 1 and 8
40
+ * **Transformer Layers**: Varied between 1 and 12
41
+ * **Output Layer**: Linear projection to vocabulary followed by softmax
42
+ * **Core Components**: LayerNorm, Residual Connections, Feed-Forward Network
43
+ * **Training Optimizer**: AdamW with cosine annealing scheduler
44
+
45
+ ---
46
+
47
+ ## Training Configuration
48
+
49
+ * **Optimizer**: AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-9, weight decay = 0.1)
50
+ * **Batch Size**: 32
51
+ * **Max Iterations**: 5k to 100k
52
+ * **Learning Rates Tested**: 1e-2, 1e-3, 1e-4
53
+ * **Training Steps**: 5k, 20k, 50k, and 100k iterations
54
+ * **Loss Function**: Cross Entropy (causal language modeling)
55
+
56
+ ---
57
+
58
+ ## Checkpoints and Experiments
59
+
60
+ | Checkpoint Name | Related Experiment | Notes |
61
+ | ------------------------------------------------- | ------------------ | --------------------------- |
62
+ | `epoch_5000_lr_1e-4_layer_2_head_2` | Training Dynamics | Early-stage underfitting |
63
+ | `epoch_100000_lr_1e-2_layer_2_head_2` | Overfitting Test | High LR, overfit behavior |
64
+ | `epoch_10000_lr_1e-3_layer_12_head_2` | Model Depth | Deep 12-layer transformer |
65
+ | `epoch_10000_lr_1e-3_layer_1_head_2` | Model Depth | Very shallow transformer |
66
+ | `epoch_10000_lr_1e-3_layer_5_head_1` | Attention Heads | Single-head model |
67
+ | `epoch_10000_lr_1e-3_layer_5_head_8` | Attention Heads | Wide attention capacity |
68
+ | `epoch_10000_lr_1e-3_layer_5_head_4_no_ffn` | Ablation | FFN removed |
69
+ | `epoch_10000_lr_1e-3_layer_5_head_4_no_residual` | Ablation | Residual connection removed |
70
+ | `epoch_10000_lr_1e-3_layer_5_head_4_no_layernorm` | Ablation | LayerNorm removed |
71
+
72
+ ---
73
+
74
+ ## Key Results Summary
75
+
76
+ | Configuration | Final Val Loss | Output Quality |
77
+ | -------------------------------- | -------------- | ------------------------------- |
78
+ | Baseline (5L, 4H, 50k @ 1e-3) | \~3.0 | Mostly coherent |
79
+ | Underfitted (5L, 4H, 5k @ 1e-4) | \~5.5 | Incoherent |
80
+ | Overfitted (5L, 4H, 100k @ 1e-2) | \~4.5 | Fluent but repetitive |
81
+ | Deep (12L, 4H) | \~3.2 | Fluent, long-range coherence |
82
+ | Shallow (1L, 4H) | \~5.2 | Poor coherence, short phrases |
83
+ | Single-head (5L, 1H) | \~4.8 | Basic fluency, low consistency |
84
+ | Multi-head (5L, 8H) | \~3.1 | Fluent, contextually stable |
85
+ | No LayerNorm | \~3.8 | Some fluency, unstable phrasing |
86
+ | No Residual | \~5.0 | Failed convergence |
87
+ | No Feed-Forward | \~3.5 | Coherent, but repetitive output |
88
+
89
+ ---
90
+
91
+ ## Resources
92
+
93
+ * [Research Paper (PDF)](https://huggingface.co/Mayank022/SLM_harry_potter)
94
+ * [Results Spreadsheet](https://docs.google.com/spreadsheets/d/1rmqVQAtsfjA4YNNSv5VmbMMKMfnqpECf5C9XjRKLKWY/edit?usp=sharing)
95
+ * [Colab Notebook (Training Code)](https://colab.research.google.com/drive/1aZxQO9Hgt5d5EVQ0yi5USyqg0Kq-6pRi?usp=sharing)
96
+
97
+