🤝 Open to Collab

Mohammed Hamdy

mmhamdy

hugging-science

·

https://surfingmanifolds.substack.com/

AI & ML interests

AI4Sci | NLP | Reinforcement Learning

Recent Activity

repliedto their post 1 day ago

Decades before the modern scaling laws, this paper showed that neural networks behavior under scale follows remarkably predictable laws. In 1993, researchers at Bell Labs were grappling with a constraint that feels entirely familiar (and contemporary): datasets were outgrowing the available hardware, and training a model to the end was becoming too expensive. To evaluate an architectural tweak to a state-of-the-art model (at the time it was LeNet) on 60,000 samples meant burning up to three weeks of compute time. To save compute, people would train candidate architectures on small subsets of the data, assuming that the top performer at small scale would remain the top performer at full scale. But with our future wisdom, we know this is not the case. In "Learning Curves: Asymptotic Values and Rate of Convergence (NeurIPS 93)", using insights from statistical mechanics, they proposed a practical and principled method for predicting the performance of classifiers trained on large datasets (at the time, models were assumed to be large enough). The method was based on a simple power-law modeling of the expected training and test errors. It is often noted that many of today's breakthroughs in AI and deep learning are actually decades-old concepts that simply lacked the computational power to be tested at the time. While there is some truth to that, it highlights a more valuable lesson: there is immense worth in revisiting early literature and reflecting on foundational ideas we may have prematurely left behind. So, go explore and find your own inspiration. The current trend has enough champions already!

posted an update 1 day ago

Decades before the modern scaling laws, this paper showed that neural networks behavior under scale follows remarkably predictable laws. In 1993, researchers at Bell Labs were grappling with a constraint that feels entirely familiar (and contemporary): datasets were outgrowing the available hardware, and training a model to the end was becoming too expensive. To evaluate an architectural tweak to a state-of-the-art model (at the time it was LeNet) on 60,000 samples meant burning up to three weeks of compute time. To save compute, people would train candidate architectures on small subsets of the data, assuming that the top performer at small scale would remain the top performer at full scale. But with our future wisdom, we know this is not the case. In "Learning Curves: Asymptotic Values and Rate of Convergence (NeurIPS 93)", using insights from statistical mechanics, they proposed a practical and principled method for predicting the performance of classifiers trained on large datasets (at the time, models were assumed to be large enough). The method was based on a simple power-law modeling of the expected training and test errors. It is often noted that many of today's breakthroughs in AI and deep learning are actually decades-old concepts that simply lacked the computational power to be tested at the time. While there is some truth to that, it highlights a more valuable lesson: there is immense worth in revisiting early literature and reflecting on foundational ideas we may have prematurely left behind. So, go explore and find your own inspiration. The current trend has enough champions already!

repliedto their post 1 day ago

It has been more than a decade now since the knowledge distillation paper came out. Knowledge Distillation (KD) is one of my favorite topics, but I have to confess that I'm not a huge fan of the term because I find it confusing (or at least, it has became so over time). The idea behind KD is not novel; it was there almost a decade before the paper came out (and arguably even a decade before that, back to 1990-91). But this paper is the one that clicked, the one that made the topic much more popular and introduced it to a broader audience. First, the timing and the authors played a big role: we have Geoffrey Hinton, Oriol Vinyals, and Jeff Dean here. And second, Geoffrey Hinton is really good at idea branding: Model compression?! No, no, no! Let's call it "Knowledge Distillation" and use evocative terms such as "Dark Knowledge" to describe what is being transferred. It's a great name, but as time has passed, the term became a bit of a relic. KD is no longer solely about compression (KD used to be introduced as a method for model compression, but now model compression is just one application of KD). And the other thing is that the word "distillation" implies some sort of potency here, that the student is somehow more powerful than the teacher, which is not the case (but many counterarguments could be made, for example, more powerful compared to another model trained with no teacher) Nevertheless, the paper is incredibly well-written, short, and fun to read. It's one of few papers that I read several times. Check it out, and maybe share your thoughts on the topic with us here! If you had to choose another name for Knowledge Distillation, what would it be?

View all activity

Organizations

liked a Space 8 months ago

Unlocking On-Policy Distillation for Any Model Family

Explore on-policy distillation visualization for any model

liked a dataset 9 months ago

transferable-samplers/many-peptides-md

Updated Dec 15, 2025 • 4.93k • 10

liked 3 Spaces 9 months ago

Science Release Heatmap

Explore AI4Science contributions by organizations and tags

Maintain the unmaintainable

Explore the complex relationships between 400+ machine learning models

Transformers Timeline

Interactive timeline to explore the 🤗Transformers models

liked a model 11 months ago

rednote-hilab/dots.ocr

Image-Text-to-Text • 3B • Updated Oct 31, 2025 • 301k • 1.32k

liked a dataset about 1 year ago

nvidia/Nemotron-Personas-USA

Viewer • Updated Dec 16, 2025 • 1M • 12.8k • 328

liked 2 models about 1 year ago

PlayHT/PlayDiffusion

Updated Jul 29, 2025 • 111

facebook/KernelLLM

Text Generation • 8B • Updated Jan 15 • 136 • • 202

liked a model over 1 year ago

sesame/csm-1b

Text-to-Speech • 2B • Updated Dec 1, 2025 • 307k • 2.4k

liked a Space over 1 year ago

The Distill Template

Craft Beautiful Blogs

liked 2 models over 1 year ago

ElectricAlexis/NotaGen

Updated Feb 26, 2025 • 154

microsoft/wham

Updated Dec 17, 2025 • 134 • 272

liked a Space over 1 year ago

The Ultra-Scale Playbook

The ultimate guide to training LLM on large GPU Clusters

liked a model over 1 year ago

hexgrad/Kokoro-82M

Text-to-Speech • Updated Apr 10, 2025 • 15.4M • • 6.41k

liked a dataset over 1 year ago

HuggingFaceH4/MATH-500

Viewer • Updated Dec 15, 2025 • 500 • 129k • 317

liked a model over 1 year ago

answerdotai/ModernBERT-base

Fill-Mask • 0.1B • Updated Jan 15, 2025 • 10.1M • 1.06k

liked a Space over 1 year ago

Scaling test-time compute

Boost LLM answers with flexible test‑time search strategies

liked a model over 1 year ago

CohereLabs/c4ai-command-r7b-12-2024

Text Generation • 8B • Updated Oct 30, 2025 • 27.1k • • 425

liked a dataset over 1 year ago

CohereLabs/Global-MMLU

Viewer • Updated Aug 14, 2025 • 602k • 18k • 160