Question of the Day
One question per day to look beyond the headlines.
Which part of an AI model’s “value” does distillation copy: weights, data, or just behavior?
Take-away Distillation captures a teacher’s decision surface (routing/logit patterns) into a new student’s weights, so “value” transfers via behavior—not copying weights or original data.
Distillation copies the behavior and intelligence of the original model rather than specific weights or data [1]. It transfers the routing intelligence from a larger teacher model to a smaller student model, aiming to preserve the quality of the model's actions or decisions [1]. During this process, explicit data might be synthesized and used for training, but the core idea is to replicate the refined behavior patterns embodied in the teacher model [1].