Knowledge Distillation#

Note

We will use the abbreviation of knowledge distilation, KD, in this chapter.

What is KD, and do we need it?#

KD refers to the transferring the knowledge of a model into another model. It is mainly used in transferring the knowledge of a big model with billions of parameters into a smaller model that’s small enough to deploy on edge devices. The small model is encouraged to perform exactly the same as the bigger model by replicating the bigger model’s output, all while being more efficient.

Why don’t we train a model from scratch?#

We could, of course, train a smaller model from scratch. However, bigger models have tendencies to do a better job in searching a better solution than smaller ones, and trying to replicate bigger models tends to perform better than training a smaller model from scratch, which often stuck in local optimum points.

The training flow of KD.#

  1. First, train a teacher network. This network is much more complicated than the student network that we wish to train in the end.

  2. Freeze the teacher network, and generate some inputs.

  3. Use the frozen teacher network and inputs to generate some target outputs.

  4. Use the input and output pairs to train the student network in a supervised manner.