Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant potential to natural language processing.
However, the large model size hinders their use in IoT and edge devices.Several studies have utilized task-specific knowledge distillation to compress the pre-trained language models.However, to reduce the number of layers in a large model, a sound strategy for distilling knowledge to a student Lid model with fewer layers than the teacher model is lacking.
In this work, we present Layer-wise Adaptive Distillation (LAD), a task-specific distillation framework that can be used to reduce the model size of BERT.We design an iterative aggregation mechanism with multiple gate blocks in LAD to adaptively distill layer-wise internal knowledge from the teacher model to BAKING CUPS LARGE the student model.The proposed method enables an effective knowledge transfer process for a student model, without skipping any teacher layers.
The experimental results show that both the six-layer and four-layer LAD student models outperform previous task-specific distillation approaches during GLUE tasks.