LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant potential to natural language processing.

However, the large model size hinders their use in IoT and edge devices.Several studies have utilized task-specific knowledge distillation to compress the pre-trained language models.However, to reduce the number of layers in a large model, a sound strategy for distilling knowledge to a student Lid model with fewer layers than the teacher model is lacking.

In this work, we present Layer-wise Adaptive Distillation (LAD), a task-specific distillation framework that can be used to reduce the model size of BERT.We design an iterative aggregation mechanism with multiple gate blocks in LAD to adaptively distill layer-wise internal knowledge from the teacher model to BAKING CUPS LARGE the student model.The proposed method enables an effective knowledge transfer process for a student model, without skipping any teacher layers.

The experimental results show that both the six-layer and four-layer LAD student models outperform previous task-specific distillation approaches during GLUE tasks.

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

Leave a Reply Cancel reply

Links

Visitors

Archives

Categories

Meta