AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Dongkuan Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang

| Summary:

Knowledge distillation (KD) methods compress large models into smaller
students with manually-designed student architectures given pre-specified
computational cost. This requires several trials to find a viable student, and
further repeating the process for each student or computational budget change.
We use Neural Architecture Search (NAS) to automatically distill several
compressed students with variable cost from a large model. Current works train
a single SuperLM consisting of millions of subnetworks with weight-sharing,
resulting in interference between subnetworks of different sizes. Our framework
AutoDistil addresses above challenges with the following steps: (a)
Incorporates inductive bias and heuristics to partition Transformer search
space into K compact sub-spaces (K=3 for typical student sizes of base, small
and tiny); (b) Trains one SuperLM for each sub-space using task-agnostic
objective (e.g., self-attention distillation) with weight-sharing of students;
(c) Lightweight search for the optimal student without re-training. Fully
task-agnostic training and search allow students to be reused for fine-tuning
on any downstream task. Experiments on GLUE benchmark against state-of-the-art
KD and NAS methods demonstrate AutoDistil to outperform leading compression
techniques with upto 2.7x reduction in computational cost and negligible loss
in task performance.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=10

Leave a Reply Cancel reply