Kavli Affiliate: Xiang Zhang | First 5 Authors: Dongkuan Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang | Summary: Knowledge distillation (KD) methods compress large models into smaller students with manually-designed student architectures given pre-specified computational cost. This requires several trials to find a viable student, and further repeating the process for each student […]
Continue.. AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models