Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

Kavli Affiliate: Ke Wang

| First 5 Authors: Ke Wang, Ke Wang, , ,

| Summary:

Automatic Pronunciation Assessment (APA) is critical for Computer-Assisted
Language Learning (CALL), requiring evaluation across multiple granularities
and aspects. Large Multimodal Models (LMMs) present new opportunities for APA,
but their effectiveness in fine-grained assessment remains uncertain. This work
investigates fine-tuning LMMs for APA using the Speechocean762 dataset and a
private corpus. Fine-tuning significantly outperforms zero-shot settings and
achieves competitive results on single-granularity tasks compared to public and
commercial systems. The model performs well at word and sentence levels, while
phoneme-level assessment remains challenging. We also observe that the Pearson
Correlation Coefficient (PCC) reaches 0.9, whereas Spearman’s rank Correlation
Coefficient (SCC) remains around 0.6, suggesting that SCC better reflects
ordinal consistency. These findings highlight both the promise and limitations
of LMMs for APA and point to future work on fine-grained modeling and
rank-aware evaluation.

| Search Query: ArXiv Query: search_query=au:”Ke Wang”&id_list=&start=0&max_results=3