Kavli Affiliate: Ke Wang
| First 5 Authors: Ke Wang, Lei He, Kun Liu, Yan Deng, Wenning Wei
| Summary:
Large Multimodal Models (LMMs) have demonstrated exceptional performance
across a wide range of domains. This paper explores their potential in
pronunciation assessment tasks, with a particular focus on evaluating the
capabilities of the Generative Pre-trained Transformer (GPT) model,
specifically GPT-4o. Our study investigates its ability to process speech and
audio for pronunciation assessment across multiple levels of granularity and
dimensions, with an emphasis on feedback generation and scoring. For our
experiments, we use the publicly available Speechocean762 dataset. The
evaluation focuses on two key aspects: multi-level scoring and the practicality
of the generated feedback. Scoring results are compared against the manual
scores provided in the Speechocean762 dataset, while feedback quality is
assessed using Large Language Models (LLMs). The findings highlight the
effectiveness of integrating LMMs with traditional methods for pronunciation
assessment, offering insights into the model’s strengths and identifying areas
for further improvement.
| Search Query: ArXiv Query: search_query=au:”Ke Wang”&id_list=&start=0&max_results=3