AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

1ServiceNow, 2University of Texas at Austin, *Work done during internship at ServiceNow,

TL;DR

AU-Harness is a standardized, efficient and highly customizable open-source framework for evaluating audio-based language models on Audio-to-Text tasks. Built for researchers and developers, it provides a comprehensive suite of tools to benchmark and compare the performance of various audio processing models.

  • Blazing Fast — Multiple models evaluated simultaneously across multiple tasks, datasets and metrics using independent Engines, enabling full parallelization. Achieves a speedup of up to 127% over existing toolkits.
  • Immensely Customizable — Options to filter datasets by accents, language, length, and customize models, tasks, and score reporting.
  • Super Modular — Streamlined evaluation processes and modularized functions for easy extension.
  • Wide Task Coverage — Supporting 21 unique tasks across 6 categories, 50+ datasets with 380+ unique subsets, and 9 different metrics.

What is AU-Harness?

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities.

To address these challenges, we present AU-Harness, a comprehensive evaluation framework that transforms how audio language models are benchmarked through three core contributions:

  • High-Performance Evaluation Engine: Our architecture leverages vLLM batching, dataset sharding, and parallel task execution to scale evaluations to multi-node infrastructures without sacrificing fidelity.
  • Unified Configuration System: Standardizes prompting, generation parameterization, metrics and reporting across benchmarks, enabling fair, reproducible comparisons and easy task integration.
  • Expanded Reasoning Assessment: Introduces 3 new spoken language reasoning tasks for audio conditioned reasoning and an LLM-Adaptive Diarization task to assess temporal grounding.

Leaderboard

i More models will be added soon.
Model Librispeech
WER
MELD
llm_judge_binary
IEMOCAP
llm_judge_binary
VoxCeleb
llm_judge_binary
mmau_mini
llm_judge_binary
CallHome
WDER
public_sg_speech_qa_test
llm_judge_detailed
BigBench Audio
llm_judge_big_bench_audio
Covost2 (zh-CN->EN)
BLEU
mnsc_sds (P3)
llm_judge_detailed
SLURP
llm_judge_binary
audiocaps_qa
llm_judge_detailed
mu_chomusic_test
llm_judge_binary
IFEval
instruction_following_score
MTBench
llm_judge_mt_bench
Spider
sql_score (EM)
BFCL
bfcl_match_score
advbench
redteaming_judge
avspoof
llm_judge_binary
Voxtral-Mini-3B 2.10 28.4 54.9 13.0 45.8 35.38 62.12 43.5 15.27 52.2 42.5 14.96 45.4 38.06 64.12 30.17 78.5 78.5 91.5
Phi-4-multimodal-instruct 1.97 30.5 50.5 27.6 47.2 39.31 63.32 40.8 24.32 54.8 23.0 26.08 44.8 49.74 58.0 27.37 17.1 97.1 10.0
Qwen-2.5-Omni-7B 1.74 49.8 85.8 28.7 62.3 35.4 69.4 53.8 28.41 52.0 57.0 38.4 59.3 50.83 62.88 38.46 68.0 98.3 30.0
Ultravox-v0_6-llama-3_3-70b 6.41 42.6 42.5 33.7 44.7 35.39 67.82 75.2 19.09 56.2 68.5 10.6 54.8 76.6 68.0 31.27 84.38 88.1 45.5
GPT-4o-mini-audio-preview [WIP] 6.25 20.2 0* 0* 42.0 37.14 70.2 65.0 21.68 61.2 48.0 15.08 50.2 72.15 62.44 45.15 86.65 88.1 0*
* Reported performance is impacted by content filtering system from Azure OpenAI

Citation

If you find this work useful for your research, please consider citing our paper:

@article{surapaneni2025auharness,
  title={AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs},
  author={Sidharth Surapaneni and Hoang Nguyen and Jash Mehta and Aman Tiwari and Oluwanifemi Bamgbose and Akshay Kalkunte and Sai Rajeswar and Sathwik Tejaswi Madhusudhan},
  journal={arXiv preprint arXiv:2509.08031},
  year={2025}
}