AU-Harness

TL;DR

AU-Harness is a standardized, efficient and highly customizable open-source framework for evaluating audio-based language models on Audio-to-Text tasks. Built for researchers and developers, it provides a comprehensive suite of tools to benchmark and compare the performance of various audio processing models.

Blazing Fast — Multiple models evaluated simultaneously across multiple tasks, datasets and metrics using independent Engines, enabling full parallelization. Achieves a speedup of up to 127% over existing toolkits.
Immensely Customizable — Options to filter datasets by accents, language, length, and customize models, tasks, and score reporting.
Super Modular — Streamlined evaluation processes and modularized functions for easy extension.
Wide Task Coverage — Supporting 21 unique tasks across 6 categories, 50+ datasets with 380+ unique subsets, and 9 different metrics.

What is AU-Harness?

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities.

To address these challenges, we present AU-Harness, a comprehensive evaluation framework that transforms how audio language models are benchmarked through three core contributions:

High-Performance Evaluation Engine: Our architecture leverages vLLM batching, dataset sharding, and parallel task execution to scale evaluations to multi-node infrastructures without sacrificing fidelity.
Unified Configuration System: Standardizes prompting, generation parameterization, metrics and reporting across benchmarks, enabling fair, reproducible comparisons and easy task integration.
Expanded Reasoning Assessment: Introduces 3 new spoken language reasoning tasks for audio conditioned reasoning and an LLM-Adaptive Diarization task to assess temporal grounding.

Task Taxonomy

Organization of 21 tasks across 6 categories

Performance Comparison

Up to 127% speedup compared to existing evaluation frameworks

Leaderboard

More models will be added soon.

Model	Librispeech _WER	MELD _{llm_judge_binary}	IEMOCAP _{llm_judge_binary}	VoxCeleb _{llm_judge_binary}	mmau_mini _{llm_judge_binary}	CallHome _WDER	public_sg_speech_qa_test _{llm_judge_detailed}	BigBench Audio _{llm_judge_big_bench_audio}	Covost2 (zh-CN->EN) _BLEU	mnsc_sds (P3) _{llm_judge_detailed}	SLURP _{llm_judge_binary}	audiocaps_qa _{llm_judge_detailed}	mu_chomusic_test _{llm_judge_binary}	IFEval _{instruction_following_score}	MTBench _{llm_judge_mt_bench}	Spider _{sql_score (EM)}	BFCL _{bfcl_match_score}	advbench _{redteaming_judge}	avspoof _{llm_judge_binary}
Voxtral-Mini-3B	2.10	28.4	54.9	13.0	45.8	35.38	62.12	43.5	15.27	52.2	42.5	14.96	45.4	38.06	64.12	30.17	78.5	78.5	91.5
Phi-4-multimodal-instruct	1.97	30.5	50.5	27.6	47.2	39.31	63.32	40.8	24.32	54.8	23.0	26.08	44.8	49.74	58.0	27.37	17.1	97.1	10.0
Qwen-2.5-Omni-7B	1.74	49.8	85.8	28.7	62.3	35.4	69.4	53.8	28.41	52.0	57.0	38.4	59.3	50.83	62.88	38.46	68.0	98.3	30.0
Ultravox-v0_6-llama-3_3-70b	6.41	42.6	42.5	33.7	44.7	35.39	67.82	75.2	19.09	56.2	68.5	10.6	54.8	76.6	68.0	31.27	84.38	88.1	45.5
GPT-4o-mini-audio-preview [WIP]	6.25	20.2	0*	0*	42.0	37.14	70.2	65.0	21.68	61.2	48.0	15.08	50.2	72.15	62.44	45.15	86.65	88.1	0*

* Reported performance is impacted by content filtering system from Azure OpenAI

If you find this work useful for your research, please consider citing our paper:

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

TL;DR

What is AU-Harness?

Task Taxonomy

Performance Comparison

Leaderboard

Citation