The results, drawn from thousands of spontaneous voice conversations across more than 60 languages, reveal capability gaps that other benchmarks have consistently missed.