Skip to content

MTR-DuplexBench (Moshi) Overview

MTR-DuplexBench, also known as Moshi, is a benchmark designed to evaluate Full-Duplex Speech Language Models (FD-SLMs) in multi-round communication settings. It addresses the challenges of evaluating dialogue systems, particularly in scenarios where turn boundaries can become blurred and context inconsistencies arise. The benchmark aims to improve the quality of dialogue in multi-round interactions while ensuring user safety by preventing the output of harmful or toxic content.

Architecture

The architecture of MTR-DuplexBench consists of several key components:

  1. Dialogue Segmentation: Segments continuous full-duplex dialogues into discrete turns for effective evaluation.
  2. Full Duplex Turn Segmentation Methodology: Identifies the start and end points of user turn boundaries within full-duplex dialogues.
  3. Majority Voting with Clustering and Filtering: Ensures stability in segmentation results by aggregating candidate turns based on time overlap.
  4. Smooth Turn-taking: Evaluates the model's ability to transition seamlessly in conversation when the user stops speaking.
  5. Interruption Handling: Assesses the model's capability to stop speaking upon user interruption and resume conversation.
  6. Pause Handling: Evaluates if the model can remain silent during brief pauses in user speech.
  7. Background Speech: Determines the model's conversational state when background speech occurs.
  8. Backchanneling: Provides acknowledgment cues during user speech.
  9. Safety Evaluation: Measures the model's ability to avoid producing harmful or toxic outputs.

Goals

The primary goals of MTR-DuplexBench include:

  • Evaluating the performance of FD-SLMs in multi-round conversational scenarios.
  • Addressing the challenges of blurred turn boundaries and context inconsistencies.
  • Ensuring user safety by preventing harmful outputs.
  • Improving dialogue quality in multi-round interactions.
  • Assessing the model's ability to handle various conversational features such as smooth turn-taking, interruption, pause handling, and background speech.

Dataset Information

MTR-DuplexBench requires several datasets for its operation, including:

  • Synthetic data generated by GPT-4o and synthesized into speech using CosyVoice 2.
  • Llama Question dataset from OpenAudioBench (300 spoken queries).
  • AdvBench dataset from Voicebench (520 spoken queries).
  • The benchmark itself (MTR-DuplexBench).
  • Natural spoken dialogues and synthetic datasets.

Outputs

The outputs of MTR-DuplexBench include:

  • Evaluation Metrics: Metrics such as GPT-score, success rate for each feature, latency of model response, and refusal rate for safety evaluation.
  • Headline Results: Key findings include an increase in success rates and backchannel frequency across various conversational features, as well as a significant safety success rate of approximately 90% for Moshi.

Evaluation

MTR-DuplexBench supports a comprehensive evaluation framework that includes:

  • Turn-by-turn evaluation of dialogue quality.
  • Multi-round evaluation for handling multiple conversational features.
  • Datasets such as the Candor dataset, MTR-DuplexBench, Llama Question dataset, and AdvBench.

Benchmarks and Metrics

The evaluation metrics used in MTR-DuplexBench include:

  • Success Rate (%)
  • Latency (s)
  • Backchannel Frequency
  • Refusal Rate for safety evaluation

Robustness Findings

The benchmark has revealed performance degradation with increasing interaction rounds, highlighting the importance of evaluating robustness in real-world interactions.

Limitations and Open Questions

Despite its advancements, MTR-DuplexBench has limitations:

  • The benchmark relies on a combination of natural and synthetic datasets, which may affect the generalizability of results.
  • Future efforts could enhance the diversity of full-duplex conversational data to better reflect real-world scenarios.

Conclusion

MTR-DuplexBench (Moshi) represents a significant step forward in the evaluation of Full-Duplex Speech Language Models, providing a robust framework for assessing model performance in multi-round interactions. Its comprehensive evaluation metrics and focus on safety and dialogue quality make it a valuable tool for advancing the field of conversational AI.

Sources

https://arxiv.org/abs/2511.10262v1