Crossed Swords XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
(Cross-Attention Robust Audio Watermark)

Lehigh University | Dolby Laboratories Inc.
arXiv Podcasts PDF BibTeX

TL;DR (Summary)

Teaser Figure

Limitations of previous audio watermarking methods: While AudioSeal achieves good robustness against signal processing attacks, it struggles with generative editing attacks and cannot provide accurate attribution. Our XAttnMark addresses both challenges through cross-attention and temporal conditioning mechanisms.

We introduce XAttnMark, a state-of-the-art approach for robust audio watermarking that achieves both reliable detection and accurate attribution through cross-attention and temporal conditioning mechanisms. Our method demonstrates superior robustness against various audio transformations (including challenging generative editing!) while maintaining high perceptual quality.

Abstract

The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), that bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution.

Method Overview

System Overview

System Overview of XAttnMark. XAttnMark consists of a watermark generator and a watermark detector, with a shared embedding table that facilitates message decoding through a cross-attention module. In the generator part, we first employ an encoder network to encode the audio latent and then apply a temporal modulation to hide the message. The modulated latent is then fed into a decoder to produce the watermark residual. In the detector part, a linear detection head is used for detecting the presence of watermarks, and a cross-attention module with the shared embedding table is used for message decoding.

Audio Examples

Watermarking Examples

Here are examples demonstrating our watermarking system. Each pair consists of an original audio and its watermarked counterpart.

Sample #1
Original Audio
Watermarked Audio

Generative Editing Examples

These examples demonstrate the imperceptibility and detectability of our watermark against generative editing. The editing pipeline is shown below:

Generative Editing Pipeline

Overview of our generative editing pipeline for testing watermark robustness.

Original Clean Audio
Original Watermarked Audio
After Generative Editing (EDM Style)

Main Results

Detection and Attribution Performance

Our method achieves state-of-the-art performance in both detection (99.19% average accuracy) and attribution (93% average accuracy) across various audio transformations. We maintain high detection accuracy even under challenging conditions:

  • Superior robustness against standard audio edits (99.5% detection for speed changes)
  • Strong performance on neural audio codecs (96.5% detection for EnCodec)
  • Reliable attribution across different user pool sizes (92-94% average accuracy)
Detection and Attribution Results

Detection and attribution performance comparison across different watermarking methods. Our approach consistently outperforms baselines in both tasks.

Attribution Accuracy

Attribution accuracy with different number of users, demonstrating scalability of our approach.

Robustness Against Standard Audio Edits

XAttnMark demonstrates exceptional robustness across a wide range of audio transformations:

  • Lossy Compression: Near-perfect performance on MP3 and AAC (98-100% attribution)
  • Frequency Operations: Maintains 97.5-99.5% detection accuracy across different frequency filters
  • Volume Changes: 97.5% detection and 100% attribution for volume adjustments
Robustness Results Against Standard Audio Edits

Comprehensive evaluation of robustness against various audio transformations, showing superior performance across different types and strengths of edits.

Robustness Results Against Standard Audio Edits with Different Configurations

Comprehensive evaluation of robustness against various audio transformations, showing superior performance across different types and strengths of edits.

Robustness to Generative Editing

Our method shows remarkable robustness against state-of-the-art generative models:

  • 91-94% detection accuracy with strong editing strength on Stable Audio
  • Consistent 94% accuracy across all editing strengths on AudioLDM2
  • Only method maintaining high performance under strong generative edits
Generative Editing Results

Detection performance under different generative editing strengths using Stable Audio and AudioLDM2.

Robustness to Adversarial Attacks

We demonstrate strong resilience against black-box adversarial attacks:

  • 68% detection accuracy under waveform-domain HSJA attacks (vs. 15% for AudioSeal)
  • 36% detection accuracy under spectrogram-domain attacks (vs. 15% for AudioSeal)
  • Maintains higher perceptual quality after attacks (PESQ: 2.80 vs. 1.14)
Adversarial Attack Results

Performance comparison under HSJA-based adversarial attacks in waveform and spectrogram domains.

Quality Assessment

Quality Assessment Results

Comparison of perceptual quality metrics across different watermarking methods. Our XAttnMark achieves competitive or superior performance across all metrics while maintaining better robustness.

Our watermarking approach maintains high perceptual quality while ensuring robust protection:

  • PESQ Score: 4.43 (competitive with state-of-the-art)
  • STOI Score: 1.000 (best in class)
  • SI-SNR: 29.00 dB
  • Lowest watermark residual loudness: -54.63 dB
  • MUSHRA subjective listening test score: ~91 (comparable to AudioSeal)

Ablation Studies

Key architectural components contribute significantly to performance:

  • Cross-attention mechanism is crucial for message retrieval
  • Temporal modulation improves message hiding capabilities
  • Psychoacoustic-aligned TF masking loss enhances imperceptibility
Ablation Study Results

Ablation study results showing the impact of each key component: cross-attention mechanism, temporal modulation.

Attribution Accuracy

Attribution accuracy with different number of users, demonstrating scalability of our approach.

Additional Insights from Extended Analysis

Architecture Comparison

Our analysis of different architectural choices reveals distinct learning characteristics:

  • WavMark (fully-shared): Quick initial learning but suffers from destructive interference between detection and decoding tasks
  • AudioSeal (fully-disjoint): Slow convergence in message-bit decoding despite good detection
  • XAttnMark (blended): Achieves optimal balance with steady improvement in both tasks
Architecture Comparison

Training dynamics comparison showing validation accuracy and quality curves for different architectural choices.

Speed Reversion Enhancement

We explore a post-hoc speed reversion layer to further improve robustness against speed changes:

For a speed-modified audio signal with unknown factor γ, we search for optimal α that maximizes:

\[ r(\alpha) = \frac{1}{L} \sum_{i=1}^{L} \big(p(\alpha) + \bar{s}_{m(\alpha)}\big) \]

where \(p(\alpha)\) represents the detection score and \(\bar{s}_{m(\alpha)}\) is the average standard deviation of predicted message bits. The search space is defined as:

\[ \alpha \in \left[\frac{1}{\gamma_{\text{max}}}, \frac{1}{\gamma_{\text{min}}}\right], \text{ where } \gamma_{\text{min}} = 0.8, \gamma_{\text{max}} = 1.25 \]
Main Finding:
  • Uses black-box grid search to find optimal speed factor
  • Achieves up to 100% attribution success rate (vs. 0% baseline)
  • Maintains efficient detection time (0.5-0.6s)
  • Zero false positive rate on unwatermarked audio
Speed Reversion Results

Performance comparison with and without speed reversion layer across different user pool sizes.

Dataset Generalization

Our method demonstrates strong generalization across diverse audio domains:

  • In-distribution: 96-99% detection and 87-93% attribution across music, speech, and general audio
  • Out-of-distribution: 93-99% detection and 86-94% attribution on unseen datasets
Dataset Generalization Results
Dataset Generalization Results

Performance comparison across different in-distribution and out-of-distribution datasets, showing strong generalization capabilities.

Performance Analysis Across Duration and Strength

We analyze the model's performance across different audio durations and watermark strengths:

  • Duration Impact:
    • Detection accuracy remains robust (98.6-99.3%) across all durations from 1-10s
    • Attribution accuracy improves from 81.2% at 1s to peak of 93.0% at 5s
    • Stable performance (91.5-92.6%) for longer durations
  • Watermark Strength Impact:
    • Weak watermarks (α ≤ 0.3): Limited effectiveness
    • Moderate watermarks (0.4 ≤ α ≤ 0.7): Balanced performance
    • Strong watermarks (α ≥ 0.8): Asymptotic performance with >98.3% detection and >92.3% attribution
    • Default α=1.0 achieves 98.5% detection and 92.7% attribution
Performance Analysis Results

Performance analysis across different audio durations and watermark strengths (α), showing optimal configurations for both parameters.