Limitations of previous audio watermarking methods: While AudioSeal achieves good robustness against signal processing attacks, it struggles with generative editing attacks and cannot provide accurate attribution. Our XAttnMark addresses both challenges through cross-attention and temporal conditioning mechanisms.
We introduce XAttnMark, a state-of-the-art approach for robust audio watermarking that achieves both reliable detection and accurate attribution through cross-attention and temporal conditioning mechanisms. Our method demonstrates superior robustness against various audio transformations (including challenging generative editing!) while maintaining high perceptual quality.
The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), that bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution.
System Overview of XAttnMark. XAttnMark consists of a watermark generator and a watermark detector, with a shared embedding table that facilitates message decoding through a cross-attention module. In the generator part, we first employ an encoder network to encode the audio latent and then apply a temporal modulation to hide the message. The modulated latent is then fed into a decoder to produce the watermark residual. In the detector part, a linear detection head is used for detecting the presence of watermarks, and a cross-attention module with the shared embedding table is used for message decoding.
Here are examples demonstrating our watermarking system. Each pair consists of an original audio and its watermarked counterpart.
These examples demonstrate the imperceptibility and detectability of our watermark against generative editing. The editing pipeline is shown below:
Overview of our generative editing pipeline for testing watermark robustness.
Our method achieves state-of-the-art performance in both detection (99.19% average accuracy) and attribution (93% average accuracy) across various audio transformations. We maintain high detection accuracy even under challenging conditions:
Detection and attribution performance comparison across different watermarking methods. Our approach consistently outperforms baselines in both tasks.
Attribution accuracy with different number of users, demonstrating scalability of our approach.
XAttnMark demonstrates exceptional robustness across a wide range of audio transformations:
Comprehensive evaluation of robustness against various audio transformations, showing superior performance across different types and strengths of edits.
Comprehensive evaluation of robustness against various audio transformations, showing superior performance across different types and strengths of edits.
Our method shows remarkable robustness against state-of-the-art generative models:
Detection performance under different generative editing strengths using Stable Audio and AudioLDM2.
We demonstrate strong resilience against black-box adversarial attacks:
Performance comparison under HSJA-based adversarial attacks in waveform and spectrogram domains.
Comparison of perceptual quality metrics across different watermarking methods. Our XAttnMark achieves competitive or superior performance across all metrics while maintaining better robustness.
Our watermarking approach maintains high perceptual quality while ensuring robust protection:
Key architectural components contribute significantly to performance:
Ablation study results showing the impact of each key component: cross-attention mechanism, temporal modulation.
Attribution accuracy with different number of users, demonstrating scalability of our approach.
Our analysis of different architectural choices reveals distinct learning characteristics:
Training dynamics comparison showing validation accuracy and quality curves for different architectural choices.
We explore a post-hoc speed reversion layer to further improve robustness against speed changes:
For a speed-modified audio signal with unknown factor γ, we search for optimal α that maximizes:
where \(p(\alpha)\) represents the detection score and \(\bar{s}_{m(\alpha)}\) is the average standard deviation of predicted message bits. The search space is defined as:
Performance comparison with and without speed reversion layer across different user pool sizes.
Our method demonstrates strong generalization across diverse audio domains:
Performance comparison across different in-distribution and out-of-distribution datasets, showing strong generalization capabilities.
We analyze the model's performance across different audio durations and watermark strengths:
Performance analysis across different audio durations and watermark strengths (α), showing optimal configurations for both parameters.