Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations.
Our work introduces several key innovations in AI-generated image detection:
To build a spectral model of real images, we propose using the pretext task of frequency reconstruction under a self-supervised learning setup, using only real images. We randomly mask the low- or high-frequency component of the input images and train the model under the objective of reconstructing the missing frequencies. This approach allows us to learn the spectral distribution of real images without requiring labeled AI-generated samples.
As our model constitutes a spectral model of real images, it is expected to better reconstruct the missing frequencies of real images compared to AI-generated ones. We introduce Spectral Reconstruction Similarity (SRS) to measure this divergence. SRS calculates the similarity between the features of the original image and its low/high-frequency components, expecting larger distances for AI-generated images compared to real ones.
Capturing subtle clues in images is crucial for effectively distinguishing between real and AI-generated content. However, most computer vision models cannot efficiently scale to the native resolution of modern photos. Our Spectral Context Attention (SCA) enables processing of high-resolution images without resizing, combining the most discriminative spectral reconstruction similarity values from different patches according to their respective context. This allows our approach to efficiently capture subtle spectral inconsistencies in images of any resolution.
We evaluated SPAI against 12 state-of-the-art methods across 13 different generative models, including early approaches like Stable Diffusion 1.3/1.4 and recent high-fidelity generators like Stable Diffusion 3, DALLE-3, and Midjourney v6.1. While competing methods often excel on some generators but fail catastrophically on others, SPAI consistently performs well across all tested generative models.
Approach | < 0.5 MPixels | 0.5 - 1.0 MPixels | > 1.0 MPixels | AVG | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Glide | SD1.3 | SD1.4 | Flux | DALLE2 | SD2 | SDXL | SD3 | GigaGAN | MJv5 | MJv6.1 | DALLE3 | Firefly | ||
NPR | 72.2 | 89.6 | 60.5 | 19.8 | 3.9 | 12.5 | 18.1 | 60.6 | 83.2 | 15.3 | 19.8 | 97.1 | 38.0 | 45.4 |
Dire | 33.3 | 59.9 | 61.3 | 45.7 | 52.2 | 68.5 | 46.9 | 49.2 | 36.3 | 41.9 | 50.3 | 65.2 | 49.9 | 50.8 |
CNNDet. | 59.2 | 59.0 | 61.2 | 39.8 | 71.5 | 57.5 | 67.4 | 30.2 | 73.4 | 48.8 | 56.7 | 23.5 | 73.4 | 55.5 |
FreqDet. | 43.6 | 92.3 | 92.7 | 36.5 | 47.4 | 42.5 | 66.5 | 69.8 | 63.2 | 36.9 | 27.5 | 42.2 | 80.9 | 57.1 |
Fusing | 63.0 | 62.8 | 62.2 | 57.5 | 76.7 | 66.9 | 62.1 | 38.8 | 80.4 | 64.0 | 74.0 | 25.2 | 76.3 | 62.3 |
LGrad | 76.5 | 82.4 | 83.4 | 74.9 | 85.7 | 60.7 | 70.2 | 12.7 | 89.9 | 69.2 | 79.6 | 30.0 | 42.0 | 65.9 |
UnivFD | 63.3 | 80.8 | 81.2 | 36.3 | 91.4 | 84.3 | 78.3 | 28.6 | 86.2 | 57.1 | 60.5 | 31.0 | 95.5 | 67.3 |
GramNet | 78.2 | 83.9 | 84.3 | 78.6 | 85.2 | 66.7 | 77.8 | 19.2 | 85.0 | 63.8 | 84.9 | 42.9 | 38.0 | 68.4 |
DeFake | 86.1 | 64.2 | 63.6 | 90.5 | 41.4 | 66.2 | 52.3 | 87.7 | 71.7 | 67.0 | 87.5 | 93.3 | 39.4 | 70.1 |
PatchCr. | 78.4 | 95.7 | 96.2 | 86.9 | 81.8 | 95.7 | 96.7 | 33.8 | 98.0 | 79.0 | 96.1 | 28.1 | 79.1 | 80.4 |
DMID | 73.1 | 100.0 | 100.0 | 97.2 | 54.3 | 99.7 | 99.6 | 67.9 | 67.9 | 99.9 | 94.4 | 41.3 | 90.2 | 83.5 |
RINE | 95.6 | 99.9 | 99.9 | 93.0 | 93.0 | 96.6 | 99.3 | 39.1 | 92.9 | 96.4 | 81.2 | 41.8 | 82.9 | 85.5 |
SPAI (Ours) | 90.2 | 99.6 | 99.6 | 83.0 | 91.1 | 96.5 | 97.4 | 75.9 | 85.4 | 94.5 | 84.0 | 90.2 | 96.0 | 91.0 |
As shown in the table above, SPAI achieves a 5.5% absolute improvement in average AUC over the previous state-of-the-art. While other methods may perform better on specific generators, SPAI consistently performs well across all tested models, demonstrating its superior generalization capability.
Our Spectral Context Attention mechanism correctly identifies problematic regions in AI-generated images, such as anatomical anomalies (e.g., six-fingered hands) and texture inconsistencies, as shown in the examples below:
We evaluated the robustness of SPAI against common online perturbations that images typically undergo when shared on the internet. The results demonstrate that our approach maintains superior performance even when images are subjected to various types of degradation.
The robustness evaluation shows that SPAI consistently outperforms competing methods across various types of perturbations, including JPEG and WebP compression, Gaussian blur, Gaussian noise, and image resizing. This demonstrates the practical applicability of our approach in real-world scenarios where images often undergo multiple transformations.
While SPAI exhibits superior generalization performance and robustness, it still faces challenges with derivative AI-generated images (screenshots, memes, photographs of screens, printed material). Compression algorithms and noisy digital/analog channels can corrupt the spectral information needed for detection. These limitations affect any detector relying solely on image signal properties and highlight potential future directions combining spectral learning with semantic context understanding.
@article{karageorgiou2025any,
title={Any-Resolution AI-Generated Image Detection by Spectral Learning},
author={Karageorgiou, Dimitrios and Papadopoulos, Symeon and Kompatsiaris, Ioannis and Gavves, Efstratios},
journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}