Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

Konstantinidou, Despina; Karageorgiou, Dimitrios

Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos

Information Technologies Institute @ CERTH
The 5th ACM International Workshop on Multimedia AI against Disinformation (MAD '26), 2026

Paper Code arXiv

Abstract

As generative Artificial Intelligence (AI) advances, the realism of AI generated imagery has reached a threshold capable of deceiving even vigilant human observers. Yet, while current AI-generated Image Detection (AID) approaches perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world cases. To study this behavior we introduce the ITW-SM dataset, a curated collection of real and AI-generated images originating from major social media platforms. We employ it to analyze the effects of key design choices typically considered when building a detector, involving its architecture, pre-trained latent spaces, training data as well as pre-processing approaches. We indicate that naively scaling the pre-training stage or opting for more training data does not always lead to better detection performance. Instead, our work reveals that it is crucial to optimize each design choice to enable the processing pipeline to propagate and effectively analyze both low-level traces as well as high-level image semantics. Building on our findings, we achieve a substantial average improvement of 26.87% in AUC across multiple state-of-the-art detection approaches and under real-world conditions, providing a roadmap for developing more resilient detectors.

Paper Highlights

A systematic experimental evaluation that reveals strengths and weaknesses of various AID approaches under real-world conditions.
The introduction of ITW-SM, a new in-the-wild AID benchmark dataset collected from four major social media platforms.
An impact analysis of training data, pre-trained latent spaces, model architectures, pre-processing stages, and data augmentations on AID performance.
An average improvement of 26.87% in AUC across multiple detection approaches under real-world conditions.
A set of recommendations for designing more robust AID models capable of handling in-the-wild variations.

The ITW-SM Dataset

To facilitate a realistic evaluation setup and address gaps in existing resources, we introduce ITW-SM, a new in-the-wild test dataset of 10,000 real and AI-generated images collected from four major social media platforms (Facebook, Instagram, LinkedIn, and X).

Real

AI Generated

Real

AI Generated

Real

AI Generated

Real

AI Generated

Real

AI Generated

Unlike previous datasets, ITW-SM captures the resolution variability and post-processing artifacts natively present on social media platforms:

Characteristic	ITW-SM	Chameleon	TWIGMA
Source of AI images	Social Media Users	AI-Painting Communities	Twitter Users
Real Image Source	Verified Social Media Accounts	Unsplash (Photographers)	-
Number of platforms	4	0	1
Resolution Range	0.1 - 45 Megapixels	0.1 - 31 Megapixels	< 0.01 - 47 Megapixels
Size	10,000	26,033	800,000
Intended Focus	General in-the-wild robustness	Generalization to realistic AI	Analysis of AI art trends

Topic and Resolution Analysis

To better understand the composition of ITW-SM, we compare its topic and resolution distributions against existing web-sampled datasets, such as Chameleon. As shown below, topic distributions differ significantly between social media platforms and online painting communities. Additionally, ITW-SM exhibits a substantially broader resolution range, particularly for real images.

Topic distribution in web-collected datasets.

Resolution distribution in web-collected datasets.

Methodology

We define an experimental framework for systematically studying the effect of four key components typically considered when building an AID model:

Framework for studying the factors impacting expected performance and generalization in AID models.

Training Data Composition: Evaluating how dataset diversity (e.g., combining benchmark data with in-the-wild TWIGMA data) affects generalization.

Dataset	LDM Subset		TWIGMA Subset
Dataset	Real	Gen.	Real	Gen.
LDM	200K	200K	-	-
TWIGMA	-	-	600K	600K
LDM+TWIGMA (400K)	200K	200K	200K	200K
LDM+TWIGMA (1.2M)	200K	200K	600K	600K

Table: Training data configurations used in our experiments.

Backbone Architectures: Assessing different vision encoders (e.g., CLIP, BLIP2, DINO-V2) for their ability to capture low-level artifacts vs. high-level semantics.
Cropping Strategies: Comparing resizing, center cropping, 10-cropping, and texture-based cropping to optimally preserve generation traces.
Data Augmentations: Applying geometric and noise transformations to simulate real-world distortions.

Core Results

Our findings confirm that while most methods achieve strong results on curated benchmark datasets, their performance degrades significantly when applied to in-the-wild AI-generated images.

Impact of Design Choices

By systematically updating AID models with optimal design choices, we achieved a substantial average improvement of 26.87% in AUC across various state-of-the-art detection approaches in real-world conditions.

Detailed Ablation Results

Click the tabs below to explore the detailed results of each ablation study. All values are reported as AUC / AP.

Model	Pre-train Data	Synthbuster	Chameleon	ITW-SM	Average
CLIP L/14	400M	96.98 / 97.33	82.25 / 81.34	96.53 / 96.98	91.92 / 91.88
OpenCLIP L/14	2B	74.82 / 81.11	85.86 / 83.03	90.01 / 91.54	83.56 / 85.23
CLIP H/14	2B	97.02 / 81.71	81.22 / 76.98	90.56 / 91.81	89.60 / 83.50
BLIP2	129M	99.37 / 99.48	86.58 / 86.28	96.49 / 96.97	94.15 / 94.24
DINO-V2-L/14	142M	99.14 / 99.18	87.33 / 85.51	98.23 / 98.50	94.90 / 94.40

Performance (AUC/AP) of the RINE method with different backbones. Other components fixed to DINO-V2-L/14, LDM+TWIGMA (1.2M), Texture cropping, With augmentations.

Method / Dataset	Synthbuster	Chameleon	ITW-SM	Average
NPR
LDM	55.95 / 58.47	53.04 / 44.57	54.85 / 55.92	54.61 / 52.99
TWIGMA	69.79 / 71.67	60.54 / 53.96	80.00 / 81.63	70.11 / 69.09
LDM+TWIGMA (400K)	63.51 / 66.56	63.29 / 55.77	43.59 / 52.66	56.80 / 58.33
LDM+TWIGMA (1.2M)	71.31 / 74.86	56.29 / 49.89	77.54 / 77.73	68.38 / 67.49
Gemma 3 IT 27B
zero-shot	70.61 / 70.54	85.01 / 79.56	84.04 / 75.50	79.89 / 75.20
DMID
LDM	81.26 / 82.79	54.69 / 55.03	79.96 / 81.87	71.97 / 73.23
TWIGMA	90.27 / 90.15	82.97 / 78.51	91.55 / 92.03	88.26 / 86.90
LDM+TWIGMA (400K)	89.23 / 89.20	77.81 / 72.85	88.04 / 85.95	85.03 / 82.67
LDM+TWIGMA (1.2M)	92.40 / 91.65	83.71 / 79.33	92.26 / 92.58	89.46 / 87.85
RINE
LDM	95.04 / 95.39	63.94 / 52.79	83.52 / 83.07	80.83 / 77.08
TWIGMA	97.77 / 97.96	89.03 / 86.69	97.66 / 98.07	94.82 / 94.24
LDM+TWIGMA (400K)	99.61 / 99.66	86.51 / 85.12	97.90 / 98.20	94.67 / 94.33
LDM+TWIGMA (1.2M)	99.14 / 99.18	87.33 / 85.51	98.23 / 98.50	94.90 / 94.40
SPAI
LDM	93.87 / 94.17	72.28 / 61.28	78.93 / 80.17	81.69 / 78.54
LDM+TWIGMA (400K)	98.81 / 99.02	89.38 / 87.26	97.59 / 97.80	95.26 / 94.69
LDM+TWIGMA (1.2M)	97.45 / 98.00	90.21 / 88.51	98.10 / 98.35	95.25 / 94.95

Performance (AUC/AP) of detection methods trained on different data configurations. Other components fixed to DINO-V2-L/14, Texture cropping, With augmentations.

Method / Cropping	Synthbuster	Chameleon	ITW-SM	Average
DMID
Center cropping	89.41 / 89.58	63.20 / 58.98	83.74 / 80.06	78.78 / 76.21
Texture cropping	92.40 / 91.65	83.71 / 79.33	92.26 / 95.58	89.46 / 87.85
RINE
Center cropping	94.65 / 94.65	84.08 / 82.70	95.04 / 95.75	91.26 / 91.03
Texture cropping	99.14 / 99.18	87.33 / 85.51	98.23 / 98.50	94.90 / 94.40
NPR
Center cropping	69.79 / 71.67	60.54 / 53.96	80.00 / 81.63	70.11 / 69.09
Texture cropping	64.08 / 65.78	62.34 / 54.24	76.03 / 76.98	67.48 / 65.67

Performance (AUC/AP) of detection methods with different cropping methods. Other components fixed to DINO-V2-L/14, LDM+TWIGMA (1.2M), With augmentations.

Method / Augmentation	Synthbuster	Chameleon	ITW-SM	Average
DMID
Without augmentations	76.53 / 66.47	76.21 / 69.36	82.20 / 81.42	78.31 / 72.42
With augmentations	92.40 / 91.65	83.71 / 79.33	92.26 / 92.58	89.46 / 87.85
RINE
Without augmentations	93.63 / 95.02	92.16 / 90.24	93.70 / 94.03	93.16 / 93.10
With augmentations	99.14 / 99.18	87.33 / 85.51	98.23 / 98.50	94.90 / 94.40
NPR
Without augmentations	72.35 / 69.61	60.80 / 48.49	68.92 / 64.09	67.36 / 60.73
With augmentations	64.08 / 65.78	62.34 / 54.24	76.03 / 76.98	67.48 / 65.67
SPAI
Without augmentations	94.80 / 94.83	82.73 / 81.02	91.46 / 91.75	89.66 / 89.20
With augmentations	97.45 / 98.00	90.21 / 88.51	98.10 / 98.35	95.25 / 94.95

Performance (AUC/AP) of detection methods with and without data augmentations. Trained on LDM+TWIGMA (1.2M). Other components fixed to DINO-V2-L/14, Texture cropping.

Key Findings

Backbones: DINO-V2 significantly outperforms CLIP-based encoders for AID tasks due to its self-supervised training focused on visual understanding rather than image-text semantic alignment.

Training Data: Retraining on in-the-wild collected data consistently benefits performance, though end-to-end models benefit more from scale than models reliant on pre-trained spaces.

Pre-processing: TextureCrop preserves critical high-frequency synthetic artifacts in high-resolution images much better than standard center cropping or resizing.

Augmentations: Incorporating augmentations simulating compression and noise is vital to bridge the gap between training and real-world data.

Acknowledgements

We thank Zacharias Chrysidis for his invaluable assistance on late-stage experimentation with VL models. This work is funded by the Horizon Europe projects vera.ai (GA No. 101070093), AI-CODE (GA No. 101135437), and ELIAS (GA No. 101120237). Computational resources were provided by the National Infrastructures for Research and Technology GRNET and funded by the EU Recovery and Resiliency Facility.

BibTeX

@inproceedings{konstantinidou2026navigating,
  title={Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?},
  author={Konstantinidou, Despina and Karageorgiou, Dimitrios and Koutlis, Christos and Papadopoulou, Olga and Schinas, Emmanouil and Papadopoulos, Symeon},
  booktitle={The 5th ACM International Workshop on Multimedia AI against Disinformation (MAD '26)},
  year={2026},
  doi={10.1145/3810988.3812665},
  url={https://arxiv.org/abs/2507.10236}
}