Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

Information Technologies Institute @ CERTH
The 5th ACM International Workshop on Multimedia AI against Disinformation (MAD '26), 2026

Abstract

As generative Artificial Intelligence (AI) advances, the realism of AI generated imagery has reached a threshold capable of deceiving even vigilant human observers. Yet, while current AI-generated Image Detection (AID) approaches perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world cases. To study this behavior we introduce the ITW-SM dataset, a curated collection of real and AI-generated images originating from major social media platforms. We employ it to analyze the effects of key design choices typically considered when building a detector, involving its architecture, pre-trained latent spaces, training data as well as pre-processing approaches. We indicate that naively scaling the pre-training stage or opting for more training data does not always lead to better detection performance. Instead, our work reveals that it is crucial to optimize each design choice to enable the processing pipeline to propagate and effectively analyze both low-level traces as well as high-level image semantics. Building on our findings, we achieve a substantial average improvement of 26.87% in AUC across multiple state-of-the-art detection approaches and under real-world conditions, providing a roadmap for developing more resilient detectors.

Paper Highlights

  • A systematic experimental evaluation that reveals strengths and weaknesses of various AID approaches under real-world conditions.
  • The introduction of ITW-SM, a new in-the-wild AID benchmark dataset collected from four major social media platforms.
  • An impact analysis of training data, pre-trained latent spaces, model architectures, pre-processing stages, and data augmentations on AID performance.
  • An average improvement of 26.87% in AUC across multiple detection approaches under real-world conditions.
  • A set of recommendations for designing more robust AID models capable of handling in-the-wild variations.

The ITW-SM Dataset

To facilitate a realistic evaluation setup and address gaps in existing resources, we introduce ITW-SM, a new in-the-wild test dataset of 10,000 real and AI-generated images collected from four major social media platforms (Facebook, Instagram, LinkedIn, and X).


Unlike previous datasets, ITW-SM captures the resolution variability and post-processing artifacts natively present on social media platforms:

Characteristic ITW-SM Chameleon TWIGMA
Source of AI images Social Media Users AI-Painting Communities Twitter Users
Real Image Source Verified Social Media Accounts Unsplash (Photographers) -
Number of platforms 4 0 1
Resolution Range 0.1 - 45 Megapixels 0.1 - 31 Megapixels < 0.01 - 47 Megapixels
Size 10,000 26,033 800,000
Intended Focus General in-the-wild robustness Generalization to realistic AI Analysis of AI art trends

Topic and Resolution Analysis

To better understand the composition of ITW-SM, we compare its topic and resolution distributions against existing web-sampled datasets, such as Chameleon. As shown below, topic distributions differ significantly between social media platforms and online painting communities. Additionally, ITW-SM exhibits a substantially broader resolution range, particularly for real images.

Topic Distribution Comparison

Topic distribution in web-collected datasets.

Resolution Distribution Comparison

Resolution distribution in web-collected datasets.

Methodology

We define an experimental framework for systematically studying the effect of four key components typically considered when building an AID model:

Framework for studying the factors impacting expected performance and generalization in AID models.
  • Training Data Composition: Evaluating how dataset diversity (e.g., combining benchmark data with in-the-wild TWIGMA data) affects generalization.
    Dataset LDM Subset TWIGMA Subset
    Real Gen. Real Gen.
    LDM 200K 200K - -
    TWIGMA - - 600K 600K
    LDM+TWIGMA (400K) 200K 200K 200K 200K
    LDM+TWIGMA (1.2M) 200K 200K 600K 600K

    Table: Training data configurations used in our experiments.

  • Backbone Architectures: Assessing different vision encoders (e.g., CLIP, BLIP2, DINO-V2) for their ability to capture low-level artifacts vs. high-level semantics.
  • Cropping Strategies: Comparing resizing, center cropping, 10-cropping, and texture-based cropping to optimally preserve generation traces.
  • Data Augmentations: Applying geometric and noise transformations to simulate real-world distortions.

Core Results

Our findings confirm that while most methods achieve strong results on curated benchmark datasets, their performance degrades significantly when applied to in-the-wild AI-generated images.

Performance degradation in the wild

Impact of Design Choices

By systematically updating AID models with optimal design choices, we achieved a substantial average improvement of 26.87% in AUC across various state-of-the-art detection approaches in real-world conditions.

Original vs Updated performance

Detailed Ablation Results

Click the tabs below to explore the detailed results of each ablation study. All values are reported as AUC / AP.

Model Pre-train Data Synthbuster Chameleon ITW-SM Average
CLIP L/14 400M 96.98 / 97.33 82.25 / 81.34 96.53 / 96.98 91.92 / 91.88
OpenCLIP L/14 2B 74.82 / 81.11 85.86 / 83.03 90.01 / 91.54 83.56 / 85.23
CLIP H/14 2B 97.02 / 81.71 81.22 / 76.98 90.56 / 91.81 89.60 / 83.50
BLIP2 129M 99.37 / 99.48 86.58 / 86.28 96.49 / 96.97 94.15 / 94.24
DINO-V2-L/14 142M 99.14 / 99.18 87.33 / 85.51 98.23 / 98.50 94.90 / 94.40

Performance (AUC/AP) of the RINE method with different backbones. Other components fixed to DINO-V2-L/14, LDM+TWIGMA (1.2M), Texture cropping, With augmentations.

Method / Dataset Synthbuster Chameleon ITW-SM Average
NPR
LDM 55.95 / 58.47 53.04 / 44.57 54.85 / 55.92 54.61 / 52.99
TWIGMA 69.79 / 71.67 60.54 / 53.96 80.00 / 81.63 70.11 / 69.09
LDM+TWIGMA (400K) 63.51 / 66.56 63.29 / 55.77 43.59 / 52.66 56.80 / 58.33
LDM+TWIGMA (1.2M) 71.31 / 74.86 56.29 / 49.89 77.54 / 77.73 68.38 / 67.49
Gemma 3 IT 27B
zero-shot 70.61 / 70.54 85.01 / 79.56 84.04 / 75.50 79.89 / 75.20
DMID
LDM 81.26 / 82.79 54.69 / 55.03 79.96 / 81.87 71.97 / 73.23
TWIGMA 90.27 / 90.15 82.97 / 78.51 91.55 / 92.03 88.26 / 86.90
LDM+TWIGMA (400K) 89.23 / 89.20 77.81 / 72.85 88.04 / 85.95 85.03 / 82.67
LDM+TWIGMA (1.2M) 92.40 / 91.65 83.71 / 79.33 92.26 / 92.58 89.46 / 87.85
RINE
LDM 95.04 / 95.39 63.94 / 52.79 83.52 / 83.07 80.83 / 77.08
TWIGMA 97.77 / 97.96 89.03 / 86.69 97.66 / 98.07 94.82 / 94.24
LDM+TWIGMA (400K) 99.61 / 99.66 86.51 / 85.12 97.90 / 98.20 94.67 / 94.33
LDM+TWIGMA (1.2M) 99.14 / 99.18 87.33 / 85.51 98.23 / 98.50 94.90 / 94.40
SPAI
LDM 93.87 / 94.17 72.28 / 61.28 78.93 / 80.17 81.69 / 78.54
LDM+TWIGMA (400K) 98.81 / 99.02 89.38 / 87.26 97.59 / 97.80 95.26 / 94.69
LDM+TWIGMA (1.2M) 97.45 / 98.00 90.21 / 88.51 98.10 / 98.35 95.25 / 94.95

Performance (AUC/AP) of detection methods trained on different data configurations. Other components fixed to DINO-V2-L/14, Texture cropping, With augmentations.

Method / Cropping Synthbuster Chameleon ITW-SM Average
DMID
Center cropping 89.41 / 89.58 63.20 / 58.98 83.74 / 80.06 78.78 / 76.21
Texture cropping 92.40 / 91.65 83.71 / 79.33 92.26 / 95.58 89.46 / 87.85
RINE
Center cropping 94.65 / 94.65 84.08 / 82.70 95.04 / 95.75 91.26 / 91.03
Texture cropping 99.14 / 99.18 87.33 / 85.51 98.23 / 98.50 94.90 / 94.40
NPR
Center cropping 69.79 / 71.67 60.54 / 53.96 80.00 / 81.63 70.11 / 69.09
Texture cropping 64.08 / 65.78 62.34 / 54.24 76.03 / 76.98 67.48 / 65.67

Performance (AUC/AP) of detection methods with different cropping methods. Other components fixed to DINO-V2-L/14, LDM+TWIGMA (1.2M), With augmentations.

Method / Augmentation Synthbuster Chameleon ITW-SM Average
DMID
Without augmentations 76.53 / 66.47 76.21 / 69.36 82.20 / 81.42 78.31 / 72.42
With augmentations 92.40 / 91.65 83.71 / 79.33 92.26 / 92.58 89.46 / 87.85
RINE
Without augmentations 93.63 / 95.02 92.16 / 90.24 93.70 / 94.03 93.16 / 93.10
With augmentations 99.14 / 99.18 87.33 / 85.51 98.23 / 98.50 94.90 / 94.40
NPR
Without augmentations 72.35 / 69.61 60.80 / 48.49 68.92 / 64.09 67.36 / 60.73
With augmentations 64.08 / 65.78 62.34 / 54.24 76.03 / 76.98 67.48 / 65.67
SPAI
Without augmentations 94.80 / 94.83 82.73 / 81.02 91.46 / 91.75 89.66 / 89.20
With augmentations 97.45 / 98.00 90.21 / 88.51 98.10 / 98.35 95.25 / 94.95

Performance (AUC/AP) of detection methods with and without data augmentations. Trained on LDM+TWIGMA (1.2M). Other components fixed to DINO-V2-L/14, Texture cropping.

Key Findings

Backbones: DINO-V2 significantly outperforms CLIP-based encoders for AID tasks due to its self-supervised training focused on visual understanding rather than image-text semantic alignment.

Training Data: Retraining on in-the-wild collected data consistently benefits performance, though end-to-end models benefit more from scale than models reliant on pre-trained spaces.

Pre-processing: TextureCrop preserves critical high-frequency synthetic artifacts in high-resolution images much better than standard center cropping or resizing.

Augmentations: Incorporating augmentations simulating compression and noise is vital to bridge the gap between training and real-world data.

Acknowledgements

We thank Zacharias Chrysidis for his invaluable assistance on late-stage experimentation with VL models. This work is funded by the Horizon Europe projects vera.ai (GA No. 101070093), AI-CODE (GA No. 101135437), and ELIAS (GA No. 101120237). Computational resources were provided by the National Infrastructures for Research and Technology GRNET and funded by the EU Recovery and Resiliency Facility.

BibTeX

@inproceedings{konstantinidou2026navigating,
  title={Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?},
  author={Konstantinidou, Despina and Karageorgiou, Dimitrios and Koutlis, Christos and Papadopoulou, Olga and Schinas, Emmanouil and Papadopoulos, Symeon},
  booktitle={The 5th ACM International Workshop on Multimedia AI against Disinformation (MAD '26)},
  year={2026},
  doi={10.1145/3810988.3812665},
  url={https://arxiv.org/abs/2507.10236}
}