Report 001 · CalibrationWhere we actually are

What the score means,
and what it doesn’t.

A number like 60 on its own is almost meaningless. The point of calibration is showing what the number has looked like, in practice, on things we already knew the answer to. That’s what this page is about. There’s also an honest comparison with the published research, because otherwise we’d just be claiming things.

The histograms below are illustrative while we collect a real labelled test set. Running python scripts/calibrate.py over a folder of known real and synthetic samples overwrites them with measurements.

image

overlap 54%

0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100

real synthetic

video

overlap 46%

0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100

real synthetic

audio

overlap 48%

0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100

real synthetic

text

overlap 47%

0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100

real synthetic

Report 002 · AccuracyToday vs. what’s possible

Where it sits,
and where it could go.

Rough estimates, not benchmark numbers. Until we run a real evaluation these are informed guesses based on the performance claimed by upstream models and our own spot testing. The ceiling column assumes a couple of weeks of training work per modality.

Image

Today75 to 80% on SDXL-era generations. Drops to 55 to 65% on newer diffusion models (Flux, Midjourney v6) and heavy post-processing.

Ceiling88 to 92% with a DINOv2 head trained on the FF++, DFDC, Celeb-DF, WildDeepFake union.

BlockerNo fine-tuned head yet. Current classifiers were trained on earlier generator families.

Video

Today60 to 70% on FF++ and DFDC. Weaker on face-swap because we only run per-frame image checks with a light temporal pass.

Ceiling85 to 90% with CNN-RNN or ViViT temporal fusion plus audio-visual dissonance.

BlockerNo trained temporal model. Video pipeline currently ignores the audio track.

Audio

TodayAround 90% on ASVspoof-style synthetic speech. Shakier on recent cloning systems (ElevenLabs, XTTS).

Ceiling93 to 95% with a Whisper head fine-tuned on WaveFake plus 2024-era cloned voices.

BlockerThe Whisper classifier head scaffold is in place but untrained.

Text

TodayAround 65% on modern LLM output. RoBERTa + GPT-2 were calibrated against GPT-3 era writing.

Ceiling80 to 85% with a discriminator fine-tuned on GPT-4, Claude, Gemini, and Llama 3 output.

BlockerNeeds a current corpus of labelled human vs LLM writing. We don't have one.

Report 003 · Against the literatureWhat the research says

How we stack up against
a systematic review.

Ramanaharan, Guruge, and Agbinya published a systematic review of 108 deepfake video detection studies in Data and Information Management (2025). It covers work from 2018 to early 2024 and lands on a handful of clear conclusions. Here’s how Veritas compares, honestly.

Match

CNN-based spatial methods dominate the field

Primary image classifiers are Swin-v2 and the DINOv2 scaffold, both spatial.

Partial

Temporal methods are needed for video

We have a temporal-flicker heuristic and a TemporalTransformer scaffold, but no trained weights yet.

Gap

Multimodal fusion (audio + visual) raises accuracy

Real gap. The video pipeline ignores the audio track. Chugh et al. show this is worth 15 to 20% on face-swap video.

Match

Ensemble methods hit 88 to 97% on benchmarks

Veritas is an ensemble. Two image classifiers, two audio models, two text models. Calibration and agreement-adjustment logic match the paper's recipe.

Partial

Detectors overfit to FF++, DFDC, Celeb-DF. Performance drops cross-dataset.

Sidestepped by not training, but we inherit it from the upstream models. Organika/sdxl-detector was trained on SDXL, so newer generators slip through.

Gap

Capsule networks, disentangled representation, rPPG

None implemented. rPPG (heart rate from facial colour, Vinay et al.) would be a real forensic lift and isn't in the repo.

Partial

46.3% of reviewed studies claim their model generalises

The paper's headline caveat is that state-of-the-art still degrades on unseen manipulations. We agree with the caveat more than the claim.

Gap

Standardised datasets with scoring systems are needed

This page gestures at it, but we don't have a real benchmark suite yet.

Reference: Ramanaharan, R., Guruge, D. B., & Agbinya, J. I. (2025). DeepFake video detection: Insights into model generalisation. A systematic review. Data and Information Management, 9(2), 100099.

Report 004 · The hedge

The paper’s takeaway is our takeaway.

Cross-dataset generalisation in deepfake detection is unsolved. The best published models hit 92 to 96% on the dataset they were trained on, and 65 to 75% on everything else. Until generation and detection hit a saturation point together, nobody is going to nail 95% on arbitrary in-the-wild fakes. A realistic ceiling for a careful ensemble like this one is around 85%, with honest hedging on the other 15%. That’s why “inconclusive” is a real verdict here and not a failure state.

What the score means,and what it doesn’t.

image

video

audio

text

Where it sits,and where it could go.

How we stack up againsta systematic review.

The paper’s takeaway is our takeaway.

What the score means,
and what it doesn’t.

Where it sits,
and where it could go.

How we stack up against
a systematic review.