Report 001 · CalibrationWhere we actually are
What the score means,
and what it doesn’t.
A number like 60 on its own is almost meaningless. The point of calibration is showing what the number has looked like, in practice, on things we already knew the answer to. That’s what this page is about. There’s also an honest comparison with the published research, because otherwise we’d just be claiming things.
The histograms below are illustrative while we collect a real labelled test set. Running python scripts/calibrate.py over a folder of known real and synthetic samples overwrites them with measurements.
image
overlap 54%0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100
real synthetic
video
overlap 46%0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100
real synthetic
audio
overlap 48%0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100
real synthetic
text
overlap 47%0–1010–2020–3030–4040–5050–6060–7070–8080–9090–100
real synthetic
Report 002 · AccuracyToday vs. what’s possible
Where it sits,
and where it could go.
Rough estimates, not benchmark numbers. Until we run a real evaluation these are informed guesses based on the performance claimed by upstream models and our own spot testing. The ceiling column assumes a couple of weeks of training work per modality.
Image
Today75 to 80% on SDXL-era generations. Drops to 55 to 65% on newer diffusion models (Flux, Midjourney v6) and heavy post-processing.
Ceiling88 to 92% with a DINOv2 head trained on the FF++, DFDC, Celeb-DF, WildDeepFake union.
BlockerNo fine-tuned head yet. Current classifiers were trained on earlier generator families.
Video
Today60 to 70% on FF++ and DFDC. Weaker on face-swap because we only run per-frame image checks with a light temporal pass.
Ceiling85 to 90% with CNN-RNN or ViViT temporal fusion plus audio-visual dissonance.
BlockerNo trained temporal model. Video pipeline currently ignores the audio track.
Audio
TodayAround 90% on ASVspoof-style synthetic speech. Shakier on recent cloning systems (ElevenLabs, XTTS).
Ceiling93 to 95% with a Whisper head fine-tuned on WaveFake plus 2024-era cloned voices.
BlockerThe Whisper classifier head scaffold is in place but untrained.
Text
TodayAround 65% on modern LLM output. RoBERTa + GPT-2 were calibrated against GPT-3 era writing.
Ceiling80 to 85% with a discriminator fine-tuned on GPT-4, Claude, Gemini, and Llama 3 output.
BlockerNeeds a current corpus of labelled human vs LLM writing. We don't have one.
Report 003 · Against the literatureWhat the research says
How we stack up against
a systematic review.
Ramanaharan, Guruge, and Agbinya published a systematic review of 108 deepfake video detection studies in Data and Information Management (2025). It covers work from 2018 to early 2024 and lands on a handful of clear conclusions. Here’s how Veritas compares, honestly.
Match
CNN-based spatial methods dominate the field
Primary image classifiers are Swin-v2 and the DINOv2 scaffold, both spatial.
Partial
Temporal methods are needed for video
We have a temporal-flicker heuristic and a TemporalTransformer scaffold, but no trained weights yet.
Gap
Multimodal fusion (audio + visual) raises accuracy
Real gap. The video pipeline ignores the audio track. Chugh et al. show this is worth 15 to 20% on face-swap video.
Match
Ensemble methods hit 88 to 97% on benchmarks
Veritas is an ensemble. Two image classifiers, two audio models, two text models. Calibration and agreement-adjustment logic match the paper's recipe.
Partial
Detectors overfit to FF++, DFDC, Celeb-DF. Performance drops cross-dataset.
Sidestepped by not training, but we inherit it from the upstream models. Organika/sdxl-detector was trained on SDXL, so newer generators slip through.
Gap
Capsule networks, disentangled representation, rPPG
None implemented. rPPG (heart rate from facial colour, Vinay et al.) would be a real forensic lift and isn't in the repo.
Partial
46.3% of reviewed studies claim their model generalises
The paper's headline caveat is that state-of-the-art still degrades on unseen manipulations. We agree with the caveat more than the claim.
Gap
Standardised datasets with scoring systems are needed
This page gestures at it, but we don't have a real benchmark suite yet.
Reference: Ramanaharan, R., Guruge, D. B., & Agbinya, J. I. (2025). DeepFake video detection: Insights into model generalisation. A systematic review. Data and Information Management, 9(2), 100099.
Report 004 · The hedge
The paper’s takeaway is our takeaway.
Cross-dataset generalisation in deepfake detection is unsolved. The best published models hit 92 to 96% on the dataset they were trained on, and 65 to 75% on everything else. Until generation and detection hit a saturation point together, nobody is going to nail 95% on arbitrary in-the-wild fakes. A realistic ceiling for a careful ensemble like this one is around 85%, with honest hedging on the other 15%. That’s why “inconclusive” is a real verdict here and not a failure state.