Geolocation Benchmark
Orca 2.1 vs the field.
Visual geolocation accuracy on four independent datasets. Standard Im2GPS thresholds. All models evaluated on the same images — no cherry-picking. The benchmark code is open source and fully reproducible.
Im2GPS3k
n=2,997 · D2 · 2026-05-31
Accuracy = % of images predicted within threshold. Higher is better.Independently run May 31 2026. n=2,123 parsed of 2,997. Last updated: 2026-05-31.
Datasets
Im2GPS3k
n=2,997
2,997 Flickr photos. Standard academic benchmark since 2016.
GeoRC
n=500
500 images annotated by world-champion GeoGuessr players with expert reasoning chains.
OSV-5M
n=5,000
OpenStreetView-5M test split. Global street-level coverage. CVPR 2024.
EarthWhere
n=810
810 images across two tiers: country-level and street-level difficulty.
Try it yourself
Verify it yourself
Im2GPS3k is publicly available. Download the dataset, write any script that calls the Orca API, score results with haversine distance at the standard thresholds, and compare. The public benchmark key below has no rate limit for benchmark use.
# 1. Download Im2GPS3k (2,997 images + ground truth coords)
# mediafire.com/file/7ht7sn78q27o9we/im2gps3ktest.zip
# coords: huggingface.co/datasets/Wendy-Fly/AAAI-2026 → im2gps3k_places365.csv
# 2. Call the Orca API for each image
curl https://app.oceanir.ai/api/v1/geolocate \
-H "x-api-key: orca_bench_public" \
-H "Content-Type: application/json" \
-d '{"image_b64": "<BASE64>", "depth": 2}'
# 3. Score with haversine distance at 1 / 25 / 200 / 750 / 2500 kmMethodology
All models are evaluated on the same image set. Accuracy is the percentage of images where the top-1 predicted location falls within the stated threshold (haversine distance). Orca is run at D2 depth. Open-source comparison models are served via vLLM at bfloat16 precision. Paper-sourced numbers are labeled as such; everything else is independently run. GeoVista and GeoAgent runs are in progress.
Every comparison number is scored on the identical Im2GPS3k test images. Orca's training data is deduplicated against this test set at both the image and coordinate level, so no test image or its location can influence Orca's result. The accuracy gap is measured on a clean split.
This benchmark compares specialized geolocation systems — models and pipelines built specifically for finding where an image was taken. General-purpose vision models are not included. They are a different category: a geolocation product should be measured against other geolocation products, not against the foundation layer any of them might use.
What we could not independently test
The geolocation AI field has grown quickly. Several strong systems exist that we cannot reproduce in a controlled benchmark — not because we are avoiding them, but because the constraints are real:
- REVERSE (arXiv:2605.26861, 48.3% @ 25km on Im2GPS3k) — the best open-source result we are aware of. Final model weights have not been released. The system also requires a live web search backend and a zoom-and-crop tool pipeline during inference; it cannot be served via vLLM like the models in this benchmark.
- G3 (NeurIPS'24, 40.9% @ 25km) — a retrieval-augmented framework that needs a faiss index built from MP16-Pro (4.72 million images), plus GPT-4V API calls for generation. Reproducing it requires hundreds of GB of storage and a separate embedding pipeline.
- Geo-R (AAAI'26, 41.5% @ 25km) — the final GRPO-trained checkpoint has not been confirmed publicly available. An SFT-only intermediate checkpoint exists but would not reflect the paper numbers.
We include their published numbers in the table above for context. Our benchmark covers models that can be independently served, scored, and reproduced by anyone with a GPU and the benchmark code. If weights or reproducible inference code for the above systems become available, we will add them.