Geolocation Benchmark

Orca 2.1 vs the field.

Visual geolocation accuracy on four independent datasets. Standard Im2GPS thresholds. All models evaluated on the same images — no cherry-picking. The benchmark code is open source and fully reproducible.

Im2GPS3k

n=2,123 scored of 2,997 · D2 · 2026-05-31

Model	1 km	25 km	200 km	750 km	2500 km
Orca 2.1Oceanir	32.2%	64.3%	75.7%	86.1%	93.3%
REVERSEPaperpaper	22.5%	48.3%	59.3%	73.5%	84.8%
Geo-RPaperpaper	18.1%	41.5%	58.3%	75.3%	86.4%
GAEA-7BWACV'26	8.4%	36.9%	56.0%	73.2%	85.6%
GeoCLIPNeurIPS'23paper	14.1%	34.5%	50.6%	69.7%	83.8%
GeoVistaAgentic	—	—	—	—	—
GeoAgentCVPR'26	—	—	—	—	—

Accuracy = % of images predicted within threshold. Higher is better.Independently run May 31 2026. n=2,123 parsed of 2,997. Last updated: 2026-05-31.

Datasets

Im2GPS3k

n=2,997

2,997 Flickr photos. Standard academic benchmark since 2016.

GeoRC

n=500

500 images annotated by world-champion GeoGuessr players with expert reasoning chains.

OSV-5M

n=5,000

OpenStreetView-5M test split. Global street-level coverage. CVPR 2024.

EarthWhere

n=810

810 images across two tiers: country-level and street-level difficulty.

Try it yourself

Try it — upload any photo

Live D2 analysis via the public API. No account needed.

POST /api/v1/geolocate

Drop an image or click to upload

JPG, PNG, WebP — any outdoor photo works

Using public benchmark key · D2 depth · No account required

Full API docs →

Verify it yourself

Im2GPS3k is publicly available. Download the dataset, write any script that calls the Orca API, score results with haversine distance at the standard thresholds, and compare. The public benchmark key below has no rate limit for benchmark use.

# 1. Download Im2GPS3k (2,997 images + ground truth coords)
#    mediafire.com/file/7ht7sn78q27o9we/im2gps3ktest.zip
#    coords: huggingface.co/datasets/Wendy-Fly/AAAI-2026 → im2gps3k_places365.csv

# 2. Call the Orca API for each image
curl https://app.oceanir.ai/api/v1/geolocate \
  -H "x-api-key: orca_bench_public" \
  -H "Content-Type: application/json" \
  -d '{"image_b64": "<BASE64>", "depth": 2}'

# 3. Score with haversine distance at 1 / 25 / 200 / 750 / 2500 km

Got different results? Email us

Methodology

All models are evaluated on the same image set. Accuracy is the percentage of images where the top-1 predicted location falls within the stated threshold (haversine distance). Orca is run at D2 depth. Open-source comparison models are served via vLLM at bfloat16 precision. Paper-sourced numbers are labeled as such; everything else is independently run. GeoVista and GeoAgent runs are in progress.

Every comparison number is scored on the identical Im2GPS3k test images. Orca's training data is deduplicated against this test set at both the image and coordinate level, so no test image or its location can influence Orca's result. The accuracy gap is measured on a clean split.

This benchmark compares specialized geolocation systems — models and pipelines built specifically for finding where an image was taken. General-purpose vision models are not included. They are a different category: a geolocation product should be measured against other geolocation products, not against the foundation layer any of them might use.

What we could not independently test

The geolocation AI field has grown quickly. Several strong systems exist that we cannot reproduce in a controlled benchmark — not because we are avoiding them, but because the constraints are real:

REVERSE (arXiv:2605.26861, 48.3% @ 25km on Im2GPS3k) — the best open-source result we are aware of. Final model weights have not been released. The system also requires a live web search backend and a zoom-and-crop tool pipeline during inference; it cannot be served via vLLM like the models in this benchmark.
G3 (NeurIPS'24, 40.9% @ 25km) — a retrieval-augmented framework that needs a faiss index built from MP16-Pro (4.72 million images), plus GPT-4V API calls for generation. Reproducing it requires hundreds of GB of storage and a separate embedding pipeline.
Geo-R (AAAI'26, 41.5% @ 25km) — the final GRPO-trained checkpoint has not been confirmed publicly available. An SFT-only intermediate checkpoint exists but would not reflect the paper numbers.

We include their published numbers in the table above for context. Our benchmark covers models that can be independently served, scored, and reproduced by anyone with a GPU and the benchmark code. If weights or reproducible inference code for the above systems become available, we will add them.

Orca 2.1 vs the field.

Upload a photo. Watch it come back as a place.