Foundation Model Selection for CRE Development Workflows: A Practical Evaluation Guide
How institutional development teams should choose and route foundation models across CRE workflows for accuracy, cost and performance.
Which AI model you use matters more than most development teams realize -- and the answer changes depending on the task.
The Model Is Not a Detail
Most institutional teams evaluating AI tools focus on the software layer: the interface, the workflow design, the integration points. The foundation model underneath gets treated as a technical detail. That framing is wrong.
The model determines accuracy, context capacity, reasoning depth and cost. For high-stakes CRE work -- site feasibility analysis, capital stack modeling, lease abstraction, interconnection queue review -- model selection shapes outcomes in ways that interface design cannot compensate for.
The Models Doing Serious Work in 2026
Four model families are running in production CRE development workflows right now.
OpenAI GPT-4o and o-series
GPT-4o is the most widely deployed commercial model in enterprise tools. Strong structured output, reliable tool use and broad integration support. The o-series reasoning models (o3, o4-mini) are meaningfully better on multi-step analytical tasks: interconnection queue modeling, capital stack scenario analysis, IC memo assembly. o4-mini is the workhorse for reasoning-heavy tasks where cost matters. The trade-off is higher latency on every call.
Anthropic Claude 3.5 and 3.7 Sonnet
Claude is the preferred model for document-intensive workflows: ground lease review, PSA analysis, construction contract extraction, environmental report parsing. Its 200k context window handles large document sets without chunking artifacts. Claude 3.7 adds extended thinking for structured analytical tasks. Slightly higher latency on long-context runs than GPT-4o.
Google Gemini 2.0 and 2.5 Flash/Pro
Gemini's strength is multimodal analysis and speed. For workflows involving site plan review, ALTA surveys, aerial imagery or architectural drawings layered with text, Gemini's native vision capability leads the field. Gemini 2.5 Pro is the top-performing model on multimodal tasks. Flash variants compete on cost-per-token for high-volume screening runs.
Meta Llama 3.x (hosted or self-hosted)
Llama runs at a fraction of the cost of proprietary APIs. For structured extraction tasks with fixed schema -- permit status lookups, rent roll parsing, budget line normalization -- a fine-tuned Llama variant can match GPT-4o performance at 80% lower cost. The trade-off is implementation overhead and no frontier reasoning capability.
Task-to-Model Routing
| Workflow | Best Fit | Reason |
|---|---|---|
| Site screening (high volume) | Gemini Flash / Llama | Cost efficiency at scale |
| Document extraction (lease, PSA, title) | Claude 3.5/3.7 | Large context, accuracy |
| Pro forma modeling / scenario analysis | o3 / o4-mini | Reasoning depth |
| Multimodal (site plans, aerials) | Gemini 2.5 Pro | Native vision |
| IC memo drafting | Claude 3.7 / GPT-4o | Structured long-form |
| Permitting and regulatory research | GPT-4o with tools | Browser and web access |
No single model dominates every CRE task. The highest-performing teams run two or three models in parallel, routing by task type.
Evaluation Criteria That Actually Matter
Context window. CRE transactions involve long documents. An environmental report can run 150 pages. A title commitment with exceptions frequently exceeds 80 pages. Models with sub-32k context windows need chunking, which introduces extraction errors. Claude and Gemini 2.5 Pro both support over 1M tokens natively.
Tool use reliability. Agentic CRE workflows require models that call external tools: data APIs, GIS layers, utility databases, market data sources. GPT-4o and Claude have the most consistent tool call behavior in production. Gemini 2.5 has closed most of the reliability gaps present in earlier versions.
Structured output fidelity. If a workflow depends on JSON extraction from documents -- rent roll tables, budget variances, contract terms -- the model must reliably produce valid structured output without hallucinating fields. Test against your actual document formats before committing.
Cost at scale. A single site screening campaign can generate 500-2,000 model calls. At $0.015 per 1k tokens for GPT-4o, that compounds quickly. For high-volume, lower-complexity tasks, route to Flash or a hosted Llama variant.
Latency. Synchronous workflows (a team member waiting on a result) tolerate up to 30 seconds. Batch overnight jobs can tolerate 5-10 minutes. Match latency requirements to model speed before deploying at scale.
Build vs. Buy: The Model Layer Question
Most enterprise CRE platforms bundle a default model. That default may not be optimal for your most important workflows. The right question is not which platform uses which model -- it is whether the platform allows model routing or override.
AI-native firms building CRE workflows on the full stack have the flexibility to match model to task. For a 12-workflow development operation, model optimization across the stack can cut inference cost 40-60% and materially improve analytical accuracy on the highest-stakes outputs.
The selection framework is not complicated. Classify workflows by document intensity, reasoning depth, multimodal requirement and volume. Test two or three models on representative inputs. The performance gap is visible within a day.