Technology

Foundation Model Selection for CRE Development Workflows: A Practical Evaluation Guide

Foundation model selection has a direct impact on CRE workflow accuracy, cost and analytical depth. This guide maps GPT-4o, Claude, Gemini and o-series models to specific development tasks and provides an evaluation framework covering context window, tool use reliability, structured output fidelity and cost at scale. Teams routing by task type rather than defaulting to a single model cut inference cost 40-60% while improving accuracy on high-stakes outputs.

by Build Team April 23, 2026 4 min read

Foundation Model Selection for CRE Development Workflows: A Practical Evaluation Guide

How institutional development teams should choose and route foundation models across CRE workflows for accuracy, cost and performance.

Which AI model you use matters more than most development teams realize -- and the answer changes depending on the task.

The Model Is Not a Detail

Most institutional teams evaluating AI tools focus on the software layer: the interface, the workflow design, the integration points. The foundation model underneath gets treated as a technical detail. That framing is wrong.

The model determines accuracy, context capacity, reasoning depth and cost. For high-stakes CRE work -- site feasibility analysis, capital stack modeling, lease abstraction, interconnection queue review -- model selection shapes outcomes in ways that interface design cannot compensate for.

The Models Doing Serious Work in 2026

Four model families are running in production CRE development workflows right now.

OpenAI GPT-4o and o-series
GPT-4o is the most widely deployed commercial model in enterprise tools. Strong structured output, reliable tool use and broad integration support. The o-series reasoning models (o3, o4-mini) are meaningfully better on multi-step analytical tasks: interconnection queue modeling, capital stack scenario analysis, IC memo assembly. o4-mini is the workhorse for reasoning-heavy tasks where cost matters. The trade-off is higher latency on every call.

Anthropic Claude 3.5 and 3.7 Sonnet
Claude is the preferred model for document-intensive workflows: ground lease review, PSA analysis, construction contract extraction, environmental report parsing. Its 200k context window handles large document sets without chunking artifacts. Claude 3.7 adds extended thinking for structured analytical tasks. Slightly higher latency on long-context runs than GPT-4o.

Google Gemini 2.0 and 2.5 Flash/Pro
Gemini's strength is multimodal analysis and speed. For workflows involving site plan review, ALTA surveys, aerial imagery or architectural drawings layered with text, Gemini's native vision capability leads the field. Gemini 2.5 Pro is the top-performing model on multimodal tasks. Flash variants compete on cost-per-token for high-volume screening runs.

Meta Llama 3.x (hosted or self-hosted)
Llama runs at a fraction of the cost of proprietary APIs. For structured extraction tasks with fixed schema -- permit status lookups, rent roll parsing, budget line normalization -- a fine-tuned Llama variant can match GPT-4o performance at 80% lower cost. The trade-off is implementation overhead and no frontier reasoning capability.

Task-to-Model Routing

Workflow Best Fit Reason
Site screening (high volume) Gemini Flash / Llama Cost efficiency at scale
Document extraction (lease, PSA, title) Claude 3.5/3.7 Large context, accuracy
Pro forma modeling / scenario analysis o3 / o4-mini Reasoning depth
Multimodal (site plans, aerials) Gemini 2.5 Pro Native vision
IC memo drafting Claude 3.7 / GPT-4o Structured long-form
Permitting and regulatory research GPT-4o with tools Browser and web access

No single model dominates every CRE task. The highest-performing teams run two or three models in parallel, routing by task type.

Evaluation Criteria That Actually Matter

Context window. CRE transactions involve long documents. An environmental report can run 150 pages. A title commitment with exceptions frequently exceeds 80 pages. Models with sub-32k context windows need chunking, which introduces extraction errors. Claude and Gemini 2.5 Pro both support over 1M tokens natively.

Tool use reliability. Agentic CRE workflows require models that call external tools: data APIs, GIS layers, utility databases, market data sources. GPT-4o and Claude have the most consistent tool call behavior in production. Gemini 2.5 has closed most of the reliability gaps present in earlier versions.

Structured output fidelity. If a workflow depends on JSON extraction from documents -- rent roll tables, budget variances, contract terms -- the model must reliably produce valid structured output without hallucinating fields. Test against your actual document formats before committing.

Cost at scale. A single site screening campaign can generate 500-2,000 model calls. At $0.015 per 1k tokens for GPT-4o, that compounds quickly. For high-volume, lower-complexity tasks, route to Flash or a hosted Llama variant.

Latency. Synchronous workflows (a team member waiting on a result) tolerate up to 30 seconds. Batch overnight jobs can tolerate 5-10 minutes. Match latency requirements to model speed before deploying at scale.

Build vs. Buy: The Model Layer Question

Most enterprise CRE platforms bundle a default model. That default may not be optimal for your most important workflows. The right question is not which platform uses which model -- it is whether the platform allows model routing or override.

AI-native firms building CRE workflows on the full stack have the flexibility to match model to task. For a 12-workflow development operation, model optimization across the stack can cut inference cost 40-60% and materially improve analytical accuracy on the highest-stakes outputs.

The selection framework is not complicated. Classify workflows by document intensity, reasoning depth, multimodal requirement and volume. Test two or three models on representative inputs. The performance gap is visible within a day.