The Architecture: Three Layers
Behind every enterprise assistant lies a repeating pattern. It doesn't matter whether it's for HR, compliance, or B2B sales — the structure is the same:
┌─────────────────────────────────────────────────┐
│ INTERFACE LAYER │
│ Chat · Dashboard · API · Webhook · VoiceERP │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ ORCHESTRATION LAYER │
│ MCP Server · Connectors · Context Router │
│ Flexible Instructions · Session Memory │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ INTELLIGENCE LAYER │
│ Pre-trained / Fine-tuned Model │
│ RAG (Retrieval Augmented Generation) │
│ Vector DB · Document Processing Pipeline │
└─────────────────────────────────────────────────┘
Layer 1: Intelligence — The Trained Model
The base model knows nothing about your company. Fine-tuning and RAG are the two strategies for giving it context — and in 2026, the range of models available for fine-tuning is broader and more accessible than ever.
Fine-tuning adjusts the model's weights with domain-specific data. It's used when you need the model to reason like an expert in your vertical — a labor attorney, a compliance auditor, a credit risk analyst.
RAG (Retrieval Augmented Generation) doesn't modify the model. Instead, it searches for relevant information in a vector database before each response. It's ideal for frequently changing data — regulations, internal policies, product catalogs.
Most implementations I saw at LATU use both. Fine-tuning for the domain's tone and logic. RAG for the live data.
Which Model to Choose for Fine-tuning?
Not all models can be fine-tuned, and not all that can be fine-tuned are suited for the same tasks. Here's the real landscape:
| Model | Provider | Training cost (per 1M tokens) | FT Inference (input / output per 1M) | Best for |
|---|---|---|---|---|
| GPT-4o-mini | OpenAI | $3.00 | $0.30 / $1.20 | Best overall cost/quality ratio |
| GPT-4o | OpenAI | $25.00 | $3.75 / $15.00 | Maximum accuracy on complex tasks |
| GPT-4.1 | OpenAI | $25.00 | Similar to GPT-4o FT | Advanced reasoning, long instructions |
| Gemini 2.5 Flash | Google Vertex | ~$2-4 | Same price as base model | High volume — no FT surcharge |
| Gemini 2.5 Pro | Google Vertex | Variable | Same price as base model | Complex analysis, multimodal |
| Mistral Nemo (12B) | Mistral | $1.00 (+$2/mo storage) | Base rate | Minimal cost, good Spanish |
| Claude 3 Haiku | AWS Bedrock | ~$4.00/1K tokens | Requires Provisioned Throughput | Only fine-tuneable Claude (limited) |
| Model | Parameters | VRAM with QLoRA | License | Strength |
|---|---|---|---|---|
| Llama 4 Scout | 17B active (MoE 16 experts) | ~20 GB | Community License | 200+ languages, multimodal |
| Llama 3.1 8B | 8B | ~10-16 GB | Community License | Affordable entry to fine-tuning |
| Qwen 2.5 72B | 72B | ~35 GB (QLoRA) | Apache 2.0 | Best open-weight model for Spanish and Portuguese |
| Qwen3-235B | 22B active (MoE) | Variable | Apache 2.0 | Top multilingual benchmarks |
| Phi-4-mini | 3.8B | ~6 GB | MIT | Edge deployment, 128K context |
| DeepSeek-V3.1 | MoE | Variable | MIT | Ultra-affordable ($0.15/1M input) |
The Dominant Pattern in 2026: Distillation
The most common strategy in production is no longer fine-tuning a large model. It's using a large model (GPT-4.1, Claude Opus) as a teacher to generate synthetic training data, and then fine-tuning a smaller model as a student for production:
// Pseudocode: Distillation teacher model → student
function distill(teacher_model, student_model, domain_data):
// 1. The teacher generates high-quality responses
synthetic_dataset = []
for each example in domain_data:
response = teacher_model.generate(example, quality="maximum")
synthetic_dataset.add({ input: example, output: response })
// 2. The student learns from the teacher
student_model.fine_tune(
dataset = synthetic_dataset,
method = "LoRA", // memory-efficient
epochs = 3
)
// Result: near-teacher quality, 10x lower inference cost
return student_model
This delivers quality close to the large model with 10x cheaper inference and faster response times. For LATAM, where every infrastructure dollar counts, it's the difference between a viable project and one that dies in the budget.
RAG vs Fine-tuning: When to Use Each
| Criterion | RAG | Fine-tuning | Hybrid |
|---|---|---|---|
| Frequently changing data | Excellent | Poor (static) | Best of both |
| Latency | Higher (search step) | Lower (sub-second) | Variable |
| Data privacy | Data in your DB | Data embedded in model | Combined |
| Accuracy on specific tasks | Good | Excellent | Maximum |
| Initial cost | Low | Medium | High |
| Maintenance | Update knowledge base | Re-train periodically | Both |
// Pseudocode: Enterprise training pipeline
function train_enterprise_model(config):
// 1. Data collection
documents = process_sources([
pdf_files,
legal_contracts,
training_videos, // automatic transcription
historical_emails,
knowledge_bases
])
// 2. Dataset preparation
dataset = []
for each document in documents:
chunks = split_into_fragments(document, size=512_tokens)
for each chunk in chunks:
// Generate domain-specific question-answer pairs
pair = generate_training_pair(chunk, context=config.vertical)
dataset.add(pair)
// 3. Fine-tuning the base model
model = load_base_model(config.base_model) // e.g.: "gpt-4o-mini"
model.fine_tune(
dataset = dataset,
epochs = 4,
learning_rate = 1e-5,
validation = 0.2 // 20% for validation
)
// 4. Building the vector index (RAG)
vector_db = create_vector_index()
for each chunk in all_chunks:
embedding = generate_embedding(chunk)
vector_db.insert(embedding, metadata=chunk.source)
return { model, vector_db }
Layer 2: Orchestration — The MCP Server
This is where the magic becomes engineering. The MCP Server (Model Context Protocol) acts as the operational brain: it receives requests, decides which tools to use, queries the right context, and orchestrates the response.
Think of MCP as an orchestra conductor. It doesn't play any instrument, but it knows exactly when each one comes in.
// Pseudocode: Enterprise MCP Server
server MCP_Enterprise:
tools = register_tools([
// Connectors to existing systems
ERPConnector { endpoint: "polpo.api/v2", auth: oauth2 },
HRConnector { endpoint: "hr-system.internal/api" },
LegalConnector { endpoint: "legal-docs.s3/bucket" },
CRMConnector { endpoint: "crm.empresa.com/graphql" },
// Document processors
PDFProcessor { ocr: true, languages: ["es", "en", "pt"] },
VideoProcessor { transcription: whisper, summary: true },
EmailProcessor { classification: automatic },
// Analysis tools
RiskAnalyzer { model: "fine-tuned-risk-v3" },
ReportGenerator { templates: "compliance-2026" },
RegulatoryValidator { jurisdiction: "UY", updated: "2026-02" }
])
flexible_instructions = load_instructions([
// The business defines the rules without code
"Always verify compliance before approving contracts > 50K USD",
"Escalate to human if model confidence < 0.85",
"Use formal tone for external communications",
"Include regulatory reference in legal responses",
"Do not process sensitive data without explicit consent"
])
function handle_request(message, user_context):
// 1. Classify intent
intent = classify(message) // hr | legal | commercial | compliance
// 2. Retrieve relevant context (RAG)
rag_context = vector_db.search(
query = message,
filters = { vertical: intent, company: user_context.company },
top_k = 5
)
// 3. Select required tools
active_tools = select_tools(intent, message)
// 4. Execute with flexible instructions
response = model.generate(
prompt = build_prompt(message, rag_context, flexible_instructions),
tools = active_tools,
max_tokens = 2048,
temperature = 0.3 // low for enterprise precision
)
// 5. Validate before delivering
if response.confidence < 0.85:
return escalate_to_human(response, user_context)
log_audit(message, response, active_tools)
return response
The interesting part is that flexible instructions allow non-technical teams to modify the system's behavior. A compliance manager can add a new rule — "verify against the new BCU 2026 regulation" — without anyone touching a single line of code.
Layer 3: Interface — Where the User Interacts
The interface layer varies by use case. It can be a conversational chat, a dashboard with metrics, a REST API for integrations, or even a voice interface like Polpo's VoiceERP.
What ties everything together is the automated workflow platform: workflows triggered by business events.
// Pseudocode: Automated workflow engine
workflow NewContracts:
trigger: "new_contract_uploaded"
steps:
1. PDFProcessor.extract(contract)
2. RiskAnalyzer.evaluate(contract.clauses)
3. RegulatoryValidator.verify(contract, jurisdiction="UY")
4. if risk > HIGH_THRESHOLD:
notify(legal_team, urgency="high")
generate_risk_report(contract)
else:
auto_approve(contract)
notify(requester, status="approved")
5. register_in_erp(contract, status)
workflow EmployeeOnboarding:
trigger: "new_employee_registered"
steps:
1. HRConnector.get_profile(employee)
2. generate_documents([
employment_contract,
confidentiality_agreement,
personal_data_policy
], personalized=employee)
3. EmailConnector.send(employee, documents)
4. schedule_followup(days=7, type="onboarding_check")
workflow ComplianceAudit:
trigger: cron("first_monday_each_month")
steps:
1. documents = LegalConnector.get_all(period="last_month")
2. for each doc in documents:
result = RegulatoryValidator.verify(doc)
if result.observations:
flag_for_review(doc, result)
3. ReportGenerator.create("monthly_compliance", results)
4. notify(board, report)
Document Processing: It's Not Trivial
One of the most impressive capabilities showcased at the event is multi-format processing. We're not talking about reading a PDF — we're talking about extracting structured knowledge from any source:
| Format | Process | Complexity |
|---|---|---|
| Text PDF | Direct extraction + chunking | Low |
| Scanned PDF | OCR → text → chunking | Medium |
| Corporate video | Whisper → transcription → summary → embedding | High |
| Email threads | Classification → entity extraction → relationships | Medium |
| Legal contracts | Legal NER → clauses → risk analysis | High |
| Excel spreadsheets | Schema detection → normalization → embedding | Medium |
| Meeting audio | Diarization → transcription → action items | High |
// Pseudocode: Multi-format processing pipeline
function process_document(file):
type = detect_type(file)
switch type:
case "text_pdf":
text = extract_pdf_text(file)
case "scanned_pdf":
text = ocr_process(file, language="es")
case "video":
audio = extract_audio(file)
text = whisper.transcribe(audio, language="es")
text = summarize_transcription(text)
case "email":
text = parse_email(file)
entities = extract_entities(text)
return { text, entities, type: "email" }
// Intelligent chunking
chunks = smart_split(text,
max_size = 512_tokens,
overlap = 50_tokens,
preserve = ["paragraphs", "lists", "tables"]
)
// Generate embeddings
for each chunk in chunks:
chunk.embedding = embedding_model.generate(chunk.text)
chunk.metadata = {
source: file.name,
date: file.date,
type: type,
company: file.company,
section: chunk.detect_section()
}
return chunks
Costs: The Reality of Building This
This is where most articles get vague. I won't. These are the real numbers for building and maintaining an enterprise AI system in 2026.
Fine-tuning: The Cost of Training
For a typical enterprise dataset of 5M tokens, 4 epochs:
| Route | Model | Training cost | FT Inference (per 1M tokens) |
|---|---|---|---|
| API managed | GPT-4o-mini (OpenAI) | ~$60 USD | $0.30 / $1.20 |
| API managed | GPT-4o (OpenAI) | ~$500 USD | $3.75 / $15.00 |
| API managed | Gemini 2.5 Flash (Vertex) | ~$10-20 USD | No surcharge vs base |
| API managed | Mistral Nemo 12B | ~$20 USD (+$2/mo) | Base rate |
| Platform | Together AI (Llama/Qwen LoRA) | ~$10 USD | No surcharge vs base |
| Platform | Fireworks AI (Llama/DeepSeek) | ~$10 USD | No surcharge vs base |
| Self-hosted | Qwen 2.5 72B (QLoRA, A100 80GB) | ~$50-100 USD (GPU-hours) | Your infrastructure |
| Self-hosted | Llama 4 Scout (QLoRA, A100 40GB) | ~$30-60 USD (GPU-hours) | Your infrastructure |
| AWS Bedrock | Claude 3 Haiku | Variable (~$4/1K tokens) | Provisioned Throughput required |
Key insight: Self-hosted with QLoRA on an A100 80GB GPU (~$2.50/hour) lets you fine-tune a 70B-parameter model for under $100. An RTX 4090 (~$0.50/hour) is enough for models up to 13B. The barrier to entry is no longer hardware — it's knowing which data to use.
MCP Server: Infrastructure
| Scenario | Estimated monthly cost |
|---|---|
| Prototype (Cloudflare Workers / Vercel) | $0 - $50 USD |
| Small production (1 company, <1K req/day) | $200 - $500 USD |
| Medium production (multi-company, <10K req/day) | $2,000 - $6,000 USD |
| Enterprise with high availability | $10,000 - $25,000 USD |
Initial Development
| Component | Estimated hours | Approximate cost |
|---|---|---|
| Basic MCP Server | 60 - 110 h | $9,000 - $15,000 USD |
| MCP Server with enterprise connectors | 160 - 260 h | $15,000 - $35,000 USD |
| Document processing pipeline | 80 - 150 h | $10,000 - $20,000 USD |
| Fine-tuning pipeline + RAG | 40 - 80 h | $5,000 - $12,000 USD |
| Workflow platform | 120 - 200 h | $15,000 - $30,000 USD |
| Complete system total | 460 - 800 h | $54,000 - $112,000 USD |
Multi-model: Different Purposes, Different Costs
When a company needs multiple specialized models — one for HR, another for compliance, another for sales — costs multiply but not linearly:
| Configuration | Monthly operating cost |
|---|---|
| 1 API model (GPT-4o-mini FT, moderate usage) | $500 - $2,000 USD |
| 3 specialized API models (OpenAI/Gemini) | $1,500 - $5,000 USD |
| 3 models via Together/Fireworks (Llama/Qwen FT) | $800 - $3,000 USD |
| 3 self-hosted QLoRA models (1x A100 80GB) | $1,800 - $3,000 USD (GPU) |
| Complete platform with workflows | $5,000 - $20,000 USD |
Which Model for Each Vertical
| Vertical | Recommended model | Why |
|---|---|---|
| HR / Talent | GPT-4o-mini FT or Phi-4-mini (on-premise) | Cost/quality for CV parsing, JD generation. Phi-4 if data cannot leave the company |
| Legal / Compliance | Hybrid: Llama 4 Scout FT + RAG | Air-gapped for sensitive documents. RAG for changing regulations. 16-expert MoE delivers precision |
| B2B / Commercial | Gemini 2.5 Flash FT or GPT-4o-mini FT | High request volume, no FT inference surcharge (Gemini). Ideal for customer-facing APIs |
| Multilingual (ES/PT) | Qwen 2.5 72B or Llama 4 Scout | Qwen leads benchmarks in Spanish and Portuguese. Llama 4 trained with 10x more multilingual data than Llama 3 |
Recommended Stack for LATAM
For a company in the region that wants to get started without burning capital:
| Component | Recommendation | Monthly cost |
|---|---|---|
| Base model | GPT-4o-mini FT or Llama via Together AI | $500 - $2,000 |
| Vector DB | Qdrant Cloud / Pinecone | $0 - $100 |
| MCP Server | Cloudflare Workers | $0 - $50 |
| Workflows | n8n self-hosted | $0 - $50 |
| Doc processing | Custom pipeline | $100 - $500 |
| Total operating | $600 - $2,700 USD/mo |
| Component | Recommendation | Monthly cost |
|---|---|---|
| Base model | Qwen 2.5 or Llama 4 Scout (QLoRA, self-hosted) | $1,800 - $3,000 (GPU) |
| Vector DB | Qdrant self-hosted | $0 - $200 |
| MCP Server | Own infrastructure | $200 - $500 |
| Workflows | n8n self-hosted | $0 - $50 |
| Total operating | $2,000 - $3,750 USD/mo |
What Changes in Practice
The companies I saw at LATU aren't selling "AI." They're selling friction reduction. A contract that took 3 days to review now goes through the pipeline in 15 minutes. An onboarding process that required 4 people is now handled by 1 with an assistant.
But the architecture matters. A chatbot with GPT and no enterprise context is a toy. A system with fine-tuning, RAG, MCP connectors, flexible instructions, and automated workflows is a real tool.
The difference between the two isn't the model. It's the engineering around the model.
To Wrap Up
The LATU event showed something that isn't said enough: these solutions are already being built in Latin America. They're not copies of Silicon Valley products — they're implementations adapted to the regulatory, linguistic, and economic realities of the region.
The AI market in LATAM is projected at $30 billion by 2033. But the number that matters isn't the market's — it's your operation's. If a system costing $2,700/mo replaces $9,000/mo in manual labor and does it with greater accuracy and traceability, the decision isn't technological. It's arithmetic.
The architecture is available. Costs have come down. The platforms exist. What's missing, as always, is the decision to start.
Want to explore how to implement any of this in your company? Let's talk.


