AI That Works: Enterprise Model Architecture in Latin America

The Architecture: Three Layers

Behind every enterprise assistant lies a repeating pattern. It doesn't matter whether it's for HR, compliance, or B2B sales — the structure is the same:

┌─────────────────────────────────────────────────┐
│               INTERFACE LAYER                    │
│  Chat · Dashboard · API · Webhook · VoiceERP    │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│           ORCHESTRATION LAYER                    │
│  MCP Server · Connectors · Context Router        │
│  Flexible Instructions · Session Memory          │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│           INTELLIGENCE LAYER                     │
│  Pre-trained / Fine-tuned Model                  │
│  RAG (Retrieval Augmented Generation)            │
│  Vector DB · Document Processing Pipeline        │
└─────────────────────────────────────────────────┘

Layer 1: Intelligence — The Trained Model

The base model knows nothing about your company. Fine-tuning and RAG are the two strategies for giving it context — and in 2026, the range of models available for fine-tuning is broader and more accessible than ever.

Fine-tuning adjusts the model's weights with domain-specific data. It's used when you need the model to reason like an expert in your vertical — a labor attorney, a compliance auditor, a credit risk analyst.

RAG (Retrieval Augmented Generation) doesn't modify the model. Instead, it searches for relevant information in a vector database before each response. It's ideal for frequently changing data — regulations, internal policies, product catalogs.

Most implementations I saw at LATU use both. Fine-tuning for the domain's tone and logic. RAG for the live data.

Which Model to Choose for Fine-tuning?

Not all models can be fine-tuned, and not all that can be fine-tuned are suited for the same tasks. Here's the real landscape:

Model	Provider	Training cost (per 1M tokens)	FT Inference (input / output per 1M)	Best for
GPT-4o-mini	OpenAI	$3.00	$0.30 / $1.20	Best overall cost/quality ratio
GPT-4o	OpenAI	$25.00	$3.75 / $15.00	Maximum accuracy on complex tasks
GPT-4.1	OpenAI	$25.00	Similar to GPT-4o FT	Advanced reasoning, long instructions
Gemini 2.5 Flash	Google Vertex	~$2-4	Same price as base model	High volume — no FT surcharge
Gemini 2.5 Pro	Google Vertex	Variable	Same price as base model	Complex analysis, multimodal
Mistral Nemo (12B)	Mistral	$1.00 (+$2/mo storage)	Base rate	Minimal cost, good Spanish
Claude 3 Haiku	AWS Bedrock	~$4.00/1K tokens	Requires Provisioned Throughput	Only fine-tuneable Claude (limited)

Open-weight models — the ones you can self-host and fine-tune without depending on an API:

Model	Parameters	VRAM with QLoRA	License	Strength
Llama 4 Scout	17B active (MoE 16 experts)	~20 GB	Community License	200+ languages, multimodal
Llama 3.1 8B	8B	~10-16 GB	Community License	Affordable entry to fine-tuning
Qwen 2.5 72B	72B	~35 GB (QLoRA)	Apache 2.0	Best open-weight model for Spanish and Portuguese
Qwen3-235B	22B active (MoE)	Variable	Apache 2.0	Top multilingual benchmarks
Phi-4-mini	3.8B	~6 GB	MIT	Edge deployment, 128K context
DeepSeek-V3.1	MoE	Variable	MIT	Ultra-affordable ($0.15/1M input)

Platforms like Together AI (~$0.48/1M tokens for LoRA) and Fireworks AI (~$0.50/1M tokens) allow fine-tuning open-weight models without your own infrastructure and with no inference surcharge.

The Dominant Pattern in 2026: Distillation

The most common strategy in production is no longer fine-tuning a large model. It's using a large model (GPT-4.1, Claude Opus) as a teacher to generate synthetic training data, and then fine-tuning a smaller model as a student for production:

// Pseudocode: Distillation teacher model → student

function distill(teacher_model, student_model, domain_data):
    // 1. The teacher generates high-quality responses
    synthetic_dataset = []
    for each example in domain_data:
        response = teacher_model.generate(example, quality="maximum")
        synthetic_dataset.add({ input: example, output: response })

    // 2. The student learns from the teacher
    student_model.fine_tune(
        dataset  = synthetic_dataset,
        method   = "LoRA",          // memory-efficient
        epochs   = 3
    )

    // Result: near-teacher quality, 10x lower inference cost
    return student_model

This delivers quality close to the large model with 10x cheaper inference and faster response times. For LATAM, where every infrastructure dollar counts, it's the difference between a viable project and one that dies in the budget.

RAG vs Fine-tuning: When to Use Each

Criterion	RAG	Fine-tuning	Hybrid
Frequently changing data	Excellent	Poor (static)	Best of both
Latency	Higher (search step)	Lower (sub-second)	Variable
Data privacy	Data in your DB	Data embedded in model	Combined
Accuracy on specific tasks	Good	Excellent	Maximum
Initial cost	Low	Medium	High
Maintenance	Update knowledge base	Re-train periodically	Both

The practical rule: RAG for the facts, fine-tuning for the behavior. If your system needs to know what the current BCU regulations say, that's RAG. If it needs to reason like a Uruguayan auditor about those regulations, that's fine-tuning. In production, you use both.

// Pseudocode: Enterprise training pipeline

function train_enterprise_model(config):
    // 1. Data collection
    documents = process_sources([
        pdf_files,
        legal_contracts,
        training_videos,          // automatic transcription
        historical_emails,
        knowledge_bases
    ])

    // 2. Dataset preparation
    dataset = []
    for each document in documents:
        chunks = split_into_fragments(document, size=512_tokens)
        for each chunk in chunks:
            // Generate domain-specific question-answer pairs
            pair = generate_training_pair(chunk, context=config.vertical)
            dataset.add(pair)

    // 3. Fine-tuning the base model
    model = load_base_model(config.base_model)  // e.g.: "gpt-4o-mini"
    model.fine_tune(
        dataset       = dataset,
        epochs        = 4,
        learning_rate = 1e-5,
        validation    = 0.2    // 20% for validation
    )

    // 4. Building the vector index (RAG)
    vector_db = create_vector_index()
    for each chunk in all_chunks:
        embedding = generate_embedding(chunk)
        vector_db.insert(embedding, metadata=chunk.source)

    return { model, vector_db }

Layer 2: Orchestration — The MCP Server

This is where the magic becomes engineering. The MCP Server (Model Context Protocol) acts as the operational brain: it receives requests, decides which tools to use, queries the right context, and orchestrates the response.

Think of MCP as an orchestra conductor. It doesn't play any instrument, but it knows exactly when each one comes in.

// Pseudocode: Enterprise MCP Server

server MCP_Enterprise:

    tools = register_tools([
        // Connectors to existing systems
        ERPConnector        { endpoint: "polpo.api/v2", auth: oauth2 },
        HRConnector         { endpoint: "hr-system.internal/api" },
        LegalConnector      { endpoint: "legal-docs.s3/bucket" },
        CRMConnector        { endpoint: "crm.empresa.com/graphql" },

        // Document processors
        PDFProcessor        { ocr: true, languages: ["es", "en", "pt"] },
        VideoProcessor      { transcription: whisper, summary: true },
        EmailProcessor      { classification: automatic },

        // Analysis tools
        RiskAnalyzer        { model: "fine-tuned-risk-v3" },
        ReportGenerator     { templates: "compliance-2026" },
        RegulatoryValidator { jurisdiction: "UY", updated: "2026-02" }
    ])

    flexible_instructions = load_instructions([
        // The business defines the rules without code
        "Always verify compliance before approving contracts > 50K USD",
        "Escalate to human if model confidence < 0.85",
        "Use formal tone for external communications",
        "Include regulatory reference in legal responses",
        "Do not process sensitive data without explicit consent"
    ])

    function handle_request(message, user_context):
        // 1. Classify intent
        intent = classify(message)  // hr | legal | commercial | compliance

        // 2. Retrieve relevant context (RAG)
        rag_context = vector_db.search(
            query     = message,
            filters   = { vertical: intent, company: user_context.company },
            top_k     = 5
        )

        // 3. Select required tools
        active_tools = select_tools(intent, message)

        // 4. Execute with flexible instructions
        response = model.generate(
            prompt          = build_prompt(message, rag_context, flexible_instructions),
            tools           = active_tools,
            max_tokens      = 2048,
            temperature     = 0.3    // low for enterprise precision
        )

        // 5. Validate before delivering
        if response.confidence < 0.85:
            return escalate_to_human(response, user_context)

        log_audit(message, response, active_tools)
        return response

The interesting part is that flexible instructions allow non-technical teams to modify the system's behavior. A compliance manager can add a new rule — "verify against the new BCU 2026 regulation" — without anyone touching a single line of code.

Layer 3: Interface — Where the User Interacts

The interface layer varies by use case. It can be a conversational chat, a dashboard with metrics, a REST API for integrations, or even a voice interface like Polpo's VoiceERP.

What ties everything together is the automated workflow platform: workflows triggered by business events.

// Pseudocode: Automated workflow engine

workflow NewContracts:
    trigger: "new_contract_uploaded"
    steps:
        1. PDFProcessor.extract(contract)
        2. RiskAnalyzer.evaluate(contract.clauses)
        3. RegulatoryValidator.verify(contract, jurisdiction="UY")
        4. if risk > HIGH_THRESHOLD:
              notify(legal_team, urgency="high")
              generate_risk_report(contract)
           else:
              auto_approve(contract)
              notify(requester, status="approved")
        5. register_in_erp(contract, status)

workflow EmployeeOnboarding:
    trigger: "new_employee_registered"
    steps:
        1. HRConnector.get_profile(employee)
        2. generate_documents([
              employment_contract,
              confidentiality_agreement,
              personal_data_policy
           ], personalized=employee)
        3. EmailConnector.send(employee, documents)
        4. schedule_followup(days=7, type="onboarding_check")

workflow ComplianceAudit:
    trigger: cron("first_monday_each_month")
    steps:
        1. documents = LegalConnector.get_all(period="last_month")
        2. for each doc in documents:
              result = RegulatoryValidator.verify(doc)
              if result.observations:
                  flag_for_review(doc, result)
        3. ReportGenerator.create("monthly_compliance", results)
        4. notify(board, report)

Document Processing: It's Not Trivial

One of the most impressive capabilities showcased at the event is multi-format processing. We're not talking about reading a PDF — we're talking about extracting structured knowledge from any source:

Format	Process	Complexity
Text PDF	Direct extraction + chunking	Low
Scanned PDF	OCR → text → chunking	Medium
Corporate video	Whisper → transcription → summary → embedding	High
Email threads	Classification → entity extraction → relationships	Medium
Legal contracts	Legal NER → clauses → risk analysis	High
Excel spreadsheets	Schema detection → normalization → embedding	Medium
Meeting audio	Diarization → transcription → action items	High

// Pseudocode: Multi-format processing pipeline

function process_document(file):
    type = detect_type(file)

    switch type:
        case "text_pdf":
            text = extract_pdf_text(file)
        case "scanned_pdf":
            text = ocr_process(file, language="es")
        case "video":
            audio = extract_audio(file)
            text = whisper.transcribe(audio, language="es")
            text = summarize_transcription(text)
        case "email":
            text = parse_email(file)
            entities = extract_entities(text)
            return { text, entities, type: "email" }

    // Intelligent chunking
    chunks = smart_split(text,
        max_size      = 512_tokens,
        overlap       = 50_tokens,
        preserve      = ["paragraphs", "lists", "tables"]
    )

    // Generate embeddings
    for each chunk in chunks:
        chunk.embedding = embedding_model.generate(chunk.text)
        chunk.metadata  = {
            source:    file.name,
            date:      file.date,
            type:      type,
            company:   file.company,
            section:   chunk.detect_section()
        }

    return chunks

Costs: The Reality of Building This

This is where most articles get vague. I won't. These are the real numbers for building and maintaining an enterprise AI system in 2026.

Fine-tuning: The Cost of Training

For a typical enterprise dataset of 5M tokens, 4 epochs:

Route	Model	Training cost	FT Inference (per 1M tokens)
API managed	GPT-4o-mini (OpenAI)	~$60 USD	$0.30 / $1.20
API managed	GPT-4o (OpenAI)	~$500 USD	$3.75 / $15.00
API managed	Gemini 2.5 Flash (Vertex)	~$10-20 USD	No surcharge vs base
API managed	Mistral Nemo 12B	~$20 USD (+$2/mo)	Base rate
Platform	Together AI (Llama/Qwen LoRA)	~$10 USD	No surcharge vs base
Platform	Fireworks AI (Llama/DeepSeek)	~$10 USD	No surcharge vs base
Self-hosted	Qwen 2.5 72B (QLoRA, A100 80GB)	~$50-100 USD (GPU-hours)	Your infrastructure
Self-hosted	Llama 4 Scout (QLoRA, A100 40GB)	~$30-60 USD (GPU-hours)	Your infrastructure
AWS Bedrock	Claude 3 Haiku	Variable (~$4/1K tokens)	Provisioned Throughput required

The recommendation for LATAM: To get started quickly, GPT-4o-mini or Together AI with Llama/Qwen. For high volume, Gemini Flash (no FT inference surcharge) or self-hosted with QLoRA. Training cost is negligible — the real cost is in inference, and that's where smaller models and no-surcharge platforms win by a landslide.

Key insight: Self-hosted with QLoRA on an A100 80GB GPU (~$2.50/hour) lets you fine-tune a 70B-parameter model for under $100. An RTX 4090 (~$0.50/hour) is enough for models up to 13B. The barrier to entry is no longer hardware — it's knowing which data to use.

MCP Server: Infrastructure

Scenario	Estimated monthly cost
Prototype (Cloudflare Workers / Vercel)	$0 - $50 USD
Small production (1 company, <1K req/day)	$200 - $500 USD
Medium production (multi-company, <10K req/day)	$2,000 - $6,000 USD
Enterprise with high availability	$10,000 - $25,000 USD

MCP carries a 30-50% higher cost than a traditional REST API because it needs to maintain session state, persistent memory, and tool orchestration.

Initial Development

Component	Estimated hours	Approximate cost
Basic MCP Server	60 - 110 h	$9,000 - $15,000 USD
MCP Server with enterprise connectors	160 - 260 h	$15,000 - $35,000 USD
Document processing pipeline	80 - 150 h	$10,000 - $20,000 USD
Fine-tuning pipeline + RAG	40 - 80 h	$5,000 - $12,000 USD
Workflow platform	120 - 200 h	$15,000 - $30,000 USD
Complete system total	460 - 800 h	$54,000 - $112,000 USD

Multi-model: Different Purposes, Different Costs

When a company needs multiple specialized models — one for HR, another for compliance, another for sales — costs multiply but not linearly:

Configuration	Monthly operating cost
1 API model (GPT-4o-mini FT, moderate usage)	$500 - $2,000 USD
3 specialized API models (OpenAI/Gemini)	$1,500 - $5,000 USD
3 models via Together/Fireworks (Llama/Qwen FT)	$800 - $3,000 USD
3 self-hosted QLoRA models (1x A100 80GB)	$1,800 - $3,000 USD (GPU)
Complete platform with workflows	$5,000 - $20,000 USD

The multi-LoRA strategy is key here: platforms like Together AI allow serving multiple LoRA adapters on a single base model, without multiplying infrastructure costs. Three specialized models for the cost of one.

Which Model for Each Vertical

Vertical	Recommended model	Why
HR / Talent	GPT-4o-mini FT or Phi-4-mini (on-premise)	Cost/quality for CV parsing, JD generation. Phi-4 if data cannot leave the company
Legal / Compliance	Hybrid: Llama 4 Scout FT + RAG	Air-gapped for sensitive documents. RAG for changing regulations. 16-expert MoE delivers precision
B2B / Commercial	Gemini 2.5 Flash FT or GPT-4o-mini FT	High request volume, no FT inference surcharge (Gemini). Ideal for customer-facing APIs
Multilingual (ES/PT)	Qwen 2.5 72B or Llama 4 Scout	Qwen leads benchmarks in Spanish and Portuguese. Llama 4 trained with 10x more multilingual data than Llama 3

Recommended Stack for LATAM

For a company in the region that wants to get started without burning capital:

Component	Recommendation	Monthly cost
Base model	GPT-4o-mini FT or Llama via Together AI	$500 - $2,000
Vector DB	Qdrant Cloud / Pinecone	$0 - $100
MCP Server	Cloudflare Workers	$0 - $50
Workflows	n8n self-hosted	$0 - $50
Doc processing	Custom pipeline	$100 - $500
Total operating		$600 - $2,700 USD/mo

For companies that need data sovereignty (legal, healthcare, government):

Component	Recommendation	Monthly cost
Base model	Qwen 2.5 or Llama 4 Scout (QLoRA, self-hosted)	$1,800 - $3,000 (GPU)
Vector DB	Qdrant self-hosted	$0 - $200
MCP Server	Own infrastructure	$200 - $500
Workflows	n8n self-hosted	$0 - $50
Total operating		$2,000 - $3,750 USD/mo

Compared to hiring a team of 3 people to do the same work manually: $4,500 - $9,000 USD/mo in salaries for the region. And the system doesn't take vacations, doesn't call in sick, and scales without linear costs.

What Changes in Practice

The companies I saw at LATU aren't selling "AI." They're selling friction reduction. A contract that took 3 days to review now goes through the pipeline in 15 minutes. An onboarding process that required 4 people is now handled by 1 with an assistant.

But the architecture matters. A chatbot with GPT and no enterprise context is a toy. A system with fine-tuning, RAG, MCP connectors, flexible instructions, and automated workflows is a real tool.

The difference between the two isn't the model. It's the engineering around the model.

To Wrap Up

The LATU event showed something that isn't said enough: these solutions are already being built in Latin America. They're not copies of Silicon Valley products — they're implementations adapted to the regulatory, linguistic, and economic realities of the region.

The AI market in LATAM is projected at $30 billion by 2033. But the number that matters isn't the market's — it's your operation's. If a system costing $2,700/mo replaces $9,000/mo in manual labor and does it with greater accuracy and traceability, the decision isn't technological. It's arithmetic.

The architecture is available. Costs have come down. The platforms exist. What's missing, as always, is the decision to start.

Want to explore how to implement any of this in your company? Let's talk.