Article

The Architecture: Three Layers

Behind every enterprise assistant lies a repeating pattern. It doesn't matter whether it's for HR, compliance, or B2B sales — the structure is the same:

┌─────────────────────────────────────────────────┐
│               INTERFACE LAYER                    │
│  Chat · Dashboard · API · Webhook · VoiceERP    │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│           ORCHESTRATION LAYER                    │
│  MCP Server · Connectors · Context Router        │
│  Flexible Instructions · Session Memory          │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│           INTELLIGENCE LAYER                     │
│  Pre-trained / Fine-tuned Model                  │
│  RAG (Retrieval Augmented Generation)            │
│  Vector DB · Document Processing Pipeline        │
└─────────────────────────────────────────────────┘

Layer 1: Intelligence — The Trained Model

The base model knows nothing about your company. Fine-tuning and RAG are the two strategies for giving it context — and in 2026, the range of models available for fine-tuning is broader and more accessible than ever.

Fine-tuning adjusts the model's weights with domain-specific data. It's used when you need the model to reason like an expert in your vertical — a labor attorney, a compliance auditor, a credit risk analyst.

RAG (Retrieval Augmented Generation) doesn't modify the model. Instead, it searches for relevant information in a vector database before each response. It's ideal for frequently changing data — regulations, internal policies, product catalogs.

Most implementations I saw at LATU use both. Fine-tuning for the domain's tone and logic. RAG for the live data.

Which Model to Choose for Fine-tuning?

Not all models can be fine-tuned, and not all that can be fine-tuned are suited for the same tasks. Here's the real landscape:

ModelProviderTraining cost (per 1M tokens)FT Inference (input / output per 1M)Best for
GPT-4o-miniOpenAI$3.00$0.30 / $1.20Best overall cost/quality ratio
GPT-4oOpenAI$25.00$3.75 / $15.00Maximum accuracy on complex tasks
GPT-4.1OpenAI$25.00Similar to GPT-4o FTAdvanced reasoning, long instructions
Gemini 2.5 FlashGoogle Vertex~$2-4Same price as base modelHigh volume — no FT surcharge
Gemini 2.5 ProGoogle VertexVariableSame price as base modelComplex analysis, multimodal
Mistral Nemo (12B)Mistral$1.00 (+$2/mo storage)Base rateMinimal cost, good Spanish
Claude 3 HaikuAWS Bedrock~$4.00/1K tokensRequires Provisioned ThroughputOnly fine-tuneable Claude (limited)
Open-weight models — the ones you can self-host and fine-tune without depending on an API:

ModelParametersVRAM with QLoRALicenseStrength
Llama 4 Scout17B active (MoE 16 experts)~20 GBCommunity License200+ languages, multimodal
Llama 3.1 8B8B~10-16 GBCommunity LicenseAffordable entry to fine-tuning
Qwen 2.5 72B72B~35 GB (QLoRA)Apache 2.0Best open-weight model for Spanish and Portuguese
Qwen3-235B22B active (MoE)VariableApache 2.0Top multilingual benchmarks
Phi-4-mini3.8B~6 GBMITEdge deployment, 128K context
DeepSeek-V3.1MoEVariableMITUltra-affordable ($0.15/1M input)
Platforms like Together AI (~$0.48/1M tokens for LoRA) and Fireworks AI (~$0.50/1M tokens) allow fine-tuning open-weight models without your own infrastructure and with no inference surcharge.

The Dominant Pattern in 2026: Distillation

The most common strategy in production is no longer fine-tuning a large model. It's using a large model (GPT-4.1, Claude Opus) as a teacher to generate synthetic training data, and then fine-tuning a smaller model as a student for production:

// Pseudocode: Distillation teacher model → student

function distill(teacher_model, student_model, domain_data):
    // 1. The teacher generates high-quality responses
    synthetic_dataset = []
    for each example in domain_data:
        response = teacher_model.generate(example, quality="maximum")
        synthetic_dataset.add({ input: example, output: response })

    // 2. The student learns from the teacher
    student_model.fine_tune(
        dataset  = synthetic_dataset,
        method   = "LoRA",          // memory-efficient
        epochs   = 3
    )

    // Result: near-teacher quality, 10x lower inference cost
    return student_model

This delivers quality close to the large model with 10x cheaper inference and faster response times. For LATAM, where every infrastructure dollar counts, it's the difference between a viable project and one that dies in the budget.

RAG vs Fine-tuning: When to Use Each

CriterionRAGFine-tuningHybrid
Frequently changing dataExcellentPoor (static)Best of both
LatencyHigher (search step)Lower (sub-second)Variable
Data privacyData in your DBData embedded in modelCombined
Accuracy on specific tasksGoodExcellentMaximum
Initial costLowMediumHigh
MaintenanceUpdate knowledge baseRe-train periodicallyBoth
The practical rule: RAG for the facts, fine-tuning for the behavior. If your system needs to know what the current BCU regulations say, that's RAG. If it needs to reason like a Uruguayan auditor about those regulations, that's fine-tuning. In production, you use both.

// Pseudocode: Enterprise training pipeline

function train_enterprise_model(config):
    // 1. Data collection
    documents = process_sources([
        pdf_files,
        legal_contracts,
        training_videos,          // automatic transcription
        historical_emails,
        knowledge_bases
    ])

    // 2. Dataset preparation
    dataset = []
    for each document in documents:
        chunks = split_into_fragments(document, size=512_tokens)
        for each chunk in chunks:
            // Generate domain-specific question-answer pairs
            pair = generate_training_pair(chunk, context=config.vertical)
            dataset.add(pair)

    // 3. Fine-tuning the base model
    model = load_base_model(config.base_model)  // e.g.: "gpt-4o-mini"
    model.fine_tune(
        dataset       = dataset,
        epochs        = 4,
        learning_rate = 1e-5,
        validation    = 0.2    // 20% for validation
    )

    // 4. Building the vector index (RAG)
    vector_db = create_vector_index()
    for each chunk in all_chunks:
        embedding = generate_embedding(chunk)
        vector_db.insert(embedding, metadata=chunk.source)

    return { model, vector_db }

Layer 2: Orchestration — The MCP Server

This is where the magic becomes engineering. The MCP Server (Model Context Protocol) acts as the operational brain: it receives requests, decides which tools to use, queries the right context, and orchestrates the response.

Think of MCP as an orchestra conductor. It doesn't play any instrument, but it knows exactly when each one comes in.

// Pseudocode: Enterprise MCP Server

server MCP_Enterprise:

    tools = register_tools([
        // Connectors to existing systems
        ERPConnector        { endpoint: "polpo.api/v2", auth: oauth2 },
        HRConnector         { endpoint: "hr-system.internal/api" },
        LegalConnector      { endpoint: "legal-docs.s3/bucket" },
        CRMConnector        { endpoint: "crm.empresa.com/graphql" },

        // Document processors
        PDFProcessor        { ocr: true, languages: ["es", "en", "pt"] },
        VideoProcessor      { transcription: whisper, summary: true },
        EmailProcessor      { classification: automatic },

        // Analysis tools
        RiskAnalyzer        { model: "fine-tuned-risk-v3" },
        ReportGenerator     { templates: "compliance-2026" },
        RegulatoryValidator { jurisdiction: "UY", updated: "2026-02" }
    ])

    flexible_instructions = load_instructions([
        // The business defines the rules without code
        "Always verify compliance before approving contracts > 50K USD",
        "Escalate to human if model confidence < 0.85",
        "Use formal tone for external communications",
        "Include regulatory reference in legal responses",
        "Do not process sensitive data without explicit consent"
    ])

    function handle_request(message, user_context):
        // 1. Classify intent
        intent = classify(message)  // hr | legal | commercial | compliance

        // 2. Retrieve relevant context (RAG)
        rag_context = vector_db.search(
            query     = message,
            filters   = { vertical: intent, company: user_context.company },
            top_k     = 5
        )

        // 3. Select required tools
        active_tools = select_tools(intent, message)

        // 4. Execute with flexible instructions
        response = model.generate(
            prompt          = build_prompt(message, rag_context, flexible_instructions),
            tools           = active_tools,
            max_tokens      = 2048,
            temperature     = 0.3    // low for enterprise precision
        )

        // 5. Validate before delivering
        if response.confidence < 0.85:
            return escalate_to_human(response, user_context)

        log_audit(message, response, active_tools)
        return response

The interesting part is that flexible instructions allow non-technical teams to modify the system's behavior. A compliance manager can add a new rule — "verify against the new BCU 2026 regulation" — without anyone touching a single line of code.

Layer 3: Interface — Where the User Interacts

The interface layer varies by use case. It can be a conversational chat, a dashboard with metrics, a REST API for integrations, or even a voice interface like Polpo's VoiceERP.

What ties everything together is the automated workflow platform: workflows triggered by business events.

// Pseudocode: Automated workflow engine

workflow NewContracts:
    trigger: "new_contract_uploaded"
    steps:
        1. PDFProcessor.extract(contract)
        2. RiskAnalyzer.evaluate(contract.clauses)
        3. RegulatoryValidator.verify(contract, jurisdiction="UY")
        4. if risk > HIGH_THRESHOLD:
              notify(legal_team, urgency="high")
              generate_risk_report(contract)
           else:
              auto_approve(contract)
              notify(requester, status="approved")
        5. register_in_erp(contract, status)

workflow EmployeeOnboarding:
    trigger: "new_employee_registered"
    steps:
        1. HRConnector.get_profile(employee)
        2. generate_documents([
              employment_contract,
              confidentiality_agreement,
              personal_data_policy
           ], personalized=employee)
        3. EmailConnector.send(employee, documents)
        4. schedule_followup(days=7, type="onboarding_check")

workflow ComplianceAudit:
    trigger: cron("first_monday_each_month")
    steps:
        1. documents = LegalConnector.get_all(period="last_month")
        2. for each doc in documents:
              result = RegulatoryValidator.verify(doc)
              if result.observations:
                  flag_for_review(doc, result)
        3. ReportGenerator.create("monthly_compliance", results)
        4. notify(board, report)


Document Processing: It's Not Trivial

One of the most impressive capabilities showcased at the event is multi-format processing. We're not talking about reading a PDF — we're talking about extracting structured knowledge from any source:

FormatProcessComplexity
Text PDFDirect extraction + chunkingLow
Scanned PDFOCR → text → chunkingMedium
Corporate videoWhisper → transcription → summary → embeddingHigh
Email threadsClassification → entity extraction → relationshipsMedium
Legal contractsLegal NER → clauses → risk analysisHigh
Excel spreadsheetsSchema detection → normalization → embeddingMedium
Meeting audioDiarization → transcription → action itemsHigh
// Pseudocode: Multi-format processing pipeline

function process_document(file):
    type = detect_type(file)

    switch type:
        case "text_pdf":
            text = extract_pdf_text(file)
        case "scanned_pdf":
            text = ocr_process(file, language="es")
        case "video":
            audio = extract_audio(file)
            text = whisper.transcribe(audio, language="es")
            text = summarize_transcription(text)
        case "email":
            text = parse_email(file)
            entities = extract_entities(text)
            return { text, entities, type: "email" }

    // Intelligent chunking
    chunks = smart_split(text,
        max_size      = 512_tokens,
        overlap       = 50_tokens,
        preserve      = ["paragraphs", "lists", "tables"]
    )

    // Generate embeddings
    for each chunk in chunks:
        chunk.embedding = embedding_model.generate(chunk.text)
        chunk.metadata  = {
            source:    file.name,
            date:      file.date,
            type:      type,
            company:   file.company,
            section:   chunk.detect_section()
        }

    return chunks


Costs: The Reality of Building This

This is where most articles get vague. I won't. These are the real numbers for building and maintaining an enterprise AI system in 2026.

Fine-tuning: The Cost of Training

For a typical enterprise dataset of 5M tokens, 4 epochs:

RouteModelTraining costFT Inference (per 1M tokens)
API managedGPT-4o-mini (OpenAI)~$60 USD$0.30 / $1.20
API managedGPT-4o (OpenAI)~$500 USD$3.75 / $15.00
API managedGemini 2.5 Flash (Vertex)~$10-20 USDNo surcharge vs base
API managedMistral Nemo 12B~$20 USD (+$2/mo)Base rate
PlatformTogether AI (Llama/Qwen LoRA)~$10 USDNo surcharge vs base
PlatformFireworks AI (Llama/DeepSeek)~$10 USDNo surcharge vs base
Self-hostedQwen 2.5 72B (QLoRA, A100 80GB)~$50-100 USD (GPU-hours)Your infrastructure
Self-hostedLlama 4 Scout (QLoRA, A100 40GB)~$30-60 USD (GPU-hours)Your infrastructure
AWS BedrockClaude 3 HaikuVariable (~$4/1K tokens)Provisioned Throughput required
The recommendation for LATAM: To get started quickly, GPT-4o-mini or Together AI with Llama/Qwen. For high volume, Gemini Flash (no FT inference surcharge) or self-hosted with QLoRA. Training cost is negligible — the real cost is in inference, and that's where smaller models and no-surcharge platforms win by a landslide.

Key insight: Self-hosted with QLoRA on an A100 80GB GPU (~$2.50/hour) lets you fine-tune a 70B-parameter model for under $100. An RTX 4090 (~$0.50/hour) is enough for models up to 13B. The barrier to entry is no longer hardware — it's knowing which data to use.

MCP Server: Infrastructure

ScenarioEstimated monthly cost
Prototype (Cloudflare Workers / Vercel)$0 - $50 USD
Small production (1 company, <1K req/day)$200 - $500 USD
Medium production (multi-company, <10K req/day)$2,000 - $6,000 USD
Enterprise with high availability$10,000 - $25,000 USD
MCP carries a 30-50% higher cost than a traditional REST API because it needs to maintain session state, persistent memory, and tool orchestration.

Initial Development

ComponentEstimated hoursApproximate cost
Basic MCP Server60 - 110 h$9,000 - $15,000 USD
MCP Server with enterprise connectors160 - 260 h$15,000 - $35,000 USD
Document processing pipeline80 - 150 h$10,000 - $20,000 USD
Fine-tuning pipeline + RAG40 - 80 h$5,000 - $12,000 USD
Workflow platform120 - 200 h$15,000 - $30,000 USD
Complete system total460 - 800 h$54,000 - $112,000 USD

Multi-model: Different Purposes, Different Costs

When a company needs multiple specialized models — one for HR, another for compliance, another for sales — costs multiply but not linearly:

ConfigurationMonthly operating cost
1 API model (GPT-4o-mini FT, moderate usage)$500 - $2,000 USD
3 specialized API models (OpenAI/Gemini)$1,500 - $5,000 USD
3 models via Together/Fireworks (Llama/Qwen FT)$800 - $3,000 USD
3 self-hosted QLoRA models (1x A100 80GB)$1,800 - $3,000 USD (GPU)
Complete platform with workflows$5,000 - $20,000 USD
The multi-LoRA strategy is key here: platforms like Together AI allow serving multiple LoRA adapters on a single base model, without multiplying infrastructure costs. Three specialized models for the cost of one.

Which Model for Each Vertical

VerticalRecommended modelWhy
HR / TalentGPT-4o-mini FT or Phi-4-mini (on-premise)Cost/quality for CV parsing, JD generation. Phi-4 if data cannot leave the company
Legal / ComplianceHybrid: Llama 4 Scout FT + RAGAir-gapped for sensitive documents. RAG for changing regulations. 16-expert MoE delivers precision
B2B / CommercialGemini 2.5 Flash FT or GPT-4o-mini FTHigh request volume, no FT inference surcharge (Gemini). Ideal for customer-facing APIs
Multilingual (ES/PT)Qwen 2.5 72B or Llama 4 ScoutQwen leads benchmarks in Spanish and Portuguese. Llama 4 trained with 10x more multilingual data than Llama 3

Recommended Stack for LATAM

For a company in the region that wants to get started without burning capital:

ComponentRecommendationMonthly cost
Base modelGPT-4o-mini FT or Llama via Together AI$500 - $2,000
Vector DBQdrant Cloud / Pinecone$0 - $100
MCP ServerCloudflare Workers$0 - $50
Workflowsn8n self-hosted$0 - $50
Doc processingCustom pipeline$100 - $500
Total operating$600 - $2,700 USD/mo
For companies that need data sovereignty (legal, healthcare, government):

ComponentRecommendationMonthly cost
Base modelQwen 2.5 or Llama 4 Scout (QLoRA, self-hosted)$1,800 - $3,000 (GPU)
Vector DBQdrant self-hosted$0 - $200
MCP ServerOwn infrastructure$200 - $500
Workflowsn8n self-hosted$0 - $50
Total operating$2,000 - $3,750 USD/mo
Compared to hiring a team of 3 people to do the same work manually: $4,500 - $9,000 USD/mo in salaries for the region. And the system doesn't take vacations, doesn't call in sick, and scales without linear costs.


What Changes in Practice

The companies I saw at LATU aren't selling "AI." They're selling friction reduction. A contract that took 3 days to review now goes through the pipeline in 15 minutes. An onboarding process that required 4 people is now handled by 1 with an assistant.

But the architecture matters. A chatbot with GPT and no enterprise context is a toy. A system with fine-tuning, RAG, MCP connectors, flexible instructions, and automated workflows is a real tool.

The difference between the two isn't the model. It's the engineering around the model.


To Wrap Up

The LATU event showed something that isn't said enough: these solutions are already being built in Latin America. They're not copies of Silicon Valley products — they're implementations adapted to the regulatory, linguistic, and economic realities of the region.

The AI market in LATAM is projected at $30 billion by 2033. But the number that matters isn't the market's — it's your operation's. If a system costing $2,700/mo replaces $9,000/mo in manual labor and does it with greater accuracy and traceability, the decision isn't technological. It's arithmetic.

The architecture is available. Costs have come down. The platforms exist. What's missing, as always, is the decision to start.


Want to explore how to implement any of this in your company? Let's talk.