Article

The Problem: 20 Years of Budget Data in 5 Different Formats

Uruguay publishes its budget data. It's one of the most transparent countries in the region on fiscal matters. But "publishing" doesn't mean "having a clean, unified dataset." It means data scattered across:

  • CKAN (datos.gub.uy): CSVs with budget credits, but with different schemas per year.
  • CGN SIIF: The official accountability report from the Contaduria General de la Nacion — the real execution numbers.
  • OPP PDFs: Historical reports from the Oficina de Planeamiento y Presupuesto. Budget tables... trapped inside PDFs.
If you want to analyze the evolution of Uruguay's national budget from 2005 to 2024, there's no single dataset that covers it all. You have to build it yourself.

This article is the behind-the-scenes of that pipeline: from LLM extraction to validation against official data, through a classic data engineering problem that no tutorial mentions — double-counting across sources.


Architecture: Simple Where It Matters, Complex Where There's No Choice

┌─────────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Sources         │     │  GCS Lake    │     │  BigQuery    │     │  Looker      │
│                  │     │              │     │              │     │  Studio      │
│  CKAN APIs ─────►│────►│ raw/         │────►│ External     │────►│              │
│  CGN SIIF ──────►│     │ processed/   │     │ Tables       │     │ Dashboard    │
│  OPP PDFs ──────►│     │              │     │ + dbt Views  │     │              │
└─────────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
         │                                           │
         │ vLLM + Qwen2.5-VL                         │ dbt
         │ (PDF extraction)                          │ (transforms)
         ▼                                           ▼
   Parquet checkpoints                     staging → intermediate → marts
   in GCS                                 (dedup + union + validation)

The idea is classic: ELT (Extract-Load-Transform). Jupyter notebooks handle extract+load to GCS in Parquet format. BigQuery exposes them as external tables. dbt transforms.

What's not classic is the extraction. Because when your source is a 2007 PDF with budget tables, there's no read_csv() that's going to save you.


PDF Extraction With Qwen2.5-VL and vLLM

The OPP reports for years 2005-2010, 2022, and 2024 only exist as PDFs. Tables with columns for current budget, executed budget, and investment. Perfectly readable for a human. Unreadable for a conventional tabular parser.

The solution: use a multimodal LLM — Qwen2.5-VL-7B — via vLLM for batch inference.

The Prompt

prompt = """Analyze this budget table image. Extract ALL rows.
For EACH row, provide a JSON object with:
- inciso: (integer) the inciso number
- denominacion: (string) the agency name
- credito_vigente: (float) current budget amount
- credito_ejecutado: (float) executed budget amount
- inversion: (float) investment amount

Return ONLY a JSON array. No explanations."""

The model receives the image of each PDF page and returns structured JSON. It works surprisingly well — for a clean 2022 PDF the accuracy is high. For a scanned 2006 one, less so, but good enough.

Checkpoints: You Don't Want to Lose 3 Hours of GPU Time

Each processed PDF gets saved as a checkpoint to GCS:

def save_checkpoint(records, pdf_name, year):
    df = pl.DataFrame(records, infer_schema_length=None)
    #                          ^^^^^^^^^^^^^^^^^^^^^^
    # Without this: ComputeError when the LLM returns 9999999999
    # Polars infers i32 from the first rows, then crashes on large values

    path = f"gs://bucket/processed/pdf_extractions/{pdf_name}.parquet"
    df.write_parquet(path)

That infer_schema_length=None line is the kind of fix that takes 2 hours of debugging and one line of code. The LLM sometimes returns absurd numeric values (like 9999999999) that don't fit in an i32. Without the parameter, Polars infers the type from the first rows and then blows up.


The Double-Counting Trap

With three data sources overlapping across years, the most serious problem isn't extraction — it's double-counting.

My first version was naive: UNION ALL everything.

The result: for 2020, the total budget showed up 3x inflated. Three sources covered the same year at different granularities.

Diagnosis

YearAvailable sourcesProblem
2005-2010PDFs onlyNo conflict
2011-2019credits_2011_2019 + credits_resumen + PDFsTriple-counting
2020CGN SIIF + credits_resumen (34K rows) + PDFsTriple-counting + 2x inflation
2021credits_2021No conflict
2022PDFs onlyNo conflict
2023CGN SIIFNo conflict
2024PDFs onlyNo conflict
For 2020, the credits_resumen source had 34,000 rows at very fine granularity. When compared per-inciso against CGN (the official source), every agency showed values 1.9x-2.0x the real amount.

The Solution: Source Priority Per Year

-- Source priority per year (avoids double-counting):
--   2011-2019: credits_2020_resumen (consistent growth, has ejecutado)
--   2020, 2023: CGN SIIF (authoritative government totals)
--   2021: credits_2021
--   2005-2010, 2022, 2024: PDF extractions (only available source)

with credits_resumen as (
    select ...
    from stg_budget_credits
    where data_source = 'credits_2020_resumen'
      and fiscal_year between 2011 and 2019
),
cgn as (
    select ...
    from stg_cgn_execution
),
pdf as (
    select ...
    from stg_pdf_extractions
    where fiscal_year < 2011
       or fiscal_year in (2022, 2024)
)

Each year uses a single source, chosen by reliability:

  1. CGN SIIF when available (2020, 2023) — it's the government's official accountability report.
  2. Credits resumen for 2011-2019 — medium granularity data with execution figures.
  3. Credits 2021 — year-specific dataset.
  4. PDFs — only for years without another source.

Validation: Don't Trust Your Data

Validation against CGN was the moment of truth. CGN SIIF publishes the official budget execution totals for the Uruguayan government. If my numbers don't match, the pipeline is useless.

Results

YearPipeline (UYU)Official CGN (UYU)Match
2020355,764,655,039355,764,655,039100%
2023492,099,120,022492,099,120,022100%
For 2011-2019, without CGN data available, I validated that year-over-year growth was consistent:

PeriodAnnual growthCoherence
2011-2012+10%Normal
2014-2015+17%Election year
2018-2019+13%Normal
Sustained growth of 10-17% per year, consistent with inflation and budget expansion in Uruguay during that decade.


The dbt Pipeline: Staging, Intermediate, Marts

dbt does the heavy lifting on transformation. Three classic layers:

Staging

1:1 views over BigQuery external tables. Type casting, column renaming, normalization:

  • stg_budget_credits — Unions 3 credit sources with different schemas (INT64 vs STRING for amounts)
  • stg_cgn_execution — CGN SIIF with validity filters
  • stg_pdf_extractions — LLM-extracted data with v3 schema

Intermediate

The layer where complexity gets resolved:

  • int_budget_unified — The deduplicated union by source priority
  • int_budget_enriched — Execution rate calculations and name normalization

Marts

The consumable layer:

  • fct_budget_execution — Fact table partitioned by fiscal year, with current budget, execution, and investment per inciso

Canonical Names: A Subtle Problem

Uruguay's 36 government agencies (incisos) appear with slightly different names in each source:

  • CGN: "PRESIDENCIA DE LA REPUBLICA"
  • Credits: "Presidencia de la Republica"
  • PDF: "PRESIDENCIA DE LA REPUBLICA" (no accent)
The fix was a canonical names CTE:

canonical_names as (
    select distinct
        org_id as inciso,
        first_value(org_nombre) over (
            partition by org_id
            order by org_nombre
        ) as canonical_name
    from stg_budget_credits
    where org_nombre is not null
      and length(trim(org_nombre)) > 3
)

It picks the name from the most reliable source (credits) and applies it as a lookup for PDF extractions.


Lessons

1. LLMs are viable extractors, not perfect ones

Qwen2.5-VL extracts PDF tables with acceptable accuracy. But it returns garbage sometimes — absurd values, duplicate rows, hallucinated fields. You need downstream validation, not blind trust.

2. Double-counting is the silent killer

When you have multiple sources for the same data, the instinct is "more is better." It isn't. A naive UNION ALL gives you inflated numbers that look reasonable until you check them against ground truth.

3. Always validate against the official source

Without the CGN SIIF comparison, the pipeline would have shipped 2-3x inflated data for several years. Validation isn't optional — it's the most important feature of the pipeline.

4. infer_schema_length=None in Polars

If you're creating DataFrames from external data (APIs, LLMs, scraping), always use infer_schema_length=None. The cost is minimal. The risk of not doing it is a ComputeError at 3 AM.

5. Checkpoints save lives (GPU lives)

LLM extraction at scale is expensive in time and GPU. Saving checkpoints per processed PDF lets you restart without losing work. A basic pattern that's not as common in Jupyter notebooks as it should be.


Stack

ComponentTechnology
API extractionPython + requests
PDF extractionvLLM + Qwen2.5-VL-7B
StorageGCS (Parquet)
WarehouseBigQuery (external tables)
Transformationdbt-bigquery
OrchestrationManual notebooks (Jupyter)
VisualizationLooker Studio
DataFramesPolars
InfrastructureGCP (GPU VM for vLLM)

Open Data in Uruguay: State of the Art

Uruguay has an unusually good open data ecosystem for the region. The datos.gub.uy portal offers datasets in consumable formats. The CGN publishes budget execution with inciso-level detail. The OPP maintains historical reports.

But "open" doesn't mean "easy." Inconsistent formats between years, amount columns that switch from INT to STRING, decimals with commas in some datasets and dots in others, PDFs where there should be CSVs.

The pipeline I built isn't impressive for its scale — it's about 30K rows at the end. It's interesting because it demonstrates the real work of data engineering: it's not about moving bytes from A to B, it's about reconciling contradictory versions of the same reality.


What's Missing?

  • Real orchestration: Airflow or Prefect replacing manual notebook execution.
  • More CGN years: Getting SIIF data for 2011-2019 would eliminate the dependency on credits_resumen.
  • Improved PDF extraction: Fine-tuning the model for Uruguayan budget tables specifically.
  • dbt tests: Referential integrity assertions and expected value range checks.

Working with public data in LATAM and need to build extraction pipelines? Let's talk.