Article

The Problem: 11 Datasets, Zero Integration

Uruguay's Interior Ministry publishes open data on crime, domestic violence, sexual offenses, femicides, the prison system, and police infrastructure. It's a remarkable level of transparency for a country of 3.5 million.

But the data arrives as isolated datasets:

  • 6 tabular datasets: 2.4 million crime records spanning 2003-2025, each with different schemas, column naming conventions, and department spellings.
  • 5 geographic layers: Police precincts, stations, gender violence centers, fire stations, headquarters — in KML and SHP formats with no shared identifiers.
There's no cross-dataset ontology, no entity resolution, and no way to ask "show me all crime patterns near a police station that lacks a gender violence center." Each dataset is a silo.

This project builds the integration layer — and then goes further. Instead of stopping at a clean warehouse, I added a neuro-symbolic analysis layer: formal verification with Z3, entity linking with an LLM, and crime prediction combining Graph Neural Networks with domain knowledge encoded as symbolic rules.


Architecture: From CSVs to Predictions

 Data Sources (Interior Ministry)
 ┌──────────────────┐    ┌──────────────────┐
 │  6 Tabular CSVs  │    │  5 Geographic    │
 │  (CKAN API)      │    │  (KML / SHP)     │
 └────────┬─────────┘    └────────┬─────────┘
          │                       │
          ▼                       ▼
 ┌──────────────────────────────────────────────┐
 │          Bruin Pipeline (local)              │
 │  ingest → transform → verify → graph         │
 └──────────────────┬───────────────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────────────┐
 │        GCS Data Lake (Parquet)               │
 │  13 processed files + knowledge graph        │
 │  627K triples / 125K nodes                   │
 └──────────────────┬───────────────────────────┘
                    │
          ┌─────────┴──────────┐
          ▼                    ▼
 ┌─────────────────┐  ┌────────────────────────┐
 │  BigQuery       │  │  Google Colab (A100)   │
 │  External Tables│  │  NB04: Z3 verification │
 └─────────────────┘  │  NB05: Qwen Text2SQL  │
                      │  NB06: GNN + symbolic  │
                      └────────────────────────┘

The first half is classic data engineering — ingest, transform, load. The second half is where it gets interesting.


The Department Problem: 19 Names, 47 Spellings

Before any analysis, I had to solve something embarrassingly basic: departments are spelled differently across datasets.

Dataset A: "MALDONADO"
Dataset B: "MALDONDADO"     ← typo
Dataset C: "PAYSANDÚ"      ← accent
Dataset D: "Tacuarembo"    ← lowercase
Dataset E: "RÍO NEGRO"     ← accent
Dataset F: "SAN JOSÉ"      ← accent

This isn't a minor issue. If "PAYSANDÚ" and "PAYSANDU" don't match, your knowledge graph has two separate nodes for the same department. Your GNN has phantom departments. Your cross-dataset joins silently drop rows.

The fix was a normalization function applied across all 7 transform scripts:

DEPARTMENT_FIXES = {
    "MALDONDADO": "MALDONADO",
    "TACUEREMBO": "TACUAREMBO",
    "PAYSANDÚ": "PAYSANDU",
    "RÍO NEGRO": "RIO NEGRO",
    "SAN JOSÉ": "SAN JOSE",
    "TACUAREMBÓ": "TACUAREMBO",
}

def normalize_department(name: str) -> str:
    upper = name.strip().upper()
    return DEPARTMENT_FIXES.get(upper, upper)

Applied once in the shared library, called from every transform. After normalization: 19 departments, 19 spellings, zero duplicates.


Knowledge Graph: 627,000 Triples

With normalized departments, I could build a proper knowledge graph. Each crime record, each police station, each geographic boundary becomes an RDF triple:

crime:EVT_123456 a crime:CrimeEvent ;
    crime:hasType "HURTO" ;
    crime:occurredIn dept:MONTEVIDEO ;
    crime:inYear 2023 ;
    crime:inMonth "MARZO" .

dept:MONTEVIDEO a geo:Department ;
    geo:hasStation station:COMISARIA_1 ;
    geo:hasCEVDG cevdg:MONTEVIDEO_1 .

The result: 627,376 triples connecting 125,535 nodes across all 11 datasets. Now I can run SPARQL queries that no single dataset could answer:

# Which departments have high DV rates but no gender violence center?
SELECT ?dept (COUNT(?event) as ?dv_count)
WHERE {
    ?event a crime:DomesticViolence ;
           crime:occurredIn ?dept .
    FILTER NOT EXISTS { ?dept geo:hasCEVDG ?center }
}
GROUP BY ?dept ORDER BY DESC(?dv_count)


Formal Verification With Z3: Proving Your Data Isn't Broken

Most data quality checks are assertions: "this column should not be null." But assertions only tell you that a specific row failed — they don't tell you whether the schema itself is coherent.

Z3 is a theorem prover. Instead of checking individual rows, I asked it: can any valid row exist under these constraints?

from z3 import *

ano = Int('ano')
depto = String('depto')

# Schema constraints
constraints = And(
    ano >= 2003, ano <= 2025,
    Or(*[depto == StringVal(d) for d in VALID_DEPARTMENTS]),
)

solver = Solver()
solver.add(constraints)

if solver.check() == sat:
    print("SAT — schema is consistent")
    model = solver.model()
    print(f"Example valid row: year={model[ano]}, dept={model[depto]}")

Three verification phases:

  1. Spec consistency: Can any valid row exist? (SAT = yes, the spec isn't contradictory)
  2. Data boundary validation: Do actual min/max values fit within the spec?
  3. Cross-dataset consistency: Do year ranges overlap where expected?
All phases returned SAT across all datasets. The constraints are coherent, the data fits within them, and datasets that should overlap in time actually do.


LLM-Powered Text2SQL: Asking Questions in Plain Language

Notebook 05 uses Qwen2.5-7B-Instruct (4-bit quantized on A100) for two tasks:

1. Entity Linking

Matching crime type strings to a canonical taxonomy. "HURTO" → theft, "RAPIÑA" → robbery, "ABIGEATO" → cattle rustling. The model handles Spanish legal terminology correctly.

2. Text2SQL

Natural language questions → DuckDB SQL over the crime dataset:

Q: "How many crimes were reported in Montevideo in 2023?"
→ SELECT COUNT(*) FROM delitos WHERE depto = 'MONTEVIDEO' AND ano = 2023
→ Result: 94,001

The key lesson: prompt engineering matters more than model size. My first version used DESCRIBE delitos output as schema context. The model generated wrong column references (YEAR(fecha) instead of ano, barrio_montevideo for department filtering).

The fix was including sample values and explicit column hints:

schema_lines = []
for col_name in df.columns:
    uniques = df[col_name].drop_nulls().unique().sort()
    if len(uniques) <= 10:
        sample = uniques.to_list()
    else:
        sample = uniques.head(5).to_list() + ["..."]
    schema_lines.append(f"  {col_name} ({dtype}): {sample}")

With sample values in the prompt, the model generates correct SQL on the first try. No fine-tuning needed.

Model Selection: Why Not a Reasoning Model?

I initially planned to use QwQ-32B (a reasoning model) for semantic evaluation. Bad idea. Taxonomy matching is classification, not deep reasoning. QwQ generates massive <think> blocks — 10+ minutes for a 931-character prompt, even in 4-bit quantization. I switched back to Qwen2.5-7B and the same task completed in seconds.

Lesson: match model capability to task complexity. A reasoning model for a classification task is like using a crane to move a chair.


GNN Crime Prediction: Where Neural Meets Symbolic

This is the core of the neuro-symbolic approach. The goal: predict next year's crime counts per department.

The Graph

Uruguay has 19 departments. I dissolved 280 police precinct polygons into department boundaries and built an adjacency graph: two departments are neighbors if they share a border.

Adjacency graph: 19 nodes, 37 edges
Average degree: 3.9

MONTEVIDEO neighbors: [CANELONES, SAN JOSE]
CANELONES neighbors: [FLORIDA, LAVALLEJA, MALDONADO, MONTEVIDEO, SAN JOSE]
SALTO neighbors: [ARTIGAS, PAYSANDU, RIVERA, TACUAREMBO]

The Neural Component

A 2-layer Graph Convolutional Network (GCN) with 1,281 parameters. Input: crime counts per type per department for year T. Output: predicted total crimes for year T+1.

Training uses temporal split — no data leakage:

  • Train: 11 year-pairs (2013→2014 through 2023→2024)
  • Validation: 2024→2025

The Symbolic Component

Five rules derived from the actual 2.4M crime records, not invented:

RuleSourceKey Pattern
Monthly seasonalityPer-crime-type distributionsHurto peaks in summer (Jan-Mar). Abigeato peaks in winter (Jul-Sep) — opposite pattern, because cattle rustling benefits from long rural nights.
Hour-of-day riskHour distributions3x variation: peak at 20h, trough at 3-4h
Urban concentrationDepartment distributionsMontevideo: 48% of crime with ~40% of population → 1.2x overrepresentation
DV escalationDomain knowledgeHigh domestic violence + no nearby CEVDG → 1.5x risk multiplier
Crime interactionCriminological theoryHurto surge → rapiña follows (opportunity theory)

Fusion

The final prediction multiplies neural and symbolic:

prediction = neural_pred × seasonal × hour × urban × escalation

This decomposition is the entire point. A stakeholder can ask "why is Montevideo predicted high in January at 20h?" and get: "the GNN learned spatial patterns (Montevideo concentrates crime), summer increases hurto by 10%, evening hours add 14%, and the urban factor is 1.2x."

Results and Class Imbalance

VALIDATION SCORES (2024 → 2025)
  MAE:       1,735 crimes/dept
  RMSE:      2,853 crimes/dept
  MAPE:       41.7%
  R²:        0.9738

R² looks great. MAPE tells the real story. The model predicts well for departments with enough data:

DepartmentActualPredictedError
MONTEVIDEO81,43791,602+12.5%
CANELONES20,37419,106-6.2%
MALDONADO10,44410,282-1.6%
...
FLORES7392,469+234%
Montevideo, Canelones, and Maldonado represent ~66% of all crime. The GCN learns their patterns well. For departments like Flores (739 total crimes in a year), the model can't distinguish them from the graph mean — classic class imbalance on a small graph.

The high R² is driven by Montevideo's scale dominating the variance. MAPE is the honest metric: useful for high-volume departments, limited for rural ones.


Why Neuro-Symbolic?

A pure neural model would be a black box. A pure rule-based system can't learn spatial diffusion patterns from data. The combination gives you:

  1. Interpretability: Every prediction decomposes into explainable factors. Auditable for government stakeholders.
  2. Data efficiency: Symbolic rules encode domain knowledge that would require decades of data for a neural model to learn independently.
  3. Scenario planning: Change the month, hour, or DV threshold and immediately see the impact. Impossible with a black-box model.

Lessons

1. Normalize before everything

Department normalization was a 30-line function that unblocked the entire downstream pipeline. Without it, the knowledge graph had phantom nodes, the GNN had misaligned features, and cross-dataset joins silently dropped rows.

2. R² lies when your data has class imbalance

With one department (Montevideo) having 100x more crime than the smallest (Flores), R² will be high even if the model just predicts the mean. Always check MAPE and per-class metrics.

3. Match model complexity to task complexity

QwQ-32B for taxonomy matching: 10+ minutes, massive <think> blocks, same accuracy as Qwen2.5-7B in 3 seconds. A reasoning model for a classification task burns GPU time for zero benefit.

4. Symbolic rules need to be data-driven

My first version had made-up seasonal factors. The actual data showed abigeato peaks in winter — the opposite of most crime types. You can't invent domain knowledge; you have to derive it from data.

5. Small graphs over-smooth

A GCN on 19 nodes with average degree 3.9 will propagate information across the entire graph in 2 hops. All small departments converge to the same prediction. This is a known GCN limitation, not a bug — but it means the architecture is better suited for denser graphs (e.g., 280 precincts).


Stack

ComponentTechnology
CloudGoogle Cloud Platform (GCS, BigQuery)
InfrastructureTerraform
StorageGCS (Parquet)
OrchestrationBruin (DAG pipeline)
TransformsPolars
Knowledge GraphRDFLib (627K+ triples, Turtle format)
Formal VerificationZ3 SMT Solver
LLMQwen2.5-7B-Instruct (4-bit, A100)
GNNPyTorch Geometric (GCN)
GeospatialGeoPandas, Shapely, Folium
SQL EngineDuckDB (in-notebook)

What's Next?

  • Temporal GNN: Replace static GCN with T-GCN or DCRNN to learn time-varying patterns directly instead of relying on year-pair snapshots.
  • Real DV integration: Pull actual domestic violence rates from the VD dataset instead of simulating them for the escalation rule.
  • Precinct-level graph: With 280 seccionales instead of 19 departments, the GCN would have enough nodes to avoid over-smoothing.
  • Backtesting: Proper temporal cross-validation instead of a single train/val split.

Building data pipelines over government open data? Let's talk.