The Problem: 11 Datasets, Zero Integration
Uruguay's Interior Ministry publishes open data on crime, domestic violence, sexual offenses, femicides, the prison system, and police infrastructure. It's a remarkable level of transparency for a country of 3.5 million.
But the data arrives as isolated datasets:
- 6 tabular datasets: 2.4 million crime records spanning 2003-2025, each with different schemas, column naming conventions, and department spellings.
- 5 geographic layers: Police precincts, stations, gender violence centers, fire stations, headquarters — in KML and SHP formats with no shared identifiers.
This project builds the integration layer — and then goes further. Instead of stopping at a clean warehouse, I added a neuro-symbolic analysis layer: formal verification with Z3, entity linking with an LLM, and crime prediction combining Graph Neural Networks with domain knowledge encoded as symbolic rules.
Architecture: From CSVs to Predictions
Data Sources (Interior Ministry)
┌──────────────────┐ ┌──────────────────┐
│ 6 Tabular CSVs │ │ 5 Geographic │
│ (CKAN API) │ │ (KML / SHP) │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────┐
│ Bruin Pipeline (local) │
│ ingest → transform → verify → graph │
└──────────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ GCS Data Lake (Parquet) │
│ 13 processed files + knowledge graph │
│ 627K triples / 125K nodes │
└──────────────────┬───────────────────────────┘
│
┌─────────┴──────────┐
▼ ▼
┌─────────────────┐ ┌────────────────────────┐
│ BigQuery │ │ Google Colab (A100) │
│ External Tables│ │ NB04: Z3 verification │
└─────────────────┘ │ NB05: Qwen Text2SQL │
│ NB06: GNN + symbolic │
└────────────────────────┘
The first half is classic data engineering — ingest, transform, load. The second half is where it gets interesting.
The Department Problem: 19 Names, 47 Spellings
Before any analysis, I had to solve something embarrassingly basic: departments are spelled differently across datasets.
Dataset A: "MALDONADO"
Dataset B: "MALDONDADO" ← typo
Dataset C: "PAYSANDÚ" ← accent
Dataset D: "Tacuarembo" ← lowercase
Dataset E: "RÍO NEGRO" ← accent
Dataset F: "SAN JOSÉ" ← accent
This isn't a minor issue. If "PAYSANDÚ" and "PAYSANDU" don't match, your knowledge graph has two separate nodes for the same department. Your GNN has phantom departments. Your cross-dataset joins silently drop rows.
The fix was a normalization function applied across all 7 transform scripts:
DEPARTMENT_FIXES = {
"MALDONDADO": "MALDONADO",
"TACUEREMBO": "TACUAREMBO",
"PAYSANDÚ": "PAYSANDU",
"RÍO NEGRO": "RIO NEGRO",
"SAN JOSÉ": "SAN JOSE",
"TACUAREMBÓ": "TACUAREMBO",
}
def normalize_department(name: str) -> str:
upper = name.strip().upper()
return DEPARTMENT_FIXES.get(upper, upper)
Applied once in the shared library, called from every transform. After normalization: 19 departments, 19 spellings, zero duplicates.
Knowledge Graph: 627,000 Triples
With normalized departments, I could build a proper knowledge graph. Each crime record, each police station, each geographic boundary becomes an RDF triple:
crime:EVT_123456 a crime:CrimeEvent ;
crime:hasType "HURTO" ;
crime:occurredIn dept:MONTEVIDEO ;
crime:inYear 2023 ;
crime:inMonth "MARZO" .
dept:MONTEVIDEO a geo:Department ;
geo:hasStation station:COMISARIA_1 ;
geo:hasCEVDG cevdg:MONTEVIDEO_1 .
The result: 627,376 triples connecting 125,535 nodes across all 11 datasets. Now I can run SPARQL queries that no single dataset could answer:
# Which departments have high DV rates but no gender violence center?
SELECT ?dept (COUNT(?event) as ?dv_count)
WHERE {
?event a crime:DomesticViolence ;
crime:occurredIn ?dept .
FILTER NOT EXISTS { ?dept geo:hasCEVDG ?center }
}
GROUP BY ?dept ORDER BY DESC(?dv_count)
Formal Verification With Z3: Proving Your Data Isn't Broken
Most data quality checks are assertions: "this column should not be null." But assertions only tell you that a specific row failed — they don't tell you whether the schema itself is coherent.
Z3 is a theorem prover. Instead of checking individual rows, I asked it: can any valid row exist under these constraints?
from z3 import *
ano = Int('ano')
depto = String('depto')
# Schema constraints
constraints = And(
ano >= 2003, ano <= 2025,
Or(*[depto == StringVal(d) for d in VALID_DEPARTMENTS]),
)
solver = Solver()
solver.add(constraints)
if solver.check() == sat:
print("SAT — schema is consistent")
model = solver.model()
print(f"Example valid row: year={model[ano]}, dept={model[depto]}")
Three verification phases:
- Spec consistency: Can any valid row exist? (SAT = yes, the spec isn't contradictory)
- Data boundary validation: Do actual min/max values fit within the spec?
- Cross-dataset consistency: Do year ranges overlap where expected?
LLM-Powered Text2SQL: Asking Questions in Plain Language
Notebook 05 uses Qwen2.5-7B-Instruct (4-bit quantized on A100) for two tasks:
1. Entity Linking
Matching crime type strings to a canonical taxonomy. "HURTO" → theft, "RAPIÑA" → robbery, "ABIGEATO" → cattle rustling. The model handles Spanish legal terminology correctly.2. Text2SQL
Natural language questions → DuckDB SQL over the crime dataset:Q: "How many crimes were reported in Montevideo in 2023?"
→ SELECT COUNT(*) FROM delitos WHERE depto = 'MONTEVIDEO' AND ano = 2023
→ Result: 94,001
The key lesson: prompt engineering matters more than model size. My first version used DESCRIBE delitos output as schema context. The model generated wrong column references (YEAR(fecha) instead of ano, barrio_montevideo for department filtering).
The fix was including sample values and explicit column hints:
schema_lines = []
for col_name in df.columns:
uniques = df[col_name].drop_nulls().unique().sort()
if len(uniques) <= 10:
sample = uniques.to_list()
else:
sample = uniques.head(5).to_list() + ["..."]
schema_lines.append(f" {col_name} ({dtype}): {sample}")
With sample values in the prompt, the model generates correct SQL on the first try. No fine-tuning needed.
Model Selection: Why Not a Reasoning Model?
I initially planned to use QwQ-32B (a reasoning model) for semantic evaluation. Bad idea. Taxonomy matching is classification, not deep reasoning. QwQ generates massive <think> blocks — 10+ minutes for a 931-character prompt, even in 4-bit quantization. I switched back to Qwen2.5-7B and the same task completed in seconds.
Lesson: match model capability to task complexity. A reasoning model for a classification task is like using a crane to move a chair.
GNN Crime Prediction: Where Neural Meets Symbolic
This is the core of the neuro-symbolic approach. The goal: predict next year's crime counts per department.
The Graph
Uruguay has 19 departments. I dissolved 280 police precinct polygons into department boundaries and built an adjacency graph: two departments are neighbors if they share a border.
Adjacency graph: 19 nodes, 37 edges
Average degree: 3.9
MONTEVIDEO neighbors: [CANELONES, SAN JOSE]
CANELONES neighbors: [FLORIDA, LAVALLEJA, MALDONADO, MONTEVIDEO, SAN JOSE]
SALTO neighbors: [ARTIGAS, PAYSANDU, RIVERA, TACUAREMBO]
The Neural Component
A 2-layer Graph Convolutional Network (GCN) with 1,281 parameters. Input: crime counts per type per department for year T. Output: predicted total crimes for year T+1.
Training uses temporal split — no data leakage:
- Train: 11 year-pairs (2013→2014 through 2023→2024)
- Validation: 2024→2025
The Symbolic Component
Five rules derived from the actual 2.4M crime records, not invented:
| Rule | Source | Key Pattern |
|---|---|---|
| Monthly seasonality | Per-crime-type distributions | Hurto peaks in summer (Jan-Mar). Abigeato peaks in winter (Jul-Sep) — opposite pattern, because cattle rustling benefits from long rural nights. |
| Hour-of-day risk | Hour distributions | 3x variation: peak at 20h, trough at 3-4h |
| Urban concentration | Department distributions | Montevideo: 48% of crime with ~40% of population → 1.2x overrepresentation |
| DV escalation | Domain knowledge | High domestic violence + no nearby CEVDG → 1.5x risk multiplier |
| Crime interaction | Criminological theory | Hurto surge → rapiña follows (opportunity theory) |
Fusion
The final prediction multiplies neural and symbolic:
prediction = neural_pred × seasonal × hour × urban × escalation
This decomposition is the entire point. A stakeholder can ask "why is Montevideo predicted high in January at 20h?" and get: "the GNN learned spatial patterns (Montevideo concentrates crime), summer increases hurto by 10%, evening hours add 14%, and the urban factor is 1.2x."
Results and Class Imbalance
VALIDATION SCORES (2024 → 2025)
MAE: 1,735 crimes/dept
RMSE: 2,853 crimes/dept
MAPE: 41.7%
R²: 0.9738
R² looks great. MAPE tells the real story. The model predicts well for departments with enough data:
| Department | Actual | Predicted | Error |
|---|---|---|---|
| MONTEVIDEO | 81,437 | 91,602 | +12.5% |
| CANELONES | 20,374 | 19,106 | -6.2% |
| MALDONADO | 10,444 | 10,282 | -1.6% |
| ... | |||
| FLORES | 739 | 2,469 | +234% |
The high R² is driven by Montevideo's scale dominating the variance. MAPE is the honest metric: useful for high-volume departments, limited for rural ones.
Why Neuro-Symbolic?
A pure neural model would be a black box. A pure rule-based system can't learn spatial diffusion patterns from data. The combination gives you:
- Interpretability: Every prediction decomposes into explainable factors. Auditable for government stakeholders.
- Data efficiency: Symbolic rules encode domain knowledge that would require decades of data for a neural model to learn independently.
- Scenario planning: Change the month, hour, or DV threshold and immediately see the impact. Impossible with a black-box model.
Lessons
1. Normalize before everything
Department normalization was a 30-line function that unblocked the entire downstream pipeline. Without it, the knowledge graph had phantom nodes, the GNN had misaligned features, and cross-dataset joins silently dropped rows.
2. R² lies when your data has class imbalance
With one department (Montevideo) having 100x more crime than the smallest (Flores), R² will be high even if the model just predicts the mean. Always check MAPE and per-class metrics.
3. Match model complexity to task complexity
QwQ-32B for taxonomy matching: 10+ minutes, massive <think> blocks, same accuracy as Qwen2.5-7B in 3 seconds. A reasoning model for a classification task burns GPU time for zero benefit.
4. Symbolic rules need to be data-driven
My first version had made-up seasonal factors. The actual data showed abigeato peaks in winter — the opposite of most crime types. You can't invent domain knowledge; you have to derive it from data.
5. Small graphs over-smooth
A GCN on 19 nodes with average degree 3.9 will propagate information across the entire graph in 2 hops. All small departments converge to the same prediction. This is a known GCN limitation, not a bug — but it means the architecture is better suited for denser graphs (e.g., 280 precincts).
Stack
| Component | Technology |
|---|---|
| Cloud | Google Cloud Platform (GCS, BigQuery) |
| Infrastructure | Terraform |
| Storage | GCS (Parquet) |
| Orchestration | Bruin (DAG pipeline) |
| Transforms | Polars |
| Knowledge Graph | RDFLib (627K+ triples, Turtle format) |
| Formal Verification | Z3 SMT Solver |
| LLM | Qwen2.5-7B-Instruct (4-bit, A100) |
| GNN | PyTorch Geometric (GCN) |
| Geospatial | GeoPandas, Shapely, Folium |
| SQL Engine | DuckDB (in-notebook) |
What's Next?
- Temporal GNN: Replace static GCN with T-GCN or DCRNN to learn time-varying patterns directly instead of relying on year-pair snapshots.
- Real DV integration: Pull actual domestic violence rates from the VD dataset instead of simulating them for the escalation rule.
- Precinct-level graph: With 280 seccionales instead of 19 departments, the GCN would have enough nodes to avoid over-smoothing.
- Backtesting: Proper temporal cross-validation instead of a single train/val split.
Building data pipelines over government open data? Let's talk.
