Introduction
In document-heavy workflows, invoice processing remains one of the most stubborn problems to automate. Layouts vary wildly. Fields appear in different places. Vendors change templates frequently — sometimes without notice. And when you add regional formatting (such as Swedish decimal conventions and OCR-/fakturanummer requirements), capturing accurate structured data becomes even more challenging.
In an early blogpost , I set out to design a system capable of learning invoice templates, validating extracted fields, governing promotions with RBAC, and improving itself automatically over time. The first version worked, but it suffered from inherent unpredictability due to its reliance on generative OCR-like models and unstable extraction behavior.
This new, upgraded version represents a major shift in approach. By removing nondeterministic components and building the pipeline around Tesseract OCR, rule-based extraction, spaCy linguistic hints, and hybrid validation, the system has become:
Deterministic
Regionally correct
Explainable
Stable across runs
Governed by robust policy controls
This article walks through the redesigned system in detail — its architectural decisions, extraction logic, validation framework, template lifecycle, and the improvements over earlier designs.
GitHub Repo: https://github.com/dhanuka84/local-secure-rag-invoice/tree/spacy-extraction
Part 1 — Why the First System Needed a Major Redesign
The first version demonstrated promise but also exposed limitations that are common in automated document extraction systems.
1. OCR unpredictability
The previous pipeline sometimes produced garbled or inconsistent text, especially for Swedish characters like Å, Ä, Ö. This led to inconsistent extraction results and unreliable field matching.
2. Fragile extraction patterns
Without region-aware formatting or structural signals, amounts would sometimes parse incorrectly. Comma and dot mismatches frequently caused numeric interpretation errors.
3. Inconsistent validation
Because the extracted data varied run-to-run, the validation logic — especially checks like subtotal + moms ≈ total — behaved inconsistently.
4. Template promotion issues
Cerbos integration was incorrect in several places, resulting in unexpected promotion denials or approvals.
5. Lack of Swedish formatting
Output such as 172.00 instead of 172,00 kr created friction for accounting users and weakened validation messages.
These issues collectively reduced trust in the system — and in financial workflows, trust is essential.
Part 2 — Design Principles for the New System
To address the limitations of the earlier implementation, I established clear goals for the redesign:
1. Determinism
The system must produce the same output for the same PDF every time.
2. Explainability
Every extracted field should be traceable to OCR output, a matching rule, and a validation logic path.
3. Region-aware extraction
The system must speak Swedish invoice language:
Comma decimals
“kr” suffix
“Moms (%)”
OCR-/fakturanummer
Fakturadatum
4. Robust validation workflow
Validation should rely on OCR text, numeric consistency, layout cues, and a hybrid confidence model — not generative output.
5. Proper template governance
Cerbos must control template promotion using robust RBAC rules, preventing accidental or unauthorized changes.
Part 3 — The New Architecture Overview
The redesigned system follows a simplified yet highly reliable extraction and validation pipeline. Every component is transparent and deterministic.
+-----------+
| PDF In |
+-----+-----+
|
v
+-----+-------+
| Tesseract OCR |
+-----+-------+
| clean Swedish OCR text
v
+-----+-------+
| spaCy Hints |
+-----+-------+
| detect DATE, NUMBERS, PERSONS
v
+-----+-------+
| Regex Extractor |
+-----+-------+
| field-level extraction
v
+-----+---------+
| LayoutLMv3 Hybrid |
| (optional fields) |
+-----+---------+
| cross-model agreement
v
+-----+---------+
| Swedish Formatter |
+-----+---------+
| 137,25 kr — not 137.25
v
+-----+----------+
| Vision Validator |
| (OCR bounding box |
| consistency) |
+-----+----------+
| score: 0.95
v
+-----+----------+
| Template Cache |
| (active/staging) |
+-----+----------+
|
v
+-----+----------+
| Cerbos RBAC |
+-----+----------+
|
v
+-----+----------+
| Promotion Flow |
+----------------+
This architecture ensures that each stage adds clarity, structure, or validation — never randomness.
Part 4 — Clean, Deterministic Tesseract OCR
OCR is now handled entirely through Tesseract in Swedish mode. This ensures:
Correct Scandinavian characters
Accurate numeric formatting
Predictable output for regex matching
Transparent debugging by printing raw OCR text
The output already contains correct Swedish-formatted values, including comma-separated decimals and “kr”.
Part 5 — Rule-Based Swedish Invoice Extraction
Instead of relying on generative models, the system uses highly targeted regex patterns tuned to Swedish electricity invoices:
OCR-/fakturanummer
Fakturadatum
Summa exkl moms
Moms
Totalt belopp
Moms (%)
These patterns map directly to standard labels used by providers like GodEl, E.ON, Vattenfall, and Fortum.
The output is deterministic and aligns exactly with the OCR text — eliminating hallucinations entirely.
Part 6 — spaCy Hints for Linguistic Sanity Checks
spaCy does not extract values directly. Instead, it provides hints:
DATE validation
NUM detection
PERSON recognition
Local context checks
spaCy confirms that extracted fields are linguistically consistent and appear in plausible OCR regions.
Part 7 — Optional LayoutLMv3 Hybrid Structure Validation
LayoutLMv3 acts as a structural validator, not an extractor. It checks:
Whether “Totalt belopp” is correctly located in the totals block
Whether amounts and labels match on the same line
Whether numeric regions align with expected invoice layouts
This provides robustness when templates shift slightly over time.
Part 8 — Swedish Formatting Layer
To provide a polished, human-readable output, extracted data is reformatted:
137.25 → 137,25 kr
34.32 → 34,32 kr
172.00 → 172,00 kr
0.25 → 25%
Field names are also translated to Swedish, producing:
OCR-/fakturanummer
Fakturadatum
Summa exkl moms
Moms
Totalt belopp
Moms (%)
This makes the JSON directly compatible with Swedish bookkeeping systems.
Part 9 — Vision Validator with Confidence Scoring
The vision validator now computes a stable, meaningful confidence score using:
OCR bounding boxes
spaCy entity confirmation
LayoutLM optional validation
Mathematical checks
Page-level consistency
A typical successful extraction produces:
vision_pass: true
vision_score: 0.95
vision_critique: "Fields are complete and subtotal + moms ≈ totalt belopp."
This drastically improves trust in the system compared to previous iterations.
Part 10 — Template Learning, Staging, and Active Lifecycle
Template lifecycle now works exactly as intended:
A new invoice format → template created → staged
Successful extraction increments success_count
After meeting the threshold, promotion is attempted
Cerbos authorizes or denies based on RBAC
Once promoted → template becomes active
Active templates cannot be re-promoted (“already_active”)
The logic is clean, predictable, and audit-friendly.
Part 11 — Cerbos RBAC for Safe Promotion
Cerbos ensures that only authorized users (e.g., managers) can promote templates.
The corrected integration now:
Constructs correct Principal and Resource descriptors
Uses the appropriate Cerbos API calls
Properly interprets allow/deny results
Reflects errors clearly when PDP is unreachable
This provides governance and auditability for all template changes.
Part 12 — Final Example Output
Example output for the GodEl invoice:
{
"OCR-/fakturanummer": "2687252805",
"Fakturadatum": "2025-11-06",
"Summa exkl moms": "137,25 kr",
"Moms": "34,32 kr",
"Totalt belopp": "172,00 kr",
"Moms (%)": "25%"
}
Validation:
vision_pass: true
vision_score: 0.95
vision_critique: “Fields are complete and subtotal + moms ≈ totalt belopp.”
Promotion:
Initially: pending_success_0
Later: promoted
Final: already_active
Part 13 — What This New System Achieves
The upgraded architecture delivers:
Predictable, stable extraction
Swedish-native formatting
Explainable hybrid validation
Proper RBAC-governed promotion
Self-improving templates
No hallucinations
Strong confidence scoring
This redesign transforms the engine from a prototype into a production-ready, trustworthy system tailored for Nordic invoices.
Part 14 — Hands-On: Running the System
Start services
Clone the below GitHub Repository: https://github.com/dhanuka84/local-secure-rag-invoice/tree/spacy-extraction
```bash
docker-compose up -d
$ docker-compose ps
WARN[0000] /home/dhanuka84/research/local-secure-rag-invoice/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
cerbos ghcr.io/cerbos/cerbos:latest "/cerbos server" cerbos 2 days ago Up 16 hours (healthy) 0.0.0.0:3592->3592/tcp, [::]:3592->3592/tcp, 3593/tcp
milvus milvusdb/milvus:v2.4.3 "/tini -- milvus run…" milvus 2 days ago Up 2 days 0.0.0.0:9091->9091/tcp, [::]:9091->9091/tcp, 0.0.0.0:19530->19530/tcp, [::]:19530->19530/tcp
ollama ollama/ollama:latest "/bin/ollama serve" ollama 2 days ago Up 2 days 0.0.0.0:11434->11434/tcp, [::]:11434->11434/tcp
redis redis:7-alpine "docker-entrypoint.s…" redis 2 days ago Up 2 days 0.0.0.0:6379->6379/tcp, [::]:6379->6379/tcp
make models
make install-deps
Install dependencies
python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Process an invoice
export AUTO_PROMOTE_THRESHOLD=1
docker exec -it redis redis-cli FLUSHALL
APP_ROLE=manager python -m src.graph_invoice.run_invoice_graph samples/invoices/godel.pdf
Environment variables:
AUTO_PROMOTE_THRESHOLD=1 → enable fast promotion for testing
APP_ROLE=manager → allowed to promote templates
staging
{
"pdf": "samples/invoices/godel.pdf",
"signature": "inbetalning_59b3d099",
"template_source": "learned",
"promotion_status": "pending_success_0",
"fields": {
"OCR-/fakturanummer": "2687252805",
"Fakturadatum": "2025-11-06",
"Summa exkl moms": "137,25 kr",
"Moms": "34,32 kr",
"Totalt belopp": "172,00 kr",
"Moms (%)": "25%"
},
"vision_pass": true,
"vision_score": 0.95,
"vision_critique": "Fields are complete and subtotal + moms \u2248 totalt belopp.",
"done": true
}
(.venv)local-secure-rag-invoice$ python -m src.invoice.templates_cli list
Active:
Staging:
acme_corporation_123_main_street_invoice_72246d14
Promoted
{
"pdf": "samples/invoices/godel.pdf",
"signature": "inbetalning_59b3d099",
"template_source": "active",
"promotion_status": "promoted",
"fields": {
"OCR-/fakturanummer": "2687252805",
"Fakturadatum": "2025-11-06",
"Summa exkl moms": "137,25 kr",
"Moms": "34,32 kr",
"Totalt belopp": "172,00 kr",
"Moms (%)": "25%"
},
"vision_pass": true,
"vision_score": 0.95,
"vision_critique": "Fields are complete and subtotal + moms \u2248 totalt belopp.",
"done": true
}
(.venv) dhanuka84@dhanuka84:~/research/local-secure-rag-invoice$ python -m src.invoice.templates_cli list
Active:
inbetalning_59b3d099
Staging:
Checking the Redis Cache
docker exec -it redis redis-cli
127.0.0.1:6379> keys *
1) "invoice:template:inbetalning_59b3d099"
2) "invoice_metrics:inbetalning_59b3d099"
Part 15 — Future Improvements
Next steps include:
Multi-page table extraction
Vendor clustering via embeddings
Training LayoutLMv3 on Swedish datasets
Automatic anomaly detection
Human-in-the-loop review UI
Conclusion
This 2025 upgrade marks a significant shift from experimental document AI toward a robust, deterministic, and region-aware invoice extraction system. By grounding the design in OCR reality, rule-based logic, hybrid validation, and responsible template governance, the system finally achieves the reliability and explainability required for financial automation.
It is now predictable, self-improving, policy-aware — and ready for real-world deployment.
No comments:
Post a Comment