11/22/2025

Enterprise‑Grade AI & Software Engineering Architecture for Invoice Understanding

 



Introduction

In document-heavy workflows, invoice processing remains one of the most stubborn problems to automate. Layouts vary wildly. Fields appear in different places. Vendors change templates frequently — sometimes without notice. And when you add regional formatting (such as Swedish decimal conventions and OCR-/fakturanummer requirements), capturing accurate structured data becomes even more challenging.

In an early blogpost , I set out to design a system capable of learning invoice templates, validating extracted fields, governing promotions with RBAC, and improving itself automatically over time. The first version worked, but it suffered from inherent unpredictability due to its reliance on generative OCR-like models and unstable extraction behavior.

This new, upgraded version represents a major shift in approach. By removing nondeterministic components and building the pipeline around Tesseract OCR, rule-based extraction, spaCy linguistic hints, and hybrid validation, the system has become:

  • Deterministic

  • Regionally correct

  • Explainable

  • Stable across runs

  • Governed by robust policy controls

This article walks through the redesigned system in detail — its architectural decisions, extraction logic, validation framework, template lifecycle, and the improvements over earlier designs.

GitHub Repo: https://github.com/dhanuka84/local-secure-rag-invoice/tree/spacy-extraction


Part 1 — Why the First System Needed a Major Redesign

The first version demonstrated promise but also exposed limitations that are common in automated document extraction systems.

1. OCR unpredictability

The previous pipeline sometimes produced garbled or inconsistent text, especially for Swedish characters like Å, Ä, Ö. This led to inconsistent extraction results and unreliable field matching.

2. Fragile extraction patterns

Without region-aware formatting or structural signals, amounts would sometimes parse incorrectly. Comma and dot mismatches frequently caused numeric interpretation errors.

3. Inconsistent validation

Because the extracted data varied run-to-run, the validation logic — especially checks like subtotal + moms ≈ total — behaved inconsistently.

4. Template promotion issues

Cerbos integration was incorrect in several places, resulting in unexpected promotion denials or approvals.

5. Lack of Swedish formatting

Output such as 172.00 instead of 172,00 kr created friction for accounting users and weakened validation messages.

These issues collectively reduced trust in the system — and in financial workflows, trust is essential.


Part 2 — Design Principles for the New System

To address the limitations of the earlier implementation, I established clear goals for the redesign:

1. Determinism

The system must produce the same output for the same PDF every time.

2. Explainability

Every extracted field should be traceable to OCR output, a matching rule, and a validation logic path.

3. Region-aware extraction

The system must speak Swedish invoice language:

  • Comma decimals

  • “kr” suffix

  • “Moms (%)”

  • OCR-/fakturanummer

  • Fakturadatum

4. Robust validation workflow

Validation should rely on OCR text, numeric consistency, layout cues, and a hybrid confidence model — not generative output.

5. Proper template governance

Cerbos must control template promotion using robust RBAC rules, preventing accidental or unauthorized changes.


Part 3 — The New Architecture Overview

The redesigned system follows a simplified yet highly reliable extraction and validation pipeline. Every component is transparent and deterministic.



        +-----------+

           |  PDF In   |

           +-----+-----+

                 |

                 v

           +-----+-------+

           | Tesseract OCR |

           +-----+-------+

                 |  clean Swedish OCR text

                 v

           +-----+-------+

           |  spaCy Hints  |

           +-----+-------+

                 |  detect DATE, NUMBERS, PERSONS

                 v

           +-----+-------+

           | Regex Extractor |

           +-----+-------+

                 |  field-level extraction

                 v

           +-----+---------+

           | LayoutLMv3 Hybrid |

           | (optional fields) |

           +-----+---------+

                 |  cross-model agreement

                 v

           +-----+---------+

           | Swedish Formatter |

           +-----+---------+

                 |  137,25 kr — not 137.25

                 v

           +-----+----------+

           | Vision Validator |

           | (OCR bounding box |

           |   consistency)   |

           +-----+----------+

                 |  score: 0.95

                 v

           +-----+----------+

           | Template Cache |

           | (active/staging) |

           +-----+----------+

                 |

                 v

           +-----+----------+

           |  Cerbos RBAC   |

           +-----+----------+

                 |

                 v

           +-----+----------+

           | Promotion Flow |

           +----------------+



This architecture ensures that each stage adds clarity, structure, or validation — never randomness.


Part 4 — Clean, Deterministic Tesseract OCR

OCR is now handled entirely through Tesseract in Swedish mode. This ensures:

  • Correct Scandinavian characters

  • Accurate numeric formatting

  • Predictable output for regex matching

  • Transparent debugging by printing raw OCR text

The output already contains correct Swedish-formatted values, including comma-separated decimals and “kr”.


Part 5 — Rule-Based Swedish Invoice Extraction

Instead of relying on generative models, the system uses highly targeted regex patterns tuned to Swedish electricity invoices:

  • OCR-/fakturanummer

  • Fakturadatum

  • Summa exkl moms

  • Moms

  • Totalt belopp

  • Moms (%)

These patterns map directly to standard labels used by providers like GodEl, E.ON, Vattenfall, and Fortum.

The output is deterministic and aligns exactly with the OCR text — eliminating hallucinations entirely.


Part 6 — spaCy Hints for Linguistic Sanity Checks

spaCy does not extract values directly. Instead, it provides hints:

  • DATE validation

  • NUM detection

  • PERSON recognition

  • Local context checks

spaCy confirms that extracted fields are linguistically consistent and appear in plausible OCR regions.


Part 7 — Optional LayoutLMv3 Hybrid Structure Validation

LayoutLMv3 acts as a structural validator, not an extractor. It checks:

  • Whether “Totalt belopp” is correctly located in the totals block

  • Whether amounts and labels match on the same line

  • Whether numeric regions align with expected invoice layouts

This provides robustness when templates shift slightly over time.


Part 8 — Swedish Formatting Layer

To provide a polished, human-readable output, extracted data is reformatted:

  • 137.25 → 137,25 kr

  • 34.32 → 34,32 kr

  • 172.00 → 172,00 kr

  • 0.25 → 25%

Field names are also translated to Swedish, producing:

  • OCR-/fakturanummer

  • Fakturadatum

  • Summa exkl moms

  • Moms

  • Totalt belopp

  • Moms (%)

This makes the JSON directly compatible with Swedish bookkeeping systems.


Part 9 — Vision Validator with Confidence Scoring

The vision validator now computes a stable, meaningful confidence score using:

  • OCR bounding boxes

  • spaCy entity confirmation

  • LayoutLM optional validation

  • Mathematical checks

  • Page-level consistency

A typical successful extraction produces:

vision_pass: true

vision_score: 0.95

vision_critique: "Fields are complete and subtotal + moms ≈ totalt belopp."


This drastically improves trust in the system compared to previous iterations.


Part 10 — Template Learning, Staging, and Active Lifecycle

Template lifecycle now works exactly as intended:

  1. A new invoice format → template created → staged

  2. Successful extraction increments success_count

  3. After meeting the threshold, promotion is attempted

  4. Cerbos authorizes or denies based on RBAC

  5. Once promoted → template becomes active

  6. Active templates cannot be re-promoted (“already_active”)

The logic is clean, predictable, and audit-friendly.


Part 11 — Cerbos RBAC for Safe Promotion

Cerbos ensures that only authorized users (e.g., managers) can promote templates.

The corrected integration now:

  • Constructs correct Principal and Resource descriptors

  • Uses the appropriate Cerbos API calls

  • Properly interprets allow/deny results

  • Reflects errors clearly when PDP is unreachable

This provides governance and auditability for all template changes.


Part 12 — Final Example Output

Example output for the GodEl invoice:

{

  "OCR-/fakturanummer": "2687252805",

  "Fakturadatum": "2025-11-06",

  "Summa exkl moms": "137,25 kr",

  "Moms": "34,32 kr",

  "Totalt belopp": "172,00 kr",

  "Moms (%)": "25%"

}

https://github.com/dhanuka84/local-secure-rag-invoice/blob/spacy-extraction/samples/invoices/godel.pdf


Validation:

  • vision_pass: true

  • vision_score: 0.95

  • vision_critique: “Fields are complete and subtotal + moms ≈ totalt belopp.”

Promotion:

  • Initially: pending_success_0

  • Later: promoted

  • Final: already_active


Part 13 — What This New System Achieves

The upgraded architecture delivers:

  • Predictable, stable extraction

  • Swedish-native formatting

  • Explainable hybrid validation

  • Proper RBAC-governed promotion

  • Self-improving templates

  • No hallucinations

  • Strong confidence scoring

This redesign transforms the engine from a prototype into a production-ready, trustworthy system tailored for Nordic invoices.



Part 14 — Hands-On: Running the System

Start services


Clone the below GitHub Repository: https://github.com/dhanuka84/local-secure-rag-invoice/tree/spacy-extraction



```bash

docker-compose up -d


$ docker-compose ps

WARN[0000] /home/dhanuka84/research/local-secure-rag-invoice/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 

NAME      IMAGE                          COMMAND                  SERVICE   CREATED      STATUS                  PORTS

cerbos    ghcr.io/cerbos/cerbos:latest   "/cerbos server"         cerbos    2 days ago   Up 16 hours (healthy)   0.0.0.0:3592->3592/tcp, [::]:3592->3592/tcp, 3593/tcp

milvus    milvusdb/milvus:v2.4.3         "/tini -- milvus run…"   milvus    2 days ago   Up 2 days               0.0.0.0:9091->9091/tcp, [::]:9091->9091/tcp, 0.0.0.0:19530->19530/tcp, [::]:19530->19530/tcp

ollama    ollama/ollama:latest           "/bin/ollama serve"      ollama    2 days ago   Up 2 days               0.0.0.0:11434->11434/tcp, [::]:11434->11434/tcp

redis     redis:7-alpine                 "docker-entrypoint.s…"   redis     2 days ago   Up 2 days               0.0.0.0:6379->6379/tcp, [::]:6379->6379/tcp



make models


make install-deps


Install dependencies


python -m venv .venv && source .venv/bin/activate

pip install --upgrade pip

pip install -r requirements.txt


Process an invoice


export AUTO_PROMOTE_THRESHOLD=1

docker exec -it redis redis-cli FLUSHALL

APP_ROLE=manager python -m src.graph_invoice.run_invoice_graph samples/invoices/godel.pdf


Environment variables:

  • AUTO_PROMOTE_THRESHOLD=1 → enable fast promotion for testing

  • APP_ROLE=manager → allowed to promote templates


staging

{

  "pdf": "samples/invoices/godel.pdf",

  "signature": "inbetalning_59b3d099",

  "template_source": "learned",

  "promotion_status": "pending_success_0",

  "fields": {

    "OCR-/fakturanummer": "2687252805",

    "Fakturadatum": "2025-11-06",

    "Summa exkl moms": "137,25 kr",

    "Moms": "34,32 kr",

    "Totalt belopp": "172,00 kr",

    "Moms (%)": "25%"

  },

  "vision_pass": true,

  "vision_score": 0.95,

  "vision_critique": "Fields are complete and subtotal + moms \u2248 totalt belopp.",

  "done": true

}




(.venv)local-secure-rag-invoice$ python -m src.invoice.templates_cli list

Active:


Staging:

  acme_corporation_123_main_street_invoice_72246d14



Promoted


{

  "pdf": "samples/invoices/godel.pdf",

  "signature": "inbetalning_59b3d099",

  "template_source": "active",

  "promotion_status": "promoted",

  "fields": {

    "OCR-/fakturanummer": "2687252805",

    "Fakturadatum": "2025-11-06",

    "Summa exkl moms": "137,25 kr",

    "Moms": "34,32 kr",

    "Totalt belopp": "172,00 kr",

    "Moms (%)": "25%"

  },

  "vision_pass": true,

  "vision_score": 0.95,

  "vision_critique": "Fields are complete and subtotal + moms \u2248 totalt belopp.",

  "done": true

}

(.venv) dhanuka84@dhanuka84:~/research/local-secure-rag-invoice$ python -m src.invoice.templates_cli list

Active:

   inbetalning_59b3d099

Staging:


Checking the Redis Cache

docker exec -it redis redis-cli

127.0.0.1:6379> keys *

1) "invoice:template:inbetalning_59b3d099"

2) "invoice_metrics:inbetalning_59b3d099"



Part 15 — Future Improvements

Next steps include:

  • Multi-page table extraction

  • Vendor clustering via embeddings

  • Training LayoutLMv3 on Swedish datasets

  • Automatic anomaly detection

  • Human-in-the-loop review UI


Conclusion

This 2025 upgrade marks a significant shift from experimental document AI toward a robust, deterministic, and region-aware invoice extraction system. By grounding the design in OCR reality, rule-based logic, hybrid validation, and responsible template governance, the system finally achieves the reliability and explainability required for financial automation.

It is now predictable, self-improving, policy-aware — and ready for real-world deployment.



No comments:

Post a Comment