Synthetic Invoice Generation: Testing Document Systems Without Production Data

Automation pipeline for synthetic invoice data generation under real-world constraints A system that generates structured, semi-variable invoice datasets to simulate production conditions when real data is unavailable. Removes dependency on production data to unblock testing and iteration under tight timelines.


๐Ÿ’ซ System Flow

<aside>

**โœ๏ธ Structured Data Tables** (Google Sheets)

โ†“

**๐Ÿ“‚ Randomized Field Generation** (Formulas + Ranges)

โ†“

**๐Ÿ”Œ Template Injection** (4 Invoice Layouts)

โ†“

โš™๏ธ Apps Script Automation

โ†“

**๐Ÿ“„ PDF Export** (Bulk)

โ†“

๐Ÿงช UAT Testing + Model Evaluation

</aside>


๐Ÿ”ง Overview

During UAT for a document processing workflow, testing was blocked by the absence of accessible, production-grade invoice data. Real invoices were either unavailable, delayed, or unsuitable for controlled testing, making it impossible to validate system behavior under realistic conditions.

To address this, I designed a synthetic data generation system that produces structured, semi-variable invoice datasets using a spreadsheet-driven model and automated export pipeline.

The system generates invoices by combining controlled input tables with bounded randomization, injecting values into multiple template formats to simulate real-world variability while maintaining schema consistency. This allowed for the creation of both repeatable baseline datasets and distinct, unseen test sets.

Because the goal was not just volume but realism under constraint, the system balances structured data modeling with variability controls, ensuring outputs are diverse enough to reflect real inputs without introducing noise that would break downstream workflows. This required explicitly defining where variability was allowed versus where structure had to be preserved to avoid breaking downstream processing.

The result is a reusable pipeline that transforms static input tables into scalable, production-like datasets, enabling reliable testing and iteration in the absence of real data.


๐Ÿ— System Design

1. Data Layer (Spreadsheet-driven)

2. Line Item Engine