Why your synthetic fintech data fails code review (and how mixture models fix it)

Published: (June 12, 2026 at 06:01 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

Every fintech developer has done this: you need test data, you reach for Faker, you generate ten thousand transactions, and your demo works. Then a data scientist on the buying side opens your dataset, runs one df.describe(), and the deal-killing question arrives: “Why are your transaction amounts uniformly distributed?” Real financial data has a shape. Synthetic data that ignores that shape is instantly recognizable — and in testing, ML training, or sales demos, instantly discrediting. I spent nine years running a savings app in Latin America (30,000+ users, 2015–2024), and when it wound down I kept something most synthetic data generators never had: 506,311 real records to measure that shape against. This post is about the three statistical properties that separate believable synthetic financial data from Faker output, with the actual numbers. The standard “sophisticated” approach is to sample amounts from a lognormal distribution. It’s better than uniform — and it still fails. When I fitted a single lognormal to 261,070 real deposits, the body of the distribution looked fine (7–10% deviation between p25 and p90), but the tail fell apart: 35–45% deviation at p95–p99. The reason is that “deposit amount” isn’t one population. It’s at least three: micro-deposits (the $1–$20 spare-change crowd), typical deposits ($100–$800), and large transfers ($6,000+). Each has its own location and spread. A single lognormal averages across them and gets all of them wrong. The fix is a mixture of lognormals. Fit GaussianMixture from scikit-learn on the log-amounts, select the number of components, sample from the mixture. One non-obvious lesson from doing this on real data: don’t select K with BIC. Financial amounts have heavy atoms at round values (more on that below), and BIC reacts to those atoms by under-fitting the number of components. Selecting K by minimizing the Kolmogorov–Smirnov statistic against a held-out sample worked far better: a 6-component mixture brought deposits from KS=0.068 down to KS=0.032, and p99 deviation from ~45% to under 5%.

python

0 views
Back to Blog

Related posts

Read more »

The spec is in the wrong place

My day job is at a large tech company. Hundreds of engineering teams, and every one of them is somewhere different on AI adoption. Some are still treating codin...

The Heuristics Say Don't

A culture that only records its disasters ends up with a biased archive. Wars documented, plagues chronicled, collapses catalogued. The quiet decades go unwritt...