Generate Realistic Dummy Data for MongoDB

Published: 1 month ago (March 31, 2026 at 05:17 AM EDT)

3 min read

Source: Dev.to

Source: Dev.to

tl;dr

I built a tool that utilizes the data schema exported from MongoDB Compass, uses vector search to determine the most appropriate Faker method for each field, and then generates realistic dummy data.

The repository is here: mock-data

MongoDB Compass now supports data modeling, a long‑desired feature. While many use the model as a data dictionary, having a model also enables quick generation of realistic dummy data for coding and testing. Writing scripts for this manually can be time‑consuming, so I set out to automate the process.

How to make it look real?

If you’ve generated dummy data before, you’re probably familiar with Faker. Since its inception in 2008, Faker provides 280+ generator methods for common data such as email, name, address, license plates, etc., and it can be extended with custom providers.

Instead of reinventing the wheel, I use Faker for Python for data mocking.

Associate generator methods with fields

To link Faker methods to fields in a JSON schema without breaking its structure, I embed an annotation in the field’s description property, surrounded by # characters. The tool extracts the text between the two # symbols and uses it as the Faker method (including any parameters). You can still include a regular description alongside the annotation.

Make it even easier

Manually annotating every field is still tedious. To automate the selection of Faker methods, I employ vector search with ChromaDB:

Extract all Faker methods from the library and compute a vector for each method name.
Compute a vector for each field name in the schema.
Search the vector database to find the Faker method whose vector best matches the field name vector.

Requirements for the approach

Faker method names must be meaningful (they are).
Field names should be meaningful (generally true for well‑designed schemas).
The chosen Faker method must be able to produce data that can be converted to the type specified in the JSON schema. ChromaDB’s filtering capabilities enforce this constraint.

Additional minor optimizations improve guessing accuracy. The result is a tool that can automatically generate realistic dummy data from a MongoDB Compass schema.

Usage

# Clone the repository
git clone https://github.com/zhangyaoxing/mock-data.git
cd mock-data

# Install in development mode
pip install -e .

# Generate dummy data as ejson files
mockdata -s schemas/BookStore.json -n 50 -t ejson output/

# Generate dummy data directly into MongoDB
mockdata -s schemas/BookStore.json -n 50 -t mongodb mongodb://localhost/

# Generate dummy data into Kafka
mockdata -s schemas/BookStore.json -n 50 -t kafka localhost:9092

GitHub repository: Mock Data

Generate Realistic Dummy Data for MongoDB

tl;dr

How to make it look real?

Associate generator methods with fields

Make it even easier

Requirements for the approach

Usage

Related posts

Time-Series Databases vs. Relational Databases, What is the Difference

This Rubber Duck Doesn't Debug — It Roasts

# Understanding Data Modeling in PowerBI: Joins, Relationship and Schemas.

MERN Quiz App Project Complete!