🛡️ Build a Smart Excel Data Cleaner in Python (Step-by-Step)

Published: (January 16, 2026 at 08:00 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

What You’ll Build

  • A desktop GUI app for Excel cleanup
  • Automatic missing‑value handling
  • Duplicate detection
  • Heuristic “data health” scoring
  • Export results to Excel, PDF, JSON, and TXT

GitHub repo (full script):
👉

🧰 Prerequisites

  • Python 3.9+
  • Basic Python knowledge

Install required packages

pip install pandas numpy openpyxl ttkbootstrap reportlab

📁 Project Structure

SmartExcelGuardian/
│── main.py
│── logo.ico
│── excelguardian.log

1️⃣ Import Required Libraries

# Core
import os
import sys
import threading
import json
import tkinter as tk
from tkinter import filedialog

# UI
import ttkbootstrap as tb
from ttkbootstrap.constants import *

# Misc
from datetime import datetime

Why these modules?

ModulePurpose
tkinterGUI foundation
ttkbootstrapModern dark UI theme
threadingKeep UI responsive during cleanup
pandasData cleaning
numpyNumeric operations
openpyxlExcel export & formatting
reColumn‑name normalization
reportlabGenerate professional PDF reports
# Data & Excel
import pandas as pd
import numpy as np
import re
from openpyxl import Workbook
from openpyxl.styles import PatternFill, Font
from openpyxl.utils.dataframe import dataframe_to_rows

# PDF Export
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.lib.colors import red, orange, green, black

2️⃣ Global State & Logging

stop_event      = threading.Event()          # Allows canceling cleanup
cleanup_results = {}                         # Shared export data
log_file        = os.path.join(os.getcwd(), "excelguardian.log")

3️⃣ Utility Helper Functions

Resource Loader (for packaged apps)

def resource_path(file_name):
    """Return absolute path for bundled resources (PyInstaller support)."""
    base_path = getattr(sys, "_MEIPASS", os.path.dirname(os.path.abspath(__file__)))
    return os.path.join(base_path, file_name)

Column‑Name Cleaner

def clean_column_name(name):
    """Normalize column names: strip, lower‑case, remove punctuation, replace spaces with '_'."""
    name = name.strip().lower()
    name = re.sub(r"[^\w\s]", "", name)   # Remove non‑alphanumeric chars
    name = re.sub(r"\s+", "_", name)      # Replace spaces with underscores
    return name

Example

OriginalCleaned
Total Sales ($)total_sales

NumPy → JSON Converter

def convert_numpy(obj):
    """Make NumPy types JSON‑serialisable."""
    if isinstance(obj, np.integer):
        return int(obj)
    if isinstance(obj, np.floating):
        return float(obj)
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    raise TypeError(f"Object of type {type(obj)} is not JSON serialisable")

4️⃣ Creating the Main Window

app = tb.Window(themename="darkly")
app.title("SmartExcelGuardian v1.1.0")
app.geometry("1100x650")

Why ttkbootstrap?

  • Modern styling out‑of‑the‑box
  • Built‑in dark mode support
  • Responsive layout helpers

5️⃣ Title Section

tb.Label(app,
         text="SmartExcelGuardian",
         font=("Segoe UI", 22, "bold")).pack(pady=(10, 2))

tb.Label(app,
         text="Professional Excel Data Guardian Tool",
         font=("Segoe UI", 10, "italic"),
         foreground="#9ca3af").pack(pady=(0, 8))

Creates the app header.

6️⃣ Excel File Selector

file_path = tk.StringVar()

# Row container (you’ll need to create `row1` as a Frame first)
tb.Entry(row1,
        textvariable=file_path,
        width=60).pack(side="left", padx=6)

tb.Button(row1,
          text="📄 Excel File",
          command=lambda: file_path.set(
              filedialog.askopenfilename(
                  filetypes=[("Excel Files", "*.xlsx *.xls")]
              )
          )).pack(side="left")

Allows the user to choose an Excel workbook.

7️⃣ Cleanup Control Buttons

start_btn = tb.Button(row2,
                     text="🛡 CLEAN DATA",
                     bootstyle="success")

stop_btn = tb.Button(row2,
                    text="🛑 STOP",
                    bootstyle="danger-outline",
                    state="disabled")
  • CLEAN DATA → starts a background thread that runs the cleanup engine.
  • STOP → safely halts processing.

8️⃣ Results Table (Treeview)

cols = (
    "column", "original_type", "suggested_type",
    "cleaned_type", "missing_values",
    "duplicates_detected", "heuristic_score",
    "rename_suggestion"
)

tree = tb.Treeview(row3, columns=cols, show="headings")

Displays a column‑by‑column health analysis.

9️⃣ Heuristic Scoring System

def heuristic_score(missing, duplicates, type_issue):
    """Return a risk score from 0‑100."""
    score = 0
    score += min(30, missing * 2)          # Missing values (max 30)
    score += min(30, duplicates * 2)       # Duplicates (max 30)
    score += 40 if type_issue else 0       # Type‑mismatch (max 40)
    return min(score, 100)
ScoreRisk LevelIndicator
0‑30Healthy🟢
31‑70Moderate🟠
71‑100High Risk🔴

🔟 Data‑Cleaning Engine

def assess_and_clean(df):
    """Iterate over columns, assess health, and clean data in‑place."""
    for col in df.columns:
        series = df[col]

        # ---------- Numeric Columns ----------
        coerced = pd.to_numeric(series, errors="coerce")
        if coerced.notna().any():                     # At least one numeric value
            cleaned_series = coerced.fillna(coerced.mean())
            df[col] = cleaned_series
            continue

        # ---------- Text Columns ----------
        cleaned_series = series.astype("string").fillna(series.mode()[0])
        df[col] = cleaned_series

Key actions

  • Numeric columns → coerce to numbers, fill missing values with the column mean.
  • Text columns → convert to string type, fill missing values with the most frequent value (mode).

Next Steps (not shown in the excerpt)

  • Populate the Treeview with heuristic scores and rename suggestions.
  • Wire the CLEAN DATA button to launch assess_and_clean in a background thread.
  • Implement export functions for Excel, PDF, JSON, and TXT using the helper utilities defined above.
  • Add proper error handling and logging to excelguardian.log.

Happy cleaning! 🎉

1️⃣ Threaded Cleanup Execution

threading.Thread(
    target=run_cleanup,
    daemon=True
).start()

Why threading?

  • Keeps UI responsive
  • Prevents freezing on large Excel files

2️⃣ Excel Export with Formulas

sum_formula  = f"=SUM(A2:A{ws.max_row})"
mean_formula = f"=AVERAGE(A2:A{ws.max_row})"

Automatically adds:

  • SUM
  • AVERAGE

to numeric columns.

3️⃣ Conditional Formatting

fill = PatternFill(start_color="FF9999", fill_type="solid")
cell.font = Font(bold=True)

High‑risk columns are:

  • Highlighted 🔴
  • Bolded for visibility

4️⃣ PDF Report Export

def score_color(score):
    if score >= 71:
        return red
    elif score >= 31:
        return orange
    else:
        return green

Creates a multi‑page PDF audit report with:

  • Color‑coded scores
  • Column summaries
  • Page numbers

5️⃣ About & Help Window

tb.Label(frame, text="How to Use", font=("Segoe UI", 12, "bold"))

Provides:

  • Feature overview
  • Usage steps
  • Developer info

🚀 Final Result

You now have:

  • A professional Excel cleaner
  • A desktop GUI
  • A heuristic scoring system
  • Multi‑format export

📌 Next Improvements

  • Add per‑sheet selection
  • Add charts (data health trends)
  • Save user presets
  • Package as .exe

🔗 Full Source Code

SmartExcelGuardian on GitHub

SmartExcelGuardian

Back to Blog

Related posts

Read more »