🛡️ Build a Smart Excel Data Cleaner in Python (Step-by-Step)
Source: Dev.to
What You’ll Build
- A desktop GUI app for Excel cleanup
- Automatic missing‑value handling
- Duplicate detection
- Heuristic “data health” scoring
- Export results to Excel, PDF, JSON, and TXT
GitHub repo (full script):
👉
🧰 Prerequisites
- Python 3.9+
- Basic Python knowledge
Install required packages
pip install pandas numpy openpyxl ttkbootstrap reportlab
📁 Project Structure
SmartExcelGuardian/
│── main.py
│── logo.ico
│── excelguardian.log
1️⃣ Import Required Libraries
# Core
import os
import sys
import threading
import json
import tkinter as tk
from tkinter import filedialog
# UI
import ttkbootstrap as tb
from ttkbootstrap.constants import *
# Misc
from datetime import datetime
Why these modules?
| Module | Purpose |
|---|---|
tkinter | GUI foundation |
ttkbootstrap | Modern dark UI theme |
threading | Keep UI responsive during cleanup |
pandas | Data cleaning |
numpy | Numeric operations |
openpyxl | Excel export & formatting |
re | Column‑name normalization |
reportlab | Generate professional PDF reports |
# Data & Excel
import pandas as pd
import numpy as np
import re
from openpyxl import Workbook
from openpyxl.styles import PatternFill, Font
from openpyxl.utils.dataframe import dataframe_to_rows
# PDF Export
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.lib.colors import red, orange, green, black
2️⃣ Global State & Logging
stop_event = threading.Event() # Allows canceling cleanup
cleanup_results = {} # Shared export data
log_file = os.path.join(os.getcwd(), "excelguardian.log")
3️⃣ Utility Helper Functions
Resource Loader (for packaged apps)
def resource_path(file_name):
"""Return absolute path for bundled resources (PyInstaller support)."""
base_path = getattr(sys, "_MEIPASS", os.path.dirname(os.path.abspath(__file__)))
return os.path.join(base_path, file_name)
Column‑Name Cleaner
def clean_column_name(name):
"""Normalize column names: strip, lower‑case, remove punctuation, replace spaces with '_'."""
name = name.strip().lower()
name = re.sub(r"[^\w\s]", "", name) # Remove non‑alphanumeric chars
name = re.sub(r"\s+", "_", name) # Replace spaces with underscores
return name
Example
| Original | Cleaned |
|---|---|
Total Sales ($) | total_sales |
NumPy → JSON Converter
def convert_numpy(obj):
"""Make NumPy types JSON‑serialisable."""
if isinstance(obj, np.integer):
return int(obj)
if isinstance(obj, np.floating):
return float(obj)
if isinstance(obj, np.ndarray):
return obj.tolist()
raise TypeError(f"Object of type {type(obj)} is not JSON serialisable")
4️⃣ Creating the Main Window
app = tb.Window(themename="darkly")
app.title("SmartExcelGuardian v1.1.0")
app.geometry("1100x650")
Why ttkbootstrap?
- Modern styling out‑of‑the‑box
- Built‑in dark mode support
- Responsive layout helpers
5️⃣ Title Section
tb.Label(app,
text="SmartExcelGuardian",
font=("Segoe UI", 22, "bold")).pack(pady=(10, 2))
tb.Label(app,
text="Professional Excel Data Guardian Tool",
font=("Segoe UI", 10, "italic"),
foreground="#9ca3af").pack(pady=(0, 8))
Creates the app header.
6️⃣ Excel File Selector
file_path = tk.StringVar()
# Row container (you’ll need to create `row1` as a Frame first)
tb.Entry(row1,
textvariable=file_path,
width=60).pack(side="left", padx=6)
tb.Button(row1,
text="📄 Excel File",
command=lambda: file_path.set(
filedialog.askopenfilename(
filetypes=[("Excel Files", "*.xlsx *.xls")]
)
)).pack(side="left")
Allows the user to choose an Excel workbook.
7️⃣ Cleanup Control Buttons
start_btn = tb.Button(row2,
text="🛡 CLEAN DATA",
bootstyle="success")
stop_btn = tb.Button(row2,
text="🛑 STOP",
bootstyle="danger-outline",
state="disabled")
- CLEAN DATA → starts a background thread that runs the cleanup engine.
- STOP → safely halts processing.
8️⃣ Results Table (Treeview)
cols = (
"column", "original_type", "suggested_type",
"cleaned_type", "missing_values",
"duplicates_detected", "heuristic_score",
"rename_suggestion"
)
tree = tb.Treeview(row3, columns=cols, show="headings")
Displays a column‑by‑column health analysis.
9️⃣ Heuristic Scoring System
def heuristic_score(missing, duplicates, type_issue):
"""Return a risk score from 0‑100."""
score = 0
score += min(30, missing * 2) # Missing values (max 30)
score += min(30, duplicates * 2) # Duplicates (max 30)
score += 40 if type_issue else 0 # Type‑mismatch (max 40)
return min(score, 100)
| Score | Risk Level | Indicator |
|---|---|---|
| 0‑30 | Healthy | 🟢 |
| 31‑70 | Moderate | 🟠 |
| 71‑100 | High Risk | 🔴 |
🔟 Data‑Cleaning Engine
def assess_and_clean(df):
"""Iterate over columns, assess health, and clean data in‑place."""
for col in df.columns:
series = df[col]
# ---------- Numeric Columns ----------
coerced = pd.to_numeric(series, errors="coerce")
if coerced.notna().any(): # At least one numeric value
cleaned_series = coerced.fillna(coerced.mean())
df[col] = cleaned_series
continue
# ---------- Text Columns ----------
cleaned_series = series.astype("string").fillna(series.mode()[0])
df[col] = cleaned_series
Key actions
- Numeric columns → coerce to numbers, fill missing values with the column mean.
- Text columns → convert to string type, fill missing values with the most frequent value (mode).
Next Steps (not shown in the excerpt)
- Populate the Treeview with heuristic scores and rename suggestions.
- Wire the CLEAN DATA button to launch
assess_and_cleanin a background thread. - Implement export functions for Excel, PDF, JSON, and TXT using the helper utilities defined above.
- Add proper error handling and logging to
excelguardian.log.
Happy cleaning! 🎉
1️⃣ Threaded Cleanup Execution
threading.Thread(
target=run_cleanup,
daemon=True
).start()
Why threading?
- Keeps UI responsive
- Prevents freezing on large Excel files
2️⃣ Excel Export with Formulas
sum_formula = f"=SUM(A2:A{ws.max_row})"
mean_formula = f"=AVERAGE(A2:A{ws.max_row})"
Automatically adds:
SUMAVERAGE
to numeric columns.
3️⃣ Conditional Formatting
fill = PatternFill(start_color="FF9999", fill_type="solid")
cell.font = Font(bold=True)
High‑risk columns are:
- Highlighted 🔴
- Bolded for visibility
4️⃣ PDF Report Export
def score_color(score):
if score >= 71:
return red
elif score >= 31:
return orange
else:
return green
Creates a multi‑page PDF audit report with:
- Color‑coded scores
- Column summaries
- Page numbers
5️⃣ About & Help Window
tb.Label(frame, text="How to Use", font=("Segoe UI", 12, "bold"))
Provides:
- Feature overview
- Usage steps
- Developer info
🚀 Final Result
You now have:
- A professional Excel cleaner
- A desktop GUI
- A heuristic scoring system
- Multi‑format export
📌 Next Improvements
- Add per‑sheet selection
- Add charts (data health trends)
- Save user presets
- Package as
.exe
