How I Open-Sourced 1,000+ Chinese Exam Questions from WordPress to GitHub
Source: Dev.to
I run Mandarin Zone, a Chinese language school in Beijing since 2008. Over the years, I built 12 complete HSK 4 mock exams using the AYS Quiz Maker WordPress plugin for our students to practice online.
Recently, I decided to open‑source all of this content. Here’s how I extracted 1,176 questions from a WordPress database and turned them into a clean, developer‑friendly GitHub repository.
The Challenge
Our quiz data was locked inside WordPress — stored across multiple database tables (aysquiz_questions, aysquiz_answers, aysquiz_quizzes) with HTML‑embedded content, WordPress shortcodes for audio files, and messy formatting.
The Extraction
Step 1: SQL Export
I wrote targeted SQL queries to join the questions, answers, and quiz mapping tables:
SELECT
q.id AS question_id,
q.question AS question_text,
q.type AS question_type,
a.answer AS answer_text,
a.correct AS is_correct,
a.ordering AS answer_order
FROM aysquiz_questions q
LEFT JOIN aysquiz_answers a ON a.question_id = q.id
ORDER BY q.id, a.ordering;
The first export came out at 400 MB for just 8,566 rows — turns out some fields had massive embedded content. After trimming unnecessary columns, it dropped to 1.4 MB.
Step 2: Data Cleaning
The raw data contained WordPress shortcodes like [audio wav="..."][/audio] and HTML entities everywhere. I wrote a Python script to:
- Extract audio URLs from shortcodes
- Strip HTML tags while preserving Chinese text
- Map question types based on content patterns (listening true/false, reading comprehension, fill‑in‑the‑blank, sentence ordering)
- Group answers by question ID and sort by ordering
Step 3: Structured JSON
Each test became a clean JSON file:
{
"quiz_id": 2,
"title": "HSK 4 Sample Quiz",
"total_questions": 100,
"questions": [
{
"number": 1,
"type": "listening_true_false",
"audio": "https://media.mandarinzone.com/.../hsk4-1-02.wav",
"options": ["对", "错"],
"correct_answer_index": 0
}
]
}
The Result
- 12 complete HSK 4 mock exams in JSON format
- 1,176 questions across 6 question types
- GitHub Pages demo where anyone can take the tests online
- CC BY‑NC‑SA 4.0 license — free for non‑commercial use
What is HSK 4?
HSK (汉语水平考试) is China’s official Chinese proficiency test, recognized worldwide. Level 4 is intermediate — it certifies you can discuss a wide range of topics and understand roughly 1,200 vocabulary words. Each exam has 100 questions covering listening, reading, and writing.
What You Can Build With This
- A mobile HSK practice app
- Anki flashcard decks
- NLP training data for Chinese language models
- Your own quiz platform
- Spaced‑repetition study tools
Try It
Take a test online:
GitHub repo:
If you’re learning Chinese or building language‑learning tools, I hope this helps. PRs and stars are welcome!