How We Built an AI Manga Studio with Google Gemini in a Week
Source: Dev.to
Overview
Manga is one of the world’s most expressive storytelling formats, but creating it traditionally requires years of artistic training. Enpitsu (鉛筆 — Japanese for “pencil”) is a full AI manga studio powered by Google Gemini. Users type a story idea, pick a genre, and Enpitsu generates a complete manga—including script, character art, and illustrated panels—and exports it as a PDF.
Gemini‑Driven Script Generation
The first step uses Gemini 2.5 Flash with structured JSON output. By passing the genre and story prompt, Gemini returns a full manga script containing:
- Title and Japanese title
- Synopsis
- Characters with visual descriptions
- Per‑panel scene descriptions with dialogue
The key Gemini feature is response_mime_type: "application/json" combined with a Pydantic response_schema, guaranteeing valid, directly‑usable JSON without fragile parsing.
response = await client.aio.models.generate_content(
model="gemini-2.5-flash",
contents=user_prompt,
config=GenerateContentConfig(
system_instruction=SYSTEM_PROMPT,
response_mime_type="application/json",
response_schema=StoryResponse,
),
)Character Sheet Generation
For each character, Gemini’s image models generate a professional settei (設定)—the character reference sheets used in real anime production. Each sheet includes front, 3/4, and side views plus emotion expressions, rendered with clean linework on a white background.
A three‑model fallback chain across available Gemini image preview models ensures graceful degradation rather than failure.
Consistent Panel Generation
Generating a single panel is straightforward, but maintaining character consistency across 20+ panels is challenging. Enpitsu solves this by:
- Passing every character’s settei sheet as a multimodal image reference in each panel‑generation call.
- Labeling each reference as either “IN THIS PANEL” (must match exactly) or “reference only” (style consistency).
- Including the previous panel as an additional visual cue.
for char_name, sheet_bytes in character_sheets.items():
contents.append(types.Part.from_bytes(data=sheet_bytes, mime_type="image/png"))
if char_name in present_set:
contents.append(types.Part.from_text(
text=f"[CHARACTER REFERENCE — IN THIS PANEL] {char_name} — match this design EXACTLY."
))
else:
contents.append(types.Part.from_text(
text=f"[CHARACTER REFERENCE — NOT IN PANEL] {char_name} — provided for style consistency."
))This approach keeps characters recognizable from page 1 through page 10.
Real‑Time Streaming with Server‑Sent Events
Generating many panels can take time. Instead of a static loading spinner, Enpitsu streams panels to the UI as they are generated using Server‑Sent Events (SSE), allowing users to watch their manga being drawn in real time.
async def event_stream():
for panel in panels:
png_bytes = await generate_panel(panel, ...)
event = PanelGenerationEvent(
image_base64=base64.b64encode(png_bytes).decode()
)
yield f"data: {event.model_dump_json()}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(event_stream(), media_type="text/event-stream")The completed manga is displayed in a reader UI and exported as a PDF using html2canvas + jsPDF.
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | Next.js 16, React 19, TypeScript, Tailwind CSS |
| Backend | Python, FastAPI, Uvicorn |
| AI | Google Gemini 2.5 Flash + Gemini Image Models (Google GenAI SDK) |
| Auth | Firebase Authentication + Firebase Admin SDK |
| Export | html2canvas + jsPDF |
Lessons Learned
- Multimodal input is a powerful consistency tool; treating character sheets as “visual anchors” works for any project needing consistent AI characters.
- Structured JSON output with
response_schemaeliminates post‑processing of Gemini’s text output. - SSE is a simple, effective protocol for streaming AI results, often preferable to WebSockets for server‑to‑client progress updates.
Future Work
- Phase 2: LiveKit integration—describe a scene with voice and watch it generate in real time.
- Project persistence and panel regeneration are also on the roadmap.
Source Code
The full implementation is available at: