How We Built an AI Manga Studio with Google Gemini in a Week

Published: (March 13, 2026 at 08:07 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Overview

Manga is one of the world’s most expressive storytelling formats, but creating it traditionally requires years of artistic training. Enpitsu (鉛筆 — Japanese for “pencil”) is a full AI manga studio powered by Google Gemini. Users type a story idea, pick a genre, and Enpitsu generates a complete manga—including script, character art, and illustrated panels—and exports it as a PDF.

Gemini‑Driven Script Generation

The first step uses Gemini 2.5 Flash with structured JSON output. By passing the genre and story prompt, Gemini returns a full manga script containing:

  • Title and Japanese title
  • Synopsis
  • Characters with visual descriptions
  • Per‑panel scene descriptions with dialogue

The key Gemini feature is response_mime_type: "application/json" combined with a Pydantic response_schema, guaranteeing valid, directly‑usable JSON without fragile parsing.

response = await client.aio.models.generate_content(
    model="gemini-2.5-flash",
    contents=user_prompt,
    config=GenerateContentConfig(
        system_instruction=SYSTEM_PROMPT,
        response_mime_type="application/json",
        response_schema=StoryResponse,
    ),
)

Character Sheet Generation

For each character, Gemini’s image models generate a professional settei (設定)—the character reference sheets used in real anime production. Each sheet includes front, 3/4, and side views plus emotion expressions, rendered with clean linework on a white background.

A three‑model fallback chain across available Gemini image preview models ensures graceful degradation rather than failure.

Consistent Panel Generation

Generating a single panel is straightforward, but maintaining character consistency across 20+ panels is challenging. Enpitsu solves this by:

  1. Passing every character’s settei sheet as a multimodal image reference in each panel‑generation call.
  2. Labeling each reference as either “IN THIS PANEL” (must match exactly) or “reference only” (style consistency).
  3. Including the previous panel as an additional visual cue.
for char_name, sheet_bytes in character_sheets.items():
    contents.append(types.Part.from_bytes(data=sheet_bytes, mime_type="image/png"))
    if char_name in present_set:
        contents.append(types.Part.from_text(
            text=f"[CHARACTER REFERENCE — IN THIS PANEL] {char_name} — match this design EXACTLY."
        ))
    else:
        contents.append(types.Part.from_text(
            text=f"[CHARACTER REFERENCE — NOT IN PANEL] {char_name} — provided for style consistency."
        ))

This approach keeps characters recognizable from page 1 through page 10.

Real‑Time Streaming with Server‑Sent Events

Generating many panels can take time. Instead of a static loading spinner, Enpitsu streams panels to the UI as they are generated using Server‑Sent Events (SSE), allowing users to watch their manga being drawn in real time.

async def event_stream():
    for panel in panels:
        png_bytes = await generate_panel(panel, ...)
        event = PanelGenerationEvent(
            image_base64=base64.b64encode(png_bytes).decode()
        )
        yield f"data: {event.model_dump_json()}\n\n"
    yield "data: [DONE]\n\n"

return StreamingResponse(event_stream(), media_type="text/event-stream")

The completed manga is displayed in a reader UI and exported as a PDF using html2canvas + jsPDF.

Tech Stack

LayerTechnology
FrontendNext.js 16, React 19, TypeScript, Tailwind CSS
BackendPython, FastAPI, Uvicorn
AIGoogle Gemini 2.5 Flash + Gemini Image Models (Google GenAI SDK)
AuthFirebase Authentication + Firebase Admin SDK
Exporthtml2canvas + jsPDF

Lessons Learned

  • Multimodal input is a powerful consistency tool; treating character sheets as “visual anchors” works for any project needing consistent AI characters.
  • Structured JSON output with response_schema eliminates post‑processing of Gemini’s text output.
  • SSE is a simple, effective protocol for streaming AI results, often preferable to WebSockets for server‑to‑client progress updates.

Future Work

  • Phase 2: LiveKit integration—describe a scene with voice and watch it generate in real time.
  • Project persistence and panel regeneration are also on the roadmap.

Source Code

The full implementation is available at:

0 views
Back to Blog

Related posts

Read more »

Travigo

Travel as fast as you speak with Gemini! Where live agents meet immersive storytelling & 3D navigation. This project was created for entering the Gemini Live Ag...

Micro games

Hey Gamers! 👾 As part of the Rapid Games Prototyping module, we are tasked with reviewing a peer's game. The challenge is to analyse a prototype built in just...