Self-Hosting a Text-to-Speech App on Google Colab

Published: 1 month ago (December 19, 2025 at 05:37 AM EST)

5 min read

Source: Dev.to

Introduction

Text‑to‑speech has quietly moved from robotic‑sounding demos to voices that feel natural and expressive. The problem is that good‑quality speech usually comes with usage limits or per‑character pricing. Running modern models locally can also be difficult without a capable GPU. A practical middle ground is to use free cloud compute and host the app yourself.

In this article, we will build a complete text‑to‑speech web application using Google Colab, the Kokoro TTS model, a clean interface with Gradio, and public access through Pinggy. Everything runs inside a Colab notebook and stays active as long as the session is alive.

Why Run Text‑to‑Speech on Colab

Most commercial TTS platforms charge by characters or audio length. That works for small projects but quickly becomes restrictive when experimenting or generating large amounts of audio.
Colab provides free access to a Tesla T4 GPU, which is more than enough for lightweight speech models. Even though Kokoro can run on CPU, GPU acceleration makes generation much faster and smoother, especially for longer text.
Colab notebooks are not publicly reachable by default. This is where Pinggy becomes useful. It creates a secure tunnel and exposes your local web app with a public URL. The result is a setup where you write code in a notebook, run a web app, and access it from any browser.

Getting Started with the Environment

Open a new notebook at colab.google.com.
From the Runtime menu, change the runtime type and enable GPU (choose T4 if available).

Install Pinggy

The tunnel needs to be active before launching the web app.

!pip install pinggy

Start a Tunnel

import pinggy

tunnel = pinggy.start_tunnel(
    forwardto="localhost:5000"
)

print("Public URLs:", tunnel.urls)

Keep the printed URL handy – you’ll use it to open the application later.

Installing Text‑to‑Speech Dependencies

!pip install kokoro-onnx gradio soundfile torch numpy

kokoro‑onnx – handles speech synthesis.
gradio – builds the web interface.
soundfile – saves audio output.

Understanding Kokoro TTS

Kokoro ONNX is based on the Kokoro 82M model and optimized for efficient inference.
The model is relatively small yet produces speech that sounds natural and clear.
It supports multiple languages (English, Japanese, German, French, Spanish, Italian, Chinese, Korean, Portuguese, Russian) and several voice styles (male & female).
Because it’s lightweight, it fits comfortably within Colab’s memory limits and runs reliably on the free GPU tier.

Core Text‑to‑Speech Logic

The code below downloads the model files, loads Kokoro, and defines a function that converts text into speech.

import soundfile as sf
import urllib.request
import tempfile
import uuid
import os
from kokoro_onnx import Kokoro

# ----------------------------------------------------------------------
# Model URLs
# ----------------------------------------------------------------------
MODEL_URL = "https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/"
model_path = "kokoro-v1.0.onnx"
voices_bin_path = "voices-v1.0.bin"

# ----------------------------------------------------------------------
# Download if missing
# ----------------------------------------------------------------------
if not os.path.exists(model_path):
    urllib.request.urlretrieve(MODEL_URL + "kokoro-v1.0.onnx", model_path)

if not os.path.exists(voices_bin_path):
    urllib.request.urlretrieve(MODEL_URL + "voices-v1.0.bin", voices_bin_path)

# ----------------------------------------------------------------------
# Load model
# ----------------------------------------------------------------------
kokoro = Kokoro(model_path, voices_bin_path)

voice_options = list(kokoro.voices.keys())
VOICE_LABELS = {v.replace("_", " ").title(): v for v in voice_options}

LANG_LABELS = {
    "English US": "en-us",
    "English UK": "en-gb",
    "Japanese": "ja-jp",
    "Chinese": "zh-cn",
    "German": "de-de",
    "Spanish": "es-es",
    "French": "fr-fr",
    "Italian": "it-it",
    "Korean": "ko-kr",
    "Portuguese": "pt-br",
    "Russian": "ru-ru",
}

def tts_generate(text, voice_label, speed, language):
    """Generate speech from text."""
    if not text.strip():
        return None, "Please enter text"

    voice_id = VOICE_LABELS[voice_label]
    lang_code = LANG_LABELS[language]

    samples, sr = kokoro.create(
        text=text,
        voice=voice_id,
        speed=float(speed),
        lang=lang_code
    )

    filename = f"tts_{uuid.uuid4().hex[:8]}.wav"
    path = os.path.join(tempfile.gettempdir(), filename)
    sf.write(path, samples, sr)

    return path, "Audio generated"

The function takes text, voice, speed, and language, then returns a playable audio file.

Building the Web Interface with Gradio

Gradio lets us turn the TTS function into a usable web app with very little code.

import gradio as gr

def build_ui():
    with gr.Blocks(title="Text to Speech AI") as app:
        gr.Markdown("### Text to Speech AI")

        # --------------------------------------------------------------
        # Input components
        # --------------------------------------------------------------
        text_input = gr.Textbox(
            label="Text",
            placeholder="Enter text here",
            lines=4
        )

        with gr.Row():
            voice_dropdown = gr.Dropdown(
                label="Voice",
                choices=list(VOICE_LABELS.keys()),
                value=list(VOICE_LABELS.keys())[0]
            )
            language_dropdown = gr.Dropdown(
                label="Language",
                choices=list(LANG_LABELS.keys()),
                value="English US"
            )

        speed_slider = gr.Slider(
            minimum=0.5,
            maximum=2.0,
            value=1.0,
            step=0.1,
            label="Speed"
        )

        generate_btn = gr.Button("Generate Speech")

        # --------------------------------------------------------------
        # Output components
        # --------------------------------------------------------------
        audio_output = gr.Audio(label="Output")
        status_output = gr.Markdown()

        # --------------------------------------------------------------
        # Interaction
        # --------------------------------------------------------------
        generate_btn.click(
            fn=tts_generate,
            inputs=[text_input, voice_dropdown, speed_slider, language_dropdown],
            outputs=[audio_output, status_output]
        )

    return app

# Launch the interface
ui = build_ui()
ui.launch(server_name="0.0.0.0", server_port=5000, share=False)

Running the notebook will:

Start a Pinggy tunnel (see earlier).
Launch the Gradio UI on localhost:5000.
Expose the public URL provided by Pinggy, allowing anyone with the link to use the TTS app.

Final Notes

Keep the Colab session alive (e.g., by periodically running a cell) to maintain the tunnel.
If you need longer audio or higher throughput, consider upgrading to a paid Colab tier or a dedicated GPU instance.
Feel free to experiment with different voices, languages, and speed settings to find the best fit for your project.

Code

inputs = [text_input, voice_dropdown, speed_slider, language_dropdown]
outputs = [audio_output, status_output]

return app

app = build_ui()
app.launch(server_name="0.0.0.0", server_port=5000, share=False)

Full‑screen controls

Enter fullscreen mode
Exit fullscreen mode

Once this cell runs, the app starts listening on port 5000 inside the notebook.

Accessing the App from Your Browser

Open the Pinggy URL printed earlier. You should see the text‑to‑speech interface.

Type some text.
Select a voice and language.
Adjust the speed if needed.
Generate audio.

The output can be played directly or downloaded as a WAV file.

Voice and Language Tips

For the most natural results, match the voice prefix with the language of the text.
- English voices → English text
- Japanese voices → Japanese text
- …and so on.
The speed control is useful for narration; a slightly slower speech often sounds clearer for long paragraphs.

Performance Notes

On the free T4 GPU, short sentences generate almost instantly.
Longer paragraphs take a few seconds.
The first request after loading the model may feel slower, but subsequent generations are faster.
Colab sessions can disconnect after inactivity, so download important audio files before closing the notebook.

Conclusion

Self‑hosting a text‑to‑speech system on Google Colab is a practical way to explore high‑quality voice synthesis without dealing with usage limits or infrastructure setup.

Kokoro provides a good balance between model size and audio quality.
Gradio keeps the interface simple.
Pinggy bridges the gap between a private notebook and public access.

This setup works well for learning, prototyping, accessibility tools, or content‑creation workflows where flexibility matters more than polished commercial platforms.

Self-Hosting a Text-to-Speech App on Google Colab

Introduction

Why Run Text‑to‑Speech on Colab

Getting Started with the Environment

Install Pinggy

Start a Tunnel

Installing Text‑to‑Speech Dependencies

Understanding Kokoro TTS

Core Text‑to‑Speech Logic

Building the Web Interface with Gradio

Final Notes

Code

Full‑screen controls

Accessing the App from Your Browser

Voice and Language Tips

Performance Notes

Conclusion

References

Related posts

How I Built a Stroke Capture System for an AI Drawing Game

El error de seguridad más común es “Dale Admin y Ya”

Sending EIP-4844 Blob Transactions with ethers.js and kzg-wasm

Automate Your Life with n8n (Beginner-Friendly Guide)

Introduction

Why Run Text‑to‑Speech on Colab

Getting Started with the Environment

Install Pinggy

Start a Tunnel

Installing Text‑to‑Speech Dependencies

Understanding Kokoro TTS

Core Text‑to‑Speech Logic

Building the Web Interface with Gradio

Final Notes

Code

Full‑screen controls

Accessing the App from Your Browser

Voice and Language Tips

Performance Notes

Conclusion

References

Related posts

How I Built a Stroke Capture System for an AI Drawing Game

El error de seguridad más común es “Dale Admin y Ya”

Sending EIP-4844 Blob Transactions with ethers.js and kzg-wasm

Automate Your Life with n8n (Beginner-Friendly Guide)

Understanding Kokoro TTS