Self-Hosting a Text-to-Speech App on Google Colab
Source: Dev.to
Introduction
Text‑to‑speech has quietly moved from robotic‑sounding demos to voices that feel natural and expressive. The problem is that good‑quality speech usually comes with usage limits or per‑character pricing. Running modern models locally can also be difficult without a capable GPU. A practical middle ground is to use free cloud compute and host the app yourself.
In this article, we will build a complete text‑to‑speech web application using Google Colab, the Kokoro TTS model, a clean interface with Gradio, and public access through Pinggy. Everything runs inside a Colab notebook and stays active as long as the session is alive.
Why Run Text‑to‑Speech on Colab
- Most commercial TTS platforms charge by characters or audio length. That works for small projects but quickly becomes restrictive when experimenting or generating large amounts of audio.
- Colab provides free access to a Tesla T4 GPU, which is more than enough for lightweight speech models. Even though Kokoro can run on CPU, GPU acceleration makes generation much faster and smoother, especially for longer text.
- Colab notebooks are not publicly reachable by default. This is where Pinggy becomes useful. It creates a secure tunnel and exposes your local web app with a public URL. The result is a setup where you write code in a notebook, run a web app, and access it from any browser.
Getting Started with the Environment
- Open a new notebook at colab.google.com.
- From the Runtime menu, change the runtime type and enable GPU (choose T4 if available).
Install Pinggy
The tunnel needs to be active before launching the web app.
!pip install pinggy
Start a Tunnel
import pinggy
tunnel = pinggy.start_tunnel(
forwardto="localhost:5000"
)
print("Public URLs:", tunnel.urls)
Keep the printed URL handy – you’ll use it to open the application later.
Installing Text‑to‑Speech Dependencies
!pip install kokoro-onnx gradio soundfile torch numpy
- kokoro‑onnx – handles speech synthesis.
- gradio – builds the web interface.
- soundfile – saves audio output.
Understanding Kokoro TTS
- Kokoro ONNX is based on the Kokoro 82M model and optimized for efficient inference.
- The model is relatively small yet produces speech that sounds natural and clear.
- It supports multiple languages (English, Japanese, German, French, Spanish, Italian, Chinese, Korean, Portuguese, Russian) and several voice styles (male & female).
- Because it’s lightweight, it fits comfortably within Colab’s memory limits and runs reliably on the free GPU tier.
Core Text‑to‑Speech Logic
The code below downloads the model files, loads Kokoro, and defines a function that converts text into speech.
import soundfile as sf
import urllib.request
import tempfile
import uuid
import os
from kokoro_onnx import Kokoro
# ----------------------------------------------------------------------
# Model URLs
# ----------------------------------------------------------------------
MODEL_URL = "https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/"
model_path = "kokoro-v1.0.onnx"
voices_bin_path = "voices-v1.0.bin"
# ----------------------------------------------------------------------
# Download if missing
# ----------------------------------------------------------------------
if not os.path.exists(model_path):
urllib.request.urlretrieve(MODEL_URL + "kokoro-v1.0.onnx", model_path)
if not os.path.exists(voices_bin_path):
urllib.request.urlretrieve(MODEL_URL + "voices-v1.0.bin", voices_bin_path)
# ----------------------------------------------------------------------
# Load model
# ----------------------------------------------------------------------
kokoro = Kokoro(model_path, voices_bin_path)
voice_options = list(kokoro.voices.keys())
VOICE_LABELS = {v.replace("_", " ").title(): v for v in voice_options}
LANG_LABELS = {
"English US": "en-us",
"English UK": "en-gb",
"Japanese": "ja-jp",
"Chinese": "zh-cn",
"German": "de-de",
"Spanish": "es-es",
"French": "fr-fr",
"Italian": "it-it",
"Korean": "ko-kr",
"Portuguese": "pt-br",
"Russian": "ru-ru",
}
def tts_generate(text, voice_label, speed, language):
"""Generate speech from text."""
if not text.strip():
return None, "Please enter text"
voice_id = VOICE_LABELS[voice_label]
lang_code = LANG_LABELS[language]
samples, sr = kokoro.create(
text=text,
voice=voice_id,
speed=float(speed),
lang=lang_code
)
filename = f"tts_{uuid.uuid4().hex[:8]}.wav"
path = os.path.join(tempfile.gettempdir(), filename)
sf.write(path, samples, sr)
return path, "Audio generated"
The function takes text, voice, speed, and language, then returns a playable audio file.
Building the Web Interface with Gradio
Gradio lets us turn the TTS function into a usable web app with very little code.
import gradio as gr
def build_ui():
with gr.Blocks(title="Text to Speech AI") as app:
gr.Markdown("### Text to Speech AI")
# --------------------------------------------------------------
# Input components
# --------------------------------------------------------------
text_input = gr.Textbox(
label="Text",
placeholder="Enter text here",
lines=4
)
with gr.Row():
voice_dropdown = gr.Dropdown(
label="Voice",
choices=list(VOICE_LABELS.keys()),
value=list(VOICE_LABELS.keys())[0]
)
language_dropdown = gr.Dropdown(
label="Language",
choices=list(LANG_LABELS.keys()),
value="English US"
)
speed_slider = gr.Slider(
minimum=0.5,
maximum=2.0,
value=1.0,
step=0.1,
label="Speed"
)
generate_btn = gr.Button("Generate Speech")
# --------------------------------------------------------------
# Output components
# --------------------------------------------------------------
audio_output = gr.Audio(label="Output")
status_output = gr.Markdown()
# --------------------------------------------------------------
# Interaction
# --------------------------------------------------------------
generate_btn.click(
fn=tts_generate,
inputs=[text_input, voice_dropdown, speed_slider, language_dropdown],
outputs=[audio_output, status_output]
)
return app
# Launch the interface
ui = build_ui()
ui.launch(server_name="0.0.0.0", server_port=5000, share=False)
Running the notebook will:
- Start a Pinggy tunnel (see earlier).
- Launch the Gradio UI on
localhost:5000. - Expose the public URL provided by Pinggy, allowing anyone with the link to use the TTS app.
Final Notes
- Keep the Colab session alive (e.g., by periodically running a cell) to maintain the tunnel.
- If you need longer audio or higher throughput, consider upgrading to a paid Colab tier or a dedicated GPU instance.
- Feel free to experiment with different voices, languages, and speed settings to find the best fit for your project.
Code
inputs = [text_input, voice_dropdown, speed_slider, language_dropdown]
outputs = [audio_output, status_output]
return app
app = build_ui()
app.launch(server_name="0.0.0.0", server_port=5000, share=False)
Full‑screen controls
- Enter fullscreen mode
- Exit fullscreen mode
Once this cell runs, the app starts listening on port 5000 inside the notebook.
Accessing the App from Your Browser
Open the Pinggy URL printed earlier. You should see the text‑to‑speech interface.
- Type some text.
- Select a voice and language.
- Adjust the speed if needed.
- Generate audio.
The output can be played directly or downloaded as a WAV file.
Voice and Language Tips
- For the most natural results, match the voice prefix with the language of the text.
- English voices → English text
- Japanese voices → Japanese text
- …and so on.
- The speed control is useful for narration; a slightly slower speech often sounds clearer for long paragraphs.
Performance Notes
- On the free T4 GPU, short sentences generate almost instantly.
- Longer paragraphs take a few seconds.
- The first request after loading the model may feel slower, but subsequent generations are faster.
- Colab sessions can disconnect after inactivity, so download important audio files before closing the notebook.
Conclusion
Self‑hosting a text‑to‑speech system on Google Colab is a practical way to explore high‑quality voice synthesis without dealing with usage limits or infrastructure setup.
- Kokoro provides a good balance between model size and audio quality.
- Gradio keeps the interface simple.
- Pinggy bridges the gap between a private notebook and public access.
This setup works well for learning, prototyping, accessibility tools, or content‑creation workflows where flexibility matters more than polished commercial platforms.