Post

Supertonic 3: What On-Device TTS Starts to Look Like When ONNX Is the Contract

Curiosity: Can speech become a local runtime feature?

When I build AI features for games or interactive systems, I keep coming back to the same uncomfortable question: what happens when a feature is technically impressive, but operationally too heavy to ship?

Text-to-speech is a good example. Cloud TTS is convenient, but production teams still have to account for latency, API cost, network availability, privacy boundaries, and platform-specific runtime behavior. Those constraints matter even more in games, digital humans, browser experiences, kiosks, and edge devices where speech is part of the interaction loop rather than a background batch job.

That is why the PyTorch Korea write-up on Supertonic 3 caught my attention. The interesting claim is not only that it is a multilingual TTS model. The stronger claim is that the model, runtime examples, and ONNX assets are shaped like something a product team can actually embed.

Supertonic 3 overview

The PyTorch Korea discussion page frames Supertonic as an on-device, multilingual TTS system from Supertone AI. In the linked official repository, Supertonic is described as an ONNX Runtime-based system that runs locally with no cloud call in the inference path.

The current public signals I verified from the linked sources are:

AreaWhat I foundWhy it matters
Model sizeSupertonic 3 is presented as an approximately 99M-parameter open-weight modelSmall enough to make browser, desktop, mobile, and edge deployment more realistic
Languages31 supported languages, including Korean, Japanese, English, German, French, Spanish, Hindi, Vietnamese, and moreUseful for global content workflows without separate per-language products
Runtime contractONNX Runtime examples for Python, Node.js, browser WebGPU/WASM, Java, C++, C#, Go, Swift, iOS, Rust, and FlutterThe model is not locked to one application stack
Output44.1kHz 16-bit WAVProduction playback can start from a clean audio format
Expressive textInline tags such as <laugh>, <breath>, and <sigh>Speech behavior can be controlled from text without a separate reference-audio workflow
Python SDKPyPI reports supertonic version 1.3.1The SDK is not just a repo demo; it is packaged for normal Python installation
Local serverThe May 18 update adds supertonic serve with native /v1/tts and OpenAI-compatible /v1/audio/speech endpointsExisting agent tools, local apps, and OpenAI-style clients can swap base URLs

The ONNX contract is the key design decision. Once the model assets and input/output conventions are stable, the product surface becomes much broader: a Unity tool can call a local service, a browser demo can use WebGPU or WASM, a Python pipeline can batch-generate voice lines, and a mobile build can reuse the same conceptual model path.

flowchart LR
    A[Text and voice style] --> B[Supertonic frontend]
    B --> C[ONNX Runtime]
    C --> D[Python app]
    C --> E[Browser WebGPU or WASM]
    C --> F[Mobile or edge runtime]
    C --> G[Local HTTP server]
    G --> H[OpenAI-compatible audio client]

The evidence: accuracy, size, and runtime footprint

The source page includes three useful visual anchors. I downloaded the original images so this post does not depend on remote onebox previews.

Supertonic 3 multilingual WER and CER comparison

The first chart compares reading accuracy across languages. The practical takeaway is not that one chart settles all TTS quality questions. It is that a compact on-device model is being evaluated against larger open TTS systems with the right kind of metric pressure: word error rate and character error rate.

Supertonic 3 parameter size comparison

The second chart is the deployment story. Model size is not just an ML benchmark detail. It affects cold start, download friction, browser cache behavior, package size, memory pressure, and whether a feature survives contact with mid-range hardware.

Supertonic runtime footprint comparison

The third chart is where my production instincts kick in. A CPU-friendly runtime footprint changes the default architecture discussion. Instead of starting from “Where is the GPU server?”, a team can start from “Which local runtime is acceptable for this product surface?”

Implementation notes: the sample code is the real story

I attached representative sample files from the official repository:

The quick Python SDK path is intentionally simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from supertonic import TTS

tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")

text = "Supertonic is a lightning fast, on-device TTS system."
wav, duration = tts.synthesize(
    text=text,
    lang="en",
    voice_style=style,
    total_steps=8,
    speed=1.05,
)

tts.save_audio(wav, "output.wav")
print(f"Generated {duration[0]:.2f}s of audio")

For teams that want direct ONNX assets rather than the SDK abstraction, the repository flow is also clear:

1
2
3
4
5
6
7
8
9
git clone https://github.com/supertone-inc/supertonic.git
cd supertonic

git lfs install
git clone https://huggingface.co/Supertone/supertonic-3 assets

cd py
uv sync
uv run example_onnx.py

The May 18 SDK update is especially relevant for agentic and tool-heavy workflows:

1
2
pip install 'supertonic[serve]'
supertonic serve --host 127.0.0.1 --port 7788

That exposes a native /v1/tts endpoint and an OpenAI-compatible /v1/audio/speech endpoint. In practical terms, a local agent, Electron tool, browser extension, or internal build pipeline can treat local speech synthesis as a service without moving audio generation into the cloud.

Innovation: Where I would use this in production

For games and interactive media, I would not start by asking whether Supertonic replaces every premium cloud voice product. That is the wrong first question.

I would ask where local, fast, private, good-enough speech unlocks a feature that was previously too expensive or too fragile:

Use caseWhy local TTS mattersFirst experiment
NPC bark generation in internal toolsDesigners can preview lines without waiting on a service callAdd supertonic serve behind an editor button
Accessibility narrationText-heavy interfaces can generate speech without sending player text outTest latency and voice consistency on target hardware
Browser-based companion appsWebGPU/WASM path keeps inference near the userPrototype page narration with cached ONNX assets
Offline kiosks or exhibitionsNetwork independence is part of the product requirementPackage fixed voice styles with the app
Agent workflowsAgents can generate reviewable audio artifacts from local textUse the OpenAI-compatible endpoint as a base-URL swap

My main caution is the same one I apply to most AI runtime work: benchmark the real product path, not the README path. Voice quality, startup time, package size, browser support, threading behavior, and target-device thermals all matter. But the architecture is promising because it gives teams multiple ways to run the same capability.

Primary links:

Research links:

Related PyTorch Korea reading:

Takeaways

InsightImplicationNext step
ONNX is the product contractOne model path can support many runtime surfacesTest the same asset in Python, browser, and one native target
Small model size changes feasibilitySpeech can move closer to the user and the deviceMeasure cold start and memory before debating architecture
The local server mattersExisting OpenAI-style clients can integrate with less glue codeTry a base-URL swap against /v1/audio/speech
Expressive tags are product controlsDesigners can shape speech behavior directly in textBuild authoring guidelines for tags, speed, and language codes

The new question this raises for me is simple: if speech synthesis becomes a local runtime primitive, what other “cloud-only” AI features should we start redesigning as device-native interaction systems?

References

This post is licensed under CC BY 4.0 by the author.