TTS Training - Trelis Studio

Model Training

Fine-tune TTS Model

TTS Architecture

Orpheus: high quality, GPU inference. Piper: fast, CPU-friendly ONNX inference.

Base Model

Or enter a custom fine-tuned Orpheus model ID to continue training from a prior run.

Base Checkpoint (optional, for fine-tuning)

Default uses a pre-trained checkpoint from rhasspy/piper-checkpoints matched to your language. Train from scratch for unsupported languages.

Quality

x-low: ~5M params, fastest. medium: ~16M params, balanced. high: ~36M params, best quality.

Language

espeak-ng language code for phonemization.

Preprocessed Dataset (optional)

HF dataset repo with pre-processed data (phonemized + mel spectrograms). Skips preprocessing step. Leave empty to preprocess from scratch.

Push preprocessed dataset to HF

Upload phonemized + mel spectrogram data to {output_repo}-preprocessed for reuse.

Training Dataset

HuggingFace dataset with text (or transcription) and audio columns. View requirements

Dataset Requirements

Columns: Must have audio and text (or transcription) columns
Audio: Any sample rate (resampled to 24kHz automatically)
Duration: ~49s max per sample at default settings (4096 seq length, ~82 tokens/sec)
Single speaker: All audio should be from one speaker

Example: canopylabs/zac-sample-dataset

Validation Dataset (optional)

If empty: 5% of training data used for validation

Split

HuggingFace Hub Settings (model output)

Push to HuggingFace Hub

Uploads trained model to Hub

HuggingFace Token

Required for private datasets or to push model to Hub. Get token

Output Organization

Output Model Name

Weights & Biases Settings (for tracking)

W&B API Key

Track training progress with W&B. Get API key

Use Trelis Optimized Parameters

Hyperparameters will be automatically optimized for your dataset size and model.

Advanced Settings

LoRA Rank (r)

LoRA Alpha

Train embedding layers (fully trains token embeddings and LM head alongside LoRA)

Batch Size (global)

Gradient Accumulation Steps

Learning Rate

LR Schedule

Epochs

Warmup Steps

Max Sequence Length

Covers interleaved text + audio tokens. Orpheus (SNAC): ~82 audio tokens/sec — 4096 ≈ ~49s, 2048 ≈ ~24s. Base model trained on 8192 (~100s); going above that is extrapolation. Samples over the limit are silently skipped (check logs for count).

Speaker Name

Voice label used during training. If your dataset has a source column it is used automatically.

Multi-GPU training (DDP) — for large datasets

Current Job

No job

No training running. Submit a job to see progress here.

Training History

Time	Base Model	Dataset	Output	Train Loss	Val Loss	Cost	Status
No training jobs yet

Model Training

Fine-tune TTS Model

Dataset Requirements

Current Job

Error

Training History