Model Training

Fine-tune TTS Model
Orpheus: high quality, GPU inference. Piper: fast, CPU-friendly ONNX inference.
Or enter a custom fine-tuned Orpheus model ID to continue training from a prior run.
HuggingFace dataset with text (or transcription) and audio columns. View requirements
Dataset Requirements
  • Columns: Must have audio and text (or transcription) columns
  • Audio: Any sample rate (resampled to 24kHz automatically)
  • Duration: ~49s max per sample at default settings (4096 seq length, ~82 tokens/sec)
  • Single speaker: All audio should be from one speaker

Example: canopylabs/zac-sample-dataset

If empty: 5% of training data used for validation

Uploads trained model to Hub
Required for private datasets or to push model to Hub. Get token
Track training progress with W&B. Get API key

Hyperparameters will be automatically optimized for your dataset size and model.
Covers interleaved text + audio tokens. Orpheus (SNAC): ~82 audio tokens/sec — 4096 ≈ ~49s, 2048 ≈ ~24s. Base model trained on 8192 (~100s); going above that is extrapolation. Samples over the limit are silently skipped (check logs for count).
Voice label used during training. If your dataset has a source column it is used automatically.
Current Job
No job

No training running. Submit a job to see progress here.

Training History
Time Base Model Dataset Output Train Loss Val Loss Cost Status
No training jobs yet