Data Preparation

Create Speech Dataset
Drop audio + transcript files here
or click to browse (individual files or .zip)
Supported: .wav, .mp3, .m4a, .sph + .srt, .vtt, .txt (or .zip)
Max: 10GB per file, 10GB total (up to 10 ZIPs per session)
For more than 10 audio files, upload a ZIP archive.
Matching: Files are paired by name (e.g., audio1.wavaudio1.vtt)
Note: .txt files (no timestamps) require audio under 5 minutes. Use .srt/.vtt for longer audio.
How to create a ZIP ▾

Place your audio + transcript files together and zip them:

  • Mac: Select all files → right-click → Compress
  • Windows: Select all files → right-click → Send toCompressed (zipped) folder
  • Linux/CLI: zip dataset.zip *.wav *.vtt
How it works ▾
  1. Upload paired audio + transcript files (.srt/.vtt/.txt)
  2. Process: forced alignment + smart chunking (≤30s)
  3. Output: HuggingFace dataset ready for training

Required to push dataset to Hub. Get token
Language of the audio/transcripts (ISO 639-3). Affects text alignment accuracy.
Advanced Audio Settings ▾
Lower = more sensitive (detects more speech). Default: 0.5
Silence gap to trigger a segment split (speech padding = half this value)

Soft target for chunk length. Ignored if Pack is off. Default: 20s
Hard cap (max 30s for Whisper). Default: 30s
Filter out chunks shorter than this (always applied). Default: 5s

When enabled, speech regions are concatenated and silence is removed. When disabled, each VAD segment becomes a separate chunk with original timing.

Applied to speech chunks after VAD extraction. Basic removes low-frequency rumble and normalizes volume. Denoise additionally reduces background noise (fans, AC, etc).

Filter out poorly-aligned chunks based on character density (keeps 6–32 chars/sec, typical speech is ~20–25 chars/sec). Min chunk duration always applies separately.
Note: transcripts with digits (e.g. "1472" vs "one four seven two") have fewer characters for the same spoken duration, which may cause false filtering. Consider disabling for number-heavy data.
Current Job
No job

No processing running. Upload files and click Process to start.

Processing History
Time Files Samples Audio Prepared Cost Output Status
No data preparation jobs yet