Audio Module (stx.audio)

Text-to-speech synthesis and audio file playback. Supports multiple TTS backends with automatic fallback.

Quick Start

import scitex as stx

# Text-to-speech
stx.audio.speak("Analysis complete. 42 significant results found.")

# Play an audio file
stx.audio.play("notification.wav")

Key Functions

speak(text, backend=None, **kwargs)

Convert text to speech and play it through the default audio output. If no backend is specified, the first available backend is used.

# Use default backend
stx.audio.speak("Hello from SciTeX")

# Specify a backend
stx.audio.speak("Hello", backend="espeak")

play(path)

Play an audio file (WAV, MP3, etc.).

stx.audio.play("alert.wav")
stx.audio.play("recording.mp3")

list_backends()

List available TTS backends on the current system.

backends = stx.audio.list_backends()
# e.g., ['espeak', 'festival', 'pyttsx3']

Use Cases

Audible notifications are useful for long-running experiments:

import scitex as stx

@stx.session
def main(CONFIG=stx.INJECTED):
    result = train_model()  # Takes hours
    stx.audio.speak(f"Training done. Accuracy: {result.accuracy:.1%}")
    return 0

API Reference

SciTeX Audio - Text-to-Speech with Multiple Backends

Backends (fallback order):
  • elevenlabs: ElevenLabs (paid, high quality, speed=1.2)

  • luxtts: LuxTTS (open-source, offline, voice-cloning, speed=2.0)

  • gtts: Google TTS (free, requires internet, speed=1.5)

  • pyttsx3: System TTS (offline, free, uses espeak/SAPI5)

Usage:

from scitex_audio import speak speak(“Hello, world!”)

from scitex_audio import get_tts, LuxTTS tts = get_tts(“luxtts”) tts.speak(“Hello!”)

scitex.audio.speak(text, backend=None, voice=None, play=True, output_path=None, fallback=None, rate=None, speed=None, mode=None, **kwargs)[source]

Convert text to speech with smart local/remote switching.

Modes:
  • local: Always use local TTS backends (fails if audio unavailable)

  • remote: Always forward to relay server

  • auto: Smart routing - prefers relay if local audio unavailable

Smart Routing (auto mode):
  1. Checks if local audio sink is available (not SUSPENDED)

  2. If local unavailable and relay configured, uses relay

  3. If both unavailable, returns error with clear message

Fallback order (local, only when backend is None): elevenlabs -> luxtts -> gtts -> pyttsx3

Parameters:
  • text (str) – Text to speak.

  • backend (Optional[str]) – TTS backend (‘elevenlabs’, ‘luxtts’, ‘gtts’, ‘pyttsx3’). Auto-selects with fallback if None.

  • voice (Optional[str]) – Voice name, ID, or language code.

  • play (bool) – Whether to play the audio.

  • output_path (Optional[str]) – Path to save audio file.

  • fallback (Optional[bool]) – If None (default), True when backend is None, False when backend is explicitly specified — i.e. an explicit backend request fails loud rather than silently falling back. Pass True/False to override.

  • rate (Optional[int]) – Speech rate in words per minute (pyttsx3 only, default 150).

  • speed (Optional[float]) – Speed multiplier for gtts (1.0=normal, >1.0=faster, <1.0=slower).

  • mode (Optional[str]) – Override mode (‘local’, ‘remote’, ‘auto’). Uses env if None.

  • **kwargs – Additional backend options.

Returns:

success, played, play_requested, backend, path (if saved), mode.

Return type:

Dict with

Environment Variables:

SCITEX_AUDIO_MODE: Default mode (‘local’, ‘remote’, ‘auto’) SCITEX_AUDIO_RELAY_URL: Relay server URL for remote mode

scitex.audio.generate_bytes(text, backend=None, voice=None, **kwargs)[source]

Generate TTS audio as raw bytes without playing.

Return type:

bytes

scitex.audio.stop_speech()[source]

Stop any currently playing speech by killing espeak processes.

Return type:

None

scitex.audio.get_tts(backend=None, **kwargs)[source]

Get a TTS instance for the specified backend.

Return type:

BaseTTS

scitex.audio.available_backends()[source]

Return list of available TTS backends.

Return type:

list[str]

scitex.audio.announce_context(include_full_path=False, speak_aloud=True, branch_resolver=None, speak_fn=None)[source]

Announce the current working directory and git branch.

Builds an orientation sentence (e.g. "Working in scitex-audio, on branch develop") and, by default, speaks it aloud. Useful when starting work in a new session.

Parameters:
  • include_full_path (bool) – Include the absolute path instead of just the directory name.

  • speak_aloud (bool) – Speak the announcement (default True). When False, only the context dict is returned.

  • branch_resolver (callable, optional) – Injectable callable (cwd: str) -> str | None returning the git branch name (testing seam). Defaults to a real git rev-parse subprocess.

  • speak_fn (callable, optional) – Injectable speak function (testing seam). Defaults to speak().

Returns:

{"directory": str, "directory_name": str, "git_branch": str | None, "announced_text": str, "spoke": bool}.

Return type:

dict

class scitex.audio.TTS(api_key=None, voice_name=None, voice_id=None, client=None, client_factory=None, **kwargs)[source]

Bases: object

Text-to-Speech using ElevenLabs API.

Examples

# Basic usage tts = TTS() tts.speak(“Hello, world!”)

# With custom voice tts = TTS(voice_name=”Adam”) tts.speak(“Processing complete”)

# Save to file without playing tts.speak(“Test”, output_path=”/tmp/test.mp3”, play=False)

__init__(api_key=None, voice_name=None, voice_id=None, client=None, client_factory=None, **kwargs)[source]

Initialize TTS.

Parameters:
  • api_key (Optional[str]) – ElevenLabs API key. Defaults to ELEVENLABS_API_KEY env var.

  • voice_name (Optional[str]) – Voice name (e.g., “Adam”, “Sarah”, “George” — free-tier).

  • voice_id (Optional[str]) – Direct voice ID (overrides voice_name).

  • client – Optional pre-built client (testing). When given, the lazy-load is skipped.

  • client_factory – Optional callable (api_key) -> client used by the lazy client property instead of the real ElevenLabs SDK (testing). Lets a test exercise the import-error path without uninstalling the dependency.

  • **kwargs – Additional config options (stability, speed, etc.)

property client

Lazy-load ElevenLabs client.

speak(text, output_path=None, play=True, voice_name=None, voice_id=None)[source]

Convert text to speech and optionally play it.

Parameters:
  • text (str) – Text to convert to speech.

  • output_path (Optional[str]) – Path to save audio file. Auto-generated if None.

  • play (bool) – Whether to play the audio after generation.

  • voice_name (Optional[str]) – Override voice name for this call.

  • voice_id (Optional[str]) – Override voice ID for this call.

Return type:

Optional[Path]

Returns:

Path to the generated audio file, or None if only played.

list_voices()[source]

List available voices from ElevenLabs.

Return type:

list

class scitex.audio.GoogleTTS(lang='en', slow=False, speed=1.5, gtts_factory=None, **kwargs)[source]

Bases: BaseTTS

Google Text-to-Speech backend using gTTS.

Free to use, requires internet connection. Good quality voices with multi-language support. Supports speed control via pydub (requires ffmpeg).

Install: pip install gTTS pydub

property name: str

Return the backend name.

property requires_internet: bool

Whether this backend requires internet connection.

synthesize(text, output_path)[source]

Synthesize text using Google TTS with optional speed control.

Return type:

Path

get_voices()[source]

Get available languages as ‘voices’.

Return type:

List[dict]

class scitex.audio.ElevenLabsTTS(api_key=None, voice='adam', model_id='eleven_multilingual_v2', stability=0.5, similarity_boost=0.75, speed=1.0, client=None, **kwargs)[source]

Bases: BaseTTS

ElevenLabs TTS backend.

High-quality voices but requires API key and has usage costs.

Environment:

ELEVENLABS_API_KEY: Your ElevenLabs API key

property name: str

Return the backend name.

property requires_api_key: bool

Whether this backend requires an API key.

property requires_internet: bool

Whether this backend requires internet connection.

property client

Lazy-load ElevenLabs client.

synthesize(text, output_path)[source]

Synthesize text using ElevenLabs API.

Return type:

Path

get_voices()[source]

Get available voices.

Return type:

List[dict]

class scitex.audio.SystemTTS(rate=150, volume=1.0, voice=None, engine=None, **kwargs)[source]

Bases: BaseTTS

System TTS backend using pyttsx3.

Works offline using system’s built-in TTS engine. Quality varies by platform and available voices.

Platforms:
  • Linux: espeak/espeak-ng

  • Windows: SAPI5

  • macOS: NSSpeechSynthesizer

property name: str

Return the backend name.

property engine

Lazy-load pyttsx3 engine.

synthesize(text, output_path)[source]

Synthesize text using system TTS.

Return type:

Path

speak_direct(text)[source]

Speak directly without saving to file (faster).

get_voices()[source]

Get available system voices.

Return type:

List[dict]

class scitex.audio.LuxTTS(device=None, model_id='YatharthS/LuxTTS', reference_audio=None, num_steps=4, speed=2.0, rms=0.01, t_shift=0.9, return_smooth=False, ref_duration=5.0, trim_start=None, **kwargs)[source]

Bases: BaseTTS

LuxTTS backend - open-source voice-cloning TTS.

High-quality 48kHz output. Near-realtime on CPU, 150x+ on GPU. Requires a reference audio file for voice cloning.

Install: pip install git+https://github.com/ysharma3501/LuxTTS.git

property name: str

Return the backend name.

property requires_internet: bool

Whether this backend requires internet connection.

synthesize(text, output_path)[source]

Synthesize text using LuxTTS.

Return type:

Path

speak(text, output_path=None, play=True, voice=None)[source]

Synthesize and optionally play. Uses .wav temp files (not .mp3).

Return type:

dict

get_voices()[source]

Get available voices (reference audio files).

Return type:

List[dict]

scitex.audio.check_wsl_audio()[source]

Check WSL audio status and connectivity.

Return type:

dict

scitex.audio.check_local_audio_available()[source]

Check if local audio playback is available.

Checks PulseAudio sink state to determine if audio can actually be heard. On NAS or headless servers, the sink is typically SUSPENDED.

In WSL environments, also checks for Windows playback fallback via PowerShell.

Return type:

dict

Returns:

dict with keys: - available: bool - True if local audio output is likely to work - state: str - ‘RUNNING’, ‘IDLE’, ‘SUSPENDED’, ‘NO_SINK’, etc. - reason: str - Human-readable explanation - fallback: str (optional) - Fallback method if primary unavailable

scitex.audio.generate_env_template(include_sensitive=True, include_defaults=True)

Generate a template .src file with all environment variables.

Parameters:
  • include_sensitive (bool) – Include sensitive variables (API keys) as commented placeholders.

  • include_defaults (bool) – Include default values for variables that have them.

Returns:

Bash-compatible .src file content.

Return type:

str

scitex.audio.transcribe(audio_path, language='ja', model='tiny', whisper_cli=None, model_path=None)[source]

Transcribe audio file to text using whisper.cpp.

Parameters:
  • audio_path (str) – Path to audio file (any format ffmpeg supports).

  • language (Optional[str]) – Language code (e.g., “ja”, “en”). None for auto-detect.

  • model (str) – Whisper model name (tiny, base, small, medium, large-v3-turbo).

  • whisper_cli (Optional[str]) – Override path to whisper-cli binary.

  • model_path (Optional[str]) – Override path to model file.

Return type:

dict

Returns:

Dict with keys: success, text, segments, language, model, audio_path.

scitex.audio.find_whisper_cli()[source]

Find whisper-cli binary.

Return type:

Optional[str]

Returns:

Path to whisper-cli, or None if not found.

scitex.audio.find_whisper_model(model='tiny')[source]

Find a whisper model file.

Parameters:

model (str) – Model name (tiny, base, small, medium, large-v3-turbo, etc.)

Return type:

Optional[str]

Returns:

Path to model file, or None if not found.

scitex.audio.available_models()[source]

List available whisper models.

Return type:

list[str]

Returns:

List of model names (e.g., [“tiny”, “base”, “medium”]).