Izwi AI Engine Revolutionized: Unlocking Advanced Audio Inference Capabilities

A significant update has been rolled out for the Izwi AI engine, bolstering its local audio inference capabilities with a slew of cutting-edge features. These updates encompass:

  • Speaker Diarization: Leveraging Sortformer models, Izwi can now seamlessly identify and separate multiple speakers, rendering it a prime solution for generating accurate meeting transcripts.
  • Forced Alignment: This feature facilitates the generation of word-level timestamps between audio and text using the Qwen3-ForcedAligner, proving particularly useful for creating precise subtitles.
  • Real-Time Streaming: Izwi now boasts support for streaming responses for transcription, chat, and text-to-speech (TTS) with incremental delivery, ensuring a more fluid user experience.
  • Multi-Format Audio: Native support has been added for a wide range of audio formats, including WAV, MP3, FLAC, and OGG, via Symphonia, thereby expanding the engine’s versatility.
  • Performance Enhancements: Updates include parallel execution, batch automatic speech recognition (ASR), paged key-value cache, and Metal optimizations, all contributing to improved efficiency and speed.
  • Model Support: Izwi now supports an array of models for TTS, ASR, chat, and diarization, including Qwen3-TTS, LFM2.5-Audio, Qwen3-ASR, Parakeet TDT, and Sortformer, further broadening its applicability.

For a deeper dive into these updates and to explore the capabilities of Izwi, visit the Izwi documentation or explore the GitHub repository. Users are invited to share their feedback and show their support by starring the project on GitHub.

Photo by Andrey Matveev on Pexels
Photos provided by Pexels