Maya1 TTS Model

Open-source speech AI with expressive voice generation and rich human emotion

3B Parameters24 kHz Audio20+ EmotionsApache 2.0

What is Maya1 TTS Model?

Maya1 is an open-source speech model designed for expressive voice generation with rich human emotion and precise voice design. Built by Maya Research, this model represents a significant step forward in making high-quality voice AI accessible to everyone.

This model allows you to create natural-sounding voices by simply describing them in plain language. Instead of working with complex parameters or technical settings, you describe the voice as if you were briefing a voice actor.

The model handles over 20 different emotions including laughter, crying, whispering, anger, sighing, and gasping, bringing a human touch to synthetic speech.

Maya1 runs on a single GPU and uses a 3-billion parameter architecture based on the Llama transformer model. It generates audio at 24 kHz quality and supports real-time streaming, making it suitable for production environments where low latency matters.

Maya1 at a Glance

Feature	Description
Model Name	Maya1
Category	Text-to-Speech AI
Function	Emotional Voice Synthesis
Parameters	3 Billion
License	Apache 2.0 (Open Source)
Audio Quality	24 kHz, Mono
Hardware Requirements	Single GPU (16GB+ VRAM)
Streaming Support	Real-time with SNAC codec

Supported Emotions

Maya1 supports over 20 inline emotion tags that can be embedded directly in your text for precise emotional control

😄<laugh>

🤣<laugh_harder>

😮‍💨<sigh>

😏<chuckle>

😲<gasp>

😠<angry>

🤩<excited>

🤫<whisper>

😢<cry>

😱<scream>

🎵<sing>

😤<snort>

😮‍💨<exhale>

😰<gulp>

🤭<giggle>

😏<sarcastic>

🤔<curious>

Embed these emotion tags directly in your text to control emotional delivery at specific points. For example: "That's amazing I can't believe it!"

Why Maya1 is Different

Most voice AI tools today are either closed-source services that charge per second of audio generated, or open-source models that lack emotional range and natural voice control. Maya1 bridges this gap by offering production-quality emotional speech synthesis with full transparency and no usage fees.

The model stands out because it accepts natural language descriptions for voice design. You can specify age, accent, pitch, timbre, pacing, and character traits in plain English. This approach makes voice creation intuitive for developers, content creators, and researchers who need expressive speech without technical audio engineering knowledge.

Maya1 supports inline emotion tags that let you place emotional expressions exactly where they belong in your text. For example, you can insert laughter mid-sentence or add a whisper for dramatic effect. This granular control over emotional delivery creates more natural and engaging speech output.

The model uses the SNAC neural codec for audio generation, which enables real-time streaming at approximately 0.98 kilobits per second. This efficient encoding makes Maya1 practical for applications like voice assistants, interactive agents, game characters, and live content generation where latency matters.

Key Features of Maya1

Natural Language Voice Control

Describe voices the way you would brief a human voice actor. Specify age, accent, pitch, character traits, and delivery style in plain language. The model interprets these descriptions and generates matching voice output without requiring technical audio parameters or training data.

20+ Inline Emotions

Insert emotion tags directly into your text to control expressive delivery. Supported emotions include laugh, giggle, chuckle, sigh, whisper, angry, gasp, cry, and over a dozen more. Place these tags exactly where you want emotional expression to occur for natural-sounding speech.

Real-Time Streaming

Generate audio in real-time with the SNAC neural codec operating at approximately 0.98 kbps. The streaming capability makes Maya1 suitable for voice assistants, interactive AI agents, live content generation, and other applications where low latency is important.

Single GPU Deployment

Run the entire model on a single GPU with 16GB or more of VRAM. Compatible with consumer hardware like RTX 4090 as well as data center GPUs like A100 and H100. This accessibility makes Maya1 practical for individual developers and small teams.

Apache 2.0 License

Fully open-source under Apache 2.0 license. Use Maya1 commercially, modify the code, deploy in production, and build products without licensing fees or usage restrictions. Own your deployment completely.

Production-Ready Infrastructure

Includes vLLM integration for scaling, automatic prefix caching for efficiency, and WebAudio compatibility for browser playback. The architecture is designed for production deployment with features that improve performance in real-world applications.

How to Use Maya1

Getting started with Maya1 is straightforward. Follow these steps to generate your first emotional speech:

Step 1: Access the Studio

Navigate to https://www.mayaresearch.ai/studio/text-to-speech in your web browser.

Step 2: Login

Step 3: Select Text-to-Speech

In the left sidebar, select the "Text-to-Speech" option to access the voice generation interface.

Step 4: Choose Your Voice

Select a voice preset from the available options. You can choose from various voice characteristics and styles to match your needs.

Step 5: Enter Your Prompt and Emotion

Type or paste your text prompt in the input field. Select an emotion from the dropdown menu or enter custom emotion tags directly in your text using the inline emotion syntax.

Step 6: Generate

Click the "Generate" button to create your audio. The system will process your request and produce the emotional speech output in real-time.

Technical Architecture

Maya1 uses a 3-billion parameter decoder-only transformer based on the Llama architecture. Rather than predicting raw audio waveforms, the model generates SNAC neural codec tokens. These tokens represent audio in a compressed hierarchical format that enables efficient streaming and generation.

The SNAC codec uses a multi-scale hierarchical structure operating at approximately 12, 23, and 47 Hz. This structure keeps autoregressive sequences compact while maintaining audio quality. Each audio frame requires 7 tokens, making the generation process efficient enough for real-time applications.

The model was pretrained on an internet-scale English speech corpus to learn broad acoustic patterns and natural coarticulation. After pretraining, supervised fine-tuning used a curated dataset of studio recordings with human-verified voice descriptions, over 20 emotion tags per sample, multi-accent English coverage, and character variations.

The training pipeline includes 24 kHz mono resampling with loudness normalization, voice activity detection for silence trimming, forced alignment for clean phrase boundaries, text deduplication using MinHash-LSH, audio deduplication with Chromaprint, and SNAC encoding with 7-token frame packing.

Installation and Setup

Getting started with Maya1 requires installing a few Python packages and loading the model from the Hugging Face model hub. The process takes just a few minutes on a system with the appropriate GPU.

Requirements

Install the necessary Python packages:

pip install torch transformers snac soundfile

Loading the Model

You can load Maya1 directly from Hugging Face or clone the repository:

# Load directly in Python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "maya-research/maya1",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("maya-research/maya1")

Quick Start Example

Generate your first emotional speech with this example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
import soundfile as sf

# Load models
model = AutoModelForCausalLM.from_pretrained(
    "maya-research/maya1",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("maya-research/maya1")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")

# Design your voice
description = "Realistic male voice in the 30s age with american accent. Normal pitch, warm timbre, conversational pacing."
text = "Hello! This is Maya1 <laugh> the best open source voice AI model with emotions."

# Generate speech
prompt = f'<description="{description}"> {text}'
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        temperature=0.4,
        top_p=0.9,
        do_sample=True
    )

# Process SNAC tokens
generated_ids = outputs[0, inputs['input_ids'].shape[1]:]
snac_tokens = [t.item() for t in generated_ids if 128266 <= t <= 156937]

# Decode to audio
frames = len(snac_tokens) // 7
codes = [[], [], []]
for i in range(frames):
    s = snac_tokens[i*7:(i+1)*7]
    codes[0].append((s[0]-128266) % 4096)
    codes[1].extend([(s[1]-128266) % 4096, (s[4]-128266) % 4096])
    codes[2].extend([(s[2]-128266) % 4096, (s[3]-128266) % 4096, (s[5]-128266) % 4096, (s[6]-128266) % 4096])

codes_tensor = [torch.tensor(c, dtype=torch.long, device="cuda").unsqueeze(0) for c in codes]
with torch.inference_mode():
    audio = snac_model.decoder(snac_model.quantizer.from_codes(codes_tensor))[0, 0].cpu().numpy()

# Save output
sf.write("output.wav", audio, 24000)
print("Voice generated successfully! Play output.wav")

What You Can Build

🎮 Interactive Characters

Generate dynamic dialogue for game NPCs with real emotions that match the story.

📚 Audio Content

Narrate audiobooks and podcasts with consistent voices and natural emotional delivery.

🤖 Smart Assistants

Create voice bots that respond with appropriate emotions in real-time conversations.

🎬 Video Voiceovers

Add expressive narration to educational videos, tutorials, and social media content.

♿ Accessibility Tools

Build engaging screen readers with natural voices for extended listening.

💬 Customer Support

Deploy empathetic voice bots that improve automated customer service experiences.

How Maya1 Compares

Feature	Maya1	Others
Open Source	Yes	Mostly No
Emotions	20+	Limited or None
Voice Design	Natural Language	Complex Setup
Cost	Free	Pay-per-use
Customization	Full Control	Limited