Transcription captures what was said. Acoustic indexing captures how it was said. When a caller says “fine” after a long pause, the word carries the same transcript but a completely different meaning than when they say it quickly in a neutral tone. Keyword search misses this. Sentiment dashboards summarize it away. Acoustic indexing preserves it — at every turn, for every call. Mise indexes five acoustic dimensions at each transcript turn. These signals power corpus search, defect detection, and live alerting.Documentation Index
Fetch the complete documentation index at: https://docs.sf-voice.sh/llms.txt
Use this file to discover all available pages before exploring further.
The five dimensions
Tone
Sentiment, irony, and sarcasm. Detects not just positive or negative valence but whether the sentiment is genuine or inverted.
Prosody
Pace, pauses, and emphasis. Captures when a speaker speeds up, slows down, or places stress on specific words.
Tension
Frustration and escalation signals. Identifies when a conversation is moving toward conflict or disengagement.
Rhythm
Cadence, interruptions, and overlap. Detects when speakers talk over each other or when silence extends past normal turn boundaries.
Intent
What the caller actually wants. Infers underlying goals from acoustic and linguistic context, not just the surface request.
Tone
Tone goes beyond positive/negative sentiment scores. Mise indexes:- Sentiment — the emotional valence of what’s being said
- Irony — when tone inverts the literal meaning
- Sarcasm — a specific pattern of exaggerated or clipped delivery that signals disbelief or irritation
Prosody
Prosody is the music of speech — the variation in pace, pitch, and rhythm that carries meaning beyond words. Mise indexes:- Pace — speaking rate and whether it’s accelerating or decelerating
- Pauses — silence duration and placement (mid-sentence pauses signal confusion; post-response pauses can signal dissatisfaction)
- Emphasis — which words receive stress, indicating what the caller considers important
Tension
Tension signals indicate that a conversation is moving in a bad direction before the caller says so explicitly. Mise indexes:- Frustration — prosodic and tonal markers associated with growing impatience
- Escalation — patterns that precede requests for a human agent or explicit complaints
Rhythm
Rhythm captures the structural dynamics of a conversation — who speaks, when, and how turns are exchanged:- Cadence — the natural turn-taking pattern and whether it breaks down
- Interruptions — one speaker cutting off another mid-turn
- Overlap — both speakers talking simultaneously
Intent
Intent indexing looks past the literal request to what the caller actually needs. A caller who says “I just wanted to check on something” while asking about a charge may actually be disputing it. Intent signals combine acoustic and linguistic context to surface:- The caller’s underlying goal
- Whether the agent’s response addressed that goal
- Mismatches between what was asked and what was answered
How this differs from transcription and keyword search
- Keyword search
- Transcription alone
- Acoustic indexing
Keyword search finds calls where the word “cancel” appears in the transcript. It returns every call where a customer or agent said “cancel” — including accidental mentions, confirmations of a past cancellation, and discussions of a cancellation policy.It does not distinguish between a caller who calmly asked about cancellation options and one who threatened to cancel in a tense, elevated tone.
Real-time vs. post-call indexing
Mise indexes acoustics at two points in the call lifecycle:Live detection (during the call)
Tension, frustration, and escalation signals are surfaced in real time as calls progress. Your integrations can subscribe to these live events and trigger alerts or human escalation before the call ends.
Live detection prioritizes latency. Post-call indexing prioritizes completeness. Both use the same underlying acoustic models, but post-call indexing runs with access to the full conversation context.
How acoustic indexing powers other features
Acoustic signals feed directly into the rest of the platform:Corpus Search
Queries like “calls where the caller expressed frustration” resolve against turn-level acoustic scores, not keyword matches.
Defect Signatures
Clusters are formed by acoustic similarity — calls with matching tension arcs and rhythm patterns group together even when the transcript content differs.
Call Replay
Each turn in the replay timeline displays its acoustic scores, so you can see exactly when sentiment dropped or tension spiked.