Protocol Documentation

stenomatic.proto
Scalar Value Types

stenomatic.proto

Top

Stenomatic service - The Voice SaaS.

Supports transcription, translation and speech synthesis (TTS), including direct voice-to-voice flow from one language to a different language. Multiple target languages are supported at once e.g. voice-to-voice from English into French, German and Chinese at the same time. It is billed per language i.e. billing is the same for one request with N target languages and N requests with one target language each.

We support different sets of languages (and constantly add new ones) per-API call and the supported list gen be obtained by calling https://api.stenomatic.com/api/v1/languages with your API key set into the x-mint-api-key HTTP header.

For API calls that send audio we support only 16kHz sampling rate, MONO only, and two encodings:

1) the raw Linear 16 PCM (uncompressed 16-bit signed little-endian samples)

2) OGG OPUS

Every API call, message, and property has its own documentation about its format and potential fallbacks.

Synthesized voices audio:

We do support 16kHz & 24kHz sampling rates for TTS audio in MP3, and 8kHz, 16kHz & 24kHz in raw PCM/RIFF encodings. Not every voice supports the 8kHz and 24kHz sampling rates though -- read the fallbacks section below.

Fallbacks:

In some cases we do a fallback to different parameters than the ones set in the config request. Not every voice is available in every gender and sampling rate. If the gender is not supported, we fallback to the other gender. If a sampling rate is not supported, we fallback to the other sampling rate. We always obey the gender setting, if such a gender is supported in any sampling rate. All voices do support the 16kHz sampling rate.

Authentication:

API calls are authenticated via a header (metadata in gRPC) called x-mint-api-key. Set your API key as its value.

Request headers:

We do support several optional "control" headers that affect the API calls.

"x-mint-client-request-id" set your own id for the request. Find the request in the logs via this ID later.

"x-mint-client-request-group-id" this will "group" several different API calls into one e.g. two sides of a phone call -- every side has its own "x-mint-client-request-id" but the same "x-mint-client-request-group-id".

"x-mint-allow-partial-translate" only for the VoiceTranslate API call. If set to "true" then every "partial" response with transcription will also have a translation of the transcription. Any other value, or when it is missing, should not return "partial translations". It will return it in some cases though.

"x-mint-send-push-notifications" will enable sending of push notification for API calls via our WebSocket server. We do send them via secure WebSocket endpoint wss://api.stenomatic.com/notifications

"x-mint-allow-branch-notifications" will send push notifications (if enabled with the option above) in to the client branch's specific topic instead of just to client's topic. Default is false and notifications arrive into the client's topic only.

"x-mint-debug-record-audio" is for debugging purposes only. It will record the incoming audio ("true") for the SpeechRecognition and VoiceTranslate API calls into a file. This file is saved into a Google Cloud Storage bucket and is automatically deleted after 14 days. Only selected Google Cloud project users can access/download these files.

x-mint-profanity-filter" is for set profanity filter. Possible values are `raw`, `masked`, `removed`

"x-mint-api-key" is the authentication header where you put your API key,

Push notifications:

The platform supports sending push notifications for the API calls' responses. We send them via our secure WebSocket endpoint (wss://api.stenomatic.com/notifications) and every 'customer+client+branch+API call" combination has its own channel.

AnalyzeEntitiesRequest

Field	Type	Label	Description
text	string		Text for entities analysis.
language_code	string		The BCP-47 code of the input text's language e.g. "en-US".

AnalyzeEntitiesResponse

Field	Type	Label	Description
entities	NlpEntity	repeated	Array of found entities and their types.

AnalyzeSentimentRequest

Field	Type	Label	Description
text	string		Text for sentiment analysis.
language_code	string		The BCP-47 code of the input text's language e.g. "en-US".

AnalyzeSentimentResponse

Field	Type	Label	Description
sentiment	AnalyzeSentimentResponse.Sentiment		Sentiment of the input text. Mixed sentiment means that the text has both positive and negative parts in approximately same amount.
score	float		Score of the sentiment from -1.0 - negative, to +1.0 - positive. Values around zero can be neutral or mixed sentiment. Mixed means that the text has both positive and negative parts which cancel each out i.e. they have similar magnitudes.

BatchTranslationFromMultipleLanguagesRequest

BatchTranslationFromMultipleLanguagesRequest request translates multiple texts from multiple languages into one target language at once. The returned translations are returned in the same order as the inputs.

Field	Type	Label	Description
texts	string	repeated	Array of texts for translation.
languages_codes_from	string	repeated	The BCP-47 codes of the input texts' languages. These must be in the same order as the input texts and their count must be the same.
target_language_code	string		The BCP-47 code of the target language.

BatchTranslationRequest

BatchTranslation request translates multiple texts from (the same) one language into one target language at once. The returned translations are returned in the same order as the inputs.

Field	Type	Label	Description
texts	string	repeated	Array of texts for translation.
language_code_from	string		The BCP-47 code of the input texts' language e.g. "en-US".
target_language_code	string		The BCP-47 code of the target language.

BatchTranslationResponse

Translation response with array of translations.

Field	Type	Label	Description
translations	string	repeated	Array of translation. They are in the same order as the input texts.

NlpEntity

Field	Type	Label	Description
name	string		Entity name i.e. a token/word from the input text.
type	string		Type in upper case e.g. PERSON, PLACE, etc.

PingReply

Ping reply with info about the API key.

Field	Type	Label	Description
message	string		Contains information about the API key.

PingRequest

Ping request to test API key.

SpeechRecognitionConfig

Streaming speech recognition configuration.

Field	Type	Label	Description
audio_language_code	string		The BCP-47 code of the input audio's language e.g. "en-US".
audio_encoding	SpeechRecognitionConfig.AudioEncoding		Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples.
phrases	string	repeated	Array of strings containing words and phrases which the speech recognition engine will prefer during recognition.
extras_json	string		Additional extra config parameters like Boost, Block and Replace words in transcript. Stored as JSON string

SpeechRecognitionRequest

Streaming speech recognition request for real-time audio transcription.

Field	Type	Label	Description
config	SpeechRecognitionConfig		Configuration of the recognition. This must be the first request.
audio_content	bytes		Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceTranslateRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM).

SpeechRecognitionResponse

Response with audio transcription that will be sent back to client in real-time.

Field	Type	Label	Description
recognition	string		The recognized text from the input audio.
is_final	bool		Marks the recognized text as final i.e. this part will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on.
closed_caption	string		The latest stable segment to be added to the live/real-time closed-captions. Experimental and not supported for every configuration. DO NOT depend on this. May change or could be removed in the future.

TemplatedTtsRequest

Field	Type	Label	Description
id	int64		Client request id
template_id	string		ID of the template.
language_code	string		Language of the template
data	string	repeated	Data to be filled into the chosen template.
voice	VoiceConfig		Voice configuration. See the `VoiceConfig` message.

TemplatedTtsResponse

Field	Type	Label	Description
id	int64		Client request id
language_code	string		BCP-47 language code of the synthesized audio.
audio	bytes		Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

Translation

Translation structure that contains the language's BCP-47 code and the text.

Field	Type	Label	Description
language_code	string		BCP-47 language code of the translated text.
text	string		Translated text.

TranslationRequest

Translation request that contains the text for translation, source language BCP-47 code and an array of target languages' BCP-47 codes. Supports translation into multiple languages at once. It is billed per language/characters i.e. billing is the same for one request with N target languages and N requests with one target language each.

Field	Type	Label	Description
text	string		Text for translation.
language_code_from	string		The BCP-47 code of the input text's language e.g. "en-US".
target_languages_codes	string	repeated	Array of BCP-47 codes of the translation's target languages.

TranslationResponse

Translation response with array of translations.

Field	Type	Label	Description
translations	Translation	repeated	Array of translations.

TranslationTtsRequest

Translation TTS request for translate + TTS call.

Field	Type	Label	Description
text	string		Text for translation.
language_code_from	string		The BCP-47 code of the text e.g. "en-US".
target_languages_codes	string	repeated	Array of BCP-47 languages codes into which the text should be translated.
voice	VoiceConfig		Voice configuration. See the `VoiceConfig` message.

TranslationTtsResponse

Translation TTS response with array of translations and their synthesised audio.

Field	Type	Label	Description
translations	Translation	repeated	Array of translations.
tts	Tts	repeated	Array of audio files with synthesised translated texts for multiple languages. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

Tts

TTS structure that contains the audio's BCP-47 language code and the synthesized text in RIFF or MP3 encoding.

Field	Type	Label	Description
language_code	string		BCP-47 language code of the synthesised audio.
audio	bytes		Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

TtsRequest

TTS request with text, its BCP-47 language code, and voice configuration for speech synthesis.

Field	Type	Label	Description
text	string		Text for speech synthesis.
language_code	string		The BCP-47 code of the input text's language e.g. "en-US".
voice	VoiceConfig		Voice configuration. See the `VoiceConfig` message.

TtsResponse

TTS response with raw bytes which could be RIFF or MP3 encoded.

Field	Type	Label	Description
audio	bytes		Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

VoiceChangeConfig

Voice change configuration.

Field	Type	Label	Description
audio_language_code	string		The BCP-47 code of the input audio's language e.g. "en-US".
voice	VoiceConfig		Voice configuration. See the `VoiceConfig` message.
audio_encoding	VoiceChangeConfig.AudioEncoding		Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples.
phrases	string	repeated	Array of strings containing words and/or phrases which the speech recognition engine will prefer during recognition.
extras_json	string		Additional extra config parameters like Boost, Block and Replace words in transcript. Stored as JSON string

VoiceChangeConfigChange

Voice change configuration changes.

Field	Type	Label	Description
recognition_mode	VoiceChangeConfigChange.RecognitionMode		Voice change recognition mode:

VoiceChangeRequest

Voice change request for synthesizing recognized text with different voice.

Field	Type	Label	Description
config	VoiceChangeConfig		Configuration of the recognition. This must be the first request.
audio_content	bytes		Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceChangeRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM).
config_change	VoiceChangeConfigChange		Config change for voice change

VoiceChangeResponse

Response with recognized text and synthesized audio that will be streamed back to the client.

Field	Type	Label	Description
recognition	string		The recognized text from the input audio.
tts	Tts		Audio file with recognition synthesized with a specific voice. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.
is_final	bool		Marks the recognized and translated texts as final i.e. this part of recognition will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on.

VoiceConfig

Configuration of the voice e.g. gender, audio encoding and audio sample rate.

Audio sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Gender will fallback to the other gender if the specified voice is not available in the wanted gender.

Field	Type	Label	Description
gender	VoiceConfig.Gender		Male or female voice for TTS. The default is male as per proto3 specification: For enums, the default value is the first defined enum value, which must be 0.
audio_encoding	VoiceConfig.AudioEncoding		RIFF or MP3 audio encoding. RIFF is uncompressed 16-bit signed little-endian (Linear PCM) including the WAV/RIFF header.
audio_sample_rate	int32		The synthesis' audio sample rate in hertz. Valid values are 0, 16000 & 24000 for MP3 and 0, 8000, 16000 & 24000 for RIFF/PCM. Set to ZERO to disable TTS i.e. no MP3/WAV is sent back with the spoken translation.
speech_rate	VoiceConfig.SpeechRate		The synthesis' rate of speech. Valid values are DEFAULT and FAST.
names	VoiceName	repeated	Optional list of names of the voices which should be used for specific language. Some languages support different voices, so pick one of them. If a matching voice name is not found, or no voice name is set, then we will use the default voice.

VoiceName

Field	Type	Label	Description
language_code	string
name	string

VoiceTranslateConfig

Voice translation configuration for multiple target languages.

Field	Type	Label	Description
audio_language_code	string		The BCP-47 code of the input audio's language e.g. "en-US".
target_languages_codes	string	repeated	Array of BCP-47 codes of the target/translation languages e.g. "es-ES","it-IT". At least one code must be set.
voice	VoiceConfig		Voice configuration. See the `VoiceConfig` message.
audio_encoding	VoiceTranslateConfig.AudioEncoding		Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples.
phrases	string	repeated	Array of strings containing words and/or phrases which the speech recognition engine will prefer during recognition.
extras_json	string		Additional extra config parameters like Boost, Block and Replace words in transcript/translation. Stored as JSON string

VoiceTranslateConfigChange

Voice translation configuration changes.

Field	Type	Label	Description
recognition_mode	VoiceTranslateConfigChange.RecognitionMode		Voice translate recognition mode:

VoiceTranslateRequest

Voice translation request for multiple target translation languages.

Field	Type	Label	Description
config	VoiceTranslateConfig		Configuration of the recognition. This must be the first request.
audio_content	bytes		Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceTranslateRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM).
config_change	VoiceTranslateConfigChange		Config change for voice translate

VoiceTranslateResponse

Response with recognized text, translation and synthesized audio that will be streamed back to the client.

Field	Type	Label	Description
recognition	string		The recognized text from the input audio.
translations	Translation	repeated	Array of translations into multiple languages.
tts	Tts	repeated	Array of audio files with the final translated texts for multiple languages. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. They will arrive for the final recognitions i.e. stable recognized text `is_final` == true. It could arrive together with the final text or it will arrive as a next message with just the audio data. BUT if a TTS heuristic is used, then audio of some parts of the non-final recognition will arrive. In such case (TTS heuristic), all the final recognitions will not have the whole audio synthesized, just the remaining not-yet-synthesized part of the text.
is_final	bool		Marks the recognized and translated texts as final i.e. this part of recognition will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on.

AnalyzeSentimentResponse.Sentiment

Sentiment.

Name	Number	Description
POSITIVE	0
NEGATIVE	1
MIXED	2
NEUTRAL	3

SpeechRecognitionConfig.AudioEncoding

Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS.

Name	Number	Description
LINEAR_16_PCM	0
OGG_OPUS	1

VoiceChangeConfig.AudioEncoding

Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS. 16kHz mono audio only.

Name	Number	Description
LINEAR_16_PCM	0
OGG_OPUS	1

VoiceChangeConfigChange.RecognitionMode

config mode - NLP: nature language processing - all special modes together

Name	Number	Description
NORMAL	0
NLP	1
DIGITS	2
EMAIL	3

VoiceConfig.AudioEncoding

Voice audio encoding RIFF or MP3.

Name	Number	Description
MP3	0
RIFF_LINEAR_16	1

VoiceConfig.Gender

Voice gender.

Name	Number	Description
MALE	0
FEMALE	1

VoiceConfig.SpeechRate

Rate of speech

Name	Number	Description
DEFAULT	0
FAST	1

VoiceTranslateConfig.AudioEncoding