Protocol Documentation

Table of Contents

stenomatic.proto

Top

Stenomatic service - The Voice SaaS.

Supports transcription, translation and speech synthesis (TTS), including direct voice-to-voice flow from one language to a different language. Multiple target languages are supported at once e.g. voice-to-voice from English into French, German and Chinese at the same time. It is billed per language i.e. billing is the same for one request with N target languages and N requests with one target language each.

We support different sets of languages (and constantly add new ones) per-API call and the supported list gen be obtained by calling https://api.stenomatic.com/api/v1/languages with your API key set into the x-mint-api-key HTTP header.

For API calls that send audio we support only 16kHz sampling rate, MONO only, and two encodings:

1) the raw Linear 16 PCM (uncompressed 16-bit signed little-endian samples)

2) OGG OPUS

Every API call, message, and property has its own documentation about its format and potential fallbacks.

Synthesized voices audio:

We do support 16kHz & 24kHz sampling rates for TTS audio in MP3, and 8kHz, 16kHz & 24kHz in raw PCM/RIFF encodings. Not every voice supports the 8kHz and 24kHz sampling rates though -- read the fallbacks section below.

Fallbacks:

In some cases we do a fallback to different parameters than the ones set in the config request. Not every voice is available in every gender and sampling rate. If the gender is not supported, we fallback to the other gender. If a sampling rate is not supported, we fallback to the other sampling rate. We always obey the gender setting, if such a gender is supported in any sampling rate. All voices do support the 16kHz sampling rate.

Authentication:

API calls are authenticated via a header (metadata in gRPC) called x-mint-api-key. Set your API key as its value.

Request headers:

We do support several optional "control" headers that affect the API calls.

"x-mint-client-request-id" set your own id for the request. Find the request in the logs via this ID later.

"x-mint-client-request-group-id" this will "group" several different API calls into one e.g. two sides of a phone call -- every side has its own "x-mint-client-request-id" but the same "x-mint-client-request-group-id".

"x-mint-allow-partial-translate" only for the VoiceTranslate API call. If set to "true" then every "partial" response with transcription will also have a translation of the transcription. Any other value, or when it is missing, should not return "partial translations". It will return it in some cases though.

"x-mint-send-push-notifications" will enable sending of push notification for API calls via our WebSocket server. We do send them via secure WebSocket endpoint wss://api.stenomatic.com/notifications

"x-mint-allow-branch-notifications" will send push notifications (if enabled with the option above) in to the client branch's specific topic instead of just to client's topic. Default is false and notifications arrive into the client's topic only.

"x-mint-debug-record-audio" is for debugging purposes only. It will record the incoming audio ("true") for the SpeechRecognition and VoiceTranslate API calls into a file. This file is saved into a Google Cloud Storage bucket and is automatically deleted after 14 days. Only selected Google Cloud project users can access/download these files.

x-mint-profanity-filter" is for set profanity filter. Possible values are `raw`, `masked`, `removed`

"x-mint-api-key" is the authentication header where you put your API key,

Push notifications:

The platform supports sending push notifications for the API calls' responses. We send them via our secure WebSocket endpoint (wss://api.stenomatic.com/notifications) and every 'customer+client+branch+API call" combination has its own channel.

AnalyzeEntitiesRequest

FieldTypeLabelDescription
text string

Text for entities analysis.

language_code string

The BCP-47 code of the input text's language e.g. "en-US".

AnalyzeEntitiesResponse

FieldTypeLabelDescription
entities NlpEntity repeated

Array of found entities and their types.

AnalyzeSentimentRequest

FieldTypeLabelDescription
text string

Text for sentiment analysis.

language_code string

The BCP-47 code of the input text's language e.g. "en-US".

AnalyzeSentimentResponse

FieldTypeLabelDescription
sentiment AnalyzeSentimentResponse.Sentiment

Sentiment of the input text. Mixed sentiment means that the text has both positive and negative parts in approximately same amount.

score float

Score of the sentiment from -1.0 - negative, to +1.0 - positive. Values around zero can be neutral or mixed sentiment. Mixed means that the text has both positive and negative parts which cancel each out i.e. they have similar magnitudes.

BatchTranslationFromMultipleLanguagesRequest

BatchTranslationFromMultipleLanguagesRequest request translates multiple texts from multiple languages into one target language at once. The returned translations are returned in the same order as the inputs.

FieldTypeLabelDescription
texts string repeated

Array of texts for translation.

languages_codes_from string repeated

The BCP-47 codes of the input texts' languages. These must be in the same order as the input texts and their count must be the same.

target_language_code string

The BCP-47 code of the target language.

BatchTranslationRequest

BatchTranslation request translates multiple texts from (the same) one language into one target language at once. The returned translations are returned in the same order as the inputs.

FieldTypeLabelDescription
texts string repeated

Array of texts for translation.

language_code_from string

The BCP-47 code of the input texts' language e.g. "en-US".

target_language_code string

The BCP-47 code of the target language.

BatchTranslationResponse

Translation response with array of translations.

FieldTypeLabelDescription
translations string repeated

Array of translation. They are in the same order as the input texts.

NlpEntity

FieldTypeLabelDescription
name string

Entity name i.e. a token/word from the input text.

type string

Type in upper case e.g. PERSON, PLACE, etc.

PingReply

Ping reply with info about the API key.

FieldTypeLabelDescription
message string

Contains information about the API key.

PingRequest

Ping request to test API key.

SpeechRecognitionConfig

Streaming speech recognition configuration.

FieldTypeLabelDescription
audio_language_code string

The BCP-47 code of the input audio's language e.g. "en-US".

audio_encoding SpeechRecognitionConfig.AudioEncoding

Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples.

phrases string repeated

Array of strings containing words and phrases which the speech recognition engine will prefer during recognition.

SpeechRecognitionRequest

Streaming speech recognition request for real-time audio transcription.

FieldTypeLabelDescription
config SpeechRecognitionConfig

Configuration of the recognition. This must be the first request.

audio_content bytes

Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceTranslateRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM).

SpeechRecognitionResponse

Response with audio transcription that will be sent back to client in real-time.

FieldTypeLabelDescription
recognition string

The recognized text from the input audio.

is_final bool

Marks the recognized text as final i.e. this part will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on.

closed_caption string

The latest stable segment to be added to the live/real-time closed-captions. Experimental and not supported for every configuration. DO NOT depend on this. May change or could be removed in the future.

TemplatedTtsRequest

FieldTypeLabelDescription
id int64

Client request id

template_id string

ID of the template.

language_code string

Language of the template

data string repeated

Data to be filled into the chosen template.

voice VoiceConfig

Voice configuration. See the `VoiceConfig` message.

TemplatedTtsResponse

FieldTypeLabelDescription
id int64

Client request id

language_code string

BCP-47 language code of the synthesized audio.

audio bytes

Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

Translation

Translation structure that contains the language's BCP-47 code and the text.

FieldTypeLabelDescription
language_code string

BCP-47 language code of the translated text.

text string

Translated text.

TranslationRequest

Translation request that contains the text for translation, source language BCP-47 code and an array of target languages' BCP-47 codes. Supports translation into multiple languages at once. It is billed per language/characters i.e. billing is the same for one request with N target languages and N requests with one target language each.

FieldTypeLabelDescription
text string

Text for translation.

language_code_from string

The BCP-47 code of the input text's language e.g. "en-US".

target_languages_codes string repeated

Array of BCP-47 codes of the translation's target languages.

TranslationResponse

Translation response with array of translations.

FieldTypeLabelDescription
translations Translation repeated

Array of translations.

TranslationTtsRequest

Translation TTS request for translate + TTS call.

FieldTypeLabelDescription
text string

Text for translation.

language_code_from string

The BCP-47 code of the text e.g. "en-US".

target_languages_codes string repeated

Array of BCP-47 languages codes into which the text should be translated.

voice VoiceConfig

Voice configuration. See the `VoiceConfig` message.

TranslationTtsResponse

Translation TTS response with array of translations and their synthesised audio.

FieldTypeLabelDescription
translations Translation repeated

Array of translations.

tts Tts repeated

Array of audio files with synthesised translated texts for multiple languages. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

Tts

TTS structure that contains the audio's BCP-47 language code and the synthesized text in RIFF or MP3 encoding.

FieldTypeLabelDescription
language_code string

BCP-47 language code of the synthesised audio.

audio bytes

Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

TtsRequest

TTS request with text, its BCP-47 language code, and voice configuration for speech synthesis.

FieldTypeLabelDescription
text string

Text for speech synthesis.

language_code string

The BCP-47 code of the input text's language e.g. "en-US".

voice VoiceConfig

Voice configuration. See the `VoiceConfig` message.

TtsResponse

TTS response with raw bytes which could be RIFF or MP3 encoded.

FieldTypeLabelDescription
audio bytes

Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

VoiceChangeConfig

Voice change configuration.

FieldTypeLabelDescription
audio_language_code string

The BCP-47 code of the input audio's language e.g. "en-US".

voice VoiceConfig

Voice configuration. See the `VoiceConfig` message.

audio_encoding VoiceChangeConfig.AudioEncoding

Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples.

phrases string repeated

Array of strings containing words and/or phrases which the speech recognition engine will prefer during recognition.

VoiceChangeRequest

Voice change request for synthesizing recognized text with different voice.

FieldTypeLabelDescription
config VoiceChangeConfig

Configuration of the recognition. This must be the first request.

audio_content bytes

Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceChangeRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM).

VoiceChangeResponse

Response with recognized text and synthesized audio that will be streamed back to the client.

FieldTypeLabelDescription
recognition string

The recognized text from the input audio.

tts Tts

Audio file with recognition synthesized with a specific voice. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`.

is_final bool

Marks the recognized and translated texts as final i.e. this part of recognition will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on.

VoiceConfig

Configuration of the voice e.g. gender, audio encoding and audio sample rate.

Audio sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Gender will fallback to the other gender if the specified voice is not available in the wanted gender.

FieldTypeLabelDescription
gender VoiceConfig.Gender

Male or female voice for TTS. The default is male as per proto3 specification: For enums, the default value is the first defined enum value, which must be 0.

audio_encoding VoiceConfig.AudioEncoding

RIFF or MP3 audio encoding. RIFF is uncompressed 16-bit signed little-endian (Linear PCM) including the WAV/RIFF header.

audio_sample_rate int32

The synthesis' audio sample rate in hertz. Valid values are 0, 16000 & 24000 for MP3 and 0, 8000, 16000 & 24000 for RIFF/PCM. Set to ZERO to disable TTS i.e. no MP3/WAV is sent back with the spoken translation.

speech_rate VoiceConfig.SpeechRate

The synthesis' rate of speech. Valid values are DEFAULT and FAST.

names VoiceName repeated

Optional list of names of the voices which should be used for specific language. Some languages support different voices, so pick one of them. If a matching voice name is not found, or no voice name is set, then we will use the default voice.

VoiceName

FieldTypeLabelDescription
language_code string

name string

VoiceTranslateConfig

Voice translation configuration for multiple target languages.

FieldTypeLabelDescription
audio_language_code string

The BCP-47 code of the input audio's language e.g. "en-US".

target_languages_codes string repeated

Array of BCP-47 codes of the target/translation languages e.g. "es-ES","it-IT". At least one code must be set.

voice VoiceConfig

Voice configuration. See the `VoiceConfig` message.

audio_encoding VoiceTranslateConfig.AudioEncoding

Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples.

phrases string repeated

Array of strings containing words and/or phrases which the speech recognition engine will prefer during recognition.

VoiceTranslateRequest

Voice translation request for multiple target translation languages.

FieldTypeLabelDescription
config VoiceTranslateConfig

Configuration of the recognition. This must be the first request.

audio_content bytes

Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceTranslateRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM).

VoiceTranslateResponse

Response with recognized text, translation and synthesized audio that will be streamed back to the client.

FieldTypeLabelDescription
recognition string

The recognized text from the input audio.

translations Translation repeated

Array of translations into multiple languages.

tts Tts repeated

Array of audio files with the final translated texts for multiple languages. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. They will arrive for the final recognitions i.e. stable recognized text `is_final` == true. It could arrive together with the final text or it will arrive as a next message with just the audio data. BUT if a TTS heuristic is used, then audio of some parts of the non-final recognition will arrive. In such case (TTS heuristic), all the final recognitions will not have the whole audio synthesized, just the remaining not-yet-synthesized part of the text.

is_final bool

Marks the recognized and translated texts as final i.e. this part of recognition will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on.

AnalyzeSentimentResponse.Sentiment

Sentiment.

NameNumberDescription
POSITIVE 0

NEGATIVE 1

MIXED 2

NEUTRAL 3

SpeechRecognitionConfig.AudioEncoding

Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS.

NameNumberDescription
LINEAR_16_PCM 0

OGG_OPUS 1

VoiceChangeConfig.AudioEncoding

Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS. 16kHz mono audio only.

NameNumberDescription
LINEAR_16_PCM 0

OGG_OPUS 1

VoiceConfig.AudioEncoding

Voice audio encoding RIFF or MP3.

NameNumberDescription
MP3 0

RIFF_LINEAR_16 1

VoiceConfig.Gender

Voice gender.

NameNumberDescription
MALE 0

FEMALE 1

VoiceConfig.SpeechRate

Rate of speech

NameNumberDescription
DEFAULT 0

FAST 1

VoiceTranslateConfig.AudioEncoding

Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS. 16kHz mono audio only.

NameNumberDescription
LINEAR_16_PCM 0

OGG_OPUS 1

Stenomatic

Stenomatic service

Method NameRequest TypeResponse TypeDescription
Ping PingRequest PingReply

Test connection with the server and your API key. Do not use in production code!

Translate TranslationRequest TranslationResponse

Translation.

BatchTranslate BatchTranslationRequest BatchTranslationResponse

Batch Translation of multiple texts from one language to multiple languages.

BatchTranslateFromMultipleLanguages BatchTranslationFromMultipleLanguagesRequest BatchTranslationResponse

Batch Translation of multiple texts from multiple languages to one target language.

Tts TtsRequest TtsResponse

TTS i.e. text to speech. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender.

TemplatedTts TemplatedTtsRequest TemplatedTtsResponse

Synthesizes texts based on pre-defined templates.

TranslateAndTts TranslationTtsRequest TranslationTtsResponse

Translation with TTS of the translated texts. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender.

SpeechRecognition SpeechRecognitionRequest stream SpeechRecognitionResponse stream

Performs bidirectional streaming speech recognition. Stream audio to the server and receive recognized text in real-time.

VoiceTranslate VoiceTranslateRequest stream VoiceTranslateResponse stream

Performs bidirectional streaming speech recognition and translation into multiple languages at once. It is billed per language/characters i.e. billing is the same for one request with N target languages and N requests with one target language each. Responses can contain audio files of the translated text. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender.

VoiceChange VoiceChangeRequest stream VoiceChangeResponse stream

Performs bidirectional streaming speech recognition and synthesis into a specific voice in the same language. It is billed based on audio length with seconds tarification. Responses can contain audio files of the recognized text. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender.

AnalyzeSentiment AnalyzeSentimentRequest AnalyzeSentimentResponse

Performs sentiment analysis (natural language processing) on the input text. Score is in range from -1.0 to 1.0, where negative score is negative sentiment and positive score is positive sentiment. Values around zero can be neutral or mixed sentiment. Mixed means that the text has both positive and negative parts which cancel each out i.e. they have similar magnitudes.

AnalyzeEntities AnalyzeEntitiesRequest AnalyzeEntitiesResponse

Performs entities analysis (natural language processing) on the input text.

Scalar Value Types

.proto TypeNotesC++JavaPythonGoC#PHPRuby
double double double float float64 double float Float
float float float float float32 float float Float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int int32 int integer Bignum or Fixnum (as required)
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long int64 long integer/string Bignum
uint32 Uses variable-length encoding. uint32 int int/long uint32 uint integer Bignum or Fixnum (as required)
uint64 Uses variable-length encoding. uint64 long int/long uint64 ulong integer/string Bignum or Fixnum (as required)
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int int32 int integer Bignum or Fixnum (as required)
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long int64 long integer/string Bignum
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int uint32 uint integer Bignum or Fixnum (as required)
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long uint64 ulong integer/string Bignum
sfixed32 Always four bytes. int32 int int int32 int integer Bignum or Fixnum (as required)
sfixed64 Always eight bytes. int64 long int/long int64 long integer/string Bignum
bool bool boolean boolean bool bool boolean TrueClass/FalseClass
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode string string string String (UTF-8)
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte ByteString string String (ASCII-8BIT)