Stenomatic service - The Voice SaaS.
Supports transcription, translation and speech synthesis (TTS), including direct voice-to-voice flow from one language to a different language. Multiple target languages are supported at once e.g. voice-to-voice from English into French, German and Chinese at the same time. It is billed per language i.e. billing is the same for one request with N target languages and N requests with one target language each.
We support different sets of languages (and constantly add new ones) per-API call and the supported list gen be obtained by calling https://api.stenomatic.com/api/v1/languages with your API key set into the x-mint-api-key HTTP header.
For API calls that send audio we support only 16kHz sampling rate, MONO only, and two encodings:
1) the raw Linear 16 PCM (uncompressed 16-bit signed little-endian samples)
2) OGG OPUS
Every API call, message, and property has its own documentation about its format and potential fallbacks.
Synthesized voices audio:
We do support 16kHz & 24kHz sampling rates for TTS audio in MP3, and 8kHz, 16kHz & 24kHz in raw PCM/RIFF encodings. Not every voice supports the 8kHz and 24kHz sampling rates though -- read the fallbacks section below.
Fallbacks:
In some cases we do a fallback to different parameters than the ones set in the config request. Not every voice is available in every gender and sampling rate. If the gender is not supported, we fallback to the other gender. If a sampling rate is not supported, we fallback to the other sampling rate. We always obey the gender setting, if such a gender is supported in any sampling rate. All voices do support the 16kHz sampling rate.
Authentication:
API calls are authenticated via a header (metadata in gRPC) called x-mint-api-key. Set your API key as its value.
Request headers:
We do support several optional "control" headers that affect the API calls.
"x-mint-client-request-id" set your own id for the request. Find the request in the logs via this ID later.
"x-mint-client-request-group-id" this will "group" several different API calls into one e.g. two sides of a phone call -- every side has its own "x-mint-client-request-id" but the same "x-mint-client-request-group-id".
"x-mint-allow-partial-translate" only for the VoiceTranslate API call. If set to "true" then every "partial" response with transcription will also have a translation of the transcription. Any other value, or when it is missing, should not return "partial translations". It will return it in some cases though.
"x-mint-send-push-notifications" will enable sending of push notification for API calls via our WebSocket server. We do send them via secure WebSocket endpoint wss://api.stenomatic.com/notifications
"x-mint-allow-branch-notifications" will send push notifications (if enabled with the option above) in to the client branch's specific topic instead of just to client's topic. Default is false and notifications arrive into the client's topic only.
"x-mint-debug-record-audio" is for debugging purposes only. It will record the incoming audio ("true") for the SpeechRecognition and VoiceTranslate API calls into a file. This file is saved into a Google Cloud Storage bucket and is automatically deleted after 14 days. Only selected Google Cloud project users can access/download these files.
x-mint-profanity-filter" is for set profanity filter. Possible values are `raw`, `masked`, `removed`
"x-mint-api-key" is the authentication header where you put your API key,
Push notifications:
The platform supports sending push notifications for the API calls' responses. We send them via our secure WebSocket endpoint (wss://api.stenomatic.com/notifications) and every 'customer+client+branch+API call" combination has its own channel.
Field | Type | Label | Description |
text | string | Text for entities analysis. |
|
language_code | string | The BCP-47 code of the input text's language e.g. "en-US". |
Field | Type | Label | Description |
entities | NlpEntity | repeated | Array of found entities and their types. |
Field | Type | Label | Description |
text | string | Text for sentiment analysis. |
|
language_code | string | The BCP-47 code of the input text's language e.g. "en-US". |
Field | Type | Label | Description |
sentiment | AnalyzeSentimentResponse.Sentiment | Sentiment of the input text. Mixed sentiment means that the text has both positive and negative parts in approximately same amount. |
|
score | float | Score of the sentiment from -1.0 - negative, to +1.0 - positive. Values around zero can be neutral or mixed sentiment. Mixed means that the text has both positive and negative parts which cancel each out i.e. they have similar magnitudes. |
BatchTranslationFromMultipleLanguagesRequest request translates multiple texts from multiple languages into one target language at once. The returned translations are returned in the same order as the inputs.
Field | Type | Label | Description |
texts | string | repeated | Array of texts for translation. |
languages_codes_from | string | repeated | The BCP-47 codes of the input texts' languages. These must be in the same order as the input texts and their count must be the same. |
target_language_code | string | The BCP-47 code of the target language. |
BatchTranslation request translates multiple texts from (the same) one language into one target language at once. The returned translations are returned in the same order as the inputs.
Field | Type | Label | Description |
texts | string | repeated | Array of texts for translation. |
language_code_from | string | The BCP-47 code of the input texts' language e.g. "en-US". |
|
target_language_code | string | The BCP-47 code of the target language. |
Translation response with array of translations.
Field | Type | Label | Description |
translations | string | repeated | Array of translation. They are in the same order as the input texts. |
Field | Type | Label | Description |
name | string | Entity name i.e. a token/word from the input text. |
|
type | string | Type in upper case e.g. PERSON, PLACE, etc. |
Ping reply with info about the API key.
Field | Type | Label | Description |
message | string | Contains information about the API key. |
Ping request to test API key.
Streaming speech recognition configuration.
Field | Type | Label | Description |
audio_language_code | string | The BCP-47 code of the input audio's language e.g. "en-US". |
|
audio_encoding | SpeechRecognitionConfig.AudioEncoding | Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples. |
|
phrases | string | repeated | Array of strings containing words and phrases which the speech recognition engine will prefer during recognition. |
Streaming speech recognition request for real-time audio transcription.
Field | Type | Label | Description |
config | SpeechRecognitionConfig | Configuration of the recognition. This must be the first request. |
|
audio_content | bytes | Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceTranslateRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM). |
Response with audio transcription that will be sent back to client in real-time.
Field | Type | Label | Description |
recognition | string | The recognized text from the input audio. |
|
is_final | bool | Marks the recognized text as final i.e. this part will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on. |
|
closed_caption | string | The latest stable segment to be added to the live/real-time closed-captions. Experimental and not supported for every configuration. DO NOT depend on this. May change or could be removed in the future. |
Field | Type | Label | Description |
id | int64 | Client request id |
|
template_id | string | ID of the template. |
|
language_code | string | Language of the template |
|
data | string | repeated | Data to be filled into the chosen template. |
voice | VoiceConfig | Voice configuration. See the `VoiceConfig` message. |
Field | Type | Label | Description |
id | int64 | Client request id |
|
language_code | string | BCP-47 language code of the synthesized audio. |
|
audio | bytes | Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. |
Translation structure that contains the language's BCP-47 code and the text.
Field | Type | Label | Description |
language_code | string | BCP-47 language code of the translated text. |
|
text | string | Translated text. |
Translation request that contains the text for translation, source language BCP-47 code and an array of target languages' BCP-47 codes. Supports translation into multiple languages at once. It is billed per language/characters i.e. billing is the same for one request with N target languages and N requests with one target language each.
Field | Type | Label | Description |
text | string | Text for translation. |
|
language_code_from | string | The BCP-47 code of the input text's language e.g. "en-US". |
|
target_languages_codes | string | repeated | Array of BCP-47 codes of the translation's target languages. |
Translation response with array of translations.
Field | Type | Label | Description |
translations | Translation | repeated | Array of translations. |
Translation TTS request for translate + TTS call.
Field | Type | Label | Description |
text | string | Text for translation. |
|
language_code_from | string | The BCP-47 code of the text e.g. "en-US". |
|
target_languages_codes | string | repeated | Array of BCP-47 languages codes into which the text should be translated. |
voice | VoiceConfig | Voice configuration. See the `VoiceConfig` message. |
Translation TTS response with array of translations and their synthesised audio.
Field | Type | Label | Description |
translations | Translation | repeated | Array of translations. |
tts | Tts | repeated | Array of audio files with synthesised translated texts for multiple languages. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. |
TTS structure that contains the audio's BCP-47 language code and the synthesized text in RIFF or MP3 encoding.
Field | Type | Label | Description |
language_code | string | BCP-47 language code of the synthesised audio. |
|
audio | bytes | Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. |
TTS request with text, its BCP-47 language code, and voice configuration for speech synthesis.
Field | Type | Label | Description |
text | string | Text for speech synthesis. |
|
language_code | string | The BCP-47 code of the input text's language e.g. "en-US". |
|
voice | VoiceConfig | Voice configuration. See the `VoiceConfig` message. |
TTS response with raw bytes which could be RIFF or MP3 encoded.
Field | Type | Label | Description |
audio | bytes | Bytes of the audio file. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. |
Voice change configuration.
Field | Type | Label | Description |
audio_language_code | string | The BCP-47 code of the input audio's language e.g. "en-US". |
|
voice | VoiceConfig | Voice configuration. See the `VoiceConfig` message. |
|
audio_encoding | VoiceChangeConfig.AudioEncoding | Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples. |
|
phrases | string | repeated | Array of strings containing words and/or phrases which the speech recognition engine will prefer during recognition. |
Voice change request for synthesizing recognized text with different voice.
Field | Type | Label | Description |
config | VoiceChangeConfig | Configuration of the recognition. This must be the first request. |
|
audio_content | bytes | Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceChangeRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM). |
Response with recognized text and synthesized audio that will be streamed back to the client.
Field | Type | Label | Description |
recognition | string | The recognized text from the input audio. |
|
tts | Tts | Audio file with recognition synthesized with a specific voice. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. |
|
is_final | bool | Marks the recognized and translated texts as final i.e. this part of recognition will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on. |
Configuration of the voice e.g. gender, audio encoding and audio sample rate.
Audio sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Gender will fallback to the other gender if the specified voice is not available in the wanted gender.
Field | Type | Label | Description |
gender | VoiceConfig.Gender | Male or female voice for TTS. The default is male as per proto3 specification: For enums, the default value is the first defined enum value, which must be 0. |
|
audio_encoding | VoiceConfig.AudioEncoding | RIFF or MP3 audio encoding. RIFF is uncompressed 16-bit signed little-endian (Linear PCM) including the WAV/RIFF header. |
|
audio_sample_rate | int32 | The synthesis' audio sample rate in hertz. Valid values are 0, 16000 & 24000 for MP3 and 0, 8000, 16000 & 24000 for RIFF/PCM. Set to ZERO to disable TTS i.e. no MP3/WAV is sent back with the spoken translation. |
|
speech_rate | VoiceConfig.SpeechRate | The synthesis' rate of speech. Valid values are DEFAULT and FAST. |
|
names | VoiceName | repeated | Optional list of names of the voices which should be used for specific language. Some languages support different voices, so pick one of them. If a matching voice name is not found, or no voice name is set, then we will use the default voice. |
Field | Type | Label | Description |
language_code | string |
|
|
name | string |
|
Voice translation configuration for multiple target languages.
Field | Type | Label | Description |
audio_language_code | string | The BCP-47 code of the input audio's language e.g. "en-US". |
|
target_languages_codes | string | repeated | Array of BCP-47 codes of the target/translation languages e.g. "es-ES","it-IT". At least one code must be set. |
voice | VoiceConfig | Voice configuration. See the `VoiceConfig` message. |
|
audio_encoding | VoiceTranslateConfig.AudioEncoding | Linear PCM or OGG OPUS audio encoding. Linear PCM is uncompressed 16-bit signed little-endian samples. |
|
phrases | string | repeated | Array of strings containing words and/or phrases which the speech recognition engine will prefer during recognition. |
Voice translation request for multiple target translation languages.
Field | Type | Label | Description |
config | VoiceTranslateConfig | Configuration of the recognition. This must be the first request. |
|
audio_content | bytes | Bytes of audio data chunk in chosen format (Linear PCM or OGG OPUS) with 16kHZ sampling rate, mono audio only. Chunks are sent in sequential `VoiceTranslateRequest` requests. Config request must be sent before the first `audio_content` request. Audio data must be sent in near real-time rate and in ~100ms chunks (3200 bytes for PCM). |
Response with recognized text, translation and synthesized audio that will be streamed back to the client.
Field | Type | Label | Description |
recognition | string | The recognized text from the input audio. |
|
translations | Translation | repeated | Array of translations into multiple languages. |
tts | Tts | repeated | Array of audio files with the final translated texts for multiple languages. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. They will arrive for the final recognitions i.e. stable recognized text `is_final` == true. It could arrive together with the final text or it will arrive as a next message with just the audio data. BUT if a TTS heuristic is used, then audio of some parts of the non-final recognition will arrive. In such case (TTS heuristic), all the final recognitions will not have the whole audio synthesized, just the remaining not-yet-synthesized part of the text. |
is_final | bool | Marks the recognized and translated texts as final i.e. this part of recognition will not change anymore. Once the text is final it will never be sent again and only text in the audio stream after that will be recognized from then on. |
Sentiment.
Name | Number | Description |
POSITIVE | 0 | |
NEGATIVE | 1 | |
MIXED | 2 | |
NEUTRAL | 3 |
Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS.
Name | Number | Description |
LINEAR_16_PCM | 0 | |
OGG_OPUS | 1 |
Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS. 16kHz mono audio only.
Name | Number | Description |
LINEAR_16_PCM | 0 | |
OGG_OPUS | 1 |
Voice audio encoding RIFF or MP3.
Name | Number | Description |
MP3 | 0 | |
RIFF_LINEAR_16 | 1 |
Voice gender.
Name | Number | Description |
MALE | 0 | |
FEMALE | 1 |
Rate of speech
Name | Number | Description |
DEFAULT | 0 | |
FAST | 1 |
Audio encoding, Linear PCM (Uncompressed 16-bit signed little-endian samples) or OGG OPUS. 16kHz mono audio only.
Name | Number | Description |
LINEAR_16_PCM | 0 | |
OGG_OPUS | 1 |
Stenomatic service
Method Name | Request Type | Response Type | Description |
Ping | PingRequest | PingReply | Test connection with the server and your API key. Do not use in production code! |
Translate | TranslationRequest | TranslationResponse | Translation. |
BatchTranslate | BatchTranslationRequest | BatchTranslationResponse | Batch Translation of multiple texts from one language to multiple languages. |
BatchTranslateFromMultipleLanguages | BatchTranslationFromMultipleLanguagesRequest | BatchTranslationResponse | Batch Translation of multiple texts from multiple languages to one target language. |
Tts | TtsRequest | TtsResponse | TTS i.e. text to speech. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender. |
TemplatedTts | TemplatedTtsRequest | TemplatedTtsResponse | Synthesizes texts based on pre-defined templates. |
TranslateAndTts | TranslationTtsRequest | TranslationTtsResponse | Translation with TTS of the translated texts. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender. |
SpeechRecognition | SpeechRecognitionRequest stream | SpeechRecognitionResponse stream | Performs bidirectional streaming speech recognition. Stream audio to the server and receive recognized text in real-time. |
VoiceTranslate | VoiceTranslateRequest stream | VoiceTranslateResponse stream | Performs bidirectional streaming speech recognition and translation into multiple languages at once. It is billed per language/characters i.e. billing is the same for one request with N target languages and N requests with one target language each. Responses can contain audio files of the translated text. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender. |
VoiceChange | VoiceChangeRequest stream | VoiceChangeResponse stream | Performs bidirectional streaming speech recognition and synthesis into a specific voice in the same language. It is billed based on audio length with seconds tarification. Responses can contain audio files of the recognized text. Could be a RIFF including the header or MP3. Depends on the `AudioEncoding` in `VoiceConfig`. Voice's sampling rate will fallback to 16kHz if the specified voice is not supported in the wanted sampling rate. Voice's gender will fallback to the other gender if the specified voice is not available in the wanted gender. |
AnalyzeSentiment | AnalyzeSentimentRequest | AnalyzeSentimentResponse | Performs sentiment analysis (natural language processing) on the input text. Score is in range from -1.0 to 1.0, where negative score is negative sentiment and positive score is positive sentiment. Values around zero can be neutral or mixed sentiment. Mixed means that the text has both positive and negative parts which cancel each out i.e. they have similar magnitudes. |
AnalyzeEntities | AnalyzeEntitiesRequest | AnalyzeEntitiesResponse | Performs entities analysis (natural language processing) on the input text. |
.proto Type | Notes | C++ | Java | Python | Go | C# | PHP | Ruby |
double | double | double | float | float64 | double | float | Float | |
float | float | float | float | float32 | float | float | Float | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | long | int/long | int64 | long | integer/string | Bignum |
uint32 | Uses variable-length encoding. | uint32 | int | int/long | uint32 | uint | integer | Bignum or Fixnum (as required) |
uint64 | Uses variable-length encoding. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum or Fixnum (as required) |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | long | int/long | int64 | long | integer/string | Bignum |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 | uint | integer | Bignum or Fixnum (as required) |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum |
sfixed32 | Always four bytes. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sfixed64 | Always eight bytes. | int64 | long | int/long | int64 | long | integer/string | Bignum |
bool | bool | boolean | boolean | bool | bool | boolean | TrueClass/FalseClass | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | String | str/unicode | string | string | string | String (UTF-8) |
bytes | May contain any arbitrary sequence of bytes. | string | ByteString | str | []byte | ByteString | string | String (ASCII-8BIT) |