Last week, Amazon announced that its automatic speech recognition (ASR) service, Amazon Transcribe, now supports WebSockets. According to Amazon, “WebSocket support opens Amazon Transcribe Streaming up to a wider audience and makes integrations easier for customers that might have existing WebSocket-based integrations or knowledge”.
Amazon Transcribe allows developers to add speech-to-text capability to their applications easily with its ASR service. Amazon announced the general availability of Amazon Transcribe in the AWS San Francisco Summit 2018. With Amazon Transcribe API, users can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech. Real-time transcripts from a live audio stream are also possible with the Transcribe API.
Until now, the Amazon Transcribe Streaming API has been available using HTTP/2 streaming. However, Amazon adds the new WebSockets support as another integration option for bringing real-time voice capabilities to different projects built using Transcribe.
What are WebSockets?
WebSockets are a protocol built atop TCP, similar to HTTP. HTTP is excellent for short-lived requests, however, it does not handle persistent real-time communications well. Due to this, the first Amazon Transcribe Streaming API made available uses HTTP/2 streams that solve a lot of the issues that HTTP had with real-time communications.
Amazon states, “an HTTP connection is normally closed at the end of the message, a WebSocket connection remains open”. With this advantage, messages can be sent bi-directionally with no bandwidth or latency added by handshaking and negotiating a connection.
WebSocket connections are full-duplex, which means that the server and client can both transmit data to and fro at the same time. WebSockets were also designed “for cross-domain usage, so there’s no messing around with cross-origin resource sharing (CORS) as there is with HTTP”.
Amazon Transcribe Streaming using Websockets
While using the WebSocket protocol to stream audio, Amazon Transcribe transcribes the stream in real-time. When a user encodes the audio with event stream encoding, Amazon Transcribe responds with a JSON structure, which is also encoded using event stream encoding.
Key components of a WebSocket request to Amazon Transcribe are:
- Creating a pre-signed URL to access Amazon Transcribe.
- Creating binary WebSocket frames containing event stream encoded audio data.
- Handling WebSocket frames in the response.
The different languages that Amazon Transcribe currently supports during real-time transcription include British English (en-GB), US English (en-US), French (fr-FR), Canadian French (fr-CA), and US Spanish (es-US).
To know more about WebSockets API in detail, visit Amazon’s official post.
Read Next
Understanding WebSockets and Server-sent Events in Detail
Implementing a non-blocking cross-service communication with WebClient[Tutorial]
Introducing Kweb: A Kotlin library for building rich web applications