Categories: Tutorials

WebRTC in FreeSWITCH

16 min read

In this article by Anthony Minessale and Giovanni Maruzzelli, authors of Mastering FreeSWITCH, we will cover the following topics:

What WebRTC is and how it works
Encryption and NAT traversing (STUN, TURN, etc)
Signaling and media
Interconnection with PSTN and SIP networks
FreeSWITCH as a WebRTC server, gateway, and application server
SIP signaling clients with JavaScript (SIP.js)
Verto signaling clients with JavaScript (mod_verto, verto.js)

(For more resources related to this topic, see here.)

WebRTC

Finally something new! How refreshing it is to be learning and experimenting again, especially if you’re an old hand! After at least ten years of linear evolution, here we are with a quantum leap, the black swan that truly disrupts the communication sector.

Browsers are already out there, waiting

With an installed base of hundreds of millions, and soon to be in the billions ballpark, browsers (both on PCs and on smart phones) are now complete communication terminals, audio/video endpoints that do not need any additional software, plugins, hardware, or whatever. Browsers now incorporate, per default and in a standard way, all the software needed to interact with loudspeakers, microphones, headsets, cameras, screens, etc.

Browsers are the new endpoints, the CPEs, the phones. They have an API, they’re updated automatically, and are compatible with your system. You don’t have to procure, configure, support, or upgrade them. They’re ready for your new service; they just work, and are waiting for your business.

Web Real-Time Communication is coming

There are two completely separated flows in communication: Signaling and media. Signaling is a flow of information that defines who is calling whom, taking what paths, and which technology is used to transmit which content. Media is the actual digitized content of the communication, for example, audio, video, screen-sharing, etc.

Media and signaling often take completely unrelated paths to go from caller to callee, for example, their IP packets traverse different gateways and routers. Also, the two flows are managed by separate software (or by different parts of the same application) using different protocols.

WebRTC defines how a browser accesses its own media capture, how it sends and receives media from a peer through the network and how it renders the media stream that it receives. It represents this using the same Session Description Protocol (SDP) as SIP does.

So, WebRTC is all about media, and doesn’t prescribe a signaling system. This is a design decision, embedded in the standard definition. Popular signaling systems include SIP, XMPP, and proprietary or custom protocols. Also, WebRTC is all about encryption. All WebRTC media streams are mandatorily encrypted.

Chrome, Firefox, and Opera (together they account for more than 70 percent of the browsers in use) already implement the standard; Edge is announcing the first steps in supporting WebRTC basic features, while only Safari is still holding its cards (Skype and FaceTime on WebRTC with proprietary signaling? Wink wink).

Under the hood

More or less, WebRTC works like this:

Browser connects to a web server and loads a webpage with some JavaScript in it
JavaScript in the webpage takes control of browser’s media interfaces (microphone, camera, speakers, and so on), resulting in an API media object
The WebRTC Api Media object will contain the capabilities of all devices and codecs available, for example, definition, sample rate, and so on, and it will permit the user to choose their own capabilities preferences (for example, use QVGA video to minimize CPU and bandwidth)
Webpage will interface with browser’s user, getting some input for signing in the webserver’s communication service (if any)
JavaScript will use whatever signaling method (SIP, XMPP, proprietary, custom) over encrypted secure websocket (wss://) for signing in the communication service, finding peers, originating and receiving calls
Once signed up in the service, a call can be made and received. Signaling will give the protocol address of the peer (for example, sip:gmaruzz@opentelecom.it)
These points are represented in the following image:
Now is the moment to find out actual IP addresses. JavaScript will generate a WebRTC API object for finding its own IP addresses, transports and ports (ICE candidates) to be offered to peer for exchanging media (JavaScript WebRTC API will use ICE, STUN, TURN, and will send to peer its own local LAN address, its own public IP address, and maybe the IP address of a Turn server it can use)
Then, WebRTC Net API will exchange ICE candidates with the peer, until they both find the most “rational” triplets of IP address, port and transport (udp, dtls, and so on), for each stream (for example, audio, video, screen share, and so on)
Once they get the best addresses, the signaling will establish the call.
These points are represented in the following image:
Once signaling communication with the peer is established, media capabilities are exchanged in SDP format (exactly as in SIP), and the two peers agree on media formats (sample rates, codecs, and so on)
When media formats are agreed, JavaScript WebRTC Transport API will use secure (encrypted) websockets (wss://) as transport for media and data
JavaScript WebRTC Media API will be used to render the media streams received (for example, render video, play sound, capture microphone, and so on)
Additionally or in alternative to media, peers can establish one or more data channels, through which they bidirectionally exchange raw or structured data (file transfers, augmented reality, stock tickers, and so on)
At hangup, signaling will tear down the call, and JavaScript WebRTC Media API will be used to shut down streams and renderings
These points are represented in the following image:

This is a high level, but complete, view of how a WebRTC system works.

Encryption – security

Please note that in normal operation everything is encrypted, uses real PKI certificates from real Certification Authorities, actual DNS names, SSL, TLS, HTTPS, WSS, DTLS-SRTP. This is how it is supposed to work. In WebRTC, security is not an afterthought: It is mandatory.

To make signaling work without encryption (for example, for debugging signaling protocols) is not so easy, but it is possible. Browsers will often raise security exceptions, and will ask for permission each time they access a camera or microphone. Some hiccups will happen, but it is doable. Signaling is not part of WebRTC standard, as you know.

On the contrary, it is not possible to have the media or data streams to leave the browser in the clear, without encryption.

The use of plain RTP to transmit media is explicitly forbidden by the standard. Media is transmitted by SRTP (Secure RTP), where encryption keys are pre-exchanged via DTLS (Datagram Transport Layer Security, a version of TLS for Datagrams), basically a secure version of UDP.

Beyond peer to peer – WebRTC to communication networks and services

WebRTC is a technique for browsers to send media to each other via Internet, peer to peer, perhaps with the help of a relay server (TURN), if they can’t reach each other directly. That’s it.

No directories, no means to find another person, and also no way to “call” that person if we know “where” to call her.

No way to transfer calls, to react to a busy user or to a user that does not pickup, and so on.

Let’s say WebRTC is a half-built phone: It has the handset, complete with working microphone and speaker, from which it comes out, the wiring left loose. You can cross join that wiring with the wiring of another half-built phone, and they can talk to each other.

Then, if you want to talk to another device, you must find it and then join the wires anew.

No dial pad, no Telecom Central Office, no interconnection between Local Carriers, and with International Carriers. No PBX. No way to call your grandma, and no possibilities to navigate the IVR at Federal Express’ Customer Care.

We need to integrate the media capabilities and the ubiquity of WebRTC with the world of telecommunication services that constitute the planet’s nervous system.

Enter the “WebRTC Gateway” and the “WebRTC Application Server”; in our case both are embodied by FreeSWITCH

WebRTC gateways and application servers

The problem to be solved is: We can implement some kind of signaling plane, even implement a complete SIP signaling stack in JavaScript (there are some very good ones in open source, we’ll see later), but then both at the network and at the media plane, WebRTC is only “kind of” compatible with the existing telecommunication world; it uses techniques and concepts that are “similar”, and protocols that are mostly an “evolution ” of those implemented in usual Voice over IP.

At the network plane, WebRTC uses ICE protocol to traverse NAT via STUN and TURN servers. ICE has been developed as Internet standard to be the ultimate tool to solve all NAT problems, but has not yet been implemented in either telco infrastructure, nor in most VoIP clients. Also, ICE candidates (the various different addresses the browser thinks they would be reachable at) need to be passed in SDP and negotiated between peers, in the same way codecs are negotiated. Being able to pass through corporate firewalls (UDP blocked, TCP open only on ports 80 and 443, and perhaps through protocol-aware proxies) is an absolute necessity for serious WebRTC deployment.

At media plane, WebRTC specific codecs (V8 for video and Opus for audio) are incompatible with the telco world, with audio G711 as the only common denominator.

Worst yet, all media are encrypted as SRTP with DTLS key exchange, and that’s unheard of in today’s telco infrastructure.

So, we need to create the signaling plane, and then convert the network transport, convert the codecs, manage the ICE candidates selection in SDP, and allow access to the wealth of ready-made services (PSTN calls, IVRs, PBXs, conference rooms, etc), and then complement the legacy services with special features and new interconnected services enabled by the unique capabilities of WebRTC endpoints.

Yeah, that’s a job for FreeSWITCH.

Which architecture? Legacy on the Web, or Web on the Telco?

Real-time communication via the Web: From the building blocks we just saw, we can implement it in many ways.

We have one degree of freedom: Signaling. I mean, media will be anyway agreed about via SDP, transmitted via websockets as SRTP packets, and encrypted via DTLS key exchange.

We still have the task to choose how we will find the peer to exchange media with. So, this is an exercise in directory, location, registration, routing, presence, status, etc. You get the idea.

So, at the end of the day you need to come out with a JavaScript library to implement your signaling on the browsers, commanding their underlying mechanisms (Comet, Websockets, WebRTC Data Channel) to find your beloved communication peer.

Actually it boils down to different possibilities:

SIP
XMPP (eg: jabber)
In-house signaling implementation
VERTO (open source)

SIP and XMPP make today’s world spin around. SIP is mostly known for carrying the majority of telephone and VoIP signaling traffic. The biggest implementations of instant messaging and chatting are based on XMPP. And there is more: Those two signaling protocols are often used together, although each one of them has extensions that provide the other one’s functionality.

Both SIP and XMPP have been designed to be expandable and modular, and SIP particularly is an abstract protocol, for the management of “sessions” (where a “session” can be whatever has a beginning and an end in time, as a voice or video call, a screen share, a whiteboard, a collaboration platform, a payment, a message, and so on).

Both have robust JavaScript implementations available (for SIP check SIP.js, JsSIP, SIPML, while for XMPP check Strophe, stanza.io, jingle.js).

If your company has considerable investments and/or expertise in those protocols, then it makes sense to expand their usage on the web too.

If you’re running Skype, or similar services, you may find it an attractive option to maintain your proprietary, closed-signaling protocol and implement it in JavaScript, so you can expand your service reach to browsers and exploit that common transport and media technologies.

VERTO is our open source signaling proposal, designed from the ground up to be familiar to Web application developers, and allowing for a high degree of integration between FreeSWITCH-provided services and browsers. It is implemented on the FreeSWITCH side by a module (mod_verto) that talks JSON with the JavaScript library (verto.js) on the browser side.

FreeSWITCH accommodates them ALL

FreeSWITCH implements all of WebRTC low-level protocols, codecs and requirements. It’s got encryption, SRTP, DTLS, RTP, websocket and secure websocket transports (ws:// and wss://). Having got it all, it is able to serve SIP endpoints over WebRTC via mod_sofia (they’ll be just other SIP phones, exactly like the rest of soft and hard SIP phones), and it interacts with XMPP via mod_jingle.

Crucially, FreeSWITCH has been designed since its inception to be able to manage and message high-definition media, both audio and video. Support for OPUS audio codec (8 up to 48 khz, enough for actual audio-cd quality) started years ago as a pioneering feature, and has evolved over the years to be so robust and self-healing as to sustain a loss of more than 40% (yep, as in FORTY PERCENT) packets and maintain understandability. WebRTC’s V8 video codec is routinely carrying our mixed video conferences in FullHD (as in 1920×1080 pixel), and we’re looking forward to investing in fiber and in some facial cream to look good in 4K.

That’s why FreeSWITCH can be the pivot of your next big WebRTC project: its architecture was designed from the start to be a multimedia powerhouse.

There is lot of experience out there using FreeSWITCH in expanding the reach of existing SIP services having the browsers acting as SIP phones via JavaScript libraries, without modifying in any way the service logic and implementation. You just add SIP extensions that happen to be browsers.

For the remainder of this article we’ll write about VERTO, a FreeSWITCH proposal especially dedicated to Web development.

What is Verto (module and jslib)?

Verto is a FreeSWITCH module (mod_verto) that allows for JSON interaction with FreeSWITCH, via secure websockets (wss). All the power and complexity of FreeSWITCH can be harnessed via Verto: Session management, call control, text messaging, and user data exchange and synchronization. Take a note for yourself: “User data exchange and synchronization”. We’ll be back to this later.

Verto is like Event Socket Layer (ESL) on steroids: Anything you can do in ESL (subscribe, send and receive messages in FS core message pumps/queues) you can do in Verto, but Verto is actually much more and can do much more. Verto is also made for high-level control of WebRTC!

Verto has an accompanying JavaScript library, verto.js. Using verto.js a web developer can videoconference and enable a website and/or add a collaboration platform to a CRM system in few lines of code. And in a few lines of a code that he understands, in a logic that’s familiar to web developers, without forcing references to foreign knowledge domains like SIP.

Also, Verto allows for the simplest way to extend your existing SIP services to WebRTC browsers.

The added benefit of “user data exchange and synchronization” (see, I’m back to it) is not to be taken lightly: You can create data structures (for example, in JSON) and have them synchronized on server and all clients, with each modification made by the client or server to be automatically, immediately and transparently reflected on all other clients.

Imagine a dynamic list of conference participants, or a chat, or a stock ticker, or a multiuser ping pong game, and so on.

Configure mod_verto

Mod_verto is installed by default by standard FreeSWITCH implementation. Let’s have a look at its configuration file, verto.conf.xml.

The most important parameter here, and the only one I had to modify from the stock configuration file, is ext-rtp-ip. If your server is behind a NAT (that is, it sits on a private network and exchanges packets with the public internet via some sort of port forwarding by a router or firewall), you must set this parameter to the public IP address the clients are reaching for.

Other very important parameters are the codec strings. Those two parameters determine the absolute string that will be used in SDP media negotiation. The list in the string will represent all the media formats to be proposed and accepted. WebRTC has mandatory (so, assured) support for vp8 video codec, while mandatory audio codecs are opus and pcmu/pcma (eg, g711). Pcmu and pcma are much less CPU hungry than opus. So, if you are willing to set for less quality (g711 is “old PSTN” audio quality), you can use “pcmu,pcma,vp8” as your strings, and have both clients and server use far less CPU power for audio processing.

This can make a real difference and very much sense in certain setups, for example, if you must cope with low-power devices. Also, if you route/bridge calls to/from PSTN, they will have no use for opus high definition audio; much better to directly offer the original g711 stream than decode/recode it in opus.

Test with Communicator

Once configured, you want to test your mod_verto install. What better moment than now to get to know the awesomeness of Verto Communicator, a JavaScript videoconference and collaboration advanced client, developed by Italo Rossi, Jonatas Oliveira and Stefan Yohansson from Brazil, Joao Mesquita from Argentina, and our core devs Ken Rice and Brian West from Tennessee and Oklahoma?

If it’s not already done, copy Verto Communicator distribution directory (/usr/src/freeswitch.git/html5/verto/verto_communicator/dist/) into a directory served by your web server in SSL (be sure you got all the SSL certificates right).

To see it in all its splendor, be sure to call from two different clients, one as simple participant, the other as moderator, and you’ll be presented with controls to manage the conference layout, for giving floor, for screen sharing, for creating banners with name and title for each participant, for real-time chatting, and much more. It is simply astonishing what can be done with JavaScript and mod_verto.

Summary

In this article we delved in WerbRTC design, what infrastructure it requires, in what is similar and in what is different from known VoIP.

We understood that WebRTC is only about media, and leave the signaling to the implementor.

Also, we get the specific of WebRTC, its way to traverse NAT, its omnipresent encryption, its peer to peer nature.

We witnessed going beyond peer to peer, connecting with the telecommunication world of services needs gateways that do transport, protocol and media translations.

FreeSWITCH is the perfect fit as WebRTC server, WebRTC gateway, and also as application server.

And then we saw how to implement Verto, a signaling born on WebRTC, a JSON web protocol designed to exploit the additional features of WerbRTC and of FreeSWITCH, like real time data structure synchronization, session rehydration, event systems, and so on.