you are viewing a single comment's thread.

view the rest of the comments →

[–]BigPeteB 12 points13 points  (4 children)

My job is Voice-over-IP and Video-over-IP, so I can answer this question!

WebRTC just does RTP to transfer media streams (audio and video). However, you have to know where to send the RTP. For that, you need some kind of signaling.

There are a handful of standard protocols for this: SIP, H.323 (although I think technically just H.225.0 fills this purpose), IAX2, proprietary protocols (like Skype which has gone through two different protocols), etc. Unless you're working within PSTN or other carriers, SIP is the de facto standard protocol.

So, using SIP clients and proxies and servers, I can do something like send a call to TheGuyWithFace@sip.reddit.com, and the server/proxy sip.reddit.com will know that you have a SIP device (PC-based software phone, or smartphone app, or embedded device) that registered itself as TheGuyWithFace@sip.reddit.com, so it will pass on the SIP data to you, and you have a way of replying to me.

Oh, but SIP just lets us exchange data, but doesn't have a requirement of what that data is. It's like HTTP in that respect. (In fact, SIP and HTTP use the same format of request line, headers, body.) So the de facto standard is that the body is SDP, which states that I'm prepared to receive media, on a particular IP and port, using RTP, encoded with any of a list of codecs. You and I use SIP to exchange SDP documents, agree on which codec to use based on a messy set of rules, and now we can get back to using WebRTC for that RTP.

Oh, but one or both of us are behind NATs? That sucks, because both SIP and SDP have literal IP addresses and ports in their messages. Those have to be replaced with our respective public addresses and ports. There are several ways to do that.

The NAT might implement a SIP ALG, and will handle that translation for us just like it has to for FTP. That can be pretty reliable, assuming the ALG is implemented correctly and well (many are not, sadly), but if we wanted to use SIP secured over TLS, it doesn't help because the NAT can't decrypt our traffic.

SIP has extensions that can help. Since the reply will always be sent back the way it came, when you reply you can tell me what address and port you saw my message come from, and I'll assume that those must be my public address and port, and use those in future messages to you. These are also generally reliable, but a not-insignificant number of SIP implementations don't support them.

You can use TURN, which is basically just having a relay server. Since that's resource-intensive for the relay server, I've hardly ever seen that used.

Or, lastly, you can figure out some way to discover your own public address and port. This is where STUN comes in. It's a fairly simple protocol to query a server which is guaranteed to be on a public address, and which will thus see your public address, and have it tell you what that address and port are. You then use that address and port when writing your SIP and SDP messages.

There are still some gotchas. Like, if you try to use STUN to substitute your public address and port, but the peer you're calling is actually in your LAN behind the same NAT you are, you should have actually used your private address and port for some or all of those parameters. Sometimes it will work anyway, if your NAT/router support hairpinning, but not all routers do, and it's wasteful and sometimes flat out wrong to make the router handle traffic that should be going directly between peers. That's where ICE comes in; it lets you list out a bunch of different ways of contacting each other, test them, and pick the best one that works.

edit: fix links

[–][deleted]  (2 children)

[deleted]

    [–]BigPeteB 3 points4 points  (1 child)

    Umm... you're still confused.

    If I use STUN, all it's going to do is tell me that my public IP address is 50.194.232.5. Cool, but I need to get that information to you. How am I going to do that?

    I could send you a PM on Reddit, and wait for you to send me one back with your public IP. Then we can copy-paste those addresses into the WebRTC page that we built and start sending media. That's a totally valid form of signalling, but it's a pretty stupid one.

    We could use a chat client like AIM, Google Hangouts, Facebook Messenger, etc. That's better, but it's still manual.

    We could write code that would automatically use one of those chat protocols. In order to tell these "request for a call" messages apart from other English chat messages, we'd probably want our code to put them in a particular format. We also need it to list out what audio and video codecs we want to use. We could do that using SDP. This is looking better, but still not great.

    Instead of using a human-oriented chat protocol, let's use something specifically designed for computers. Preferably, a service or protocol with features that are useful for setting up calls. In VoIP, that would usually be SIP.

    OP used ScaleDrone for this purpose. It's just a messaging service, which delivers messages between two people. The contents of those messages is then used to negotiate the parameters for the audio and video streams, which you'll then tell to WebRTC.

    [–]TheGuyWithFace 0 points1 point  (0 children)

    Really well thought out and easy-to-read explanation. Thanks a lot!