Voice conferences

Recently, in an ongoing attempt to avoid Skype, I've read about protocols suitable for voice conferences, and now decided to write it down.

1 Preface

Voice conferences never seemed handy to me for a number of reasons: they are strictly real-time (quite an issue if you don't maintain a sleep regimen or any of participants have anything else scheduled), "half-duplex" (in that only one person can talk at a time for the speech to be distinguishable), there's no reliable and easy way to get greppable logs (transcriptions), and unless it's combined with textual chat, there's no way to copy and paste things, to share links or program output, etc.

They are probably good for multiplayer games, when you are busy controlling a character, but also need to coordinate actions in real-time. They may also be nice for those who are not used to reading and writing/typing, or even for a casual chat.

What puzzles me is that even programmers (who should be used to reading and writing, and do share links and other textual things quite often) seem to prefer voice communication surprisingly often. As well as that projects such as tox and secushare aim to provide nice (secure and private) voice/video conferences while there's still no usable protocol for such textual conferences, which should be easier to implement, suggesting that voice/video conferences are deemed pretty important by those.

1.1 Futher observations

1.1.1 Novermber 2017

  1. The conferences keep ending with variations of "let's discuss this in more detail by email".
  2. The primary argument I hear in favor of the conferences is that "sometimes it's faster to discuss something that way".
  3. Occasionally people are proposing a voice conference instead of answering a question by email; then it gets scheduled for a few days ahead, when everyone can attend.
  4. Other times those are also scheduled in advance, but not blocking email questions, at least.

At least in the past few years I'm not getting into long ones, but it's still rather annoying.

1.1.2 January 2018

Commonly used and mainstream software and protocols still don't manage to mitigate echoing; unless there's suitable equipment, participants have to mute and unmute themselves all the time in order to circumvent it.

2 Requirements and concerns

A particularly nasty thing about voice communication is speaker recognition and identification: you leave your biometrics along with what you are saying. If a protocol is centralized, doesn't provide end-to-end encryption, and/or unknown, it's roughly as good as making the conversations public and signed at once, and possibly stored until the end of civilization, what makes it considerably less comfortable to talk. Even apart from all the ethical and security issues, and considering a casual chat, while some people write silly things in public and sign those with their names all the time, for others it'd be embarrassing. Perhaps forgiving silliness and not judging yourself and others helps to care about that less, but it's out of scope of this note.

So, my initial requirements are: end-to-end encryption, an open protocol, at least an open source (preferably libre) client in existence, preferably a distributed protocol.

Apparently the requirements imposed by the majority of users, which should also be taken into account in order to actually use such a protocol, are that it should be extremely easy to set and to use on various systems: not more than a few mouse clicks or touchscreen taps. Perhaps being well-known is another thing that is important to inexperienced users, since the less known things they tend to find tend to be malware – even by the relaxed, non-RMS definition of malware (while by the RMS definition, most of the widely used programs are malware).

And the obvious requirement for it is to work well: acceptable sound/video quality (no perceivable noise, pauses, or delays) even with not-so-good channels, NAT traversal, etc.

3 Protocols

There's a comparison of VoIP software and a few more lists on Wikipedia, the YBTI map, and just widely known protocols. Seems like newbie users mostly think in terms of client software that implements those protocols, by the way, so the clients are even more widely known.

3.1 XMPP + Jingle

Though it might be nice, with XMPP it's pretty tricky even to ensure textual, one-to-one message delivery: different clients and servers support different sets of XEPs, users are not willing to configure anything, clients hide most of the settings and logs from users, users who are willing to configure or debug things can't do it with reasonable effort because of that.

3.2 SIP and H.323

Apparently those are very common for VoIP, and there's a lot of software (even Android API for SIP), but they require to set a server, and users don't hear about them that often, making it something rather obscure for them. While it's not immediately available, "let's rather not deal with it" is a possible reaction.

3.3 WebRTC

WebRTC is a weird thing to find in web browsers (how's that functionality even related to web?), yet it seems to be pretty nice: good NAT travesal (ICE, STUN, TURN), end-to-end encryption (DTLS), voice and video conferences, supported by all the common web browsers for a while now, making it extremely easy to use: a single mouse click to get into a conference. It's not perfect, but open and standardised, and uses other standards, too.

When I've tried to use it with some public server, there was just noise and no distinguishable words; tcpdump showed that it used a slow TURN server instead of STUN with hole punching, even though UDP hole punching worked fine between the used networks when I've tried it manually, using netcat. Apparently further debug will require reading plenty of docs, since there's nowhere to ask about it: no webrtc channel on freenode, and webrtc.org contacts only list Twitter, Google+, and Google groups (which ignore my letters most of the time for an unknown reason, even with usually negative spamassassin score).

3.3.1 Update

Found test.webrtc.org, discovered that my ISP now puts me behind a NAT, so barely made it to pass the "Reflexive connectivity" test by acquiring a NAT-free IP address and redirecting all the UDP traffic (because there's no fixed port – so one doesn't have to set port forwarding at all, if only it worked) to my host from my router, by adding the following into /etc/config/firewall (that's on LibreCMC, should be the same with OpenWRT):

config redirect
        option enabled '1'
        option src 'wan'
        option dest 'lan'
        option proto 'udp'
        option name 'all udp'
        option dest_ip ''

(Then /etc/init.d/firewall restart to apply the changes.)

I don't know why it didn't work without that though.

3.3.2 Update #2

I didn't test it still, the idea of using voice conferences was ditched.

3.3.3 Update #3

Finally tested; for some reason it didn't quite work for sound anyway, even while passing all the tests. Maybe it's caused by my microphone though.

3.4 Tox, secushare, etc

Though Tox seems nice and almost works, it's not that stable, and doesn't have quite usable software even for those who are willing to build and configure it. Haven't tried its voice conferences though. But still, it's not likely to work for regular users at the moment, even if voice conferences work well.

Secushare is similar, just bad parts are currently worse (harder to set, fewer things seem to be ready), and good parts may be better.

4 Conclusion

It's 2016 (update: 2018), there's still no satisfactory way to communicate over internet – neither in text nor in voice. Not that much because of incompatible views on what's better (there is that "convenience versus security" thing though, but it doesn't have to be that bad), but because it's not done.