If you want to go with HTML5 only, you will need a browser implementing the HTML Media Capture draft (available here) in order to access the raw data from the microphone.
Once you have this data in hand, you need to send it over the network. Websockets would be the HTML5 option to have fast enough round trips with the server (sending local audio data and receiving remote audio data at the same time)
Since you mention python, I would recommend looking around the twisted implementation of websockets.
You can have all your clients "register" on the websocket server with a callerID, so the server knows where to find a given callerID.
Then your server will need an "invite" API where caller1 "invites" caller2.
Once the call is setup and each client starts sending its audio data, the server will be able to send this audio data to the other party.
Upon receiving audio data, the browser will need to play this audio data on the speakers, probably using the HTML5 audiotag.
To do this, you may be forced to use a "trick" : instead of having the websocket server forward the raw audio data to the client, you may need to simulate 2 "infinite" files :
- caller1.wav : sound captured on caller1 mic
- caller2.wav : sound captured on caller2 mic
caller1 browser would add caller2.wav in the audio.src attribute once the call is setup (caller1 would be informed of this event via websocket) and hopefully if the python server appends the raw audio data to the caller2.wav as it receives it, it would start playing.
This sounds like a cool prototype you're going to hack up !
Good luck on your journey,
Jerome Wagner