I could give some general point of view. I would be assuming the SIP-based VOIP which is actually pretty omnipresent (IMS, LTE, 3GPP, etc.).
The VOIP has two parts that you might have spotted while searching:
- SIP (the control plane)
- RTP (the data or payload plane = audio)
In general, there are two approaches the one comes from a peer-to-peer world where every change in media flow is communicated to the other party with REFER doing actually call transfer for any purpose. But that is usually not a prefered way of doing things. Here comes the second approach which is kind of hiding whatever changes on the B-party (called party) side. Such thing is used also in IMS (which is behind the modern GSM networks). The trick is that the A-party (caller) actually reaches the B-party proxy. In terms of SIP, it is B2BUA aka back to back user agent. Which as the name suggests it covers all the magic that happens in the called party network.
The magic is then actually hidden behind that B2BUA which actually behaves as an entity in the middle and thus can manipulate both SIP and RTP.
Therefore this entity can actually fork the audio using an MGW (media gateway) towards the "real" B-Party (a human/operator) as well as directing the audio to the ML/AI/Expert System analysis. This process also incorporates an appropriate control plane events like starting the analytic process attach, actual audio forking (RTP) and also triggering the SIP INVITE for final B-party. Whenever the analysis is concluded then out of band messaging to some "rich" client at the SIP Agent (computer/tablet with SoftPhone) or some CRM system attached to the call centre system. Such a message should inform the B-Party about the result of the analysis.
All the magic is hidden either inside the B2BUA or eventually inside SIP application server which is a generic name for various services like call distribution to call centre agents, voice mail, IVR, etc.
The voice analysis is today used at banks for caller verification, mood analysis and many "smart" audio processing.
In that domain, there are some opensource and proprietary SIP systems. They tend to be somehow complex. And moreover, the logic is pretty different compared to request-response systems (like HTTP). The call is a stateful system with "session" (call ~ Call-ID) and everything is bound to that.
Hope that this can help you.