The why is simple to explain: to carry several media over one bit stream. Consider DVB (digital TV): each transponder (= frequency) provides one bit stream. But you already need at least two streams for a TV channel: audio and video. And then a lot more that you'll never see carrying meta-information. So instead of transporting each of these streams on a separate frequency, they are multiplexed into one bit stream. That is the MPEG-TS (Transport Stream). A demuxer then takes this stream and separates it into substreams which carry the real information.
Through this, a typical DVB-T transponder in Europe can carry four TV channels (called a bouquet). The number can vary, it's a decision of the stream provider (trade-off between more quality = less channels = more expensive or less quality = more channels = cheaper, I guess).
As to which audio stream is played: a TV channel can have several audio streams (for example, normal audio, audio with descriptions for visual impaired, another language, etc.). By default, a player will probably play the first audio stream but can switch audio streams at any time.