Incase anyone runs into this:
You can treat the samples per pixel as your zoom level, at higher levels (zoomed out more) you will probably want to subsample that for performance reasons.
You will most likely want a fixed width that fits on the screen to draw on and use virtual scrolling (so you don't potentially have a draw area of several million pixels).
You can calculate the value for each pixel by iterating over the audio data with: skip (scroll position * samples per pixel) + (pixel * samples per pixel) take samples per pixel.
This allows for performant infinite zoom and scroll as you only read and draw the minimum amount to fill the view.
The scroll width is calculated with audio data length / samples per pixel.
Audio samples are generally shown in one of two ways, the peak value of the sample range or the rms value. The rms value is calculated by summing the squares of all values in the sample range, divide the sum by sample length, the rms value if the squareroot of this (rms will be a bit higher than average and is a good measure of perceived loudness)
You can increase performance in multiple ways such as increasing sub sampling (causes loss of detail), throttling the scroll and making the draw requests cancelable incase new scroll fires before previous is rendered.