Explanation of audio stat using sox

Asked 14/4, 2017 at 16:18 Answered 8/3, 2020 at 1:0

I have a bunch of audio files and need to split each files based on silence and using SOX. However, I realize that some files have very noisy background and some don't thus I can't use a single set of parameter to iterate over all files doing the split. I try to figure out how to separate them by noisy background. Here is what I got from sox input1.flac -n stat and sox input2.flac -n stat

Samples read:          18207744
Length (seconds):    568.992000
Scaled by:         2147483647.0
Maximum amplitude:     0.999969
Minimum amplitude:    -1.000000
Midline amplitude:    -0.000015
Mean    norm:          0.031888
Mean    amplitude:    -0.000361
RMS     amplitude:     0.053763
Maximum delta:         0.858917
Minimum delta:         0.000000
Mean    delta:         0.018609
RMS     delta:         0.039249
Rough   frequency:         1859
Volume adjustment:        1.000

and

Samples read:         198976896
Length (seconds):   6218.028000
Scaled by:         2147483647.0
Maximum amplitude:     0.999969
Minimum amplitude:    -1.000000
Midline amplitude:    -0.000015
Mean    norm:          0.156168
Mean    amplitude:    -0.000010
RMS     amplitude:     0.211787
Maximum delta:         1.999969
Minimum delta:         0.000000
Mean    delta:         0.091605
RMS     delta:         0.123462
Rough   frequency:         1484
Volume adjustment:        1.000

The former does not contain noisy background and the latter does. I suspect I can use the Sample Mean of Max delta because of the big gap. Can anyone explain for me the meaning of those stats, or at least show me where I can get it myself (I tried looking up in official documentation but they don't explain). Many thanks.

Caudate answered 14/4, 2017 at 16:18 Comment(0)

I don't know how I've managed to miss stat in the SoX docs all this time, it's right there.

Length
- length of the audio file in seconds
Scaled by
- what the input is scaled by. By default 2^31-1, to go from 32-bit signed integer to [-1, 1]
Maximum amplitude
- maximum sample value
Minimum amplitude
- minimum sample value
Midline amplitude
- aka mid-range, midpoint between the max and minimum values.
Mean norm
- arithmetic mean of samples' absolute values
Mean amplitude
- arithmetic mean of samples' values
RMS amplitude
- root mean square, root of squared values' mean
Maximum delta
- maximum difference between two successive samples
Minimum delta
- minimum difference between two successive samples
Mean delta
- arithmetic mean of differences between successive samples
RMS delta
- root mean square of differences between successive samples
Rough frequency
- estimation of the input file's frequency, in hertz. unsure of method used
Volume adjustment
- value that should be sent to -v so peak absolute amplitude is 1

Personally I'd rather use the stats function, whose output I find much more practically useful.

As a measure to differentiate between the more or less noisy audio I'd try using the difference between the highest and lowest sound levels. The quietest parts will never be quieter than the background noise alone, so if there is little difference the audio is either noisy, or just loud all the time, like a compressed pop song. You could take the difference between the maximum and minimum RMS values, or between peak and minimum RMS. The RMS window length should be kept fairly short, say between 10 and 200ms, and if the audio has fade-in or fade-out sections, those should be trimmed away, though I didn't include that in the code.

audio="input1.flac"
width=0.01

# Mixes down multi-channel files to mono
stats=$(sox "$audio" -n channels 1 stats -w $width 2>&1 |\
  grep "Pk lev dB\|RMS Pk dB\|RMS Tr dB" |\
  sed 's/[^0-9.-]*//g')

peak=$(head -n 1 <<< "$stats")
rmsmax=$(head -n 2 <<< "$stats" | tail -n 1)
rmsmin=$(tail -n 1 <<< "$stats")

rmsdif=$(bc <<< "scale=3; $rmsmax - $rmsmin")
pkmindif=$(bc <<< "scale=3; $peak - $rmsmin")

echo "
  max RMS: $rmsmax
  min RMS: $rmsmin

  diff RMS: $rmsdif
  peak-min: $pkmindif
"

Theatre answered 15/4, 2017 at 17:16 Comment(6)

Can I pls ask what's good for determine/estimate whether an audio is clipped or not? Checking if 'Min level' == 1.0 or 'Pk lev dB' == 0.0 ?? Thx – Bonar 7/10, 2019 at 23:3

@Blue482: Those values could also indicate that the audio has been normalized, and won't detect clipping if the audio has been scaled afterwards, better to use Flat factor and Pk count. The first one is a measure of how many successive samples has a delta of zero, the latter is a measure of how many times the audio reaches max amplitude, both will increase with clipping. The doc claims that Pk count isn't a count of samples, but occurrences, but doesn't expand on that, and it tracks number of sample rather well. – Theatre 8/10, 2019 at 12:4

Flat factor is also not very well defined, but max appears to be 87.6, and does not depend on the length of the audio. See f.ex: sox -n -p synth 10 square 1 norm -3 | sox - -n stats – Theatre 8/10, 2019 at 12:13

Thanks @ AkselA! Is there any intuitions on how to select a threshold for Flat factor and Pk count for deciding whether an audio is clipped or not? – Bonar 8/10, 2019 at 14:9

@Blue482: I think you have to test for yourself. Find out what type I/II error rates you are most comfortable with. It also depends a lot on what kind of music your'e testing. Slight clipping in a crunched electric guitar is not nearly as noticeable as with a solo clarinet, say. – Theatre 8/10, 2019 at 19:5

Thank you very much @ AkselA ! Appreciated! My audios are conversational recordings. – Bonar 8/10, 2019 at 21:13

The documentation is found in sox.pdf in the install directory.

For example, if you install the Windows 32-bit version of SoX 14.4.2, the PDF is found at C:\Program Files (x86)\sox-14-4-2\sox.pdf and the documentation for stat is on pages 35 - 36.

I also found a webpage version here.

Goddart answered 7/9, 2019 at 19:33 Comment(0)

I'd use the "mean norm" value as a decider. It works for me, especially if you get pops or clicks on the line. If the line is clean however, then Maximum Amplitude might be a better value to use (I notice your Maximum Amplitude is the same on both, so therefore do not use this in your case).

Risley answered 8/3, 2020 at 1:0 Comment(0)

Recommended topics

Hot tags