Stripping silence with sox

Asked 21/12, 2016 at 23:39 Answered 29/5, 2022 at 18:59

I have around 20,000 .wav files (all voice lines) that I need to strip the silence from the start AND end of.

The "silence" isn't pure silence, so I'll need to set a threshold.

I'd also like to leave a little "silence" before the actual sound/voice starts, so each file would get trimmed but .X seconds of the original silence remains.

I've tried various commands and can't get it to set a threshold correctly. I've seen a lot of internet comments about doing this, so I must be using the command wrong.

I also can't figure out how to leave .X seconds of silence.

I assume sox can do this, or at least most of it?

Sequoia answered 21/12, 2016 at 23:39 Comment(0)

I found this very useful guide for using SoX Silence. While the official SoX Silence manual page is quite a mess and incomprehensible, this guide provides thorough explanation with examples: https://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/comment-page-2/

You can try:

sox input.wav output.wav silence 1 X 0.1% 1 X 0.1% : newfile : restart

with X being a number such as 0.75

Pilcomayo answered 13/4, 2018 at 22:41 Comment(4)

Do you have any idea on how to do in python ? are there any library to do the same ? – Defelice 3/4, 2019 at 12:18

@DeepanRaj - Use Python to fork sox? I doubt you will find a Python library that supports all the audio formats and features that sox does... Unless it's just a library that wraps sox... – Karli 30/7, 2019 at 5:34

I found pydub library that does the work for me. Thanks – Defelice 12/8, 2019 at 6:48

You are right the man page is no help here and the article is great! thanks – Admonitory 9/6, 2020 at 21:20

Trimming silence at the start and end

One solution would be (based on this Digital Cardboard blog post) to call sox like this:

sox in.wav out.wav silence 1 0.1 0.1% reverse silence 1 0.1 0.1% reverse


(Here is a version with placehoders X and Y instead of specific values to
explain below what is happening exactly:
sox in.wav out.wav silence 1 X Y reverse silence 1 X Y reverse
)

X is the minimum duration (in seconds) of a sound in order to be interpreted as non-silence by sox. For example there might be a loud clicking sound at the beginning of the audio that is 0.15 seconds long. If we set 0.2for X then this loud but short click will be interpreted as silence and will be removed. If for X we set 0.1 then the click will be interpreted by sox as the start of the non-silence part, meaning everything before the click will be removed but not the click itself.
Also note that a trailing zero should be used if the duration is a whole number, so 1.0 should be used instead of 1 to avoid unexpected behavior.

Y defines a loudness threshold. Everything below it will be interpreted as silence, no matter how long or short it is. So some long rumbling sound at the beginning, that is not very loud might fall below the threshold and thus gets interpreted as silence and thus is not removed. Everything that is loud enough to be above the threshold will be interpreted as the start of non-silence if its duration is long enough (see X).
Note that digitalcardboard states that the smallest value to be used should be 0.1% instead of 0.

1 simply specifies to remove silence only at the beginning. To trim silence at the end we use the same but reverse the audio first. Why this approach is correct for trimming the end should become apparent below, where I analyze what the solutions of the other answers do further below.

Leaving a certain amount of silence at the beginning

The simple answer is: sox does not support this.

But we can try to work around this by trimming the silence and then add a fixed amount of silence at the beginning. This can be done with:

sox in.wav out.wav silence 1 0.1 0.1% reverse silence 1 0.1 0.1% reverse pad X 0

X is the duration (in seconds) of the silence that we want to prepend.

0 in this position means that no padding should be added at the end.

Of course this is not the same as keeping some duration of the original silence (if present), because that would also allow result files that don't have any silence at the beginning if the input also doesn't have any silence at the beginning. Still, trimming + padding is the best I could come up with.

Other answers

So far all the answers here are no solution for the question. OP wanted to remove silence from the start and the end. Here is what the previous solutions do instead, for the interested:

Kid_Learning_C: Multiple output files are generated where each file contains one of the non-silent parts of the input. So suppose the input file consists of Silence->Non-Silence-A->Silence->Non-Silence-B->Silence. Using the arguments from this answer we would get output001.wav containing Non-Silence-A. output002.wav containing Non-Silence-B. And for some reason an extremely short output003.wav.
DSBLR: The end is not trimmed.
Anas Naguib: Also removes in-between silence. So with an input file that consists of Silence->Non-Silence-A->Silence->Non-Silence-B->Silence we would get Non-Silence-A->Non-Silence-B which means all silence is removed. Not just at the start and end.

Oh, and all of those answers provide no solution for keeping some of the silence at the beginning as asked by the OP.

Inference answered 29/5, 2022 at 18:59 Comment(0)

Trim the silence beginning of the audio

sox in.wav out1.wav silence 1 0.1 1%

Source: https://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/

Emission answered 11/6, 2019 at 1:21 Comment(0)

You can use this command for Sox

sox inputfile.wav tmpoutput.wav silence 1 0.75 0.1% -1 0.75 0.1%

Quinnquinol answered 26/10, 2021 at 22:20 Comment(1)

Nope, that cuts a lot of samples away from the beginning, if there are pauses in between – Adorl 7/1, 2023 at 18:10

Trimming silence at the start and end

Leaving a certain amount of silence at the beginning

Other answers

Recommended topics

Hot tags