Trimming silence at the start and end
One solution would be (based on this Digital Cardboard blog post) to call sox like this:
sox in.wav out.wav silence 1 0.1 0.1% reverse silence 1 0.1 0.1% reverse
(Here is a version with placehoders X and Y instead of specific values to
explain below what is happening exactly:
sox in.wav out.wav silence 1 X Y reverse silence 1 X Y reverse
)
X
is the minimum duration (in seconds) of a sound in order to be interpreted as non-silence by sox. For example there might be a loud clicking sound at the beginning of the audio that is 0.15 seconds long. If we set 0.2
for X
then this loud but short click will be interpreted as silence and will be removed. If for X
we set 0.1
then the click will be interpreted by sox as the start of the non-silence part, meaning everything before the click will be removed but not the click itself.
Also note that a trailing zero should be used if the duration is a whole number, so 1.0
should be used instead of 1
to avoid unexpected behavior.
Y
defines a loudness threshold. Everything below it will be interpreted as silence, no matter how long or short it is. So some long rumbling sound at the beginning, that is not very loud might fall below the threshold and thus gets interpreted as silence and thus is not removed. Everything that is loud enough to be above the threshold will be interpreted as the start of non-silence if its duration is long enough (see X
).
Note that
digitalcardboard states that the smallest value to be used should be 0.1%
instead of 0
.
1
simply specifies to remove silence only at the beginning. To trim silence at the end we use the same but reverse the audio first. Why this approach is correct for trimming the end should become apparent below, where I analyze what the solutions of the other answers do further below.
Leaving a certain amount of silence at the beginning
The simple answer is: sox does not support this.
But we can try to work around this by trimming the silence and then add a fixed amount of silence at the beginning. This can be done with:
sox in.wav out.wav silence 1 0.1 0.1% reverse silence 1 0.1 0.1% reverse pad X 0
X
is the duration (in seconds) of the silence that we want to prepend.
0
in this position means that no padding should be added at the end.
Of course this is not the same as keeping some duration of the original silence (if present), because that would also allow result files that don't have any silence at the beginning if the input also doesn't have any silence at the beginning. Still, trimming + padding is the best I could come up with.
Other answers
So far all the answers here are no solution for the question. OP wanted to remove silence from the start and the end. Here is what the previous solutions do instead, for the interested:
- Kid_Learning_C: Multiple output files are generated where each file contains one of the non-silent parts of the input. So suppose the input file consists of Silence->Non-Silence-A->Silence->Non-Silence-B->Silence. Using the arguments from this answer we would get output001.wav containing Non-Silence-A. output002.wav containing Non-Silence-B. And for some reason an extremely short output003.wav.
- DSBLR: The end is not trimmed.
- Anas Naguib: Also removes in-between silence. So with an input file that consists of Silence->Non-Silence-A->Silence->Non-Silence-B->Silence we would get Non-Silence-A->Non-Silence-B which means all silence is removed. Not just at the start and end.
Oh, and all of those answers provide no solution for keeping some of the silence at the beginning as asked by the OP.