How to create a numpy array from a pydub AudioSegment?

Asked 24/6, 2016 at 14:3 Answered 15/12, 2021 at 15:54

I'm aware of the following question: How to create a pydub AudioSegment using an numpy array?

My question is the right opposite. If I have a pydub AudioSegment how can I convert it to a numpy array?

I would like to use scipy filters and so on. It is not very clear to me what is the internal structure of the AudioSegment raw data.

Hatred answered 24/6, 2016 at 14:3 Comment(0)

Pydub has a facility for getting the audio data as an array of samples, it is an array.array instance (not a numpy array) but you should be able to convert it to a numpy array relatively easily:

from pydub import AudioSegment
sound = AudioSegment.from_file("sound1.wav")

# this is an array
samples = sound.get_array_of_samples()

You may be able to create a numpy variant of the implementation though. That method is implemented pretty simply:

def get_array_of_samples(self):
    """
    returns the raw_data as an array of samples
    """
    return array.array(self.array_type, self._data)

Creating a new audio segment from a (modified?) array of samples is also possible:

new_sound = sound._spawn(samples)

The above is a little hacky, it was written for internal use within the AudioSegment class, but it mainly just figures out what type of audio data you're using (array of samples, list of samples, bytes, bytestring, etc). It's safe to use despite the underscore prefix.

Wickman answered 24/6, 2016 at 20:32 Comment(3)

is there a way to do the reverse too? ie. create an AS object from raw/array data on the fly, without accessing file system. – Triolet 4/12, 2017 at 7:58

@Triolet I added info about that to my answer – Wickman 5/12, 2017 at 18:53

This question was useful for me, but did not solve my problem fully. I found the simplest way to convert back and forth was with this code: audio_segment = pydub.AudioSegment(audio.tobytes(), sample_width=audio.dtype.itemsize, frame_rate=sample_rate, channels=num_channels) followed by np.frombuffer(audio_segment.get_array_of_samples(), dtype=np.float32) comparing the bytes of the original audio with audio.tobytes() to the audio that comes from np.frombuffer(...).tobytes() you can see they're identical. – Hubris 15/9, 2022 at 21:41

None of the existing answers is perfect, they miss reshaping and sample width. I have written this function that helps to convert the audio to the standard audio representation in np:

def pydub_to_np(audio: pydub.AudioSegment) -> (np.ndarray, int):
    """
    Converts pydub audio segment into np.float32 of shape [duration_in_seconds*sample_rate, channels],
    where each value is in range [-1.0, 1.0]. 
    Returns tuple (audio_np_array, sample_rate).
    """
    return np.array(audio.get_array_of_samples(), dtype=np.float32).reshape((-1, audio.channels)) / (
            1 << (8 * audio.sample_width - 1)), audio.frame_rate

Gonroff answered 2/4, 2021 at 16:23 Comment(0)

You can get an array.array from an AudioSegment then convert it to a numpy.ndarray:

from pydub import AudioSegment
import numpy as np
song = AudioSegment.from_mp3('song.mp3')
samples = song.get_array_of_samples()
samples = np.array(samples)

Fenian answered 2/3, 2017 at 22:51 Comment(4)

The array won't be shaped / ordered as necessary for a scipy filter. After the above code block, you'll likely need: samples = samples.reshape(song.channels, -1, order='F'); samples.shape # (<probably 2>, <len(song) in samples>). The samples waveform is then ready for filtering, FFT analysis, plotting, etc (although you may want to cast it to float). – Breathless 19/3, 2018 at 19:47

This comment is really helpful, combined with the answer ... solves my problem – Stupe 10/8, 2019 at 17:23

The code after; in a comment is neccessery? – Elisa 19/3, 2021 at 17:14

@ChrisP No, it is not neccessary - just for explanation – Cassandry 30/4, 2021 at 7:24

get_array_of_samples (not found on [ReadTheDocs.AudioSegment]: audiosegment module) returns an 1 dimensional array, and that doesn't work well since it loses information about the audio stream (frames, channels, ...)

A couple of days ago, I ran into this problem, and as I used [PyPI]: sounddevice (which expects a numpy.ndarray) to play the sound (I needed to play it on different output audio devices). Here's what I came up with.

code00.py:

#!/usr/bin/env python

import sys
from pprint import pprint as pp

import numpy as np
import pydub
import sounddevice as sd


def audio_file_to_np_array(file_name):
    asg = pydub.AudioSegment.from_file(file_name)
    dtype = getattr(np, "int{:d}".format(asg.sample_width * 8))  # Or could create a mapping: {1: np.int8, 2: np.int16, 4: np.int32, 8: np.int64}
    arr = np.ndarray((int(asg.frame_count()), asg.channels), buffer=asg.raw_data, dtype=dtype)
    print("\n", asg.frame_rate, arr.shape, arr.dtype, arr.size, len(asg.raw_data), len(asg.get_array_of_samples()))  # @TODO: Comment this line!!!
    return arr, asg.frame_rate


def main(*argv):
    pp(sd.query_devices())  # @TODO: Comment this line!!!
    a, fr = audio_file_to_np_array("./test00.mp3")
    dvc = 5  # Index of an OUTPUT device (from sd.query_devices() on YOUR machine)
    #sd.default.device = dvc  # Change default OUTPUT device
    sd.play(a, samplerate=fr)
    sd.wait()


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

Output:

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q038015319]> set PATH=%PATH%;f:\Install\pc064\FFMPEG\FFMPEG\4.3.1\bin

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q038015319]> dir /b
code00.py
test00.mp3

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q038015319]> "e:\Work\Dev\VEnvs\py_pc064_03.09.01_test0\Scripts\python.exe" code00.py
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)] 064bit on win32

   0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
>  1 Microphone (Logitech USB Headse, MME (2 in, 0 out)
   2 Microphone (Realtek Audio), MME (2 in, 0 out)
   3 Microsoft Sound Mapper - Output, MME (0 in, 2 out)
<  4 Speakers (Logitech USB Headset), MME (0 in, 2 out)
   5 Speakers / Headphones (Realtek , MME (0 in, 2 out)
   6 Primary Sound Capture Driver, Windows DirectSound (2 in, 0 out)
   7 Microphone (Logitech USB Headset), Windows DirectSound (2 in, 0 out)
   8 Microphone (Realtek Audio), Windows DirectSound (2 in, 0 out)
   9 Primary Sound Driver, Windows DirectSound (0 in, 2 out)
  10 Speakers (Logitech USB Headset), Windows DirectSound (0 in, 2 out)
  11 Speakers / Headphones (Realtek Audio), Windows DirectSound (0 in, 2 out)
  12 Realtek ASIO, ASIO (2 in, 2 out)
  13 Speakers (Logitech USB Headset), Windows WASAPI (0 in, 2 out)
  14 Speakers / Headphones (Realtek Audio), Windows WASAPI (0 in, 2 out)
  15 Microphone (Logitech USB Headset), Windows WASAPI (1 in, 0 out)
  16 Microphone (Realtek Audio), Windows WASAPI (2 in, 0 out)
  17 Microphone (Realtek HD Audio Mic input), Windows WDM-KS (2 in, 0 out)
  18 Speakers (Realtek HD Audio output), Windows WDM-KS (0 in, 2 out)
  19 Stereo Mix (Realtek HD Audio Stereo input), Windows WDM-KS (2 in, 0 out)
  20 Microphone (Logitech USB Headset), Windows WDM-KS (1 in, 0 out)
  21 Speakers (Logitech USB Headset), Windows WDM-KS (0 in, 2 out)

 44100 (82191, 2) int16 164382 328764 164382

--- (Manually inserted line) Sound is playing :) ---

Done.

Notes:

As seen, there's no value hardcoded (in terms of dimensions, dtype, ...)
I also need to return the sample rate (as it can't be in embedded the array), and it's required by the device (in this case it's 44.1k which is the default - but I've tested with files having half that value)
All the existing answers use float to represent a sample. That doesn't work for me, as for most of the test files the sample rate is 16bit long, and np.float16 is not supported (by my FPU), so I had to use int
As a side note, when testing on various files, an .m4a could not be played on my Win laptop by SoundDevice (most likely because a 32k sample rate), but PyDub was able to

Jackstraw answered 15/12, 2021 at 15:54 Comment(0)

Recommended topics

Hot tags