CategoriesPython

Investigating Python Memory Usage

Alternate title: Why Is This Program Using So Much Memory and What Can I Do About It??

As part of work on my speech transcriber project, which aims to transcribe longer recordings while using less memory by segmenting based on DeepSegment. It’s still very much a work in progress.

While testing on an AWS EC2 instance with 2GB of RAM, though, it crashed with a memory error, even though it shouldn’t use nearly that much. This post is about how I diagnosed and solved the problem, and what tools are available.

Getting an Overview of the Problem

First, I narrowed down my code to something that was more easily repeatable.

from pydub import AudioSegment

segment = AudioSegment.from_file("test_audio.mp3") # Open the 57MB mp3
segment.set_frame_rate(16000) # Change the frame rate

All of the graphs below were based on running this simple test.

Now, it’s time to introduce psrecord, which is capable of measuring the CPU and RAM usage of a process. It can attach to an already-running process with the command psrecord <pid> --plot plot.png, which is useful for peeking at a long-running process.

For our purposes, though, psrecord can start the process for us and monitor it from start to finish. Just put the command to run in quotation marks in place of the pid. It’ll look like psrecord "python test_memory.py" --plot plot.png

Here’s what the resulting graph looks like:

Pydub memory usage before changes

The red line plots CPU usage (on the left) and the blue line memory usage (on the right). The peak memory usage is roughly 2,300MB. Definitely too much for my 2GB EC2 instance.

This is a good overview of the scope of the problem, and gives a baseline of CPU and time to compare to. In other words, if a change gets us below the 2GB mark on RAM, but suddenly takes longer to process, or uses more CPU, that’s something we want to be aware of.

Finding the Root of the Problem

What psrecord does not tell us is where the memory is being allocated in the program. What line(s) of code, specifically, are using up all of this memory?

That’s where Fil comes in. It produces a flamegraph, much like Py-Spy, but with memory usage instead of CPU. This will let us zoom in on the specific lines in pydub that allocate memory.

(Note that Fil’s actual output is an SVG and much easier to use)

According to Fil, the peak memory was 2,147MB and there are a number of places that memory is allocated. Our goal, then, is to look through those places and see if any of them can be removed.

Diving into the Pydub Source

To do that, we’re going to have to dig into the source code and try to understand the flow of data. The following samples come from this file in the pydub repository.

def from_file(cls, file, format=None, codec=None, parameters=None, **kwargs):
... # Open the file and convert it to the WAV format
    p_out = bytearray(p_out) # Cast to bytearray to make it mutable
    fix_wav_headers(p_out) # Mutate the WAV data to fix the headers
    obj = cls._from_safe_wav(BytesIO(p_out)) # Create the AudioSegment
def _from_safe_wav(cls, file):
    file, close_file = _fd_or_path_or_tempfile(file, 'rb', tempfile=False)
    file.seek(0)
    obj = cls(data=file)
    if close_file:
        file.close()
    return obj
def __init__(self, data=None, *args, **kwargs):
...
    else:
        # normal construction
        try:
            data = data if isinstance(data, (basestring, bytes)) else data.read()
...
        wav_data = read_wav_audio(data)
def read_wav_audio(data, headers=None):
... # Read the headers to get various metadata to store in the WavData
    return WavData(audio_format, channels, sample_rate, bits_per_sample,
                   data[pos:pos + data_hdr.size])

When opening a file using AudioSegment.from_file, the flow is basically:

  1. Open the file and convert it to WAV.
  2. Cast the bytes to a bytearray, then mutate that bytearray to fix the wav headers.
  3. Cast the bytearray to BytesIO, then use AudioSegment._from_safe_wav to create the instance of AudioSegment.
  4. _from_safe_wav makes sure the file is opened and at the beginning of the file, before constructing the AudioSegment using the data.
  5. __init__ reads the data from the BytesIO object.
  6. The data is passed to read_wav_audio to get headers extracted, so the raw data being operated on is only the audio data.
  7. read_wav_audio extracts the headers and returns them as part of a WavData object, along with the raw audio data. It cuts off the headers by slicing the bytes.

As Fil showed, there are several copies of the data being passed around. Some can’t really be avoided. For example, slicing bytes is going to make a copy.

The Solution

It took quite a bit of experimenting to arrive at the solution. I started by using a memoryview, which would allow the last step (slicing the data) to not make a copy. That worked for my use, but it broke a number of functions, so it wasn’t acceptable as a contribution.

My next try used a bytearray, which again allowed me to cut off the headers without making a big copy. This got closer (at least, most things didn’t break), but it did break Python 2.7 support. More importantly, it made AudioSegments mutable.

Finally, I realized that I was focusing on the wrong end of the stack. The last operation is naturally what drew my attention first–since it showed up as the cause of the exception when my program ran out of memory. However, there’s a much easier place to reduce copying earlier in the call stack.

Here’s how I changed from_file:

p_out = bytes(p_out)
obj = cls(p_out)

Yes, all that happened is I replaced the casting to BytesIO and the call to _from_safe_wav with casting back to bytes, then instantiating the class directly. If you look back at it, this is exactly what _from_safe_wav did. It just had several layers of indirection: wrapping the bytes in BytesIO, then reading them back later.

So, was that small change worth it? Let’s see what Fil says about it now.

I would say that a ~900MB savings in RAM is worthwhile!

And for completeness, here’s the psrecord graph:

Pydub memory usage after changes

As might be expected, removing things only made it faster. Memory usage peaks lower, and the whole program runs much faster. A lot of the run time seems to have been just copying data around, so that makes sense.

Lessons Learned

First, keep looking until you find the right tools for the job. When I first set out to understand the memory usage, the first tools were designed more for finding memory leaks, which is a different category of memory error. Finding the right tools helped me find the solution much easier.

Second, slow down and think through the options. My initial efforts focused on only one portion of the possible locations that memory usage could be reduced, which ended up being the wrong place to focus.

On the other hand, don’t let analysis paralysis win. Even if it’s not clear where the solution might end up being, jumping in and experimenting can give you a better idea of what might work.

Third, don’t be afraid to explore whether an open source library could be better for your use case! For small files, the overhead of making all those copies is not so significant, so not as many people have likely looked into improving memory usage. Taking the time to explore the issue allowed me to make a contribution.

Thanks for reading! I would appreciate your feedback, or to hear about a tricky memory problem that you debugged.