CategoriesPython

Investigating Python Memory Usage

Alternate title: Why Is This Program Using So Much Memory and What Can I Do About It??

As part of work on my speech transcriber project, which aims to transcribe longer recordings while using less memory by segmenting based on DeepSegment. It’s still very much a work in progress.

While testing on an AWS EC2 instance with 2GB of RAM, though, it crashed with a memory error, even though it shouldn’t use nearly that much. This post is about how I diagnosed and solved the problem, and what tools are available.

Getting an Overview of the Problem

First, I narrowed down my code to something that was more easily repeatable.

from pydub import AudioSegment

segment = AudioSegment.from_file("test_audio.mp3") # Open the 57MB mp3
segment.set_frame_rate(16000) # Change the frame rate

All of the graphs below were based on running this simple test.

Now, it’s time to introduce psrecord, which is capable of measuring the CPU and RAM usage of a process. It can attach to an already-running process with the command psrecord <pid> --plot plot.png, which is useful for peeking at a long-running process.

For our purposes, though, psrecord can start the process for us and monitor it from start to finish. Just put the command to run in quotation marks in place of the pid. It’ll look like psrecord "python test_memory.py" --plot plot.png

Here’s what the resulting graph looks like:

Pydub memory usage before changes

The red line plots CPU usage (on the left) and the blue line memory usage (on the right). The peak memory usage is roughly 2,300MB. Definitely too much for my 2GB EC2 instance.

This is a good overview of the scope of the problem, and gives a baseline of CPU and time to compare to. In other words, if a change gets us below the 2GB mark on RAM, but suddenly takes longer to process, or uses more CPU, that’s something we want to be aware of.

Finding the Root of the Problem

What psrecord does not tell us is where the memory is being allocated in the program. What line(s) of code, specifically, are using up all of this memory?

That’s where Fil comes in. It produces a flamegraph, much like Py-Spy, but with memory usage instead of CPU. This will let us zoom in on the specific lines in pydub that allocate memory.

(Note that Fil’s actual output is an SVG and much easier to use)

According to Fil, the peak memory was 2,147MB and there are a number of places that memory is allocated. Our goal, then, is to look through those places and see if any of them can be removed.

Diving into the Pydub Source

To do that, we’re going to have to dig into the source code and try to understand the flow of data. The following samples come from this file in the pydub repository.

def from_file(cls, file, format=None, codec=None, parameters=None, **kwargs):
... # Open the file and convert it to the WAV format
    p_out = bytearray(p_out) # Cast to bytearray to make it mutable
    fix_wav_headers(p_out) # Mutate the WAV data to fix the headers
    obj = cls._from_safe_wav(BytesIO(p_out)) # Create the AudioSegment
def _from_safe_wav(cls, file):
    file, close_file = _fd_or_path_or_tempfile(file, 'rb', tempfile=False)
    file.seek(0)
    obj = cls(data=file)
    if close_file:
        file.close()
    return obj
def __init__(self, data=None, *args, **kwargs):
...
    else:
        # normal construction
        try:
            data = data if isinstance(data, (basestring, bytes)) else data.read()
...
        wav_data = read_wav_audio(data)
def read_wav_audio(data, headers=None):
... # Read the headers to get various metadata to store in the WavData
    return WavData(audio_format, channels, sample_rate, bits_per_sample,
                   data[pos:pos + data_hdr.size])

When opening a file using AudioSegment.from_file, the flow is basically:

  1. Open the file and convert it to WAV.
  2. Cast the bytes to a bytearray, then mutate that bytearray to fix the wav headers.
  3. Cast the bytearray to BytesIO, then use AudioSegment._from_safe_wav to create the instance of AudioSegment.
  4. _from_safe_wav makes sure the file is opened and at the beginning of the file, before constructing the AudioSegment using the data.
  5. __init__ reads the data from the BytesIO object.
  6. The data is passed to read_wav_audio to get headers extracted, so the raw data being operated on is only the audio data.
  7. read_wav_audio extracts the headers and returns them as part of a WavData object, along with the raw audio data. It cuts off the headers by slicing the bytes.

As Fil showed, there are several copies of the data being passed around. Some can’t really be avoided. For example, slicing bytes is going to make a copy.

The Solution

It took quite a bit of experimenting to arrive at the solution. I started by using a memoryview, which would allow the last step (slicing the data) to not make a copy. That worked for my use, but it broke a number of functions, so it wasn’t acceptable as a contribution.

My next try used a bytearray, which again allowed me to cut off the headers without making a big copy. This got closer (at least, most things didn’t break), but it did break Python 2.7 support. More importantly, it made AudioSegments mutable.

Finally, I realized that I was focusing on the wrong end of the stack. The last operation is naturally what drew my attention first–since it showed up as the cause of the exception when my program ran out of memory. However, there’s a much easier place to reduce copying earlier in the call stack.

Here’s how I changed from_file:

p_out = bytes(p_out)
obj = cls(p_out)

Yes, all that happened is I replaced the casting to BytesIO and the call to _from_safe_wav with casting back to bytes, then instantiating the class directly. If you look back at it, this is exactly what _from_safe_wav did. It just had several layers of indirection: wrapping the bytes in BytesIO, then reading them back later.

So, was that small change worth it? Let’s see what Fil says about it now.

I would say that a ~900MB savings in RAM is worthwhile!

And for completeness, here’s the psrecord graph:

Pydub memory usage after changes

As might be expected, removing things only made it faster. Memory usage peaks lower, and the whole program runs much faster. A lot of the run time seems to have been just copying data around, so that makes sense.

Lessons Learned

First, keep looking until you find the right tools for the job. When I first set out to understand the memory usage, the first tools were designed more for finding memory leaks, which is a different category of memory error. Finding the right tools helped me find the solution much easier.

Second, slow down and think through the options. My initial efforts focused on only one portion of the possible locations that memory usage could be reduced, which ended up being the wrong place to focus.

On the other hand, don’t let analysis paralysis win. Even if it’s not clear where the solution might end up being, jumping in and experimenting can give you a better idea of what might work.

Third, don’t be afraid to explore whether an open source library could be better for your use case! For small files, the overhead of making all those copies is not so significant, so not as many people have likely looked into improving memory usage. Taking the time to explore the issue allowed me to make a contribution.

Thanks for reading! I would appreciate your feedback, or to hear about a tricky memory problem that you debugged.

CategoriesPython

Measuring Python Performance Using Py-Spy

When optimizing the performance of a program, it’s essential to test and measure what the bottlenecks are. Programmers are bad at guessing what part of a program will be the slowest. Trying to guess is likely to lead to sacrificing code readability for uncertain gains, or even losses in performance (Code Complete, 2nd Edition page 594).

In my case, I have a starry sky generator that I wanted to improve the performance of. The goal: to allow people to generate bigger images in a reasonable amount of time. So, how can we find out where improvements need to be made?

Enter Py-Spy

Py-Spy is a tool for profiling a Python program. When I was first optimizing the performance of my program, I used PyFlame, but that project is no longer maintained. Py-Spy does everything PyFlame did and more. Another nice bonus is that it isn’t limited to Linux. On top of all that, it’s also installable through pip, so it seems to be a big win!

To install it, just run pip install py-spy.

Py-Spy has a number of commands and options we can use to customize the output. For one, we can attach it to an already-running process (for example, a production web server that’s having issues we want to diagnose) using -p PID. This method will probably require you to run it as root (sudo) so it can access the memory of the other process. The method I will be using is to pass py-spy the command to start the Python program itself, which will look like py-spy [command] -- python stars.py.

Speaking of commands, there are three available: record, top, and dump. Top is an interesting one: it looks like the unix top command, but instead shows data about which functions your program is spending the most time in. Dump just prints out the current call stack, and can only be used in by attaching to an already-running process. This is useful if your program is getting hung up somewhere, and you want to find out where.

For our purposes, though, the record command is the most useful. It comes with various options.

Record’s Options

$ py-spy record --help
py-spy-record 
Records stack trace information to a flamegraph, speedscope or raw file

USAGE:
    py-spy record [OPTIONS] --output <filename> --pid <pid> [python_program]...

OPTIONS:
    -p, --pid <pid>              PID of a running python program to spy on
    -o, --output <filename>      Output filename
    -f, --format <format>        Output file format [default: flamegraph]  [possible values:
                                 flamegraph, raw, speedscope]
    -d, --duration <duration>    The number of seconds to sample for [default: unlimited]
    -r, --rate <rate>            The number of samples to collect per second [default: 100]
    -s, --subprocesses           Profile subprocesses of the original process
    -F, --function               Aggregate samples by function name instead of by line number
    -g, --gil                    Only include traces that are holding on to the GIL
    -t, --threads                Show thread ids in the output
    -i, --idle                   Include stack traces for idle threads
    -n, --native                 Collect stack traces from native extensions written in Cython, C
                                 or C++
        --nonblocking            Don't pause the python process when collecting samples. Setting
                                 this option will reduce the perfomance impact of sampling, but
                                 may lead to inaccurate results
    -h, --help                   Prints help information
    -V, --version                Prints version information

ARGS:
    <python_program>...    commandline of a python program to run

There are a few options that are particularly noteworthy here. --format lets us pick between flamegraph, raw, or speedscope. We’ll be using flamegraph, but speedscope is interesting, too. You can examine a speedscope file using the webapp.

--function will group the output by function, rather than by line number. Both have pros and cons. Grouping by function is helpful to get an easier-to-understand overview, while grouping by line number can help you narrow it down further.

Finally, --rate tells py-spy how many times per second to sample the program. The default is 100, but I’ve found that adjusting this either up or down can help, especially if there are a lot of small, quick functions (or lines) that add up. It doesn’t hurt to play around with this and compare the resulting flamegraphs to see which seem the most useful.

Now, we can generate a flamegraph of the starry sky generator. I’ll be running py-spy -o profile.svg --function -- python stars.py on this commit, modified to generate one image with the dimensions (2000, 2000).

Reading a Flamegraph

Here it is!

py-spy cast (stars.py) (957 samples, 24.22%)cast (stars.py)randint (random.py) (690 samples, 17.46%)randint (random.py)randrange (random.py) (561 samples, 14.20%)randrange (random.py)_randbelow_with_getrandbits (random.py) (242 samples, 6.13%)_randbel..generate_star_pixel (stars.py) (668 samples, 16.91%)generate_star_pixel (stars..planck (stars.py) (397 samples, 10.05%)planck (stars.p..generate_sky_pixel (stars.py) (2,946 samples, 74.56%)generate_sky_pixel (stars.py)randint (random.py) (304 samples, 7.69%)randint (ra..randrange (random.py) (260 samples, 6.58%)randrange.._randbelow_with_getrandbits (random.py) (112 samples, 2.83%)_r..<module> (stars.py) (3,927 samples, 99.39%)<module> (stars.py)generate_sky (stars.py) (3,889 samples, 98.43%)generate_sky (stars.py)putpixel (PIL/Image.py) (668 samples, 16.91%)putpixel (PIL/Image.py)load (PIL/Image.py) (259 samples, 6.56%)load (PIL..all (3,951 samples, 100%)

When running it yourself, you’ll get an SVG file that, when opened in your browser, will be bigger, easier to read, and include some nice JavaScript features like being able to click on a block to zoom in on it, and search. For the moment, take some time to explore the above SVG—hover over a block to get the full text and percentage.

The graph is read from top to bottom. So, in this case, all of the time was spent in the module stars.py. Underneath that is the call to generate_sky, which also basically takes up all of the time. From there, things get more interesting. A portion of the time is taken up just by generate_sky (the part that doesn’t have any blocks beneath it), most of it is taken up by generate_sky_pixel, and some is used by putpixel.

Note that this isn’t grouped by time, but by function. These functions are called one-by-one, so if it were grouped by time, it would be a tiny block for generate_sky_pixel, then a tiny block for putpixel and so on several thousand times.

Since it’s grouped by function, we can more easily compare overall how much time is spent in a particular function versus another. At a glance, we can see that much more time is spent generating the pixel vs. putting it into the image.

A lot of time in generate_sky_pixel isn’t taken up by a sub function, but a fairly significant amount is still used by cast and others.

Let’s get a new graph, but grouped by line number instead of function: py-spy record -o profile.svg -- python stars.py

py-spy Reset Zoom generate_sky (stars.py:132) (51 samples, 1.24%)generate_sky_pixel (stars.py:101) (124 samples, 3.02%)gen..generate_sky_pixel (stars.py:102) (58 samples, 1.41%)cast (stars.py:78) (47 samples, 1.15%)generate_sky_pixel (stars.py:103) (621 samples, 15.15%)generate_sky_pixel (sta..cast (stars.py:79) (469 samples, 11.44%)cast (stars.py:79)randint (random.py:248) (360 samples, 8.78%)randint (ran..randrange (random.py:224) (182 samples, 4.44%)randr..generate_sky_pixel (stars.py:109) (81 samples, 1.98%)g..cast (stars.py:78) (64 samples, 1.56%)generate_sky_pixel (stars.py:110) (603 samples, 14.71%)generate_sky_pixel (st..cast (stars.py:79) (439 samples, 10.71%)cast (stars.py:7..randint (random.py:248) (355 samples, 8.66%)randint (ran..randrange (random.py:224) (159 samples, 3.88%)rand..generate_sky_pixel (stars.py:111) (345 samples, 8.41%)generate_sky..randint (random.py:248) (286 samples, 6.98%)randint (..randrange (random.py:224) (152 samples, 3.71%)rand..generate_sky_pixel (stars.py:115) (181 samples, 4.41%)gener..planck (stars.py:71) (55 samples, 1.34%)generate_star_pixel (stars.py:87) (205 samples, 5.00%)genera..planck (stars.py:72) (93 samples, 2.27%)p..planck (stars.py:71) (44 samples, 1.07%)generate_star_pixel (stars.py:88) (150 samples, 3.66%)gene..planck (stars.py:72) (69 samples, 1.68%)planck (stars.py:71) (51 samples, 1.24%)generate_star_pixel (stars.py:89) (179 samples, 4.37%)gener..planck (stars.py:72) (87 samples, 2.12%)p..generate_star_pixel (stars.py:91) (101 samples, 2.46%)ge..generate_sky_pixel (stars.py:117) (719 samples, 17.54%)generate_sky_pixel (stars.p..generate_sky_pixel (stars.py:119) (85 samples, 2.07%)g..generate_sky_pixel (stars.py:120) (54 samples, 1.32%)generate_sky_pixel (stars.py:121) (52 samples, 1.27%)putpixel (PIL/Image.py:1678) (41 samples, 1.00%)load (PIL/Image.py:796) (79 samples, 1.93%)l..putpixel (PIL/Image.py:1680) (355 samples, 8.66%)putpixel (PI..load (PIL/Image.py:818) (130 samples, 3.17%)loa..putpixel (PIL/Image.py:1686) (56 samples, 1.37%)<module> (stars.py:145) (4,026 samples, 98.20%)<module> (stars.py:145)generate_sky (stars.py:140) (3,958 samples, 96.54%)generate_sky (stars.py:140)putpixel (PIL/Image.py:1692) (235 samples, 5.73%)putpixe..all (4,100 samples, 100%)

There’s a lot more information in this graph. For example, it calls attention to the fact that cast is called in two different lines in generate_sky_pixel. The time spent in generate_star_pixel is pretty evenly distributed between lines 87, 88, and 89–which makes sense, because those are the three lines that call planck.

There’s one more piece of information that will be useful: the total time it takes to generate an image. The flamegraph tells us what percentage of time each function/line takes, but it isn’t meant to measure the total run time. I created a performance.py file which uses timeit to generate ten images with the dimensions (900, 900) and return the average number of seconds it took per image. In this case, it took 7.64 seconds. We can definitely do better.

Tuning the Performance

Now that we have the information from these two graphs, as well as the run time, we can get to work making it faster. Looking again at the first flamegraph above, it seems putpixel uses up a total of ~17% of the run time. The docs specifically warn that putpixel is relatively slow, so it seems like this should be a pretty easy win.

I experimented with several methods, including storing the data as a list of lists of tuples, then converting to a numpy array, then feeding that to Image.fromarray, with a result of 7.1 seconds, about a 7% savings. As you might imagine, this still wasn’t very good.

The natural progression from there seemed to be to skip the lists altogether and start with a numpy array, filled with zeroes initially. For some reason, this was actually slower than putpixel: 7.89 seconds. I’m not a NumPy expert, so I’m not sure why this is. Perhaps mutating is a slow operation for NumPy, or maybe I was just doing it the wrong way. If someone takes a look and wants to let me know, I’d be happy to learn about this.

After that, I tried building up a bytearray, extending it by the three pixels each time they were generated, then converting that to bytes and passing it to Image.frombytes(). Total run time: 6.44 seconds. That’s about a 17% savings over putpixel.

Here’s what our flamegraph looks like now that we’ve settled on a putpixel replacement (and after splitting the bytearray.extend onto its own line, so that it will show up separately):

py-spy Reset Zoom generate_sky (stars.py:144) (39 samples, 1.21%)generate_sky_pixel (stars.py:110) (109 samples, 3.37%)gen..generate_sky_pixel (stars.py:111) (52 samples, 1.61%)cast (stars.py:87) (43 samples, 1.33%)generate_sky_pixel (stars.py:112) (580 samples, 17.95%)generate_sky_pixel (stars.py..cast (stars.py:88) (431 samples, 13.34%)cast (stars.py:88)randint (random.py:248) (327 samples, 10.12%)randint (random..randrange (random.py:224) (173 samples, 5.35%)randran..generate_sky_pixel (stars.py:118) (94 samples, 2.91%)ge..cast (stars.py:87) (45 samples, 1.39%)generate_sky_pixel (stars.py:119) (584 samples, 18.07%)generate_sky_pixel (stars.py..cast (stars.py:88) (428 samples, 13.24%)cast (stars.py:88)randint (random.py:248) (322 samples, 9.96%)randint (rando..randrange (random.py:224) (159 samples, 4.92%)randra..generate_sky_pixel (stars.py:120) (365 samples, 11.29%)generate_sky_pixe..randint (random.py:248) (292 samples, 9.03%)randint (rand..randrange (random.py:224) (132 samples, 4.08%)rand..generate_sky_pixel (stars.py:124) (136 samples, 4.21%)gener..generate_star_pixel (stars.py:100) (112 samples, 3.47%)gen..planck (stars.py:80) (67 samples, 2.07%)p..generate_star_pixel (stars.py:96) (214 samples, 6.62%)generate_..planck (stars.py:81) (94 samples, 2.91%)pl..generate_star_pixel (stars.py:97) (137 samples, 4.24%)gener..planck (stars.py:81) (62 samples, 1.92%)p..planck (stars.py:80) (41 samples, 1.27%)generate_sky_pixel (stars.py:126) (740 samples, 22.90%)generate_sky_pixel (stars.py:126)generate_star_pixel (stars.py:98) (171 samples, 5.29%)genera..planck (stars.py:81) (82 samples, 2.54%)pl..generate_sky_pixel (stars.py:128) (60 samples, 1.86%)g..generate_sky_pixel (stars.py:129) (58 samples, 1.79%)g..generate_sky_pixel (stars.py:130) (47 samples, 1.45%)generate_sky (stars.py:153) (2,967 samples, 91.80%)generate_sky (stars.py:153)generate_sky (stars.py:154) (192 samples, 5.94%)generate..<module> (stars.py:161) (3,225 samples, 99.78%)<module> (stars.py:161)all (3,232 samples, 100%)

Line 154 (bytearray.extend(pixels)) now only took up about 6% of the time. Even on a small image of 900 by 900 pixels, this resulted in a savings of over a second per image. For bigger images, this savings is in the range of several seconds.

Everything else in the program is directly related to image generation and the math and random number generation behind that, so assuming all of that is already optimal (spoiler alert: it isn’t, the cast() function was entirely unnecessary), this is about as fast as the program can get.

Conclusion

Flamegraphs and the profilers that generate them are useful tools for understanding the performance of a program. Using them, you can avoid trying to guess where bottlenecks are and potentially doing a lot of work for little gain.

For further reading, I recommend this article about the reasoning behind the creation of flamegraphs and the problem they were trying to solve. If you’re struggling to understand how to read the graph, it may help more than my explanation.

Now, go forth and conquer your performance problems!