Using psutil.Process.memory_info memory usage differs from Pandas.memory_usage
Asked Answered
V

1

11

I'm profiling a program that makes use of Pandas to process some CSVs. I'm using psutil's Process.memory_info to report the Virtual Memory Size (vms) and the Resident Set Size (rss) values. I'm also using Pandas DataFrame.memory_usage (df.memory_usage().sum()) to report the size of my dataframes in memory.

There's a conflict between the reported vms and df.memory_usage values, where Pandas is reporting more memory just for the dataframe than the Process.memory_info call is reporting for the whole (single-threaded) process.

For example:

  • rss: 334671872 B
  • vms: 663515136 B
  • df.memory_usage().sum(): 670244208 B

The Process.memory_info call is made immediately after the memory_usage call. My expected result was that df.memory_usage < vms at all times, but this doesn't hold up. I assume I'm misinterpreting the meaning of the vms value?

Voiced answered 14/10, 2019 at 15:10 Comment(5)
Are you running this in Jupyter or an IDE like PyCharm? If you're running it in Jupyter, try an IDE and post if you get the same results. The ipkernel doesn't seem to manage memory in an expected way.Drilling
These values were coming from logs generated by nosetests. I see similar behavior poking around in ipython.Voiced
I ran some tests in a loop, creating data, creating a dataframe from the data and measuring the memory, but I haven't been able to reproduce the issue of df.memory_usage().sum() > vms.Drilling
If you're creating the data in your test program, that might cause a difference from my setup. The data exists as a CSV that has been read in with read_csv, all columns as strings.Voiced
I updated my process. I created a file, repeatedly read it in, and measure the memory. df.memory_usage().sum() = 670,000,128 and vms=815,214,592 consistently. I have 32GB of RAM and 5GB of virtual memory. It seems like your readings mean the size of df is larger than the amount of virtual memory (pagefile) being used. Incidentally, VM is just space allocated on the hard drive.Drilling
T
4

Here is the reference related to your problem: use rss or vms to track memory. The relationship of RSS and VMS is bit confusing. You can learn about these concepts in detail . You should also know that how to calculate the total memory usage in this and this.

**TO SUMMARIZE AND COMPLEMENT MY OPINION**:


RSS:

Resident set size is used to show how much memory is allocated to a process is in RAM. Remember - It doesn't include memory which is swapped out.

It involves memory from shared libraries, including all stack and heap memory.

VMS:

Virtual memory size includes all memory that the process can access. Which includes;

Memory that is swapped out, memory that is allocated but not used, and memory that is from shared libraries.

Example:

Let's assume, a Process-X has a 500-K binary and is linked to 2500-K of shared libraries, has 200-K of stack/heap allocations of which 100-K is actually in memory (rest is swapped or unused), and it has only actually loaded 1000-K of the shared libraries and 400-K of its own binary then:

RSS: 400K + 1000K + 100K = 1500K
VMS: 500K + 2500K + 200K = 3200K

In this example, since part of the memory is shared, many processes may use it, so if you add up all of the RSS values you can easily end up with more space than your system has.

As you can see when you simple run this;

import os
import psutil
process = psutil.Process(os.getpid())
print("vms: ", process.memory_info().vms)
print("rss:", process.memory_info().rss)

Output:

vms: 7217152

rss: 13975552

By simply adding, import pandas as pd, you can see the difference.

import os
import psutil
import pandas as pd
process = psutil.Process(os.getpid())
print("vms: ", process.memory_info().vms)
print("rss:", process.memory_info().rss)

Here is output:

vms: 276295680

rss: 54116352

So, the memory that is allocated also may not be in RSS until it is actually used by the program. So if your program allocated a bunch of memory up front, then uses it over time;

  • You could see RSS going up and VMS staying the same.

Now whether you go with df.memory_usage().sum() or Process.memory_info, I believe RSS does include memory from dynamically linked libraries. So the sum of their RSS will be more than the actual memory used.

Taillight answered 5/11, 2019 at 11:44 Comment(1)
I appreciate the context of the RSS and VMS values, but that doesn't explain how Pandas' df.memory_usage().sum() could report a higher value than VMS, which should be the full extent of memory accessible.Voiced

© 2022 - 2024 — McMap. All rights reserved.