Memory profiling with data.table
Asked Answered
D

2

6

What is the correct way to profile memory in R code that contains calls to data.table functions? Let's say I want to determine the maximum memory usage during an expression.

This reference indicates that Rprofmem may not be the right choice: https://cran.r-project.org/web/packages/profmem/vignettes/profmem.html

All memory allocations that are done via the native allocVector3() part of R's native API are logged, which means that nearly all memory allocations are logged. Any objects allocated this way are automatically deallocated by R's garbage collector at some point. Garbage collection events are not logged by profmem(). Allocations not logged are those done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects. Such objects are not handled by the R garbage collector.

The data.table source code contains plenty of calls to Calloc() and malloc() so this suggests that Rprofmem will not measure all memory allocated by data.table functions. If Rprofmem is not the right tool, how come Matthew Dowle uses it here: R: loop over columns in data.table?

I've found a reference suggesting similar potential issues for gc() (which can be used to measure maximum memory usage between two calls to gc()): https://r.789695.n4.nabble.com/Determining-the-maximum-memory-usage-of-a-function-td4669977.html

gc() is a good start. Call gc(reset = TRUE) before and gc() after your task, and you will see the maximum extra memory used by R in the interim. (This does not include memory malloced by compiled code, which is much harder to measure as it gets re-used.)

Nothing I've found suggests that similar issues exist with Rprof(memory.profiling=TRUE). Does this mean that the Rprof approach will work for data.table even though it doesn't always use the R API to allocate memory?

If Rprof(memory.profiling=TRUE) in fact is not the right tool for the job, what is?

Would ssh.utils::mem.usage work?

Deceitful answered 8/10, 2019 at 0:50 Comment(9)
Basically you have three options: 1) Memory usage probes at regular time intervals with Rprof 2) Identify every non-low-level memory allocation with Rprofmem 3) Identify each and every memory allocation using the valgrind tool (or similar). Really complicated.Catchfly
Regarding how Rprof works see also: https://mcmap.net/q/393746/-interpretation-of-memory-profiling-output-of-rprofCatchfly
But am I correct in surmising that for evaluating data.table code option #2 (Rprofmem) won't work correctly?Deceitful
It depends: If you want to find out and optimize the max. memory used: Rprof and gc do work then but do offer only a guess which statement in the executed function to blame (due to the time interval probing and async garbage collection). If you want to find out more exactly how often and how much memory is allocated by a single statement Rprofmem is better because it tracks exactly this (except C-level/OS memory requests of course) but does not tell you the total amount of memory used. BTW: You can use both together at the same time with github.com/HenrikBengtsson/profmemCatchfly
Neither Rprof nor Rprofmem can track C/OS-level memory allocations and I IMHO wouldn't expect data.table to waste non-R-memory for data under R-sovereignity because this would be a waste of time (due to memory copies). Indexing may be an exception but takes much less space than the data (we are using data.table for its ability to cope with big data sizes). But if you really want to know the total memory consumption (incl. C/OS-allocs) or look for memory leaks see the instructions here: cran.r-project.org/doc/manuals/r-release/…Catchfly
Instead of ssh.utils::mem.usage (which shows the process memory usage only) you could use the Windows task manager or Linux top & friends. But what can do you do then with this value? You can run out of R-internal memory even when you still have enough computer memory since R uses its own memory allocation logic and this value does not give you any hint which R statement(s) to blame for that...Catchfly
Could you please explain why ssh.utils::mem.usage reporting process-level memory isn't the same as the process-level memory shown in top? I don't understand the distinction.Deceitful
Also, it sounds like you're suggesting that while data.table may use calloc/malloc you think that the bulk of actual data allocation is being done through the R API. Does this suggest that if we're interested in how to correctly use data.table to be memory-efficient, then Rprof/Rprofmem is sufficient, but that if we think that data.table doesn't allocate memory correctly, then we'd need to use valgrind? Am I interpreting what you're saying correctly?Deceitful
Let us continue this discussion in chat.Catchfly
H
3

This is not related to data.table. Recently there was a discussion on twitter about same dplyr behaviour: https://mobile.twitter.com/healthandstats/status/1182840075001819136

/usr/bin/time -v Rscript -e 'library(data.table); CJ(1:1e4, 1:1e4)' |& grep resident

There is also interesting cgmemtime project, but it requires a little bit more setup.

If you are on Windows I suggest you to move to Linux.

Hilarity answered 12/10, 2019 at 18:27 Comment(2)
Of course. Obviously it applies to any package that calls compiled code which makes memory allocations not through the R API. If anything, since data.table is written in C and at least somewhat uses the R API, maybe this is less of an issue for data.table than it is for dplyr which seems to be written in C++. It's my understanding that any object accessible to the user in R must be allocated via allocVector3. It's only temporary objects that exist within functions calls using complied code that might not be allocated via allocVector3.Deceitful
@Deceitful Moreover using allocVector[3] is not always feasible, for example it is not thread-safe.Hilarity
S
1

If you are using Windows, you can call Powershell memory and other performance objects for RGui and Memory Compression as system commands through R and call various memory counters. I don't have a path to store Powershell objects in R yet. Powershell Code for RGui and 'Memory Compression' which Windows uses to store frequently used objects:

    $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t2 = $t1 | Select { $_.Id;
    [math]::Round($_.WorkingSet64/1MB);
    [math]::Round($_.PrivateMemorySize64/1MB);
    [math]::Round($_.VirtualMemorySize64/1MB) };
    $t2 | ft * 

    $t1 | gm -View All
    $t1.Modules
    $t1.MaxWorkingSet

Powershell embedded in R:

    ps_f <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t2 = $t1 | Select { 
     $_.Id;
     [math]::Round($_.WorkingSet64/1MB);
     [math]::Round($_.PrivateMemorySize64/1MB);
     [math]::Round($_.VirtualMemorySize64/1MB) };
    $t2 | ft * "); }

    ps_f()

     $_.Id;                                                                                                                
     [math]::Round($_.WorkingSet64/1MB);                                                                                   
     [math]::Round($_.PrivateMemorySize64/1MB);                                                                            
     [math]::Round($_.VirtualMemorySize64/1MB)                                                                             
    -----------------------------------------------------------------------------------------------------------------------
    {2264, 1076, 3, 1401}                                                                                                  
    {15832, 3544, 6691, 11965}   



    ps_mem <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t1 | Select ProcessName,MaxWorkingSet,MinWorkingSet,PagedMemorySize64,NonpagedSystemMemorySize64;")} 

    > ps_mem()

    ProcessName                : Memory Compression
    MaxWorkingSet              : 
    MinWorkingSet              : 
    PagedMemorySize64          : 3411968
    NonpagedSystemMemorySize64 : 0

    ProcessName                : Rgui
    MaxWorkingSet              : 1413120
    MinWorkingSet              : 204800
    PagedMemorySize64          : 7014719488
    NonpagedSystemMemorySize64 : 6662736

    # run some data.table operation

    > ps_mem()
    ProcessName                : Memory Compression
    MaxWorkingSet              : 
    MinWorkingSet              : 
    PagedMemorySize64          : 3411968
    NonpagedSystemMemorySize64 : 0

    ProcessName                : Rgui
    MaxWorkingSet              : 1413120
    MinWorkingSet              : 204800
    PagedMemorySize64          : 7015915520
    NonpagedSystemMemorySize64 : 6662736

Powershell Code:

    $t1 | where {$_.ProcessName -eq "Rgui"} | Measure-Object -Maximum *memory* | ft  Property,Maximum

Powershell embedded in R:

    ps_mem_ <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t2 = $t1 | where {$_.ProcessName -eq 'Rgui'}; 
    $t2 | Measure-Object -Maximum *memory* | ft  Property,Maximum ")} 

    # having some problems with rollover...

    > ps_mem_()

    Property                       Maximum
    --------                       -------
    NonpagedSystemMemorySize       6662736
    NonpagedSystemMemorySize64     6662736
    PagedMemorySize            -1570734080
    PagedMemorySize64           7019200512
    PagedSystemMemorySize           680240
    PagedSystemMemorySize64         680240
    PeakPagedMemorySize        -1260961792
    PeakPagedMemorySize64      11623940096
    PeakVirtualMemorySize       -161009664
    PeakVirtualMemorySize64    17018859520
    PrivateMemorySize          -1570734080
    PrivateMemorySize64         7019200512
    VirtualMemorySize           -339103744
    VirtualMemorySize64        12545798144

    some data.table run

    > ps_mem_()

    Property                       Maximum
    --------                       -------
    NonpagedSystemMemorySize       6662736
    NonpagedSystemMemorySize64     6662736
    PagedMemorySize            -1570734080
    PagedMemorySize64           7019200512
    PagedSystemMemorySize           680240
    PagedSystemMemorySize64         680240
    PeakPagedMemorySize        -1260961792
    PeakPagedMemorySize64      11623940096
    PeakVirtualMemorySize       -161009664
    PeakVirtualMemorySize64    17018859520
    PrivateMemorySize          -1570734080
    PrivateMemorySize64         7019200512
    VirtualMemorySize           -339103744
    VirtualMemorySize64        12545798144

To see all the Rgui objects:

    $t1 | gm -View All


       TypeName: System.Diagnostics.Process

    Name                       MemberType     Definition
    ----                       ----------     ----------
    Handles                    AliasProperty  Handles = Handlecount
    Name                       AliasProperty  Name = ProcessName
    NPM                        AliasProperty  NPM = NonpagedSystemMemorySize64
    PM                         AliasProperty  PM = PagedMemorySize64
    SI                         AliasProperty  SI = SessionId
    VM                         AliasProperty  VM = VirtualMemorySize64
    WS                         AliasProperty  WS = WorkingSet64
    Disposed                   Event          System.EventHandler Disposed(System.Object, System.EventArgs)
    ErrorDataReceived          Event          System.Diagnostics.DataReceivedEventHandler ErrorDataReceived(System.Object, System.Diagnostics.DataReceivedEventArgs)
    ...
Sik answered 8/10, 2019 at 23:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.