Why does reading a whole file take up more RAM than its size on DISK?

Caveat

This is NOT a duplicate of this. I'm not interested in finding out my memory consumption or the matter, as I'm already doing that below. The question is WHY the memory consumption is like this.

Also, even if I did need a way to profile my memory do note that guppy (the suggested Python memory profiler in the aforementioned link does not support Python 3 and the alternative guppy3 does not give accurate results whatsoever yielding in results such as (see actual sizes below):

Partition of a set of 45968 objects. Total size = 5579934 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  13378  29  1225991  22   1225991  22 str
     1  11483  25   843360  15   2069351  37 tuple
     2   2974   6   429896   8   2499247  45 types.CodeType

Background

Right, so I have this simple script which I'm using to do some RAM consumption tests, by reading a file in 2 different ways:

reading a file one line at a time, processing, and discarding it (via generators), which is efficient and recommended for basically any file size (especially large files), which works as expected.
reading a whole file into memory (I know this is advised against, however this was just for educational purposes).

Test script

import os
import psutil
import time


with open('errors.log') as file_handle:
    statistics = os.stat('errors.log')  # See below for contents of this file
    file_size = statistics.st_size / 1024 ** 2

    process = psutil.Process(os.getpid())

    ram_usage_before = process.memory_info().rss / 1024 ** 2
    print(f'File size: {file_size} MB')
    print(F'RAM usage before opening the file: {ram_usage_before} MB')

    file_handle.read()  # loading whole file in memory

    ram_usage_after = process.memory_info().rss / 1024 ** 2
    print(F'Expected RAM usage after loading the file: {file_size + ram_usage_before} MB')
    print(F'Actual RAM usage after loading the file: {ram_usage_after} MB')

    # time.sleep(30)

Output

File size: 111.75 MB
RAM usage before opening the file: 8.67578125 MB
Expected RAM usage after loading the file: 120.42578125 MB
Actual RAM usage after loading the file: 343.2109375 MB

I also added a 30 second sleep to check with awk at the os level, where I've used the following command:

ps aux | awk '{print $6/1024 " MB\t\t" $11}' | sort -n

which yields:

...
343.176 MB      python  # my script
619.883 MB      /Applications/PyCharm.app/Contents/MacOS/pycharm
2277.09 MB      com.docker.hyperkit

The file contains about 800K copies of the following line:

[2019-09-22 16:50:17,236] ERROR in views, line 62: 404 Not Found: The
following URL: http://localhost:5000/favicon.ico was not found on the
server.

Is it because of block sizes or dynamic allocation, whereby the contents would be loaded in blocks and a lot of that memory would actually be unused ?

characters | s | o | l | u | ţ | i | e utf-8 |0x73 |0x6f |0x6c |0x75 |0xc5 0xa3|0x69 |0x65 ucs-2 |0x00 0x73|0x00 0x6f|0x00 0x6c|0x00 0x75|0x01 0x63|0x00 0x69|0x00 0x65

Caveat

Background

Test script

Output

Recommended topics

Hot tags