Perl: Decreasing string length increases memory use in array of strings
Asked Answered
W

3

6

I am reading through a massive file to store data in a very large hash. I am trying to keep the RAM use as small as possible.

I have a MWE that shows strange behavior in Perl:

#!/usr/bin/env perl

use 5.038;
use warnings FATAL => 'all';
use autodie ':default';
use DDP {output => 'STDOUT', array_max => 10, show_memsize => 1}; # pretty print with "p"

my @l = split /\s+/, 'OC   Pimascovirales; Iridoviridae; Betairidovirinae; Iridovirus.';
p @l;
$_ =~ s/[\.;]$// foreach @l; # single line keeps code shorter
p @l;

which has output:

[
    [0] "OC",
    [1] "Pimascovirales;",
    [2] "Iridoviridae;",
    [3] "Betairidovirinae;",
    [4] "Iridovirus."
] (356B)
[
    [0] "OC",
    [1] "Pimascovirales",
    [2] "Iridoviridae",
    [3] "Betairidovirinae",
    [4] "Iridovirus"
] (400B)

While this example is trivially small, I'm going to be doing this many many times, so RAM management is important.

how did decreasing the string length increase the RAM size of this array from 356B to 400B?

If possible, can I avoid increases like this?

Wenzel answered 18/1 at 15:54 Comment(4)
You're probably seeing an increase in allocated memory due to some internal process in the regex substitution. More importantly, why do you need to store a massive amount of data in a gigantic hash? That is the question you should probably be asking first.Airman
You can trim the punctuation in the split command, e.g. /\s+[.;]*/Airman
In "single line keeps code shorter" you don't need $_ =~ ... -- regex binds to $_ by default, and s/// changes its target in place. So, s/.../.../ for @ary; does the substitution on array elements. If you are comfortable with this form and won't later overlook what it's doingLambent
In a character class in regex the . does not have its usual special, metacharacter meaning; it's just a period. So instead of s/[\.;]$// you can do s/[.;]$// (I'm assuming you don't mean to remove the backslash but with it only want to escape .)`Lambent
D
7

It's a consequence of the Copy on Write. In other words, until you start changing the strings, Perl just knows where to look into the original string to find them, but doesn't copy them.

Use Devel::Peek to see:

use Devel::Peek qw{ Dump };
Dump @l;

Before the substitution:

SV = PVAV(0x5565ec854f20) at 0x5565ec8d8dc8
  REFCNT = 1
  FLAGS = ()
  ARRAY = 0x5565ecde8350
  FILL = 400
  MAX = 473
  FLAGS = (REAL)
  Elt No. 0
  SV = PV(0x5565ec853de0) at 0x5565ec853220
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x5565ec8d6dd0 "OC"\0
    CUR = 2
    LEN = 10
  Elt No. 1
  SV = PV(0x5565ec853eb0) at 0x5565ec853418
    REFCNT = 1
    FLAGS = (POK,IsCOW,pPOK)
    PV = 0x5565eca6b9d0 "Pimascovirales;"\0
    CUR = 15
    LEN = 17
    COW_REFCNT = 0
  Elt No. 2
...

After:

SV = PVAV(0x5565ec854f20) at 0x5565ec8d8dc8
  REFCNT = 1
  FLAGS = ()
  ARRAY = 0x5565ecde8350
  FILL = 400
  MAX = 473
  FLAGS = (REAL)
  Elt No. 0
  SV = PV(0x5565ec853de0) at 0x5565ec853220
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x5565ec8d6dd0 "OC"\0
    CUR = 2
    LEN = 10
  Elt No. 1
  SV = PV(0x5565ec853eb0) at 0x5565ec853418
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x5565ecdf1030 "Pimascovirales"\0
    CUR = 14
    LEN = 32
  Elt No. 2
...

All elements (but the 1st one) originally had the IsCOW flag.

Diopside answered 18/1 at 16:17 Comment(1)
I used a longer string (original x 100) as my version of Devel::Size doesn't show bytes but kilobytes, so no difference was visible for short strings. Might depend on Perl version, too.Diopside
M
6

Why does the OP's post-modification data structure use more memory?

  • s/// creates new scalars rather than modifying the string in-place, and
  • s/// happens to creates new scalars with larger string buffers than split in the OP's example.

I explain both these in more far more detail below, but this is really it.


Why doesn't s/// modify the string in-place?

At least in the ways that matter for this post, the following two snippets are equivalent since 5.20:

$_ =~ s/[\.;]$//
$_ = $_ =~ s/[\.;]$//r

A new scalar is created and assigned to the bound scalar instead of modifying the scalar directly.

But this wasn't always the case. Once upon a time, Perl would simply reduce the used size of the buffer when removing from its end using s///, resulting in no additional memory used. This is demonstrated by the following simple program:

$ 5.18t/bin/perl -MDevel::Peek -e'$_ = "abc"; $_ .= "d"; Dump($_); s/d\z//; Dump($_);'
SV = PV(0x55da3c065ce0) at 0x55da3c0a4830
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x55da3c08e7c0 "abcd"\0
  CUR = 4
  LEN = 16
SV = PV(0x55da3c065ce0) at 0x55da3c0a4830
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x55da3c08e7c0 "abc"\0
  CUR = 3
  LEN = 16

Note that the string buffer is at 0x55da3c08e7c0 before and after. Only the used amount of the buffer (CUR) changed.

Skip ahead to 5.20, and you get something different.

$ 5.20t/bin/perl -MDevel::Peek -e'$_ = "abc"; $_ .= "d"; Dump($_); s/d\z//; Dump($_);'
SV = PV(0x55ee06d20d20) at 0x55ee06d61ee0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x55ee06d4d530 "abcd"\0
  CUR = 4
  LEN = 10
SV = PV(0x55ee06d20d20) at 0x55ee06d61ee0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x55ee06d3acf0 "abc"\0
  CUR = 3
  LEN = 10

Note that the string buffer moved from 0x55ee06d4d530 to 0x55ee06d3acf0.

A copy of the buffer is being made, resulting in at least temporary additional memory use.

What changed is that 5.20 introduced the copy-on-write ("COW") mechanism. Thanks to this mechanism, copies of scalars containing strings no longer copy the string buffer. Only the pointer to the buffer is copied, and the string buffer is flagged as shared with the IsCOW flag.

When you perform a regex match, a copy is made of the scalar being matched. This copy is attached through magic to the all applicable capture vars ($1, etc), including $& and similar. But thanks to the new COW mechanism, no copy is made of the string buffer. Both the original and the copy share the same string buffer until one is changed.

In our scenario, one of them is changed but a moment later since we're performing an in-place substitution. $_ therefore gets a new buffer to hold the modified value. This is what gets us the equivalency I described at the start of this answer.

We can see the COW mechanism in action if we avoid changing the original scalar.

$ 5.20t/bin/perl -MDevel::Peek -e'$_ = "abc"; $_ .= "d"; Dump($_); my $y = s/d\z//r; Dump($_); Dump($y);'
SV = PV(0x55b9dacc8d20) at 0x55b9dad09ee0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x55b9dacf5790 "abcd"\0
  CUR = 4
  LEN = 10
SV = PV(0x55b9dacc8d20) at 0x55b9dad09ee0
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x55b9dacf5790 "abcd"\0
  CUR = 4
  LEN = 10
  COW_REFCNT = 1
SV = PV(0x55b9dacc8e50) at 0x55b9dacf4548
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x55b9dacf5be0 "abc"\0
  CUR = 3
  LEN = 10

Note that the scalar has the IsCOW flag set after the regex match. Its buffer (0x55b9dacf5790) is being shared with a scalar associated with $&.

Using COW for captures variables made the code cleaner, fixed bugs, and improved performance.

The memory used by the copy of the matched string will be freed the next time you do a match in the same scope, so the memory "lost" to this copy doesn't accumulate. This means the memory lost from this isn't related to the length of @l in the OP's example.


Why does s/// creates scalars with larger string buffers than split?

Because s/// "builds up" the string, where split knows the strings it wants to return before it creates the scalars for them.

Perl favours speed at the expense of (often substantial amounts of) memory. One way in which is does this is by allocating string buffers that are larger than necessary. In this case, the new scalars are being created with larger buffers.

split doesn't "build up" the string. It knows the exact length of the string it wants to place in the scalar when it creates the scalar.

s///r doesn't know the final length of the string it's going to return up front. It "builds it up" by appending to a scalar it created. As the scalar's string buffer becomes full, it undergoes size expansions.

This difference in how the string are built accounts for the differences in the size of the buffers.

  • split allocates scalars with buffers of size 16, 17, 16, 19, 16 in the OP's example.
  • s/// allocates scalars with buffers of size 16, 40, 16, 40, 16 in the OP's example.
Meri answered 18/1 at 18:37 Comment(2)
If the memory is not occupied by the members of @l, why is the number shown by DDP different?Diopside
@choroba, Good question (which isn't answered by your Answer either). split returns buffers just big enough. s/// returns buffer with more extra space. The last section definitely needs to be corrected.Meri
A
2

To answer the second part of your question: you can use split /[;.\s]+/ and the resulting array will be 354B, and contain your wanted values with no post-processing (and no string copying) necessary.

That assumes that there are no semicolons or dots anywhere except at the ends of words; if that's untrue you can use the less-pretty (and probably marginally slower) split /(?:[;.](?=\s))?\s+/.

Audiometer answered 18/1 at 20:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.