Why does the OP's post-modification data structure use more memory?
s///
creates new scalars rather than modifying the string in-place, and
s///
happens to creates new scalars with larger string buffers than split
in the OP's example.
I explain both these in more far more detail below, but this is really it.
Why doesn't s///
modify the string in-place?
At least in the ways that matter for this post, the following two snippets are equivalent since 5.20:
$_ =~ s/[\.;]$//
$_ = $_ =~ s/[\.;]$//r
A new scalar is created and assigned to the bound scalar instead of modifying the scalar directly.
But this wasn't always the case. Once upon a time, Perl would simply reduce the used size of the buffer when removing from its end using s///
, resulting in no additional memory used. This is demonstrated by the following simple program:
$ 5.18t/bin/perl -MDevel::Peek -e'$_ = "abc"; $_ .= "d"; Dump($_); s/d\z//; Dump($_);'
SV = PV(0x55da3c065ce0) at 0x55da3c0a4830
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x55da3c08e7c0 "abcd"\0
CUR = 4
LEN = 16
SV = PV(0x55da3c065ce0) at 0x55da3c0a4830
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x55da3c08e7c0 "abc"\0
CUR = 3
LEN = 16
Note that the string buffer is at 0x55da3c08e7c0
before and after. Only the used amount of the buffer (CUR
) changed.
Skip ahead to 5.20, and you get something different.
$ 5.20t/bin/perl -MDevel::Peek -e'$_ = "abc"; $_ .= "d"; Dump($_); s/d\z//; Dump($_);'
SV = PV(0x55ee06d20d20) at 0x55ee06d61ee0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x55ee06d4d530 "abcd"\0
CUR = 4
LEN = 10
SV = PV(0x55ee06d20d20) at 0x55ee06d61ee0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x55ee06d3acf0 "abc"\0
CUR = 3
LEN = 10
Note that the string buffer moved from 0x55ee06d4d530
to 0x55ee06d3acf0
.
A copy of the buffer is being made, resulting in at least temporary additional memory use.
What changed is that 5.20 introduced the copy-on-write ("COW") mechanism. Thanks to this mechanism, copies of scalars containing strings no longer copy the string buffer. Only the pointer to the buffer is copied, and the string buffer is flagged as shared with the IsCOW
flag.
When you perform a regex match, a copy is made of the scalar being matched. This copy is attached through magic to the all applicable capture vars ($1
, etc), including $&
and similar. But thanks to the new COW mechanism, no copy is made of the string buffer. Both the original and the copy share the same string buffer until one is changed.
In our scenario, one of them is changed but a moment later since we're performing an in-place substitution. $_
therefore gets a new buffer to hold the modified value. This is what gets us the equivalency I described at the start of this answer.
We can see the COW mechanism in action if we avoid changing the original scalar.
$ 5.20t/bin/perl -MDevel::Peek -e'$_ = "abc"; $_ .= "d"; Dump($_); my $y = s/d\z//r; Dump($_); Dump($y);'
SV = PV(0x55b9dacc8d20) at 0x55b9dad09ee0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x55b9dacf5790 "abcd"\0
CUR = 4
LEN = 10
SV = PV(0x55b9dacc8d20) at 0x55b9dad09ee0
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55b9dacf5790 "abcd"\0
CUR = 4
LEN = 10
COW_REFCNT = 1
SV = PV(0x55b9dacc8e50) at 0x55b9dacf4548
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x55b9dacf5be0 "abc"\0
CUR = 3
LEN = 10
Note that the scalar has the IsCOW
flag set after the regex match. Its buffer (0x55b9dacf5790
) is being shared with a scalar associated with $&
.
Using COW for captures variables made the code cleaner, fixed bugs, and improved performance.
The memory used by the copy of the matched string will be freed the next time you do a match in the same scope, so the memory "lost" to this copy doesn't accumulate. This means the memory lost from this isn't related to the length of @l
in the OP's example.
Why does s///
creates scalars with larger string buffers than split
?
Because s///
"builds up" the string, where split
knows the strings it wants to return before it creates the scalars for them.
Perl favours speed at the expense of (often substantial amounts of) memory. One way in which is does this is by allocating string buffers that are larger than necessary. In this case, the new scalars are being created with larger buffers.
split
doesn't "build up" the string. It knows the exact length of the string it wants to place in the scalar when it creates the scalar.
s///r
doesn't know the final length of the string it's going to return up front. It "builds it up" by appending to a scalar it created. As the scalar's string buffer becomes full, it undergoes size expansions.
This difference in how the string are built accounts for the differences in the size of the buffers.
split
allocates scalars with buffers of size 16, 17, 16, 19, 16 in the OP's example.
s///
allocates scalars with buffers of size 16, 40, 16, 40, 16 in the OP's example.
/\s+[.;]*/
– Airman$_ =~ ...
-- regex binds to$_
by default, ands///
changes its target in place. So,s/.../.../ for @ary;
does the substitution on array elements. If you are comfortable with this form and won't later overlook what it's doing – Lambent.
does not have its usual special, metacharacter meaning; it's just a period. So instead ofs/[\.;]$//
you can dos/[.;]$//
(I'm assuming you don't mean to remove the backslash but with it only want to escape.
)` – Lambent