Perl substr based on bytes

Asked 24/4, 2012 at 16:56 Answered 24/4, 2012 at 17:42

I'm using SimpleDB for my application. Everything goes well unless the limitation of one attribute is 1024 bytes. So for a long string I have to chop the string into chunks and save it.

My problem is that sometimes my string contains unicode character (chinese, japanese, greek) and the substr() function is based on character count not byte.

I tried to use use bytes for byte semantic or later substr(encode_utf8($str), $start, $length) but it does not help at all.

Any help would be appreciated.

Alcinia answered 24/4, 2012 at 16:56 Comment(1)

Which version of perl you are using? – Epode 24/4, 2012 at 17:10

UTF-8 was engineered so that character boundaries are easy to detect. To split the string into chunks of valid UTF-8, you can simply use the following:

my $utf8 = encode_utf8($text);
my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

Then either

# The saving code expects bytes.
store($_) for @utf8_chunks;

# The saving code expects decoded text.
store(decode_utf8($_)) for @utf8_chunks;

Demonstration:

$ perl -e'
    use Encode qw( encode_utf8 );

    # This character encodes to three bytes using UTF-8.
    my $text = "\N{U+2660}" x 342;

    my $utf8 = encode_utf8($text);
    my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

    CORE::say(length($_)) for @utf8_chunks;
'
1023
3

Claman answered 24/4, 2012 at 17:42 Comment(2)

@Borodin, Yes and no. It wouldn't make a difference to the regex engine, but it would make a difference to the reader. We want to match where the previous match left off, so let's not hide that. – Claman 24/4, 2012 at 17:53

@mob, Thanks, but it's UTF-8's designers that were clever. They specifically engineered UTF-8 so that character boundaries are easy to detect. – Claman 24/4, 2012 at 17:54

substr operates on 1-byte characters unless the string has the UTF-8 flag on. So this will give you the first 1024 bytes of a decoded string:

substr encode_utf8($str), 0, 1024;

although, not necessarily splitting the string on character boundaries. To discard any split characters at the end, you can use:

$str = decode_utf8($str, Encode::FB_QUIET);

Ciapas answered 24/4, 2012 at 17:24 Comment(3)

This doesn't necessarily split the string on character boundaries, so the OP couldn't call decode_utf8 on an individual chunk (which might be OK). – Thyme 24/4, 2012 at 17:36

Incorrect; substr operates on characters (whatever their size in bytes may be), not bytes. – Downhearted 29/10, 2016 at 18:25

@mob: they could, passing Encode::FB_QUIET as the second parameter. – Ciapas 31/10, 2016 at 8:34

Recommended topics

Hot tags