Why would var_dump return a bigger value than the string length?
Asked Answered
T

1

5

I am working on getting some song lyrics using an API, and converting the lyrics string into an array of words. I am getting some unusual behaviors in preg_replace function. When I did some debugging using var_dump, I see that var_dump returns a value of 10 for the string "you", which tells me that there might be something wrong. After that preg_replace acts weirdly.

This is my code:

$source = get_chart_lyrics_data("madonna","frozen");
$pieces = explode("\n", $source);
$lyrics = array();
for($i=0;$i<count($pieces);$i++){
  if($i>10){
    $words = explode(" ",$pieces[$i]);
    foreach($words as $_word){
      if($_word=="")
        continue;
      var_dump($_word);
      $word = strtolower($_word);
      var_dump($word);
      $word = trim($word);
      var_dump($word);
      $word = preg_replace("/[^A-Za-z ]/", '', $word);
      var_dump($word);
      $lyrics[$word]++;
    }
  }
}

This is the first 4 lines this code returns:

string(10) “You”
string(10) “you”
string(10) “you”
string(8) “lyricyou”

How come var_dump is returning a value of 10 for "you"? And why preg_replace is acting like that?

Thanks.

Thielen answered 23/1, 2015 at 4:17 Comment(7)
This is probably an encoding problem, var_dump returns the number of byte, not the number of character, can you show the original string or better, where it comes from.Ames
try echo htmlentities(htmlentities($word)); to see if there are any special characters or somethingC
“You” and "You" are differentBiologist
Those are not double-quotes but some other characters. If they're UTF-8 encoded that might explain why the string length is greater than the number of visible characters.Frigidarium
The double quotes should be part of the var_dump output; I'd assume they just turned "fancy" while copy and pasting into SO.Relational
Use too the u modifier with preg_replace: preg_replace('/[^\pL ]+/u', '', $word);Ames
Hi @CasimiretHippolyte I tried replacing preg_replace with your suggestion. It didn't make any change. Let me know if you have any other suggestions.Thielen
R
13

The likeliest answer is that the string contains non-printable characters beyond "you". To figure out what exactly it contains, you'll have to look at the raw bytes. Do this with echo bin2hex($word). This outputs a string like 666f6f..., where every 2 characters are one byte in hexadecimal notation. You may make that more readable with something like:

echo join(' ', str_split(bin2hex($word), 2));
// 66 6f 6f ...

Now use your favourite ASCII/Unicode table (depending on the encoding of the string) to figure out what individual characters those represent and where you got them from.

Perhaps your string is encoded in UTF-16, in which case you should see telltale 00 bytes every two characters.

Relational answered 23/1, 2015 at 4:25 Comment(5)
yep the echo function is returning this: 3c 6c 79 72 69 63 3e 79 6f 75Thielen
That's <lyric>you. The missing characters are getting interpreted as a tag by the browser.Ocrea
@Thielen Protip: don't use the browser for debugging, or at least look at the page's source code.Relational
@duskwuff thanks for the feedback. Yes I never thought about that. And It was apparently a soap output I was getting.Thielen
You are the wind beneath my wings, @deceze.Mckelvey

© 2022 - 2024 — McMap. All rights reserved.