Pulling data from API, memory growth
Asked Answered
M

4

11

I'm working on a project where I pull data (JSON) from an API. The problem I'm having is that the memory is slowly growing until I get the dreaded fatal error:

Fatal error: Allowed memory size of * bytes exhausted (tried to allocate * bytes) in C:... on line *

I don't think there should be any memory growth. I tried unsetting everything at the end of the loop but no difference. So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

<?php

$start = microtime(true);

$time = microtime(true) - $start;
echo "Start: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

include ('start.php');
include ('connect.php');

set_time_limit(0);

$api_key = 'API-KEY';
$tier = 'Platinum';
$threads = 10; //number of urls called simultaneously

function multiRequest($urls, $start) {

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;start function: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    $nbrURLS = count($urls); // number of urls in array $urls
    $ch = array(); // array of curl handles
    $result = array(); // data to be returned

    $mh = curl_multi_init(); // create a multi handle 

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;Creation multi handle: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    // set URL and other appropriate options
    for($i = 0; $i < $nbrURLS; $i++) {
        $ch[$i]=curl_init();

        curl_setopt($ch[$i], CURLOPT_URL, $urls[$i]);
        curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, 1); // return data as string
        curl_setopt($ch[$i], CURLOPT_SSL_VERIFYPEER, 0); // Doesn't verifies certificate

        curl_multi_add_handle ($mh, $ch[$i]); // Add a normal cURL handle to a cURL multi handle
    }

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;For loop options: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    // execute the handles
    do {
        $mrc = curl_multi_exec($mh, $active);          
        curl_multi_select($mh, 0.1); // without this, we will busy-loop here and use 100% CPU
    } while ($active);

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;Execution: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    echo '&nbsp;&nbsp;&nbsp;For loop2<br>';

    // get content and remove handles
    for($i = 0; $i < $nbrURLS; $i++) {

        $error = curl_getinfo($ch[$i], CURLINFO_HTTP_CODE); // Last received HTTP code 

        echo "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

        //error handling if not 200 ok code
        if($error != 200){

            if($error == 429 || $error == 500 || $error == 503 || $error == 504){
                echo "Again error: $error<br>";
                $result['again'][] = $urls[$i];

            } else {
                echo "Error error: $error<br>";
                $result['errors'][] = array("Url" => $urls[$i], "errornbr" => $error);
            }

        } else {
            $result['json'][] = curl_multi_getcontent($ch[$i]);

            echo "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Content: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";
        }

        curl_multi_remove_handle($mh, $ch[$i]);
        curl_close($ch[$i]);
    }

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp; after loop2: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    curl_multi_close($mh);

    return $result;
}


$gamesId = mysqli_query($connect, "SELECT gameId FROM `games` WHERE `region` = 'EUW1' AND `tier` = '$tier ' LIMIT 20 ");
$urls = array();

while($result = mysqli_fetch_array($gamesId))
{
    $urls[] = 'https://euw.api.pvp.net/api/lol/euw/v2.2/match/' . $result['gameId'] . '?includeTimeline=true&api_key=' . $api_key;
}

$time = microtime(true) - $start;
echo "After URL array: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

$x = 1; //number of loops

while($urls){ 

    $chunk = array_splice($urls, 0, $threads); // take the first chunk ($threads) of all urls

    $time = microtime(true) - $start;
    echo "<br>After chunk: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

    $result = multiRequest($chunk, $start); // Get json

    unset($chunk);

    $nbrComplete = count($result['json']); //number of retruned json strings

    echo 'For loop: <br/>';

    for($y = 0; $y < $nbrComplete; $y++){
        // parse the json
        $decoded = json_decode($result['json'][$y], true);

        $time = microtime(true) - $start;
        echo "&nbsp;&nbsp;&nbsp;Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";


    }

    unset($nbrComplete);
    unset($decoded);

    $time = microtime(true) - $start;
    echo $x . ": ". memory_get_peak_usage(true) . " | " . $time . "<br>";

    // reuse urls
    if(isset($result['again'])){
        $urls = array_merge($urls, $result['again']);
        unset($result['again']);
    }

    unset($result);
    unset($time);

    sleep(15); // limit the request rate

    $x++;
}

include ('end.php');

?>

PHP Version 5.3.9 - 100 loops:

loop: memory | time (sec)
1: 5505024 | 0.98330211639404
3: 6291456 | 33.190237045288
65: 6553600 | 1032.1401019096
73: 6815744 | 1160.4345710278
75: 7077888 | 1192.6274609566
100: 7077888 | 1595.2397520542

EDIT:
After trying it with PHP 5.6.14 xampp on windows:

loop: memory | time (sec)
1: 5505024 | 1.0365679264069
3: 6291456 | 33.604479074478
60: 6553600 | 945.90159296989
62: 6815744 | 977.82566595078
93: 7077888 | 1474.5941500664
94: 7340032 | 1490.6698410511
100: 7340032 | 1587.2434458733

EDIT2: I only see the memory increase after json_decode

Start: 262144 | 135448
After URL array: 262144 | 151984
After chunk: 262144 | 152272
   start function: 262144 | 152464
   Creation multi handle: 262144 | 152816
   For loop options: 262144 | 161424
   Execution: 3145728 | 1943472
   For loop2
      error: 3145728 | 1943520
      Content: 3145728 | 2095056
      error: 3145728 | 1938952
      Content: 3145728 | 2131992
      error: 3145728 | 1938072
      Content: 3145728 | 2135424
      error: 3145728 | 1933288
      Content: 3145728 | 2062312
      error: 3145728 | 1928504
      Content: 3145728 | 2124360
      error: 3145728 | 1923720
      Content: 3145728 | 2089768
      error: 3145728 | 1918936
      Content: 3145728 | 2100768
      error: 3145728 | 1914152
      Content: 3145728 | 2089272
      error: 3145728 | 1909368
      Content: 3145728 | 2067184
      error: 3145728 | 1904616
      Content: 3145728 | 2102976
    after loop2: 3145728 | 1899824
For loop: 
   Decode: 3670016 | 2962208
   Decode: 4980736 | 3241232
   Decode: 5242880 | 3273808
   Decode: 5242880 | 2802024
   Decode: 5242880 | 3258152
   Decode: 5242880 | 3057816
   Decode: 5242880 | 3169160
   Decode: 5242880 | 3122360
   Decode: 5242880 | 3004216
   Decode: 5242880 | 3277304
Minica answered 27/10, 2015 at 10:21 Comment(16)
This will be difficult without a real example to try (as it may be difficult to represent your actual dataset). My suggestions are: 1. Use a profiler (i.e. Blackfire) 2. If you cannot use a profiler spread some more memory_get_peak_usage around (one each line is what I would do) so you can see exactly where the memory is growing. My best guess is CURL is leaking memory ;)Teutonize
that's been allocated to your PHP script. perhaps part of it, is cached memory. So even tough you unset your code, actions else where could have raised the memory but that doesn't reflect the actual memory in use but rather the most you ever used during processing the data? For example the activity u do in your "manipulate data" could raise it, that memory was used but is no longer used after, but the system will still cache it temporarilyRodrique
By the way, what's your PHP version?Teutonize
@RicardoVelhote I was using PHP 5.3.9 and chanced to 5.6.14 but not much of a difference. See edit post. Now I'm going to try a profiler.Minica
@prix I'm testing without the manipulation part.Minica
In multiRequest function, before return try to unset($ch).Guereza
Apparently php 7 soon to come out has much better memory consumption when dealing with arrays,not really a solution but something to think about.nikic.github.io/2014/12/22/…Carvajal
@Minica Can you put more memory_get_peak_usage(true) in your code and show us the results. We have to know where the memory actually grows so we can pinpoint the exact place. One iteration is enough because at the first loop it's already at 55MB. The growth in the other loops is not significant.Teutonize
Might be helpful reddit.com/r/PHP/comments/3q1ymn/…Carvajal
@RicardoVelhote I have printed some more memory_get_peak_usage(true) I only see the memory grow after json_decode() now im going to do a big loop to see what happens.Minica
@Minica Please also include memory_get_usage. I think it's weird that there is such a big increase from Start to After URL array. Can you also update the code you posted with the memory debug lines so we can see exactly where they are?Teutonize
@RicardoVelhote No idea why that big increase was now its not increasing at all, maybe I didn't copy the right loop. The memory is in creasing mosty at json_decode() but once it grew at the second for loop of multiRequest()Minica
I’d really hope you’d make a runnable test case; now we can just more or less just guess what specific thing goes wrong...Punke
Some reasons I’ve commonly found to be culprit is native resources, like in this case $ch and PHP’s idiomatic way of handling foreachs. For debugging memory problems, xdebug is useful. kcachegrind works for reading its output well. Without exact code to review, it’s really hard to look at this any more, but even then, I’d want to vote to close this as “Why this code is not working?” question... As there is no simple piece of code you’d want to know why it works like it works.Punke
@RicardoVelhote thank you for your time and effort, but after a long time of trying to solve this problem. I guess I just admit my defeat. Maybe I'm going to try to write this in an other better suited language i don't know yet, (any suggestions?) If I find a solution you'll be the first to know it.Minica
(any suggestions?) You've already been given a suggestion I’d really hope you’d make a runnable test case;, giving up doesn't make the problem go away.Rodrique
E
1

I tested your script on 10 URLS. I removed all your comments except one comment at the end of the script and one in problem loop when used json_decode. Also I opened one page which you encode from API and looked very big array and I think you're right, you have an issue in json_decode.

Results and fixes.

Result without changes:

Code:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

Decode: 3407872 | 2947584
Decode: 3932160 | 2183872
Decode: 3932160 | 2491440
Decode: 4980736 | 3291288
Decode: 6291456 | 3835848
Decode: 6291456 | 2676760
Decode: 6291456 | 4249376
Decode: 6291456 | 2832080
Decode: 6291456 | 4081888
Decode: 6291456 | 3214112
Decode: 6291456 | 244400

Result with unset($decode):

Code:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3932160 | 1573296
Decode: 4456448 | 1573296
Decode: 4456448 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 244448

Also you can add gc_collect_cycles:

Code:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   gc_collect_cycles();
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

It's can help for you in some cases, but in results it can comes to performance degradation.

You can try restart script with unset, and unset+gc and write ago if you will have the same issue after changes.

Also I don't see where you use $decoded variable, if it's error in code, you can remove json_decode :)

Elene answered 3/11, 2015 at 20:5 Comment(1)
This is a simplified version of my real script, thats why $decoded isn't used. i'll try unsetting and gc.Minica
G
4

Your method is quite long, so I don't believe that garbage collection wont get fired until the very end of a function, which means your unused variables can build up. If they aren't going to be used anymore, then garbage collection would take care of this for you.

You might think about refactoring this code into smaller methods to take advantage of this, and with all the other good stuff that comes with having smaller methods, however in the meantime you could try putting gc_collect_cycles(); at the very end of your loop to see if you can free some memory:

if(isset($result['again'])){
    $urls = array_merge($urls, $result['again']);
    unset($result['again']);
}

unset($result);
unset($time);

gc_collect_cycles();//add this line here
sleep(15); // limit the request rate

Edit : the segment I have updated actually doesn't belong to the big function, however I suspect maybe the size of $result may bowl things over, and it wont get cleaned until the loop terminates, possibly. This is a worth a try however.

Goodhen answered 30/10, 2015 at 11:20 Comment(9)
By my experience, GC got radically better handling long functions with PHP 5.4; before that, it indeed worked like that.Punke
@smar Baring that in mind, I have done a double take. The calling function in question isnt that long, but it does substantial work. I'm also going off of this : hackingwithphp.com/18/1/10Goodhen
Well, I guess that’s true, I’ve used OOP on my PHP scripts so it’s always leaving at least one function most of the time. Also does leaving a function count when it leaves PHP’s builtin function? I do not know.Punke
So I think it would be when the entire script exits in that scenario, I believe, which is a bit late.Goodhen
Yes I don’t say you’re wrong (I think you’re correct), I just don’t want to verify it by constructing runnable case by myself from that code and then read from PHP’s source why it actually does what it does to verify that, hence I left a comment based on my experience. (From same reason, I can’t leave an answer by myself, since I just don’t know :)Punke
I think this can be a good suggestion to try, however, I believe the memory increse results from the amount of data returned by the API that is processed and stored in $result not something related to the code or PHP. I guess without actually trying it for real we are doing guesswork ;)Teutonize
@Octopi thank you for your time, i tried your suggestion but it didn't change the outcome.Minica
Have managed to run this thing from console and in current state I can't see any excess memory leaks. As we can't see what is happening with decoded json, I guess that we can't help. Running with php 5.5.9Selfconfidence
I Would highly support this method use more functions or build a class that uses more methods this way when the method has ended all the memory it was using if forcibly cleaned. this also means your not using the php Global scope.Wegner
L
3

So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

Yes, running out of memory is normal when you use all of it. You are requesting 10 simultaneous HTTP requests and unserializing the JSON responses in to PHP memory. Without limiting the size of the responses you will always be in danger of running out of memory.

What else can you do?

  1. Do not run multiple http connections simultaneously. Turn $threads down to 1 to test this. If there is a memory leak in a C extension calling gc_collect_cycles() will not free any memory, this only affects memory allocated in the Zend Engine which is no longer reachable.
  2. Save the results to a folder and process them in another script. You can move the processed files into a sub directory to mark when you have successfully processed a json file.
  3. Investigate forking or a message queue to have multiple processes work on a portion of the problem at the same time - either multiple PHP processes listening to a queue bucket or forked children of the parent process with their own process memory.
Lewiss answered 31/10, 2015 at 18:55 Comment(0)
A
1

So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

There is nothing wrong with your code because this is the normal behaviour, you are requesting data from an external source, which in turn is loaded into memory.

Of course a solution to your problem could be as simple as:

ini_set('memory_limit', -1);

Which allows for all the memory needed to be used.


When I'm using dummy content the memory usage stays the same between requests.

This is using PHP 5.5.19 in XAMPP on Windows.

There has been a cURL memory leak related bug which was fixed in Version 5.5.4

Abixah answered 29/10, 2015 at 13:25 Comment(2)
Thanks you for your time, I was using php 5.3.9 and I switched to the newest xampp but the memory useage didn't change much. See edit post.Minica
Although this will prevent the memory error as it is, it may not be a "sensible" solution unless the deployment environment is entirely under your control. It may simply solve the problem until you deploy to a server without enough memory, or where you run other processes and end up borking your server with PHP eating all the memory. It'll "remove" the problem, rather than "fixing" itHereof
E
1

I tested your script on 10 URLS. I removed all your comments except one comment at the end of the script and one in problem loop when used json_decode. Also I opened one page which you encode from API and looked very big array and I think you're right, you have an issue in json_decode.

Results and fixes.

Result without changes:

Code:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

Decode: 3407872 | 2947584
Decode: 3932160 | 2183872
Decode: 3932160 | 2491440
Decode: 4980736 | 3291288
Decode: 6291456 | 3835848
Decode: 6291456 | 2676760
Decode: 6291456 | 4249376
Decode: 6291456 | 2832080
Decode: 6291456 | 4081888
Decode: 6291456 | 3214112
Decode: 6291456 | 244400

Result with unset($decode):

Code:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3932160 | 1573296
Decode: 4456448 | 1573296
Decode: 4456448 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 244448

Also you can add gc_collect_cycles:

Code:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   gc_collect_cycles();
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

It's can help for you in some cases, but in results it can comes to performance degradation.

You can try restart script with unset, and unset+gc and write ago if you will have the same issue after changes.

Also I don't see where you use $decoded variable, if it's error in code, you can remove json_decode :)

Elene answered 3/11, 2015 at 20:5 Comment(1)
This is a simplified version of my real script, thats why $decoded isn't used. i'll try unsetting and gc.Minica

© 2022 - 2024 — McMap. All rights reserved.