Prevent loading from remote source if file is larger than a given size
Asked Answered
T

3

2

Let's say I want XML Files only with upto 10MB to be loaded from a remote server.

Something like

$xml_file = "http://example.com/largeXML.xml";// size= 500MB

//PRACTICAL EXAMPLE: $xml_file = "http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml";// size= 683MB

 /*GOAL: Do anything that can be done to hinder this large file from being loaded by the DOMDocument without having to load the File n check*/

$dom =  new DOMDocument();

$dom->load($xml_file /*LOAD only IF the file_size is <= 10MB....else...echo 'File is too large'*/);

How can this possibly be achieved?.... Any idea or alternative? or best approach to achieving this would be highly appreciated.

I checked PHP: Remote file size without downloading file but when I try with something like

var_dump(
    curl_get_file_size(
        "http://www.dailymotion.com/rss/user/dialhainaut/"
    )
);

I get string 'unknown' (length=7)

When I try with get_headers as suggested below, the Content-Length is missing in the headers, so this will not work reliably either.

Please kindly advise how to determine the length and avoid sending it to the DOMDocument if it exceeds 10MB

Twaddle answered 21/4, 2016 at 6:30 Comment(5)
Did you look at filesize() function?Salzburg
@MawiaHL Can you try: var_dump(filesize("http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml"))Twaddle
Page not found is the result.Salzburg
@MawiaHL -- This loads in the browser: w3.org/TR/2001/REC-xsl-20011015/xslspec.xml .... but doesn't work with filesize()..... in var_dump(filesize('https://www.w3.org/TR/2001/REC-xsl-20011015/xslspec.xml'))Twaddle
@DownVoters.... Please advise what is wrong with the Question. Thank You!Twaddle
X
2

Ok, finally working. The headers solution was obviously not going to work broadly. In this solution, we open a file handle and read the XML line by line until it hits the threshold of $max_B. If the file is too big, we still have the overhead of reading it up until the 10MB mark, but it's working as expected. If the file is less than $max_B, it proceeds...

$xml_file = "http://www.dailymotion.com/rss/user/dialhainaut/";
//$xml_file = "http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml";

$fh = fopen($xml_file, "r");  

if($fh){
    $file_string = '';
    $total_B = 0;
    $max_B = 10485760;
    //run through lines of the file, concatenating them into a string
    while (!feof($fh)){
        if($line = fgets($fh)){
            $total_B += strlen($line);
            if($total_B < $max_B){
                $file_string .= $line;
            } else {
                break;
            }
        }
    } 

    if($total_B < $max_B){
        echo 'File ok. Total size = '.$total_B.' bytes. Proceeding...';
        //proceed
        $dom = new DOMDocument();
        $dom->loadXML($file_string); //NOTE the method change because we're loading from a string   

    } else {
        //reject
        echo 'File too big! Max size = '.$max_B.' bytes.';  
    }

    fclose($fh);

} else {
    echo '404 file not found!';
}
Xe answered 21/4, 2016 at 6:56 Comment(12)
This Crashes when tested with: file_get_contents("http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml"); size //683MB ... Kindly adviseTwaddle
Script Hanged had to restart the server... file_get_contents tries to load the entire 683 MB into memory before its worked onTwaddle
Yeah, we're banging up against the maximum size for a single variable it's the memory_limit setting in php.ini. We need a better solution - one that can test the file without loading the whole thing.Xe
Nice try!.. however, these files are random from random servers. ...This fails on something like: ` $xml_file = "dailymotion.com/rss/user/dialhaina‌​ut/";` $head = array_change_key_case(get_headers($xml_file, TRUE)); The headers do not include Content-Length.... please see @Mawia HL 's answer... Thanks at least for trying.. :)Twaddle
Ok, I've added some simple error handling. You can update precisely how errors are handled as you like. But this does answer the question - if the file exists, and it's too big, you can now determine that without downloading it.Xe
___That's really great!.... I'll give it a thought. Though just wondering for dailymotion.com users intending to load their xml contents will always get the ERROR.... .... for a LINK like: dailymotion.com/rss/user/dialhaina‌ut/ ... shows the XML in the Webrowser but the headers don't include 'Content-length' ... but thanks alotTwaddle
get_headers() does a full GET request, so you need to provide a stream context and change the method to HEAD to prevent downloading the thing just to get it's size. See example #2 in the PHP manual.Broida
Ok, finally working for daily motion, and also rejecting the oversize files. Check it out.Xe
@Xe .... So far your suggestion has worked perfectly with multiple tests. You deserve "Un Coup the Chapeau" un-less a different solution is suggested... but this works very-well... Merci Beaucoup!Twaddle
That's great! Thanks for the challenge - for sure the most interesting question of the day.Xe
@Xe you're awesome buddy... if you like the question, please upVote it. Thax again!Twaddle
I would still check whether the headers are there, because if they are, it'll save us the trouble of potentially downloading 10MB of garbage to decide that it's garbage.Vestal
S
1

10MB is equal to 10485760 B. If content-length is not specified, it will use curl which is available since php5. I got this source from somewhere in SO but could not remember it.:

function get_filesize($url) {
    $headers = get_headers($url, 1);
    if (isset($headers['Content-Length'])) return $headers['Content-Length'];
    if (isset($headers['Content-length'])) return $headers['Content-length'];
    $c = curl_init();
    curl_setopt_array($c, array(
        CURLOPT_URL => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_HTTPHEADER => array('User-Agent: Mozilla/5.0 
         (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.3) 
          Gecko/20090824 Firefox/3.5.3'),
        ));
    curl_exec($c);
    return curl_getinfo($c, CURLINFO_SIZE_DOWNLOAD);
    }
}
    $filesize = get_filesize("http://www.dailymotion.com/rss/user/dialhainaut/");
    if($filesize<=10485760){
        echo 'Fine';
    }else{
       echo $filesize.'File is too big';
    }    

.

Check demo here

Salzburg answered 21/4, 2016 at 6:44 Comment(9)
@Mawai HL--- We have tried that before it fails when used on this XML: $head = array_change_key_case(get_headers("http://www.dailymotion.com/rss/user/dialhainaut/", TRUE)); The headers do not include Content-Length.... please try n advise. ThxTwaddle
@ErickBest, dailymotion.com/rss/user/dialhaina‌​ut/ does not return anything. It only returns Page not found The page you're looking for is either restricted or doesn't exist. So how can anyone know the size of the file when it does not exist at all.Salzburg
--Please try this in the Web-Browser: dailymotion.com/rss/user/dialhainautTwaddle
If there is no content at all, get_headers won't return anything.Salzburg
Sure... makes sense. but when run in the browser the XML shows up... if it is larger than 10MB it should not be loaded to the DOMDocumentTwaddle
get_headers() does a full GET request, so you need to provide a stream context and change the method to HEAD to prevent downloading the thing just to get it's size. See example #2 in the PHP manual.Broida
@Mawia HL ... Yes... Confirmed ..., Working perfectly ... (Wish I could have 2 Accepted Answers)Twaddle
@Mawia HL The code in the demo works with multiple tests... but not the ones in your answer. Can you past the demo codes as your answer?Twaddle
@ErickBest, The advantage of using my answer over the accepted is that some host providers disabled fopen() function. On a Windows webserver, when using fopen with a file path stored in a variable, PHP will return an error if the variable isn't encoded in ASCII. When using SSL, Microsoft IIS will violate the protocol by closing the connection without sending a close_notify indicator. So Curl is better.Salzburg
M
-1

Edit: New Answer a bit workaroundish:
You can't check the Dom Elements Length, BUT, you can make a header request and get the filesize from the URL:

<?php

function i_hope_this_works( $XmlUrl ) {
    //lets assume we fk up so we set size to -1  
    $size = -1;

      $request = curl_init( $XmlUrl );

      // Go for a head request, so the body of a 1 gb file will take the same as 1 kb
      curl_setopt( $request, CURLOPT_NOBODY, true );
      curl_setopt( $request, CURLOPT_HEADER, true );
      curl_setopt( $request, CURLOPT_RETURNTRANSFER, true );
      curl_setopt( $request, CURLOPT_FOLLOWLOCATION, true );
      curl_setopt( $request, CURLOPT_USERAGENT, get_user_agent_string() );

      $requesteddata = curl_exec( $request );
      curl_close( $request );

      if( $requesteddata ) {
        $content_length = "unknown";
        $status = "unknown";

        if( preg_match( "/^HTTP\/1\.[01] (\d\d\d)/", $requesteddata, $matches ) ) {
          $status = (int)$matches[1];
        }

        if( preg_match( "/Content-Length: (\d+)/", $requesteddata, $matches ) ) {
          $content_length = (int)$matches[1];
        }

        // you can google status qoutes 200 is Ok for example
        if( $status == 200 || ($status > 300 && $status <= 308) ) {
          $result = $content_length;
        }
      }

      return $result;
    }
    ?>

You should now be able to get every Filesize you want by URL just with

$file_size = i_hope_this_works('yourURLasString')
Milieu answered 21/4, 2016 at 6:40 Comment(11)
RESULT: Warning: Illegal string offset 'size' in C:\.....\fileSize_tst\index.php on line 5Twaddle
What is the value of size ?Milieu
The Size is unknown....can be anything size.... but must not be > 10MB... the File comes from a remote server..(Please read the question ones more)Twaddle
I mean what is in the variable sizeMilieu
@johannes-- you added the variable size in your answer and I said that there is no way the variable size can be used bcz the file is loaded from a remoteServer with unknown size....Twaddle
Actually size will be generated from PHP, for you, but i don't know if it works with a remote server since I don't use it alot, i will edit my answer to have more specific Code ... That at least works for meMilieu
@johannes-- The file is not being uploaded it is already uploaded somewhere in a different server, we simply want to read it using the DOMDocument but we want to STOP reading ...or ...**NOT-READ at all** if its size exceeds 10MB... i.e: FILES LIKE: http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml ......or..... http://www.cs.washington.edu/research/xmldatasets/data/SwissProt/SwissProt.xml are just too largeTwaddle
Try int filesize ( string $filename ) filename is the path to the file, from which Server are you getting the files?Milieu
__From Multiple Servers... The Servers are extremely random. but the Main GOAL is we do whatever we can so that NO FILE with size beyond 10MB should be passed to the DOMDocumentTwaddle
Let us continue this discussion in chat.Milieu
This approach failed ... returns unknown... when tested with http://www.dailymotion.com/rss/user/dialhainaut/Twaddle

© 2022 - 2024 — McMap. All rights reserved.