CakePHP recommendation to iterate a huge table and generate a sitemap?
Asked Answered
Q

6

6

I'm trying to create an XML sitemap using CakePHP, from a table which has more than 50,000 records at the moment, each record equivalent to a URI in the sitemap. Now the problem I'm facing is CakePHP is running me out of memory while generating it, for two reasons:

  1. A find('all') is building a huge associative array of the entire set of 50,000 URIs.
  2. Since I don't want to output HTML from the controller itself, I'm transferring the associative array containing URI, priority, change frequency etc., to the view with a $this->set() call -- which again is huge, containing 50,000 indices.

Is it possible at all, to do this while following MVC and CakePHP guidelines?

Quevedo answered 3/3, 2010 at 6:34 Comment(0)
B
2

Are you sure you have to run out of memory on 50.000 records? Even if a row is 1K in size (pretty huge), you would have to deal with ~ 50 MB of data? My P1 had enough RAM to handle that. Set memory_limit in php.ini higher than the default. (Consider also tweaking max_execution_time.)

On the other hand, if you consider the data set as too huge and processing it as too resource intensive, you should not serve that page dynamically, it is the perfect DDoS bait. (At least I would cache it heavily.) You could schedule a cron job to re-generate the page every X hours by a server side script free from the MVC penalty of serving all data at once to the view, it could work on the rows sequentially.

Batch answered 3/3, 2010 at 20:33 Comment(1)
The sitemap works absolutely fine on my local dev box. Once I deploy to my shared hosting where I have very limited memory, it explodes. This is exactly what I was thinking of, just wanted to double check and make sure I'm going that way because there are no CakePHP/MVC options left. Thanks!Quevedo
T
4

I know this question is old, but for really huge queries there is still no good solution i think.

To iterate through a huge resultset you can use DboSource methods.

First get the DBO

$dbo = $this->Model->getDataSource();

Build the query

$sql = $dbo->buildStatement($options);

Then execute the statement and iterate through the results

if ($dbo->execute($sql))
{
    while ($dbo->hasResult() && $row = $dbo->fetchResult()) {
        // $row is an array with same structure like find('first')
    }
}
Turenne answered 29/2, 2012 at 14:52 Comment(1)
As of CakePHP 2.4.4, the $dbo->buildStatement requires two parameters. Further, the $dbo->buildStatement only return an invalid SQL without table name. After digging into the source I used $this->generateAssociationQuery($model, null, null, null, null, $queryData, false, $null); insteadNuggar
W
3

I had a similar problem this week, and stumbled across the Containable Behavior. This allows you to cut down any relationship related queries (if you have any).

The best solution would be to programmatically use LIMIT and OFFSET, and loop through the recordset small chunks at a time. This saves you from stuffing 50K records into memory at once.

Wendiewendin answered 3/3, 2010 at 7:34 Comment(2)
Thanks for the reply. I don't have any related table and the query is more or less a simple SELECT operation. Firing multiple queries is something I wanted to avoid. Plus, even if I do that, how do I still transfer the data to the view?Quevedo
Would you be displaying all 50k records in the same view? If not, most pagination patterns work well with the limit/offset query. If you do have to display them all at once, maybe dig into your php.ini (if you have admin access to your server) and change memory_limit to a higher value. That might fix your memory problems with a find('all). If your table has a lot of fields, use the 'fields' value to narrow down what is necessary (as mentioned in another comment).Wendiewendin
P
2

find('all') is way too greedy, you'll have to be more specific if you don't want to run out of memory.

As stated above, use the Containable behavior. If you only need results from your table, (without associated tables) and for only a couple of fields, a more explicit query like this should be better :

$results = $this->YourModel->find('all', array(
    'contain' => false,
    'fields' => array('YourModel.name', 'YourModel.url')
);

You should also consider adding an html cache mechanism (cakePHP has a builtin or using the one suggested by Matt Curry).

Of course it will be a cached version, and won't be perfectly up to date to your list. If you want more control you can always save the result in cake cache (using Cache::write), using the afterSave/afterDelete callbacks of your model to update the cached value and recreate the cached xml file from here.

Pace answered 3/3, 2010 at 13:57 Comment(1)
Specifically, look at the "fields" argument to find().Agenesis
B
2

Are you sure you have to run out of memory on 50.000 records? Even if a row is 1K in size (pretty huge), you would have to deal with ~ 50 MB of data? My P1 had enough RAM to handle that. Set memory_limit in php.ini higher than the default. (Consider also tweaking max_execution_time.)

On the other hand, if you consider the data set as too huge and processing it as too resource intensive, you should not serve that page dynamically, it is the perfect DDoS bait. (At least I would cache it heavily.) You could schedule a cron job to re-generate the page every X hours by a server side script free from the MVC penalty of serving all data at once to the view, it could work on the rows sequentially.

Batch answered 3/3, 2010 at 20:33 Comment(1)
The sitemap works absolutely fine on my local dev box. Once I deploy to my shared hosting where I have very limited memory, it explodes. This is exactly what I was thinking of, just wanted to double check and make sure I'm going that way because there are no CakePHP/MVC options left. Thanks!Quevedo
K
1

Have you tried unBindModel (if you have relations)...

Whenever I have to do huge queries in cakephp I just use the "regular" mysql-functions like mysql_query, mysql_fetch_assoc etc. Much faster, and no lack of memory...

Kitchens answered 3/3, 2010 at 6:47 Comment(1)
This is a single table SELECT. Looks like old-school MySQL is the efficient and only way to go, but I'm way out of CakePHP by then. And I still have the problem of how to render the calculated associated array.Quevedo
D
1

Use https://github.com/jamiemill/cakephp_find_batch or implement this logic by yourself.

Dogface answered 30/6, 2012 at 18:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.