Doctrine Batch Processing Iterate High Memory Usage
Asked Answered
P

6

33

I've been looking into using iterators for batch processing in Doctrine (http://docs.doctrine-project.org/en/2.0.x/reference/batch-processing.html). I've got a database with 20,000 images which I would like to iterate over.

I understand that using an iterator is supposed to prevent Doctrine from loading every row in memory. However the memory usage between the two examples is almost exactly the same. I am calculating the memory usage before and after using (memory_get_usage() / 1024).

$query = $this->em->createQuery('SELECT i FROM Acme\Entities\Image i');
$iterable = $query->iterate();

while (($image = $iterable->next()) !== false) {
    // Do something here!
}

Memory usage for the iterator.

Memory usage before: 2823.36328125 KB
Memory usage after: 50965.3125 KB

This second example loads the entire result set into memory using the findAll method.

$images = $this->em->getRepository('Acme\Entities\Image')->findAll();

Memory usage for findAll.

Memory usage before: 2822.828125 KB
Memory usage after: 51329.03125 KB
Palumbo answered 8/5, 2014 at 15:12 Comment(0)
U
78

Batch processing with doctrine is trickier than it seems, even with the help of iterate() and IterableResult.

Just as you expected greatest benefit of IterableResult is that it does not load all of the elements into memory, and the second benefit is that it doesn't hold references to the entities you load, thus IterableResult doesn't prevent GC from freeing memory from your entity.

However there's another object Doctrine's EntityManager (more specifically UnitOfWork) which holds all the references to each object which you queried explicitly or implicitly (EAGER associations).

In simple words, whenever you get any entity(ies) returned by findAll() findOneBy() even through DQL queries and also IterableResult, then a reference to each of those entities is saved inside of doctrine. The reference is simply stored in an assoc array, here's pseudocode: $identityMap['Acme\Entities\Image'][0] = $image0;

So because upon each iteration of your loop, your previous images (despite not being present in the loop's scope or IterableResult's scope) are still present inside of this identityMap, GC cannot clean them and your memory consumption is the same as when you were calling findAll().

Now let's go through the code and see what is actually happening

$query = $this->em->createQuery('SELECT i FROM Acme\Entities\Image i');  
// here doctrine only creates Query object, no db access here

$iterable = $query->iterate();
// unlike findAll(), upon this call no db access happens.  
// Here the Query object is simply wrapped in an Iterator  

while (($image_row = $iterable->next()) !== false) {  
    // now upon the first call to next() the DB WILL BE ACCESSED FOR THE FIRST TIME
    // the first resulting row will be returned
    // row will be hydrated into Image object
    // ----> REFERENCE OF OBJECT WILL BE SAVED INSIDE $identityMap <----
    // the row will be returned to you via next()

    // to access actual Image object, you need to take [0]th element of the array                            


     $image = $image_row[0];
    // Do something here!
     write_image_data_to_file($image,'myimage.data.bin');
    
    //now as the loop ends, the variables $image (and $image_row) will go out of scope 
    // and from what we see should be ready for GC
    // however because reference to this specific image object is still held
    // by the EntityManager (inside of $identityMap), GC will NOT clean it 
}
// and by the end of your loop you will consume as much memory
// as you would have by using `findAll()`.

So the first solution is to actually tell Doctrine EntityManager to detach the object from the $identityMap. I also replaced while loop to foreach to make it more readable.

foreach($iterable as $image_row){
    $image = $image_row[0]; 

    // do something with the image
    write_image_data_to_file($image);
    
    $entity_manager->detach($image);
    // this line will tell doctrine to remove the _reference_to_the_object_ 
    // from identity map. And thus object will be ready for GC
}

However the example above has few flaws, even though it is featured in the doctrine's documentation on batch processing. It works well, in case your entity Image isn't performing EAGER load for any of it's associations. But if you're EAGERly loading any of the associations eg. :

/*
  @ORM\Entity
*/
class Image {
  
  /* 
    @ORM\Column(type="integer")
    @ORM\Id 
   */
  private $id;
  
  /*
    @ORM\Column(type="string")
  */
  private $imageName;

  /*
   @ORM\ManyToOne(targetEntity="Acme\Entity\User", fetch="EAGER")
   This association will be automatically (EAGERly) loaded by doctrine
   every time you query from db Image entity. Whether by findXXX(),DQL or iterate()
  */
  private $owner;

  // getters/setters left out for clarity
}

So if we use same piece of the code as above, upon

foreach($iterable as $image_row){
    $image = $image_row[0]; 
    // here becuase of EAGER loading, we already have in memory owner entity
    // which can be accessed via $image->getOwner() 

    // do something with the image
    write_image_data_to_file($image);
    
    $entity_manager->detach($image);
    // here we detach Image entity, but `$owner` `User` entity is still
    // referenced in the doctrine's `$identityMap`. Thus we are leaking memory still.
   
}

The possible solution can be to use EntityManager::clear() instead or EntityManager::detach() which will clear COMPLETELY the identity map.

foreach($iterable as $image_row){
    $image = $image_row[0]; 
    // here becuase of EAGER loading, we already have in memory owner entity
    // which can be accessed via $image->getOwner() 

    // do something with the image
    write_image_data_to_file($image);
    
    $entity_manager->clear();
    // now ``$identityMap` will be cleared of ALL entities it has
    // the `Image` the `User` loaded in this loop iteration and as as
    // SIDE EFFECT all OTHER Entities which may have been loaded by you
    // earlier. Thus you when you start this loop you must NOT rely
    // on any entities you have `persist()`ed or `remove()`ed 
    // all changes since the last `flush()` will be lost.
   
}
Uncharitable answered 2/11, 2014 at 11:31 Comment(10)
Thank you for detailed explanation. I will add from my experence, that $iterable = $query->iterate(); can consume tons of memory as well because of query buffering. php.net/manual/en/mysqlinfo.concepts.buffering.phpSandfly
@dimitri_k Thank you for details, btw is there a way to use HYDRATE_SCALAR with iterator ? If yes then would that be another alternative to deal with this identityMap issue ?Schuh
@Schuh that's an interesting suggestion to use HYDRATE_SCALAR. Originally I didn't think of it as viable option because when you use HYDRATE_SCALAR - you wouldn't get objects returned in a nice traversable object graph. You will NOT be able to follow relationship from one object to another like in $image->getUser()->getProfession()->getSalary(). They would be returned as big union array (akin to SELECT ... JOIN). HYDRATE_SCALAR may work for you when you have flat object relationships (1 or 2 levels). I am not sure how it works when you have more relationships added to object later...Uncharitable
But I looked at the Doctrine source and ArrayHydrator doesn't seem to be calling $this->registerManaged(..). Which probably means that you shouldn't be getting memory leaks. github.com/doctrine/doctrine2/blob/master/lib/Doctrine/ORM/…Uncharitable
Unlike findAll(), upon this call no db access happens Completely false statement. The data is fully fetched but hydrated row-by-row. See the code in @BlocksByLukas's answer.Chukar
thank you @SalmanvonAbbas for being so scrupulous to my post. From what I remember from 5 yrs ago when I wrote it, the problem was getIterator() / iterate() method within ORM classes (couldn't quickly find which file it was). Which had lazy implementation. Whereas the answer of @BlocksByLukas you've referred to shows implementation not of ORM, but of the underlying driver of mysqli namespace Doctrine\DBAL\Driver\Mysqli\MysqliStatement (and it uses SQL not DQL). MysqliStatement is part of Doctrine project (as a driver but not ORM). That's what I can remember off the top of my head...Uncharitable
@SalmanvonAbbas also the point i was trying to make with call to ->iterate() was that it simply returns wrapper object (there's no actual physical request to mysql or mysql driver made yet). And only upon request of first item from iterator that process of query is initiated. (Thus this behaviour doesn't contradict with what @BlocksByLukas stated, assuming his research results were right and apply here). ps. also in my previous comment i mistakenly mentioned two methods getIterator()/ iterate(), but it should correctly read only one iterate() (can't edit that comment anymore).Uncharitable
'detach' is deprecated.Headsail
@AntonDuzenko so I wonder what's the new approach? From (Doctrine Docs about Batch processing) we can read the following An ORM tool is not primarily well-suited for mass inserts, updates or deletions. Every RDBMS has its own, most effective way of dealing with such operations...... seems that the answer is : "Do not use Doctrine for batch processing" :S You can also find this Doctrine Issue on Deprecation of EntityRepository#clear() helpfulUncharitable
At least we do want to use it for batch select (many thousands of nested entities). EntityManager::clear() seems to do the job BUT it still wastes a lot of CPU time on this cache nonsense. I'm generally displeased with Doctrine intenal sloppiness. @DimitryK If you want a job to be done properly, do it yourself ;)Headsail
E
8

If you combine doctrine iterate() with batching strategy, you should be able to iterate over large records.

For example:


$batchSize = 1000;
$numberOfRecordsPerPage = 5000;

$totalRecords = $queryBuilder->select('count(u.id)')
            ->from('SELECT i FROM Acme\Entities\Image i')
            ->getQuery()
            ->getSingleScalarResult();   //Get total records to iterate on

        $totalProcessed = 0;

        $processing = true;

        while ($processing) {
            $query = $entityManager->createQuery('SELECT i FROM Acme\Entities\Image i')
                ->setMaxResults($numberOfRecordsPerPage) //Maximum records to fetch at a time
                ->setFirstResult($totalProcessed);
          
             $iterableResult = $query->iterate();
          
            while (($row = $iterableResult->next()) !== false) {
                $image = $row[0];
                $image->updateSomethingImportant();
              
                 if (($totalProcessed % $batchSize ) === 0) {
                    $entityManager->flush();
                    $entityManager->clear();
                }
                $totalProcessed++;
            }
            if ($totalProcessed === $totalRecords) {
                break;
            }
        }

    $entityManager->flush();


See Iterating large data in doctrine 2

Emogene answered 3/5, 2019 at 14:29 Comment(1)
Iterate should be enough. Maybe he should disable doctrine cache ?Trudytrue
S
5

I strongly believe the batch processing with Doctrine or any kind of iterations with MySQL (PDO or mysqli) are just an illusion.

@dimitri-k provided a nice explanation especially about unit of work. The problem is the miss leading: "$query->iterate()" which doesn't really iterate over the data source. It's just an \Traversable wrapper around already fully fetched data source.

An example demonstrating that even removing Doctrine abstraction layer completely from the picture, we will still run into memory issues:

echo 'Starting with memory usage: ' . memory_get_usage(true) / 1024 / 1024 . " MB \n";

$pdo  = new \PDO("mysql:dbname=DBNAME;host=HOST", "USER", "PW");
$stmt = $pdo->prepare('SELECT * FROM my_big_table LIMIT 100000');
$stmt->execute();

while ($rawCampaign = $stmt->fetch()) {
    // echo $rawCampaign['id'] . "\n";
}

echo 'Ending with memory usage: ' . memory_get_usage(true) / 1024 / 1024 . " MB \n";

Output:

Starting with memory usage: 6 MB 
Ending with memory usage: 109.46875 MB

Here, the disappointing getIterator() method:

namespace Doctrine\DBAL\Driver\Mysqli\MysqliStatement

/**
 * {@inheritdoc}
 */
public function getIterator()
{
    $data = $this->fetchAll();

    return new \ArrayIterator($data);
}

You can use my little library to actually stream heavy tables using PHP Doctrine or DQL or just pure SQL. However you find appropriate: https://github.com/EnchanterIO/remote-collection-stream

Squarerigged answered 19/6, 2017 at 14:44 Comment(6)
you didn't explain why your library is better thoSpeculator
@EnchanterIO How does your library improve the issue? stackoverflow.com/a/36201665Hoosegow
Thank you @blocksbylukas for the interesting idea, but I don't have time now to understand it properly, as I don't use Doctrine that much anymore.Uncharitable
I think you are wrong regarding batch processing and Doctrine, at least nowadays. It is not simply an array iterator over already fetched data. I was just able to reduce the memory consumption alot by implementing the iterator pattern as outlined by others and in the docs.Nabila
But what kind of queries happened behind the scenes? How much data did you fetch in memory? Surely Doctrine is doing some fantastic tricks and I don't remember exactly what was the issue 3 years ago but I think I wrote this answer because I was under the premise that the Iterator is going to somehow "stream" the rows from a table, and I believe it instead fetched all the rows and THEN did PHP optimization on Entities. It's probably different now, 3 years later. That little library (didn't use it since then honestly) is just very explicit about what it does: SQL Offsets + Iterator for looping.Squarerigged
Oh I miss PHP! Feeling nostalgic now, this was one of the last things I did in PHP.Squarerigged
G
2

tl;dr;

When running the command use --no-debug or set the Sql logger to null to prevent it saving all queries it runs.

Every then and now use EntityManager::clear(), memory leak will go down to almost zero.

Giuliana answered 24/6, 2021 at 8:31 Comment(0)
T
2

For batching I like to use yield this way:

ImageRepository.php

<?php

class ImageRepository extends ServiceEntityRepository
{
    // The rest of your code...

    public function findAllImages(): \Iterator
    {
        // Find the total amount of images.
        $total_images = $this->createQueryBuilder('i')
           ->select('COUNT(*)')
           ->getQuery()
           ->getSingleScalarResult();

        $processed_records = 0;

        while (true) {
            $query = $this->createQueryBuilder('i')
                // Process batch of 100 results, this can
                // be whatever amount you can fit in memory,
                ->setMaxResults(100)
                ->setFirstResult($processed_records)
                ->getQuery()
                ->getResult();

            $processed_records += count($query);

            // With yield you are not storing the full
            // amount of images in memory.    
            yield from $query;

            if ($processed_records >= $total_images) {
                break;
            }
        }
    }
}

then you can use it somewhere

<?php

class ImageCommand extends Command
{
    // The rest of your code...

    public function processAllImages(ImageRepository $repository): void
    {
        $images = $repository->findAllImages();

        foreach ($images as $image) {
            $this->processImage($image);
        }
    }
}
Teahouse answered 24/1, 2023 at 15:45 Comment(0)
K
-6

The results could be similar because the db client may be allocating additional memory you cant see. Also your code uses the 'IterableResult' which is returned form '$query->iterate()'; this allows for processing large results without memory problems. Just quick thoughts hope it helped a little.

Knesset answered 8/5, 2014 at 15:23 Comment(1)
This answer does not help at all.Clergy

© 2022 - 2024 — McMap. All rights reserved.