Working with large Magento collections

Magento collections represent a very convenient way of grouping models together. We can grab a group of models from database with little to none sql code thanks to special methods which collection model provides. Collections can get really big (new products, customers, categories, lots of attributes etc.) and become really tricky to work with. Let’s see what can be done to make large collections more manageable.
Let’s say we want to change the first name of every customer on our site to uppercase. One would probably do something like this:
$customers = Mage::getModel('customer/customer')->getCollection()->addAttributeToSelect(array('firstname'), 'inner');
foreach ($customers as $customer) {
$customer->setFirstname(strtoupper($customer->getFirstname()));
$customer->save();
}
The above code works perfectly well for smaller collections, but with large ones the script will eventually run out of memory and produce the following error:
“Fatal error: Allowed memory size of X bytes exhausted (tried to allocate X bytes) in magento_site\lib\Varien\Data\Collection.php on line 550“.
Why is that so?
Well, collection is basically an array of objects, in this case an array of customer objects. The more objects (in this case customers) we have and the more attributes we add to selection, the bigger the resulting array will be, which leads to high memory usage.
Solution to this kind of problem is implemented in Magento in the form of Mage_Core_Model_Resource_Iterator model. It enables us to grab data from database one by one, which is in contrast to loading all results at once like we did in example above. Main part of the iterator is method walk() which needs two mandatory parameters to work, collection query string and callback method(s).
If we look at the method implementation:
public function walk($query, array $callbacks, array $args=array(), $adapter = null)
{
$stmt = $this->_getStatement($query, $adapter);
$args['idx'] = 0;
while ($row = $stmt->fetch()) {
$args['row'] = $row;
foreach ($callbacks as $callback) {
$result = call_user_func($callback, $args);
if (!empty($result)) {
$args = array_merge($args, $result);
}
}
$args['idx']++;
}
return $this;
}
we can see that it takes provided query string, executes it, fetches results one by one and sends them to the callback method. Result which is passed to callback method is placed in an array $args, which holds data for each object in collection. In callback function we can access the data with $args[‘row’].
Here is what our example would look like using iterator:
public function uppercaseAction()
{
// get customer collection
$customers = Mage::getModel('customer/customer')->getCollection()->addAttributeToSelect(array('firstname'), 'inner');
// call iterator walk method with collection query string and callback method as parameters
Mage::getSingleton('core/resource_iterator')->walk($customers->getSelect(), array(array($this, 'customerCallback')));
}
// callback method
public function customerCallback($args)
{
$customer = Mage::getModel('customer/customer’); // get customer model
$customer->setData($args['row']); // map data to customer model
$customer->setFirstname(strtoupper($customer->getFirstname())); // set value of firstname attribute
$customer->getResource()->saveAttribute($customer, 'firstname'); // save only changed attribute instead of whole object
}
Things are fairly simple here. We define collection and pass its sql query to walk method of iterator. Iterator than executes given query and calls customerCallback() method on each row returned from database. In callback method we instantiate customer model, map data from $args[‘row’] to the model and change its firstname property to uppercase.
One thing to take care about is object saving. Since we are saving lots of models we want our script to be as efficient as possible. Using “classic” save on model is pretty heavy operation which takes time and resources. Since we are changing only one attribute it is best to use saveAttribute() method which is much faster.
Thats it, no more “out of memory” error :).
Happy coding!
10 comments
Have you done anything similar with MAgento 2?
Hey Guys,
I like this solution BUT I dont prefer to “walk” and query each single row individually.
Jarrod had a point at 2014 (see below) for the best solution. Use ResourceCollection instead of Collection and then you will able to use pagination
->setCurPage($pageNum)
->setPageSize($size)
Your script will be much more efficient if you can optimize the size of one step/page and read 10 or 100 rows each time and loop on that 10 or 100 (or whatever) records.
I had a table with 2.5 million rows which was using an excessive amount of RAM (2.7GB). With the help of this article and using the code in my Gist it uses 5.4MB
https://gist.github.com/cmtickle/0190fafd5c741dceeea87dc507f91b9e
It’s still slow, because there is only one update per SQL query. I use function insertOnDuplicate to update thousands rows per query – it’s hundreds times faster 🙂
Great.
I have the same problem with order collection. Anf I solved this way (I had not yet read your post 🙂 ).
Reading all the information of the all orders my server says: out of memory (16G!!!). But, I select only the fields that I need. -> addFieldToSelect
With this, no problem.
Then perhaps you can use this, you want to work with firstname:
$customers = Mage::getModel(‘customer/customer’)->getCollection()->addAttributeToSelect(array(‘firstname’), ‘inner’)-> addFieldToSelect(‘firstname’);
Of course you you will need memory, but, in my example it will use only 14% of total memory, not 100% and crash.
Really a great article for managing large collection.
Really helpful for optimized collection update!
cool one!!
I will be happy for more examples on it
Thanks
I’m trying to confirm my existing customer database (~3500) imported from an older website, this method still times out.
Another good method to keeping Your memory usage low on collections is to make use of ->setCurPage($pageNum) and ->setPageSize($size).
Then when you iterate through the collection, you wrap your foreach loop in a while loop and increment your page (start at 1). Use get_mem_usage and tweak the page size as necessary to keep your numbers in your desired range.