Analysis patterns: paged result sets


I've been working on a side project that involves scraping data off of one site, and displaying it on another. Technically, this is largely a solved problem, but I've been reading Domain Driven Design and it's adding a lot of new context to the work I'm doing.

For context

I have a large dataset, too large to pull in its entirety out of the database on every single page request. Until this point, I've been working with Entities and Repositories, each layer being responsible for its own concerns. However, paginating result sets largely feels like it is a concept outside of the Repository. My gut instinct is the mirror the Doctrine findBy* methods and add $limit and $offset parameters to every method that returns an array. But I forget, or I'm lazy, and it's a hassle to go an back-add parameters to methods that will clearly never need them. And it's ugly. So I wanted something better. Something more explicit about the concept of significant result sets.

Enter this article. It does a good job of explaining my own concerns, as well as outlining a solution that strikes me as incredibly useful.

if a collection can grow enough so that is cannot be embraced in a single query, it becomes a domain concern!

The solution

The solution presented is a Paged interface. Repository methods that would otherwise return a collection should return a Paged object instead. Part of the beauty of this solution is that I can define the interface in the Domain layer, and then the actually implementation can live with the Repository in the in Infrastructure layer. So... (Domain layer)

interface Paged
{
  /**
   * Count the total available results in a set
   *
   * @return int : the count
   */
  public function count();

  /**
   * Define which slice of the total result set to return.
   *
   * @param int $limit : how many Entities to return
   * @param int $offset : when to start the query
   *
   * @return array : an array of matched Entities
   */
  public function getRange($limit, $offset);
}

and... (Infrastructure layer)

class DoctrineRepository implements Repository
{
  public function findByName($name)
  {
    $qb = $this->objectManager->createQueryBuilder();

    // Some QueryBuilder logic trimmed for legibility

    return new Paged($qb->getQuery());
  }
}

class DoctrinePaged implements Paged
{
  public function getRange($limit, $offset)
  {
    return $this->query
      ->setMaxResults($limit)
      ->setFirstResult($offset)
      ->getResult();
  }
}

I'm now able to model subsets of the full collection inside my Domain, without worrying about any of the actual implementation details. How about that!

Deeper insight and opportunities

On what I thought was an unrelated note, the scraper portion of my application was a terrifying mess. While it was still in development, I wanted it to be able to save "batches" of Entities so that it could be restarted halfway through if something went wrong, and so that I could start working with the smaller subset of Entities before the entire document had been scraped. To accomplish this, a commit named "Basically ruin SRP so that the scraper could save incrementally" was passing the Doctrine Repository into the HTML Repository so that it could incrementally flush the Doctrine cache. I'm not proud of what I did, but it's a website scraper, they're always garbage, right?

The addition of the Paged result set was a wonderfully timed insight. Now, I could have the HTML Repository return a Paged object, where getRange would incrementally return a more manageable set of Entities that the Application Service could save in the Doctrine Repository, and then request the next batch of HTML Entities. Single responsibility principle restored, code 100x more legible, I'm working in a less than 2 week old project, and already refactoring is paying off.

class ApplicationService
{
  $limit = 10;
  $offset = 0;
  $paged = $sourceRepository->findAll();

  while ($offset < $paged->count()) {
    $entities = $paged->getRange($limit, $offset);
    $destinationRepository->add($entities);
    $destinationRepository->flush();
    $offset += $limit;
  }
}