PHP OOP design - limiting parameters to specific child classes while implementing generic interfaces
Asked Answered
T

1

6

I often do PHP projects designed to scrape hierarchical data from web pages and save them to the DB (essentially, structure the data - think scraping government websites that do have the data, but do not provide it in a structured way). Each time, I try to come up an OOP design that would allow me to achieve the following:

  • Easily replace current HTML parsing scripts with new ones, in case the original web page changes
  • Allow easy extensions of the data scraped and saved, as these projects are also meant for others to take and build on. My aim is to collect the "base" data, while others might decide to include something extra, change the way it is saved and etc.

So far I am yet to find the solution, but the closest I got it something like this:

I define an abstract class for data containers that would implement common tree-traversing functions:

abstract class DataContainer {

  protected $parent = NULL;
  protected $children = NULL;   

  public function getParent() {
    return $this->parent;
  }

  public function getChildren() {
    return $this->children;
  }             
}

And then I have the actual data containers. Imagine, I am scraping data on participation in parliamentary sessions down to a "specific question in a sitting" level. I would have SessionContainer, SittingContainer, QuestionContainer that would all extend the DataContainer.

Each of the session, sitting and question data are scraped from a different URL. Leaving the mechanism of getting the URL content aside, let's just say I need scraper classes, which would take the containers and a DOmDocument for actual parsing. So I would define an generic interface like this:

interface Scraper {
  public function scrapeData(DOMDocument $Dom, DataContainer $DataContainer);   
}

Then, each of the session, sitting and question would have their own scrapers, which implement the interface. But I'd also like to ensure that they only can accept the containers they are meant for. So it would look like:

class SessionScraper implements Scraper {
  public function scrapeData(DOMDocument $DOM, SessionContainer $DataContainer) {
  }
}

Finally, I would have a generic Factory class that also implements Scraper interface and just distributes the scraping to relevant scrapers. Like this:

public function scrapeData(DOMDocument $DOM, DataContainer $DataContainer) {
  //get the scraper from configuration array
  $class = $this->config[get_class($DataContainer)];
  $craper = new $class();
  $class->scrapeData($DOM, $DataContainer);
}

This is the class that would be actually called in the code. Very similarly, I could deal with saving to DB - each data container could have its DBSaver class, which would implement DBSaver interface. Again, all the calls could be done via the Factory class, which would also implement the DBSaver interface.

Everything would be perfect, but the problem is that classes that implement the interface should implement exact signature of the interface. E.g. method SessionScraper::scrapeData cannot accept only SessionContainer objects, it must accept all DataContainer objects. But it is not meant to!

Finally, the question:

  • Is my design wrong and I should be structuring everything in a completely different way? (how?), or:
  • My design is OK, it's just that I need to enforce types within methods with instanceof and similar checks instead of enforcing it via typehinting?

Thanks in advance for all the suggestions / criticisms. I am completely happy with somebody overturning this code on its head, if necessary!

Toadstool answered 6/10, 2011 at 21:40 Comment(0)
S
2

Container springs into the eye. This name is very generic, you might need something more dynamic. I think you have Data and you classify it, so it has a type.

So instead you hardcode the exact interface into the type hinting, you should resolve this dynamically.

If now each Container would have a type, the Scraper could signal/tell whether or not it is applicable for the type of Container.

The concrete form of scraping is actually the strategy you use for specific data to parse it. Your container encapsulates this strategy providing an interface to the normalized data.

You just only need to add some logic/contract between Container and Scraper so that they can talk to each other. This contract you can put inside the interface of both.

This would also allow you to have a Scraper that can deal with multiple types if you want to stretch it.

For your Container, take a look into SPL as well that you implement some interfaces so that you have iterators (and recursive iterators) available. This might be the generic structure you're referring to, and the SPL could boost the usability of your Container classes.

You do not need to hardcode everything in OOP, you can keep things dynamic and especially in PHP you normally resolve things at runtime.

This will also allow you to easier replace Scrapers with a new version. As Scrapers now would have a type by definition (as suggested above), you can resolve at runtime which concrete class should do the scraping, e.g. dynamically loading them from a .php file in a nice file-system structure.

Just my 2 cents.

Sutphin answered 7/10, 2011 at 0:4 Comment(2)
thanks for the extensive answer - triggered a couple other ideas, too! One clarification - do I understand you correctly that you essentially suggest having one Data/Container class for keeping all the data, and identifying it by type property rather than by creating child classes? Or it would be both type property and child classes, just the scrapers would take into account only the type?Toadstool
I don't know your data specifically, so it's hard to tell. If the data is very common only has different properties, you don't need to create many data classes, you can go with dynamic properties. That's much better for the overall application later. Mostly the scrapers will change, sometimes the data with it. You would always need to create a new data class only because some website changed a little. Not good :)Sutphin

© 2022 - 2024 — McMap. All rights reserved.