I often do PHP projects designed to scrape hierarchical data from web pages and save them to the DB (essentially, structure the data - think scraping government websites that do have the data, but do not provide it in a structured way). Each time, I try to come up an OOP design that would allow me to achieve the following:
- Easily replace current HTML parsing scripts with new ones, in case the original web page changes
- Allow easy extensions of the data scraped and saved, as these projects are also meant for others to take and build on. My aim is to collect the "base" data, while others might decide to include something extra, change the way it is saved and etc.
So far I am yet to find the solution, but the closest I got it something like this:
I define an abstract class for data containers that would implement common tree-traversing functions:
abstract class DataContainer {
protected $parent = NULL;
protected $children = NULL;
public function getParent() {
return $this->parent;
}
public function getChildren() {
return $this->children;
}
}
And then I have the actual data containers. Imagine, I am scraping data on participation in parliamentary sessions down to a "specific question in a sitting" level. I would have SessionContainer
, SittingContainer
, QuestionContainer
that would all extend the DataContainer
.
Each of the session, sitting and question data are scraped from a different URL. Leaving the mechanism of getting the URL content aside, let's just say I need scraper classes, which would take the containers and a DOmDocument for actual parsing. So I would define an generic interface like this:
interface Scraper {
public function scrapeData(DOMDocument $Dom, DataContainer $DataContainer);
}
Then, each of the session, sitting and question would have their own scrapers, which implement the interface. But I'd also like to ensure that they only can accept the containers they are meant for. So it would look like:
class SessionScraper implements Scraper {
public function scrapeData(DOMDocument $DOM, SessionContainer $DataContainer) {
}
}
Finally, I would have a generic Factory
class that also implements Scraper interface and just distributes the scraping to relevant scrapers. Like this:
public function scrapeData(DOMDocument $DOM, DataContainer $DataContainer) {
//get the scraper from configuration array
$class = $this->config[get_class($DataContainer)];
$craper = new $class();
$class->scrapeData($DOM, $DataContainer);
}
This is the class that would be actually called in the code. Very similarly, I could deal with saving to DB - each data container could have its DBSaver class, which would implement DBSaver interface. Again, all the calls could be done via the Factory
class, which would also implement the DBSaver interface.
Everything would be perfect, but the problem is that classes that implement the interface should implement exact signature of the interface. E.g. method SessionScraper::scrapeData
cannot accept only SessionContainer
objects, it must accept all DataContainer
objects. But it is not meant to!
Finally, the question:
- Is my design wrong and I should be structuring everything in a completely different way? (how?), or:
- My design is OK, it's just that I need to enforce types within methods with
instanceof
and similar checks instead of enforcing it via typehinting?
Thanks in advance for all the suggestions / criticisms. I am completely happy with somebody overturning this code on its head, if necessary!