Changes between Version 1 and Version 2 of ApertureSimpleDataCrawler


Ignore:
Timestamp:
10/17/05 11:25:38 (19 years ago)
Author:
anonymous
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureSimpleDataCrawler

    v1 v2  
    5858 
    5959}}} 
     60 
     61Remarks Chris: 
     62 
     63The concept of a hierarchy, including a root, is strongly present in this interface. I don't really like this as some data sources have no intrinsic hierarchy, e.g. graph- or table-oriented data sources, so that crawling such sources becomes awkward or even impossible. For example, what kind of folders would a !WebCrawler return? How would it be able to know all folders and all items in those folders a priori? 
     64 
     65In our code we prevent this problem by keeping the traversal of the hierarchy, graph, table or whatever kind of structure the data source has, internal in the crawler implementation. Information regarding the folder/row/whatever becomes just another part of the metadata of the returned !DataObjects. 
     66 
     67Another complaint is that change ''detection'' is apparently performed outside the crawler. I assume the idea here is that there will be a single piece of code to detect changes that will work with any kind of !SimpleDataCrawler? Else there wouldn't be a reason to have this getChangeDataOfObject method. In my opinion change detection will ideally be highly crawler dependent. You might be able to generalize it this way, but at the cost of not being able to do source-specific optimizations. Consider for example HTTP-specific improvements (the if-modified-since header that lets the webserver tell you whether a source has not been changed). Maybe we can also improve IMAP in a similar way, I don't know yet. 
     68 
     69"Crawler" is probably not a good term to use in the name of this interface as there is some other code using an instance of this interface that retrieves the folder and object uris and decides to retrieve them (i.e. the actual crawling). 
     70 
     71What I do like is that folders become more prominent. This may seem to contradict what I said before. What I mean is that it is possible to retrieve information about folders itself. Using a simple extension (e.g., a getDataOfFolder method) it becomes possible to retrieve all metadata of a folder. This is something we have not considered before in our own architecture. 
     72 
     73Here's a new idea that in my opinion merges this idea with our own architecture. Create a super interface of !DataObject (Resource? - has a strong RDF association. Entity? - has other associations here at Aduna). !DataObject then gets a sibling named Folder. Crawlers do not only produce !DataObject instances, they produce instances of its supertype. This way, crawlers that crawl data sources with an intrinsic hierarchy can return Folder instances, which contain all metadata of the Folder, similar to how !DataObjects contain metadata of that object. Similarly, we can introduce other !DataObject siblings for capturing table- or graph-related metadata that is not specific to a single !DataObject. Crawler-using applications that have no interest in this information can simply ignore these events. Also, the crawler interface itself does not need to specify folder-/graph-/table-specific information. 
     74 
     75In our use case this also facilitates metadata indexing in because currently our MetadataFetcher (the class transforming the information inside a DataObject to RDF statements) interprets the document URIs and "reinvents" the folder hierarchy, modeling it as Resources with a partOf relation. This would then no longer be necessary, the Folder instance would already contain all necessary information.