Context Navigation

Changes between Version 5 and Version 6 of ApertureDataAccessor

Timestamp:: 10/18/05 16:58:43 (20 years ago)
Author:: anonymous
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

ApertureDataAccessor

-                      v5
+                      v6
 = DataAccessors =
+* using URI as identifier, not String. more type-safe. hence also the UriNotFoundException. Parallel to ApertureDataObject which is also based on Java.net.URI
+== Changelog ==
+TODO: Leo: from my perspective, DataAccessor, DataCrawler and CrawlData are too much coupled.
+The return value is far too complicated defined. ''' @return A DataObject for the specified URI, or null when an AccessData instance has been specified and the binary resource has not been modified since the last access.''' The semantics of this return value contain too much semantics. If it is a generic framework, change detection could be entirely up to the DataCrawler, if it is programmed datasource-specific.
+ * using URI as identifier, not String. more type-safe. hence also the !UriNotFoundException. Parallel to !ApertureDataObject which is also based on Java.net.URI.
+Chris: in our proposal this was a String containing a URL. Our rationale for this signature is that the specified parameter does not serve as a formal identifier, it is rather the address used to access the physical resource. Hence a URL and not a URI. Another reason is that the address of the returned DataObject's URI may be completely different: the !HttpDataAccessor follows HTTP redirects and uses the URL of the page it is redirected to as the URI. It is conceptually rather awkward to request a !DataObject for a specific URI and get a !DataObject with a different URI back.
+Naturally you would then expect a java.net.URL as a parameter, but Java requires a URLStreamHandler for every scheme that you use, meaning that you cannot easily "invent" new schemes, e.g. "imap:", "outlook:", etc. Hence the String as a compromise.
+== ToDo ==
+Leo: from my perspective, !DataAccessor, !DataCrawler and !CrawlData are too much coupled.
+The return value is far too complicated defined: ''' @return A !DataObject for the specified URI, or null when an !AccessData instance has been specified and the binary resource has not been modified since the last access.''' The semantics of this return value contain too much semantics. If it is a generic framework, change detection could be entirely up to the !DataCrawler, if it is programmed datasource-specific.
+Chris: I agree that it is rather complicated, so I'm definitely interested in simpler setups. Still I believe that there are good reasons to pick this architecture.
+The !DataAccessor has been decoupled from the !DataCrawler for the following reasons:
+ * Maximum code reusability in customer projects in complex enterprise environments. Consider for example various document management systems which may have a filesystem- or website-like structure (i.e. a folder tree or hypertext graph). Such systems may only need a dedicated !DataAccessor that knows how to access such a system, as the crawler can then be reused.[[BR]]I must admit however that currently this is only partially the case:[[BR]](1) The !HypertextCrawler is truely scheme-independent, retrieving !DataAccessors based on the scheme of a url as well as using a !MimeTypeIdentifier and a !LinkExtractor to determine which pages to load next, but it uses URLs internally, leading to the problem described above (in the URLs vs. URIs vs. Strings part).[[BR]](2) The !FileSystemCrawler still uses java.io.File to determine the folder tree and the Files, so we would need to delegate the part that discovers folders, subfolders and files in a scheme-dependent way, similarly to how !HypertextCrawler delegates functionality for type detection and link extraction to other objects.
+ * Another reason for this decoupling is that several crawlers may make use of the same !DataAccessors, e.g. a !FileSystemCrawler and a !HypertextCrawler that both use a !FileDataAccessor. For us this is a realistic scenario, e.g. to crawl intranets that are available on a network drive.
+ * Finally, this allows you to access a resource (i.e. create a !DataObject for it) without having to go through the crawler, which operates more on a !DataSource level and not on the level of individual !DataObjects.
+Other aspects:
+ * The !AccessData/!CrawlData is specified as a parameter so that the !DataAccessor can perform change detection, as it can use scheme-specific optimizations (e.g. HTTP is-modified-since header). The null return value to indicate unmodified resources is used to make handling these resources (typically the majority in incremental scans) as cheap as possible: no object instantiations for these resources.
+ * The !CrawlData interface allows us in the future to get rid of the !CrawlDataBase implementation class, which has its own storage format, and create an adapter that works on top of the Sesame Repository that also contains all extracted metadata. This way all known metadata of a resource is stored in a single place, ensuring consistency, lower resource comsumption, improved caching behaviour, etc.
 == Java Interface ==