Version 10 (modified by sauermann, 19 years ago) (diff) |
---|
DataAccessors
Changelog
- using URI as identifier, not String. more type-safe. hence also the UriNotFoundException. Parallel to ApertureDataObject which is also based on Java.net.URI.
Chris: in our proposal this was a String containing a URL. Our rationale for this signature is that the specified parameter does not serve as a formal identifier, it is rather the address used to access the physical resource. Hence a URL and not a URI. Another reason is that the address of the returned DataObject's URI may be completely different: the HttpDataAccessor follows HTTP redirects and uses the URL of the page it is redirected to as the URI. It is conceptually rather awkward to request a DataObject for a specific URI and get a DataObject with a different URI back.
Leo: Well, its conceptionally also awkward to request a DataObject for a specific string and get a different one back. In the storage backend, this makes things a little more complicated, because the returned URI serves then as the identifier in the database (as graph contect in Sesame2 - which would you use then - URI or URL??). We might then have dead links in Autofocus - when somehow synchronisation between URL and URI fails.
Leo: The case of redirecting URLs of HTTP is special, it makes me sick. But in IMAP, you may also have changing URL to URI: when you request a URL and the writing is different. IMAP may replace "imap://leo@server/folder.subfolder/msg#1" with "imap://leo@server/folder%23subfolder/msg#1". Therefore this idea is still interesting. So I do basically agree with the notion of "asking for a url ang getting a URI"
Naturally you would then expect a java.net.URL as a parameter, but Java requires a URLStreamHandler for every scheme that you use, meaning that you cannot easily "invent" new schemes, e.g. "imap:", "outlook:", etc. Hence the String as a compromise.
ToDo
The DataAccessor has been decoupled from the DataCrawler for the following reasons:
- Maximum code reusability in customer projects in complex enterprise environments. Consider for example various document management systems which may have a filesystem- or website-like structure (i.e. a folder tree or hypertext graph). Such systems may only need a dedicated DataAccessor that knows how to access such a system, as the crawler can then be reused.
I must admit however that currently this is only partially the case:
(1) The HypertextCrawler is truely scheme-independent, retrieving DataAccessors based on the scheme of a url as well as using a MimeTypeIdentifier and a LinkExtractor to determine which pages to load next, but it uses URLs internally, leading to the problem described above (in the URLs vs. URIs vs. Strings part).
(2) The FileSystemCrawler still uses java.io.File to determine the folder tree and the Files, so we would need to delegate the part that discovers folders, subfolders and files in a scheme-dependent way, similarly to how HypertextCrawler delegates functionality for type detection and link extraction to other objects. - Another reason for this decoupling is that several crawlers may make use of the same DataAccessors, e.g. a FileSystemCrawler and a HypertextCrawler that both use a FileDataAccessor. For us this is a realistic scenario, e.g. to crawl intranets that are available on a network drive.
- Finally, this allows you to access a resource (i.e. create a DataObject for it) without having to go through the crawler, which operates more on a DataSource level and not on the level of individual DataObjects. *Leo: Using a DataAccessor without a DataCrawler should force the usage of a different method call. see below for the interface. (also, the loose/close coupling we mentioned about IMAP comes here)
Other aspects:
- The AccessData/CrawlData is specified as a parameter so that the DataAccessor can perform change detection, as it can use scheme-specific optimizations (e.g. HTTP is-modified-since header). The null return value to indicate unmodified resources is used to make handling these resources (typically the majority in incremental scans) as cheap as possible: no object instantiations for these resources.
- The CrawlData interface allows us in the future to get rid of the CrawlDataBase implementation class, which has its own storage format, and create an adapter that works on top of the Sesame Repository that also contains all extracted metadata. This way all known metadata of a resource is stored in a single place, ensuring consistency, lower resource comsumption, improved caching behaviour, etc.
- Leo: to simplify and seperate the getDataObject (=get it now!) and getDataObjectCrawl (=check if changed, get if changed) I would suggest to define two methods, one for really getting a resource and one in the crawling scenario.
- I also renamed the method from get() to getDataObject(), when an object implements both DataAccessor and other Interfaces, the semantics of the method name get() are fuzzy.
Java Interface
Probably equal to source:trunk/gnowsis/src/org/gnowsis/data/adapter/CBDAdapter.java
/** * A DataAccessor provides access to physical resources by creating DataObjects * representing the resource, based on a url and optionally data about a previous access * and other parameters. * The main task of a DataAccessor is to find the resource identified by the URL String * and create a DataObject that represents the resource. When crawling, the DataAccessor * additionally uses the passed CrawlData interface to check and update information about * the last crawl. * About the returned DataObject: i n most cases, the DataObject is just a passive container * of information, the DataAccessor will have filled it with information. However, it may * also have returned a dedicated DataObject implementation that determines some things * dynamically, that is up to the DataAccessor to decide. * During one call, the DataAccessor has the following tasks: * - check if the URL is ok * - check redirects: if the URL is redirected to antother URI, * the DataObject will have the new URI as identifier * - check changes (was the object changed since last crawl); only needed in getCrawl() * - if crawling: update the CrawlData with new datetime, size, etc. * - detect if the DataObject is going to be a DataObject, DataObjectFile or DataObjectFolder. * go on accordingly. * - open the stream * - detect mime-type (using all tricks available: http headers, file extensions, magic bytes) * - detect byte-size * - extract the most basic metadata (only the data that is there already) * - create a new DataObject with all of the above * - and return it */ public interface DataAccessor { /** * Get a DataObject for the specified url during crawling. The resulting DataObject's ID may differ * from the specified url due to normalization schemes, following of redirected URLs, etc. * * An AccessData instance has to be specified with which the DataAccessor has to store * and retrieve information about previous accesses to resources. This is mostly useful * for DataCrawlers who want to be able to incrementally scan a DataSource. * The resulting DataObject can be null, * indicating that the binary resource has not been modified since the last access. * * A DataAccessor is always required to store something in the AccessData when a * url is accessed, so that afterwards AccessData.isKnownId will return true. * * Specific DataAccessor implementations may accept additional parameters through the params Map. * * @param url The url locator of the resource. If the resource is identified by some * other URI, then the DataAccessor will follow redirects accordingly. * @param dataSource The source that will be registered as the source of the DataObject. * @param crawlData database containing information about previous accesses * and where this access is stored. * @param params Optional additional parameters needed to access the physical resource. * also, parameters may be passed that determine how the metadata should be extracted or which detail * of metadata is needed. Applications may pass params through the whole chain. * @return A DataObject for the specified URI, or null when the * binary resource has not been modified since the last access. * @throws UrlNotFoundException when the binary resource could not be found * @throws IOException When any other kind of I/O error occurs. */ public DataObject getDataObjectCrawl(String url, DataSource source, CrawlData crawlData , Map params) throws UriNotFoundException, IOException; /** * Get a DataObject for the specified url. The resulting DataObject's ID may differ * from the specified url due to normalization schemes, following of redirected URLs, etc. * This method is independent from access during crawling sessions. * * Specific DataAccessor implementations may accept additional parameters through the params Map. * * @param url The url locator of the resource. If the resource is identified by some * other URI, then the DataAccessor will follow redirects accordingly. * @param dataSource The source that will be registered as the source of the DataObject. * @param params Optional additional parameters needed to access the physical resource. * also, parameters may be passed that determine how the metadata should be extracted or which detail * of metadata is needed. Applications may pass params through the whole chain. * @return A DataObject for the specified URI * @throws UrlNotFoundException when the binary resource could not be found * @throws IOException When any other kind of I/O error occurs. */ public DataObject getDataObject(String url, DataSource source, Map params) throws UriNotFoundException, IOException; }