Changes between Version 8 and Version 9 of ApertureDataAccessor


Ignore:
Timestamp:
10/20/05 13:26:41 (19 years ago)
Author:
sauermann
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureDataAccessor

    v8 v9  
    1414 
    1515== ToDo == 
    16  
    17 Leo: from my perspective, !DataAccessor, !DataCrawler and !CrawlData are too much coupled.  
    18 The return value is far too complicated defined: ''' @return A !DataObject for the specified URI, or null when an !AccessData instance has been specified and the binary resource has not been modified since the last access.''' The semantics of this return value contain too much semantics. If it is a generic framework, change detection could be entirely up to the !DataCrawler, if it is programmed datasource-specific.  
    19  
    20 Chris: I agree that it is rather complicated, so I'm definitely interested in simpler setups. Still I believe that there are good reasons to pick this architecture. 
    21  
    2216The !DataAccessor has been decoupled from the !DataCrawler for the following reasons: 
    2317 
    2418 * Maximum code reusability in customer projects in complex enterprise environments. Consider for example various document management systems which may have a filesystem- or website-like structure (i.e. a folder tree or hypertext graph). Such systems may only need a dedicated !DataAccessor that knows how to access such a system, as the crawler can then be reused.[[BR]]I must admit however that currently this is only partially the case:[[BR]](1) The !HypertextCrawler is truely scheme-independent, retrieving !DataAccessors based on the scheme of a url as well as using a !MimeTypeIdentifier and a !LinkExtractor to determine which pages to load next, but it uses URLs internally, leading to the problem described above (in the URLs vs. URIs vs. Strings part).[[BR]](2) The !FileSystemCrawler still uses java.io.File to determine the folder tree and the Files, so we would need to delegate the part that discovers folders, subfolders and files in a scheme-dependent way, similarly to how !HypertextCrawler delegates functionality for type detection and link extraction to other objects. 
    2519 * Another reason for this decoupling is that several crawlers may make use of the same !DataAccessors, e.g. a !FileSystemCrawler and a !HypertextCrawler that both use a !FileDataAccessor. For us this is a realistic scenario, e.g. to crawl intranets that are available on a network drive. 
    26  * Finally, this allows you to access a resource (i.e. create a !DataObject for it) without having to go through the crawler, which operates more on a !DataSource level and not on the level of individual !DataObjects. 
     20 * Finally, this allows you to access a resource (i.e. create a !DataObject for it) without having to go through the crawler, which operates more on a !DataSource level and not on the level of individual !DataObjects.    
     21 *'''Leo:''' Using a DataAccessor without a DataCrawler should force the usage of a different method call. see below for the interface. (also, the loose/close coupling we mentioned about IMAP comes here) 
    2722 
    2823Other aspects: 
     
    3126 * The !CrawlData interface allows us in the future to get rid of the !CrawlDataBase implementation class, which has its own storage format, and create an adapter that works on top of the Sesame Repository that also contains all extracted metadata. This way all known metadata of a resource is stored in a single place, ensuring consistency, lower resource comsumption, improved caching behaviour, etc. 
    3227 
    33  * Leo: to simplify and seperate the '''get (=get it now!)''' and '''getCrawl (=check if changed, get if changed)''' I would suggest to define two methods,  one for really getting a resource and one in the crawling scenario. The getCrawl method would be the existing one, the get method a simpler one. 
     28 * Leo: to simplify and seperate the '''getDataObject (=get it now!)''' and '''getDataObjectCrawl (=check if changed, get if changed)''' I would suggest to define two methods,  one for really getting a resource and one in the crawling scenario.  
     29 
     30* I also renamed the method from '''get()''' to '''getDataObject()''', when an object implements both DataAccessor and other Interfaces, the semantics of the method name get() are fuzzy. 
    3431 
    3532 
     
    5148  * also have returned a dedicated DataObject implementation that determines some things  
    5249  * dynamically, that is up to the DataAccessor to decide.  
     50  * During one call, the DataAccessor has the following tasks: 
     51  * - check if the URL is ok 
     52  * - check redirects: if the URL is redirected to antother URI,  
     53  *   the DataObject will have the new URI as identifier 
     54  * - check changes (was the object changed since last crawl); only needed in getCrawl() 
     55  * - if crawling: update the CrawlData with new datetime, size, etc. 
     56  * - detect if the DataObject is going to be a DataObject, DataObjectFile or DataObjectFolder.  
     57  *   go on accordingly. 
     58  * - open the stream 
     59  * - detect mime-type (using all tricks available: http headers, file extensions, magic bytes) 
     60  * - detect byte-size 
     61  * - extract the most basic metadata (only the data that is there already) 
     62  * - create a new DataObject with all of the above   
     63  * - and return it 
    5364  */ 
    5465public interface DataAccessor { 
    5566 
    5667        /** 
    57          * Get a DataObject for the specified url. The resulting DataObject's ID may differ 
     68         * Get a DataObject for the specified url during crawling. The resulting DataObject's ID may differ 
    5869         * from the specified url due to normalization schemes, following of redirected URLs, etc.  
    5970         * 
    60          * An AccessData instance can optionally be specified with which the DataAccessor can store 
     71         * An AccessData instance has to be specified with which the DataAccessor has to store 
    6172         * and retrieve information about previous accesses to resources. This is mostly useful 
    6273         * for DataCrawlers who want to be able to incrementally scan a DataSource. 
    63          * When an AccessData instance is specified, the resulting DataObject can be null, 
     74         * The resulting DataObject can be null, 
    6475         * indicating that the binary resource has not been modified since the last access. 
    6576         *  
     
    6980         * Specific DataAccessor implementations may accept additional parameters through the params Map. 
    7081         *  
    71          * @param uri         The uri used to address the resource. 
     82         * @param url         The url locator of the resource. If the resource is identified by some 
     83         *                    other URI, then the DataAccessor will follow redirects accordingly. 
    7284         * @param dataSource  The source that will be registered as the source of the DataObject. 
    73          * @param crawlData   Optional database containing information about previous accesses. 
     85         * @param crawlData   database containing information about previous accesses  
     86         *                    and where this access is stored. 
    7487         * @param params      Optional additional parameters needed to access the physical resource. 
    7588         *                  also, parameters may be passed that determine how the metadata should be  
    7689                            extracted or which detail 
    7790         *                  of metadata is needed. Applications may pass params through the whole chain. 
    78          * @return A DataObject for the specified URI, or null when an AccessData instance has been 
    79          * specified and the binary resource has not been modified since the last access. 
     91         * @return A DataObject for the specified URI, or null when the  
     92         *         binary resource has not been modified since the last access. 
    8093         * @throws UrlNotFoundException when the binary resource could not be found 
    8194         * @throws IOException When any other kind of I/O error occurs. 
    8295         */ 
    83         public DataObject get(URI uri, DataSource source, 
    84             CrawlData crawlData , Map<?,?> params) throws UriNotFoundException, IOException; 
     96        public DataObject getDataObjectCrawl(String url, DataSource source, 
     97            CrawlData crawlData , Map params) throws UriNotFoundException, IOException; 
     98 
     99        /** 
     100         * Get a DataObject for the specified url. The resulting DataObject's ID may differ 
     101         * from the specified url due to normalization schemes, following of redirected URLs, etc. 
     102         * This method is independent from access during crawling sessions. 
     103         *  
     104         * Specific DataAccessor implementations may accept additional parameters through the params Map. 
     105         *  
     106         * @param url         The url locator of the resource. If the resource is identified by some 
     107         *                    other URI, then the DataAccessor will follow redirects accordingly. 
     108         * @param dataSource  The source that will be registered as the source of the DataObject. 
     109         * @param params      Optional additional parameters needed to access the physical resource. 
     110         *                  also, parameters may be passed that determine how the metadata should be  
     111                            extracted or which detail 
     112         *                  of metadata is needed. Applications may pass params through the whole chain. 
     113         * @return A DataObject for the specified URI 
     114         * @throws UrlNotFoundException when the binary resource could not be found 
     115         * @throws IOException When any other kind of I/O error occurs. 
     116         */ 
     117        public DataObject getDataObject(String url, DataSource source, 
     118            CrawlData crawlData , Map params) throws UriNotFoundException, IOException; 
     119 
     120     
    85121} 
    86122}}}