Changes between Version 6 and Version 7 of ApertureArchitecture


Ignore:
Timestamp:
10/12/05 14:43:56 (19 years ago)
Author:
anonymous
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureArchitecture

    v6 v7  
    11= Aperture Architecture = 
    22 
    3 <page deleted> 
     3The central parts in the architecture are currently !DataSource, !DataCrawler, 
     4!DataAccessor and !DataObject. Together they are used to access the contents of 
     5an information system, such as a file system or web site. 
     6 
     7A !DataSource contains all information necessary to locate the information 
     8items in a source. For example, a !FileSystemDataSource has a set of one or 
     9more directories on a file system, a set of patterns that describe what files 
     10to include or exclude, etc. For the rest it is completely passive. 
     11 
     12A !DataCrawler is responsible for actually accessing the physical source and 
     13reporting the individual information items as !DataObjects. Each !DataObject 
     14contains all metadata provided by the data source, such as file names, 
     15modification dates, etc., as well as the !InputStream that provides access to 
     16physical resource (e.g. the file itself). 
     17 
     18We have chosen to separate the functionalities offered by !DataSource and !DataCrawler as there 
     19may be several alternative crawling strategies for a single !DataSource type. 
     20Consider for example a generic !FileSystemCrawler that handles any kind of 
     21file system accessible through java.io.File versus a !WindowsFileSystemCrawler 
     22using OS-native functionality to get notified about file additions, deletions 
     23and changes. Another possibility is various !DataCrawler implementations that 
     24have different trade-offs in speed and accuracy. 
     25 
     26Currently, A !DataSource also contains support for writing its configuration 
     27to or initializing it from an XML file. We might consider putting this in a 
     28separate utility class, because the best way to store such information is 
     29often application dependent. 
     30 
     31A !DataCrawler creates !DataObjects for the individual information items it 
     32encounters in the physical data source. These !DataObjects are reported to 
     33!DataCrawlerListeners registered at the !DataCrawler. An abstract base class 
     34(!DataCrawlerBase) is provided that provides base functionality for 
     35maintaining information about which files have been reported in the past, 
     36allowing for incremental scanning. 
     37 
     38In order to create a !DataObject for a single resource, a !DataAccessor is used. 
     39This functionality is kept out of the 
     40!DataCrawler implementations on purpose because there may be several crawlers 
     41who can make good use of the same data accessing functionality. A good 
     42example is the !FileSystemCrawler and !HypertextCrawler, which both make use of 
     43the !FileDataAccessor. Although they arrive at the physical resource in 
     44different ways (by traversing folder trees vs. following links from other 
     45documents), they can use the same functionality to turn a java.io.File into a 
     46!FileDataObject. 
     47 
     48It should be clear now that a !DataCrawler is specific for the kind of 
     49!DataSource it supports, whereas a !DataAccessor is specific for the url 
     50scheme(s) it supports. 
     51 
     52== Incremental Scanning == 
     53 
     54The !AccessData instance used in !DataCrawlerBase maintains the information 
     55about which objects have been scanned before. This instance is passed to the 
     56!DataAccessor as this is the best place to do this detection: 
     57 
     58 * This prevents object creation when the resource has not been modified since the last scan (!DataAccessor.get returns null). 
     59 
     60 * This allows for more sophisticated optimizations, e.g. the !HttpDataAccessor uses HTTP-specific functionality so that the webserver can decide whether the resource has changed since the last scan. This prevents an unchanged web page from being transported to the crawling side in the first place. 
     61 
     62== HypertextCrawler == 
     63 
     64The !HypertextCrawler makes use of two external components: a mime type 
     65identifier and a hypertext link extractor. 
     66 
     67The latter component is required to know which resources are linked from a specific resource and should be crawled next. This functionality is realized as a separate component/service as there are many document types that support links (PDF might be a nice one to support next). 
     68 
     69A specific link extractor is therefore mimetype-specific. In order to know which link extractor to use, one first needs to know the mime type of the starting resource, which is handled by the first component.