Changes between Version 1 and Version 2 of ApertureDataSource


Ignore:
Timestamp:
10/12/05 13:02:15 (19 years ago)
Author:
anonymous
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureDataSource

    v1 v2  
    11= DataSources et al = 
    22 
    3 The central parts in the architecture are currently DataSource?, DataCrawler?, DataAccessor? and DataObject?. Together they are used to access the contents of an information system, such as a file system or web site. 
     3The central parts in the architecture are currently DataSource, DataCrawler, 
     4DataAccessor and DataObject. Together they are used to access the contents of 
     5an information system, such as a file system or web site. 
    46 
    5 A DataSource? contains all information necessary to locate the information items in a source. For example, a FileSystemDataSource? has a set of one or more directories on a file system, a set of patterns that describe what files to include or exclude, etc. 
     7A DataSource contains all information necessary to locate the information 
     8items in a source. For example, a FileSystemDataSource has a set of one or 
     9more directories on a file system, a set of patterns that describe what files 
     10to include or exclude, etc. 
    611 
    7 A DataCrawler? is responsible for actually accessing the physical source and reporting the individual information items as DataObjects?. Each DataObject? contains all metadata provided by the data source, such as file names, modification dates, etc., as well as the InputStream? providing access to physical resource. 
     12A DataCrawler is responsible for actually accessing the physical source and 
     13reporting the individual information items as DataObjects. Each DataObject 
     14contains all metadata provided by the data source, such as file names, 
     15modification dates, etc., as well as the InputStream providing access to 
     16physical resource. 
    817 
    9 We have chosen to distinguish between a DataSource? and a DataCrawler? as there may be several alternative crawling strategies for a single DataSource? type. Consider for example a generic FileSystemCrawler? that handles any kind of file system accessible through java.io.File versus a WindowsFileSystemCrawler? using OS-native functionality to get notified about file additions, deletions and changes. Another possibility is various DataCrawler? implementations that have different trade-offs in speed and accuracy. 
     18We have chosen to distinguish between a DataSource and a DataCrawler as there 
     19may be several alternative crawling strategies for a single DataSource type. 
     20Consider for example a generic FileSystemCrawler that handles any kind of 
     21file system accessible through java.io.File versus a WindowsFileSystemCrawler 
     22using OS-native functionality to get notified about file additions, deletions 
     23and changes. Another possibility is various DataCrawler implementations that 
     24have different trade-offs in speed and accuracy. 
    1025 
    11 Currently, A DataSource? also contains support for writing its configuration to or initializing it from an XML file. We might consider putting this in a separate utility class, because the best way to store such information is often application dependent. 
     26Currently, A DataSource also contains support for writing its configuration 
     27to or initializing it from an XML file. We might consider putting this in a 
     28separate utility class, because the best way to store such information is 
     29often application dependent. 
    1230 
    13 A DataCrawler? creates DataObjects? for the individual information items it encounters in the data source. These DataObjects? are reported to DataCrawlerListeners? registered at the DataCrawler?. An abstract base class (DataCrawlerBase?) is provided that provides base functionality for maintaining information about which files have been reported in the past, allowing for incremental scanning. 
     31A DataCrawler creates DataObjects for the individual information items it 
     32encounters in the data source. These DataObjects are reported to 
     33DataCrawlerListeners registered at the DataCrawler. An abstract base class 
     34(DataCrawlerBase) is provided that provides base functionality for 
     35maintaining information about which files have been reported in the past, 
     36allowing for incremental scanning. 
    1437 
    15 In order to create a DataObject? for a single resource encountered by the DataCrawler?, a DataAccessor? is used. This functionality is kept out of the DataCrawler? implementations on purpose because there may be several crawlers who can make good use of the same data accessing functionality. A good example is the FileSystemCrawler? and HypertextCrawler?, which both make use of the FileDataAccessor?. Although they arrive at the physical resource in different ways (by traversing folder trees vs. following links from other documents), they can use the same functionality to turn a java.io.File into a FileDataObject?. 
     38In order to create a DataObject for a single resource encountered by the 
     39DataCrawler, a DataAccessor is used. This functionality is kept out of the 
     40DataCrawler implementations on purpose because there may be several crawlers 
     41who can make good use of the same data accessing functionality. A good 
     42example is the FileSystemCrawler and HypertextCrawler, which both make use of 
     43the FileDataAccessor. Although they arrive at the physical resource in 
     44different ways (by traversing folder trees vs. following links from other 
     45documents), they can use the same functionality to turn a java.io.File into a 
     46FileDataObject. 
    1647 
    17 It should be clear now that a DataCrawler? is specific for the kind of DataSource? it supports, whereas a DataAccessor? is specific for the url scheme(s) it supports. 
     48It should be clear now that a DataCrawler is specific for the kind of 
     49DataSource it supports, whereas a DataAccessor is specific for the url 
     50scheme(s) it supports. 
    1851 
    19 The AccessData? instance used in DataCrawlerBase? maintains the information about which objects have been scanned before. This instance is passed to the DataAccessor? as this is the best class to do this detection. For example, this allows the HttpDataAccessor? to use HTTP-specific functionality to let the webserver decide on whether the resource has changed since the last scan, preventing an unchanged file from being transported to the crawling side in the first place. 
     52The AccessData instance used in DataCrawlerBase maintains the information 
     53about which objects have been scanned before. This instance is passed to the 
     54DataAccessor as this is the best class to do this detection. For example, 
     55this allows the HttpDataAccessor to use HTTP-specific functionality to let 
     56the webserver decide on whether the resource has changed since the last scan, 
     57preventing an unchanged file from being transported to the crawling side in 
     58the first place. 
    2059 
    2160== HypertextCrawler == 
    2261 
    23 The HypertextCrawler? makes use of two external compoments: a mime type identifier and a hypertext link extractor. The latter component is required to know which resources are linked from a specific resource and should be crawled next. This functionality is realized as a separate component/service as there are many document types that support links (PDF might be a nice one to support next). A specific link extractor is thus mimetype-specific. However, in order to know which link extractor to use, one first needs to know the mime type of the starting resource, which is handled by the first component.  
     62The HypertextCrawler makes use of two external compoments: a mime type 
     63identifier and a hypertext link extractor. The latter component is required 
     64to know which resources are linked from a specific resource and should be 
     65crawled next. This functionality is realized as a separate component/service 
     66as there are many document types that support links (PDF might be a nice one 
     67to support next). A specific link extractor is thus mimetype-specific. 
     68However, in order to know which link extractor to use, one first needs to 
     69know the mime type of the starting resource, which is handled by the first 
     70component.