5 | | A DataSource? contains all information necessary to locate the information items in a source. For example, a FileSystemDataSource? has a set of one or more directories on a file system, a set of patterns that describe what files to include or exclude, etc. |
| 7 | A DataSource contains all information necessary to locate the information |
| 8 | items in a source. For example, a FileSystemDataSource has a set of one or |
| 9 | more directories on a file system, a set of patterns that describe what files |
| 10 | to include or exclude, etc. |
7 | | A DataCrawler? is responsible for actually accessing the physical source and reporting the individual information items as DataObjects?. Each DataObject? contains all metadata provided by the data source, such as file names, modification dates, etc., as well as the InputStream? providing access to physical resource. |
| 12 | A DataCrawler is responsible for actually accessing the physical source and |
| 13 | reporting the individual information items as DataObjects. Each DataObject |
| 14 | contains all metadata provided by the data source, such as file names, |
| 15 | modification dates, etc., as well as the InputStream providing access to |
| 16 | physical resource. |
9 | | We have chosen to distinguish between a DataSource? and a DataCrawler? as there may be several alternative crawling strategies for a single DataSource? type. Consider for example a generic FileSystemCrawler? that handles any kind of file system accessible through java.io.File versus a WindowsFileSystemCrawler? using OS-native functionality to get notified about file additions, deletions and changes. Another possibility is various DataCrawler? implementations that have different trade-offs in speed and accuracy. |
| 18 | We have chosen to distinguish between a DataSource and a DataCrawler as there |
| 19 | may be several alternative crawling strategies for a single DataSource type. |
| 20 | Consider for example a generic FileSystemCrawler that handles any kind of |
| 21 | file system accessible through java.io.File versus a WindowsFileSystemCrawler |
| 22 | using OS-native functionality to get notified about file additions, deletions |
| 23 | and changes. Another possibility is various DataCrawler implementations that |
| 24 | have different trade-offs in speed and accuracy. |
13 | | A DataCrawler? creates DataObjects? for the individual information items it encounters in the data source. These DataObjects? are reported to DataCrawlerListeners? registered at the DataCrawler?. An abstract base class (DataCrawlerBase?) is provided that provides base functionality for maintaining information about which files have been reported in the past, allowing for incremental scanning. |
| 31 | A DataCrawler creates DataObjects for the individual information items it |
| 32 | encounters in the data source. These DataObjects are reported to |
| 33 | DataCrawlerListeners registered at the DataCrawler. An abstract base class |
| 34 | (DataCrawlerBase) is provided that provides base functionality for |
| 35 | maintaining information about which files have been reported in the past, |
| 36 | allowing for incremental scanning. |
15 | | In order to create a DataObject? for a single resource encountered by the DataCrawler?, a DataAccessor? is used. This functionality is kept out of the DataCrawler? implementations on purpose because there may be several crawlers who can make good use of the same data accessing functionality. A good example is the FileSystemCrawler? and HypertextCrawler?, which both make use of the FileDataAccessor?. Although they arrive at the physical resource in different ways (by traversing folder trees vs. following links from other documents), they can use the same functionality to turn a java.io.File into a FileDataObject?. |
| 38 | In order to create a DataObject for a single resource encountered by the |
| 39 | DataCrawler, a DataAccessor is used. This functionality is kept out of the |
| 40 | DataCrawler implementations on purpose because there may be several crawlers |
| 41 | who can make good use of the same data accessing functionality. A good |
| 42 | example is the FileSystemCrawler and HypertextCrawler, which both make use of |
| 43 | the FileDataAccessor. Although they arrive at the physical resource in |
| 44 | different ways (by traversing folder trees vs. following links from other |
| 45 | documents), they can use the same functionality to turn a java.io.File into a |
| 46 | FileDataObject. |
19 | | The AccessData? instance used in DataCrawlerBase? maintains the information about which objects have been scanned before. This instance is passed to the DataAccessor? as this is the best class to do this detection. For example, this allows the HttpDataAccessor? to use HTTP-specific functionality to let the webserver decide on whether the resource has changed since the last scan, preventing an unchanged file from being transported to the crawling side in the first place. |
| 52 | The AccessData instance used in DataCrawlerBase maintains the information |
| 53 | about which objects have been scanned before. This instance is passed to the |
| 54 | DataAccessor as this is the best class to do this detection. For example, |
| 55 | this allows the HttpDataAccessor to use HTTP-specific functionality to let |
| 56 | the webserver decide on whether the resource has changed since the last scan, |
| 57 | preventing an unchanged file from being transported to the crawling side in |
| 58 | the first place. |
23 | | The HypertextCrawler? makes use of two external compoments: a mime type identifier and a hypertext link extractor. The latter component is required to know which resources are linked from a specific resource and should be crawled next. This functionality is realized as a separate component/service as there are many document types that support links (PDF might be a nice one to support next). A specific link extractor is thus mimetype-specific. However, in order to know which link extractor to use, one first needs to know the mime type of the starting resource, which is handled by the first component. |
| 62 | The HypertextCrawler makes use of two external compoments: a mime type |
| 63 | identifier and a hypertext link extractor. The latter component is required |
| 64 | to know which resources are linked from a specific resource and should be |
| 65 | crawled next. This functionality is realized as a separate component/service |
| 66 | as there are many document types that support links (PDF might be a nice one |
| 67 | to support next). A specific link extractor is thus mimetype-specific. |
| 68 | However, in order to know which link extractor to use, one first needs to |
| 69 | know the mime type of the starting resource, which is handled by the first |
| 70 | component. |