| 22 | 16 | The !DataAccessor has been decoupled from the !DataCrawler for the following reasons: |
| 23 | 17 | |
| 24 | 18 | * Maximum code reusability in customer projects in complex enterprise environments. Consider for example various document management systems which may have a filesystem- or website-like structure (i.e. a folder tree or hypertext graph). Such systems may only need a dedicated !DataAccessor that knows how to access such a system, as the crawler can then be reused.[[BR]]I must admit however that currently this is only partially the case:[[BR]](1) The !HypertextCrawler is truely scheme-independent, retrieving !DataAccessors based on the scheme of a url as well as using a !MimeTypeIdentifier and a !LinkExtractor to determine which pages to load next, but it uses URLs internally, leading to the problem described above (in the URLs vs. URIs vs. Strings part).[[BR]](2) The !FileSystemCrawler still uses java.io.File to determine the folder tree and the Files, so we would need to delegate the part that discovers folders, subfolders and files in a scheme-dependent way, similarly to how !HypertextCrawler delegates functionality for type detection and link extraction to other objects. |
| 25 | 19 | * Another reason for this decoupling is that several crawlers may make use of the same !DataAccessors, e.g. a !FileSystemCrawler and a !HypertextCrawler that both use a !FileDataAccessor. For us this is a realistic scenario, e.g. to crawl intranets that are available on a network drive. |
| 33 | | * Leo: to simplify and seperate the '''get (=get it now!)''' and '''getCrawl (=check if changed, get if changed)''' I would suggest to define two methods, one for really getting a resource and one in the crawling scenario. The getCrawl method would be the existing one, the get method a simpler one. |
| | 28 | * Leo: to simplify and seperate the '''getDataObject (=get it now!)''' and '''getDataObjectCrawl (=check if changed, get if changed)''' I would suggest to define two methods, one for really getting a resource and one in the crawling scenario. |
| | 29 | |
| | 30 | * I also renamed the method from '''get()''' to '''getDataObject()''', when an object implements both DataAccessor and other Interfaces, the semantics of the method name get() are fuzzy. |
| | 50 | * During one call, the DataAccessor has the following tasks: |
| | 51 | * - check if the URL is ok |
| | 52 | * - check redirects: if the URL is redirected to antother URI, |
| | 53 | * the DataObject will have the new URI as identifier |
| | 54 | * - check changes (was the object changed since last crawl); only needed in getCrawl() |
| | 55 | * - if crawling: update the CrawlData with new datetime, size, etc. |
| | 56 | * - detect if the DataObject is going to be a DataObject, DataObjectFile or DataObjectFolder. |
| | 57 | * go on accordingly. |
| | 58 | * - open the stream |
| | 59 | * - detect mime-type (using all tricks available: http headers, file extensions, magic bytes) |
| | 60 | * - detect byte-size |
| | 61 | * - extract the most basic metadata (only the data that is there already) |
| | 62 | * - create a new DataObject with all of the above |
| | 63 | * - and return it |
| 83 | | public DataObject get(URI uri, DataSource source, |
| 84 | | CrawlData crawlData , Map<?,?> params) throws UriNotFoundException, IOException; |
| | 96 | public DataObject getDataObjectCrawl(String url, DataSource source, |
| | 97 | CrawlData crawlData , Map params) throws UriNotFoundException, IOException; |
| | 98 | |
| | 99 | /** |
| | 100 | * Get a DataObject for the specified url. The resulting DataObject's ID may differ |
| | 101 | * from the specified url due to normalization schemes, following of redirected URLs, etc. |
| | 102 | * This method is independent from access during crawling sessions. |
| | 103 | * |
| | 104 | * Specific DataAccessor implementations may accept additional parameters through the params Map. |
| | 105 | * |
| | 106 | * @param url The url locator of the resource. If the resource is identified by some |
| | 107 | * other URI, then the DataAccessor will follow redirects accordingly. |
| | 108 | * @param dataSource The source that will be registered as the source of the DataObject. |
| | 109 | * @param params Optional additional parameters needed to access the physical resource. |
| | 110 | * also, parameters may be passed that determine how the metadata should be |
| | 111 | extracted or which detail |
| | 112 | * of metadata is needed. Applications may pass params through the whole chain. |
| | 113 | * @return A DataObject for the specified URI |
| | 114 | * @throws UrlNotFoundException when the binary resource could not be found |
| | 115 | * @throws IOException When any other kind of I/O error occurs. |
| | 116 | */ |
| | 117 | public DataObject getDataObject(String url, DataSource source, |
| | 118 | CrawlData crawlData , Map params) throws UriNotFoundException, IOException; |
| | 119 | |
| | 120 | |