| | 3 | *** DataSources |
| | 4 | |
| | 5 | The central parts in the architecture are currently DataSource, DataCrawler, |
| | 6 | DataAccessor and DataObject. Together they are used to access the contents of |
| | 7 | an information system, such as a file system or web site. |
| | 8 | |
| | 9 | A DataSource contains all information necessary to locate the information |
| | 10 | items in a source. For example, a FileSystemDataSource has a set of one or |
| | 11 | more directories on a file system, a set of patterns that describe what files |
| | 12 | to include or exclude, etc. |
| | 13 | |
| | 14 | A DataCrawler is responsible for actually accessing the physical source and |
| | 15 | reporting the individual information items as DataObjects. Each DataObject |
| | 16 | contains all metadata provided by the data source, such as file names, |
| | 17 | modification dates, etc., as well as the InputStream providing access to |
| | 18 | physical resource. |
| | 19 | |
| | 20 | We have chosen to distinguish between a DataSource and a DataCrawler as there |
| | 21 | may be several alternative crawling strategies for a single DataSource type. |
| | 22 | Consider for example a generic FileSystemCrawler that handles any kind of |
| | 23 | file system accessible through java.io.File versus a WindowsFileSystemCrawler |
| | 24 | using OS-native functionality to get notified about file additions, deletions |
| | 25 | and changes. Another possibility is various DataCrawler implementations that |
| | 26 | have different trade-offs in speed and accuracy. |
| | 27 | |
| | 28 | Currently, A DataSource also contains support for writing its configuration |
| | 29 | to or initializing it from an XML file. We might consider putting this in a |
| | 30 | separate utility class, because the best way to store such information is |
| | 31 | often application dependent. |
| | 32 | |
| | 33 | A DataCrawler creates DataObjects for the individual information items it |
| | 34 | encounters in the data source. These DataObjects are reported to |
| | 35 | DataCrawlerListeners registered at the DataCrawler. An abstract base class |
| | 36 | (DataCrawlerBase) is provided that provides base functionality for |
| | 37 | maintaining information about which files have been reported in the past, |
| | 38 | allowing for incremental scanning. |
| | 39 | |
| | 40 | In order to create a DataObject for a single resource encountered by the |
| | 41 | DataCrawler, a DataAccessor is used. This functionality is kept out of the |
| | 42 | DataCrawler implementations on purpose because there may be several crawlers |
| | 43 | who can make good use of the same data accessing functionality. A good |
| | 44 | example is the FileSystemCrawler and HypertextCrawler, which both make use of |
| | 45 | the FileDataAccessor. Although they arrive at the physical resource in |
| | 46 | different ways (by traversing folder trees vs. following links from other |
| | 47 | documents), they can use the same functionality to turn a java.io.File into a |
| | 48 | FileDataObject. |
| | 49 | |
| | 50 | It should be clear now that a DataCrawler is specific for the kind of |
| | 51 | DataSource it supports, whereas a DataAccessor is specific for the url |
| | 52 | scheme(s) it supports. |
| | 53 | |
| | 54 | The AccessData instance used in DataCrawlerBase maintains the information |
| | 55 | about which objects have been scanned before. This instance is passed to the |
| | 56 | DataAccessor as this is the best class to do this detection. For example, |
| | 57 | this allows the HttpDataAccessor to use HTTP-specific functionality to let |
| | 58 | the webserver decide on whether the resource has changed since the last scan, |
| | 59 | preventing an unchanged file from being transported to the crawling side in |
| | 60 | the first place. |
| | 61 | |
| | 62 | *** HypertextCrawler |
| | 63 | |
| | 64 | The HypertextCrawler makes use of two external compoments: a mime type |
| | 65 | identifier and a hypertext link extractor. The latter component is required |
| | 66 | to know which resources are linked from a specific resource and should be |
| | 67 | crawled next. This functionality is realized as a separate component/service |
| | 68 | as there are many document types that support links (PDF might be a nice one |
| | 69 | to support next). A specific link extractor is thus mimetype-specific. |
| | 70 | However, in order to know which link extractor to use, one first needs to |
| | 71 | know the mime type of the starting resource, which is handled by the first |
| | 72 | component. |
| | 73 | |
| | 74 | *** Email interpretation |
| | 75 | |
| | 76 | The ImapDataAccessor is a fairly complex class that does a lot of effort in |
| | 77 | interpreting a mime message. Rather than just delivering the raw inputstream |
| | 78 | of the Message, it produces a DataObject with possible child DataObjects that |
| | 79 | reflects as best as possible the way in which mail readers display the mail. |
| | 80 | |
| | 81 | For example, what may seem to be a simple mail with a few headers and a body |
| | 82 | may in fact be a multipart mail with two alternative bodies, one in plain |
| | 83 | text and one in HTML. What conceptually is a single "information object" is |
| | 84 | spread over 4 different JavaMail objects (a MimeMessage with a Multipart |
| | 85 | containing two BodyParts, if I remember correctly). The ImapDataAccessor |
| | 86 | tries to hide this complexity of multiparts and just creates a single |
| | 87 | DataObject with headers and content. |
| | 88 | |
| | 89 | It may be a good idea to adapt the other mail crawlers such as the existing |
| | 90 | Outlook and Mozilla mail crawlers so that they produce javax.mail.Message |
| | 91 | objects. We can then refactor the ImapDataAccessor so that this Message- |
| | 92 | interpretation code lives elsewhere, making it possible to also apply it on |
| | 93 | the Messages created by these other mail crawlers. This allows us to reuse |
| | 94 | the mail interpretation code accross these mail formats. |
| | 95 | |
| | 96 | If these other mail crawlers have access to the raw mail content (i.e. the |
| | 97 | message as transported through SMTP), this may be rather easy to realize, as |
| | 98 | the functionality to parse these lines and convert it into a Message |
| | 99 | datastructure is part of JavaMail. We should see if this functionality is |
| | 100 | publicly available in the library. |
| | 101 | |
| | 102 | *** Extractors |
| | 103 | |
| | 104 | This API is still under discussion, that's why I shipped the older |
| | 105 | TextExtractor implementations to DFKI. |
| | 106 | |
| | 107 | The purpose of Extractor is to extract all information (full text and other) |
| | 108 | from an InputStream of a specific document. Extractors are therefore |
| | 109 | mimetype-specific. |
| | 110 | |
| | 111 | Todo: describe and discuss final API |
| | 112 | |
| | 113 | *** OSGi |
| | 114 | |
| | 115 | Both Aduna and DFKI are in favour of using OSGi as a way to bundle these |
| | 116 | components. At Aduna we have followed a specific way of modelling a service, |
| | 117 | using a factory for every implementation of a service, and a separate |
| | 118 | registry that registers all implementations of a specific service. It is the |
| | 119 | responsibility of the bundle activator of a service to register an instance |
| | 120 | of a service implementation's factory with the service registry. This allows |
| | 121 | for a very light-weight initialization of the system, provided that creation |
| | 122 | of a factory instance is very light-weight. |
| | 123 | |
| | 124 | Currenly, Leo and Chris think that we should base our code only on pure OSGi |
| | 125 | code (i.e. org.osgi.*) and not use any other utilities such as the dependency |
| | 126 | manager that's currently used in the Aduna code. Perhaps Herko can tell us |
| | 127 | more about what we're in for, because we both have hardly any experience with |
| | 128 | OSGi yet. |
| | 129 | |
| | 130 | *** Archives |
| | 131 | |
| | 132 | Some functionality that is still missing but that we at Aduna would really |
| | 133 | like to have is support for handling archives such as zip and rar files. |
| | 134 | |
| | 135 | The interface for doing archive extraction will probably be a mixture of |
| | 136 | Extractor and DataSource/DataCrawler. On the one hand they will be mimetype- |
| | 137 | specific and will operate on an InputStream (perhaps a DataObject), just like |
| | 138 | Extractor, on the other hand they deliver a stream of new DataObjects. |
| | 139 | |
| | 140 | A URI scheme also has to be developed for such nested objects, so that you |
| | 141 | can identify a stream packed inside an archive. |
| | 142 | |
| | 143 | Support for zip and gzip are probably trivial as these formats are already |
| | 144 | accessible through java.util.zip. Rar is another format we encounter |
| | 145 | sometimes. As far as I know there is no java library available for it is an |
| | 146 | open format, i.e. the specs are available. |
| | 147 | |
| | 148 | *** Opening resources |
| | 149 | |
| | 150 | Besides crawling resources, we should also be able to open them. |
| | 151 | |
| | 152 | At first this may look like a job for the DataAccessor, which after all has |
| | 153 | knowledge about the details of the physical source. |
| | 154 | |
| | 155 | On second thought, I believe that for the opening of files you need some |
| | 156 | other service, parallel to DataAccessor, that is also scheme-specific and |
| | 157 | that takes care of opening the files. Reasons: |
| | 158 | |
| | 159 | - DataAccessors actually retrieve the files, which is not necessary for some |
| | 160 | file openers. For example, for opening a local file you can instruct Windows |
| | 161 | to do just that. Similarly, a web page can be retrieved and shown by a |
| | 162 | webbrowser, there is no need for us to retrieve the contents and feed it to |
| | 163 | the browser. |
| | 164 | |
| | 165 | - There may be several alternative ways of opening a resource. For example, |
| | 166 | the java.net JDIC project contains functionality for opening files and |
| | 167 | webpages, whereas we have our own classes to do that. |
| | 168 | |
| | 169 | This may be a good reason to decouple this functionality from the |
| | 170 | DataAccessor and run it in parallel. |
| | 171 | |
| | 172 | *** The use of RDF |
| | 173 | |
| | 174 | We should discuss where and how RDF is used in this framework. In previous |
| | 175 | email discussions we already thought about using RDF as a way to let an |
| | 176 | Extractor output its extracted information, because of the flexibility it |
| | 177 | provides: |
| | 178 | |
| | 179 | - no assumption on what the metadata looks like, can be very simple or very |
| | 180 | complex |
| | 181 | - easy to store in RDF stores, no transformation necessary (provided that |
| | 182 | you have named graphs support) |
| | 183 | |
| | 184 | The same technique could also be used in the DataObjects, which now use a Map |
| | 185 | with dedicated keys, defined per DataObject type. I would be in favour of |
| | 186 | changing this to "something RDF", as it considerably eases development. |
| | 187 | |
| | 188 | Leo came up with an idea that allows delivering RDF while at the same time |
| | 189 | providing a simpler interface to programmers not knowledgeable in RDF. The |
| | 190 | idea is to create a class that implements both the org.openrdf.model.Graph |
| | 191 | interface as well as the java.util.Map interface. The effect of |
| | 192 | |
| | 193 | result.put(authorURI, "chris"); |
| | 194 | |
| | 195 | with the authorURI being equal to the URI of the author predicate, would then |
| | 196 | be equal to |
| | 197 | |
| | 198 | result.add(documentURI, authorURI, "chris"); |
| | 199 | |
| | 200 | I.e., you can use the Map methods to insert simple resource-predicate-literal |
| | 201 | statements (the majority), which is simple to document and understand, |
| | 202 | whereas people who know what they are doing can also add arbitrary RDF |
| | 203 | statements. |