3 | | = Leo: please don't edit, I'm working on this right now = |
4 | | |
5 | | == DataSources and Friends == |
6 | | |
7 | | The central parts in the architecture are currently DataSource, DataCrawler, |
8 | | DataAccessor and DataObject. Together they are used to access the contents of |
9 | | an information system, such as a file system or web site. |
10 | | |
11 | | A DataSource contains all information necessary to locate the information |
12 | | items in a source. For example, a FileSystemDataSource has a set of one or |
13 | | more directories on a file system, a set of patterns that describe what files |
14 | | to include or exclude, etc. |
15 | | |
16 | | A DataCrawler is responsible for actually accessing the physical source and |
17 | | reporting the individual information items as DataObjects. Each DataObject |
18 | | contains all metadata provided by the data source, such as file names, |
19 | | modification dates, etc., as well as the InputStream providing access to |
20 | | physical resource. |
21 | | |
22 | | We have chosen to distinguish between a DataSource and a DataCrawler as there |
23 | | may be several alternative crawling strategies for a single DataSource type. |
24 | | Consider for example a generic FileSystemCrawler that handles any kind of |
25 | | file system accessible through java.io.File versus a WindowsFileSystemCrawler |
26 | | using OS-native functionality to get notified about file additions, deletions |
27 | | and changes. Another possibility is various DataCrawler implementations that |
28 | | have different trade-offs in speed and accuracy. |
29 | | |
30 | | Currently, A DataSource also contains support for writing its configuration |
31 | | to or initializing it from an XML file. We might consider putting this in a |
32 | | separate utility class, because the best way to store such information is |
33 | | often application dependent. |
34 | | |
35 | | A DataCrawler creates DataObjects for the individual information items it |
36 | | encounters in the data source. These DataObjects are reported to |
37 | | DataCrawlerListeners registered at the DataCrawler. An abstract base class |
38 | | (DataCrawlerBase) is provided that provides base functionality for |
39 | | maintaining information about which files have been reported in the past, |
40 | | allowing for incremental scanning. |
41 | | |
42 | | In order to create a DataObject for a single resource encountered by the |
43 | | DataCrawler, a DataAccessor is used. This functionality is kept out of the |
44 | | DataCrawler implementations on purpose because there may be several crawlers |
45 | | who can make good use of the same data accessing functionality. A good |
46 | | example is the FileSystemCrawler and HypertextCrawler, which both make use of |
47 | | the FileDataAccessor. Although they arrive at the physical resource in |
48 | | different ways (by traversing folder trees vs. following links from other |
49 | | documents), they can use the same functionality to turn a java.io.File into a |
50 | | FileDataObject. |
51 | | |
52 | | It should be clear now that a DataCrawler is specific for the kind of |
53 | | DataSource it supports, whereas a DataAccessor is specific for the url |
54 | | scheme(s) it supports. |
55 | | |
56 | | The AccessData instance used in DataCrawlerBase maintains the information |
57 | | about which objects have been scanned before. This instance is passed to the |
58 | | DataAccessor as this is the best class to do this detection. For example, |
59 | | this allows the HttpDataAccessor to use HTTP-specific functionality to let |
60 | | the webserver decide on whether the resource has changed since the last scan, |
61 | | preventing an unchanged file from being transported to the crawling side in |
62 | | the first place. |
63 | | |
64 | | == HypertextCrawler == |
65 | | |
66 | | The HypertextCrawler makes use of two external compoments: a mime type |
67 | | identifier and a hypertext link extractor. The latter component is required |
68 | | to know which resources are linked from a specific resource and should be |
69 | | crawled next. This functionality is realized as a separate component/service |
70 | | as there are many document types that support links (PDF might be a nice one |
71 | | to support next). A specific link extractor is thus mimetype-specific. |
72 | | However, in order to know which link extractor to use, one first needs to |
73 | | know the mime type of the starting resource, which is handled by the first |
74 | | component. |
75 | | |
76 | | == Email interpretation == |
77 | | |
78 | | The ImapDataAccessor is a fairly complex class that does a lot of effort in |
79 | | interpreting a mime message. Rather than just delivering the raw inputstream |
80 | | of the Message, it produces a DataObject with possible child DataObjects that |
81 | | reflects as best as possible the way in which mail readers display the mail. |
82 | | |
83 | | For example, what may seem to be a simple mail with a few headers and a body |
84 | | may in fact be a multipart mail with two alternative bodies, one in plain |
85 | | text and one in HTML. What conceptually is a single "information object" is |
86 | | spread over 4 different JavaMail objects (a MimeMessage with a Multipart |
87 | | containing two BodyParts, if I remember correctly). The ImapDataAccessor |
88 | | tries to hide this complexity of multiparts and just creates a single |
89 | | DataObject with headers and content. |
90 | | |
91 | | It may be a good idea to adapt the other mail crawlers such as the existing |
92 | | Outlook and Mozilla mail crawlers so that they produce javax.mail.Message |
93 | | objects. We can then refactor the ImapDataAccessor so that this Message- |
94 | | interpretation code lives elsewhere, making it possible to also apply it on |
95 | | the Messages created by these other mail crawlers. This allows us to reuse |
96 | | the mail interpretation code accross these mail formats. |
97 | | |
98 | | If these other mail crawlers have access to the raw mail content (i.e. the |
99 | | message as transported through SMTP), this may be rather easy to realize, as |
100 | | the functionality to parse these lines and convert it into a Message |
101 | | datastructure is part of JavaMail. We should see if this functionality is |
102 | | publicly available in the library. |
103 | | |
104 | | == Extractors == |
105 | | |
106 | | This API is still under discussion, that's why I shipped the older |
107 | | TextExtractor implementations to DFKI. |
108 | | |
109 | | The purpose of Extractor is to extract all information (full text and other) |
110 | | from an InputStream of a specific document. Extractors are therefore |
111 | | mimetype-specific. |
112 | | |
113 | | Todo: describe and discuss final API |
114 | | |
115 | | == OSGi == |
116 | | |
117 | | Both Aduna and DFKI are in favour of using OSGi as a way to bundle these |
118 | | components. At Aduna we have followed a specific way of modelling a service, |
119 | | using a factory for every implementation of a service, and a separate |
120 | | registry that registers all implementations of a specific service. It is the |
121 | | responsibility of the bundle activator of a service to register an instance |
122 | | of a service implementation's factory with the service registry. This allows |
123 | | for a very light-weight initialization of the system, provided that creation |
124 | | of a factory instance is very light-weight. |
125 | | |
126 | | Currenly, Leo and Chris think that we should base our code only on pure OSGi |
127 | | code (i.e. org.osgi.*) and not use any other utilities such as the dependency |
128 | | manager that's currently used in the Aduna code. Perhaps Herko can tell us |
129 | | more about what we're in for, because we both have hardly any experience with |
130 | | OSGi yet. |
131 | | |
132 | | == Archives == |
133 | | |
134 | | Some functionality that is still missing but that we at Aduna would really |
135 | | like to have is support for handling archives such as zip and rar files. |
136 | | |
137 | | The interface for doing archive extraction will probably be a mixture of |
138 | | Extractor and DataSource/DataCrawler. On the one hand they will be mimetype- |
139 | | specific and will operate on an InputStream (perhaps a DataObject), just like |
140 | | Extractor, on the other hand they deliver a stream of new DataObjects. |
141 | | |
142 | | A URI scheme also has to be developed for such nested objects, so that you |
143 | | can identify a stream packed inside an archive. |
144 | | |
145 | | Support for zip and gzip are probably trivial as these formats are already |
146 | | accessible through java.util.zip. Rar is another format we encounter |
147 | | sometimes. As far as I know there is no java library available for it is an |
148 | | open format, i.e. the specs are available. |
149 | | |
150 | | == Opening resources == |
151 | | |
152 | | Besides crawling resources, we should also be able to open them. |
153 | | |
154 | | At first this may look like a job for the DataAccessor, which after all has |
155 | | knowledge about the details of the physical source. |
156 | | |
157 | | On second thought, I believe that for the opening of files you need some |
158 | | other service, parallel to DataAccessor, that is also scheme-specific and |
159 | | that takes care of opening the files. Reasons: |
160 | | |
161 | | - DataAccessors actually retrieve the files, which is not necessary for some |
162 | | file openers. For example, for opening a local file you can instruct Windows |
163 | | to do just that. Similarly, a web page can be retrieved and shown by a |
164 | | webbrowser, there is no need for us to retrieve the contents and feed it to |
165 | | the browser. |
166 | | |
167 | | - There may be several alternative ways of opening a resource. For example, |
168 | | the java.net JDIC project contains functionality for opening files and |
169 | | webpages, whereas we have our own classes to do that. |
170 | | |
171 | | This may be a good reason to decouple this functionality from the |
172 | | DataAccessor and run it in parallel. |
173 | | |
174 | | == The use of RDF == |
175 | | |
176 | | We should discuss where and how RDF is used in this framework. In previous |
177 | | email discussions we already thought about using RDF as a way to let an |
178 | | Extractor output its extracted information, because of the flexibility it |
179 | | provides: |
180 | | |
181 | | - no assumption on what the metadata looks like, can be very simple or very |
182 | | complex |
183 | | - easy to store in RDF stores, no transformation necessary (provided that |
184 | | you have named graphs support) |
185 | | |
186 | | The same technique could also be used in the DataObjects, which now use a Map |
187 | | with dedicated keys, defined per DataObject type. I would be in favour of |
188 | | changing this to "something RDF", as it considerably eases development. |
189 | | |
190 | | Leo came up with an idea that allows delivering RDF while at the same time |
191 | | providing a simpler interface to programmers not knowledgeable in RDF. The |
192 | | idea is to create a class that implements both the org.openrdf.model.Graph |
193 | | interface as well as the java.util.Map interface. The effect of |
194 | | |
195 | | result.put(authorURI, "chris"); |
196 | | |
197 | | with the authorURI being equal to the URI of the author predicate, would then |
198 | | be equal to |
199 | | |
200 | | result.add(documentURI, authorURI, "chris"); |
201 | | |
202 | | I.e., you can use the Map methods to insert simple resource-predicate-literal |
203 | | statements (the majority), which is simple to document and understand, |
204 | | whereas people who know what they are doing can also add arbitrary RDF |
205 | | statements. |
| 3 | <page deleted> |