Changes between Version 2 and Version 3 of ApertureArchitecture


Ignore:
Timestamp:
10/12/05 12:00:26 (19 years ago)
Author:
anonymous
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureArchitecture

    v2 v3  
    11= Aperture Architecture = 
    22 
     3*** DataSources 
     4 
     5The central parts in the architecture are currently DataSource, DataCrawler, 
     6DataAccessor and DataObject. Together they are used to access the contents of 
     7an information system, such as a file system or web site. 
     8 
     9A DataSource contains all information necessary to locate the information 
     10items in a source. For example, a FileSystemDataSource has a set of one or 
     11more directories on a file system, a set of patterns that describe what files 
     12to include or exclude, etc. 
     13 
     14A DataCrawler is responsible for actually accessing the physical source and 
     15reporting the individual information items as DataObjects. Each DataObject 
     16contains all metadata provided by the data source, such as file names, 
     17modification dates, etc., as well as the InputStream providing access to 
     18physical resource. 
     19 
     20We have chosen to distinguish between a DataSource and a DataCrawler as there 
     21may be several alternative crawling strategies for a single DataSource type. 
     22Consider for example a generic FileSystemCrawler that handles any kind of 
     23file system accessible through java.io.File versus a WindowsFileSystemCrawler 
     24using OS-native functionality to get notified about file additions, deletions 
     25and changes. Another possibility is various DataCrawler implementations that 
     26have different trade-offs in speed and accuracy. 
     27 
     28Currently, A DataSource also contains support for writing its configuration 
     29to or initializing it from an XML file. We might consider putting this in a 
     30separate utility class, because the best way to store such information is 
     31often application dependent. 
     32 
     33A DataCrawler creates DataObjects for the individual information items it 
     34encounters in the data source. These DataObjects are reported to 
     35DataCrawlerListeners registered at the DataCrawler. An abstract base class 
     36(DataCrawlerBase) is provided that provides base functionality for 
     37maintaining information about which files have been reported in the past, 
     38allowing for incremental scanning. 
     39 
     40In order to create a DataObject for a single resource encountered by the 
     41DataCrawler, a DataAccessor is used. This functionality is kept out of the 
     42DataCrawler implementations on purpose because there may be several crawlers 
     43who can make good use of the same data accessing functionality. A good 
     44example is the FileSystemCrawler and HypertextCrawler, which both make use of 
     45the FileDataAccessor. Although they arrive at the physical resource in 
     46different ways (by traversing folder trees vs. following links from other 
     47documents), they can use the same functionality to turn a java.io.File into a 
     48FileDataObject. 
     49 
     50It should be clear now that a DataCrawler is specific for the kind of 
     51DataSource it supports, whereas a DataAccessor is specific for the url 
     52scheme(s) it supports. 
     53 
     54The AccessData instance used in DataCrawlerBase maintains the information 
     55about which objects have been scanned before. This instance is passed to the 
     56DataAccessor as this is the best class to do this detection. For example, 
     57this allows the HttpDataAccessor to use HTTP-specific functionality to let 
     58the webserver decide on whether the resource has changed since the last scan, 
     59preventing an unchanged file from being transported to the crawling side in 
     60the first place. 
     61 
     62*** HypertextCrawler 
     63 
     64The HypertextCrawler makes use of two external compoments: a mime type 
     65identifier and a hypertext link extractor. The latter component is required 
     66to know which resources are linked from a specific resource and should be 
     67crawled next. This functionality is realized as a separate component/service 
     68as there are many document types that support links (PDF might be a nice one 
     69to support next). A specific link extractor is thus mimetype-specific. 
     70However, in order to know which link extractor to use, one first needs to 
     71know the mime type of the starting resource, which is handled by the first 
     72component. 
     73 
     74*** Email interpretation 
     75 
     76The ImapDataAccessor is a fairly complex class that does a lot of effort in 
     77interpreting a mime message. Rather than just delivering the raw inputstream 
     78of the Message, it produces a DataObject with possible child DataObjects that 
     79reflects as best as possible the way in which mail readers display the mail. 
     80 
     81For example, what may seem to be a simple mail with a few headers and a body 
     82may in fact be a multipart mail with two alternative bodies, one in plain 
     83text and one in HTML. What conceptually is a single "information object" is 
     84spread over 4 different JavaMail objects (a MimeMessage with a Multipart 
     85containing two BodyParts, if I remember correctly). The ImapDataAccessor 
     86tries to hide this complexity of multiparts and just creates a single 
     87DataObject with headers and content. 
     88 
     89It may be a good idea to adapt the other mail crawlers such as the existing 
     90Outlook and Mozilla mail crawlers so that they produce javax.mail.Message 
     91objects. We can then refactor the ImapDataAccessor so that this Message- 
     92interpretation code lives elsewhere, making it possible to also apply it on 
     93the Messages created by these other mail crawlers. This allows us to reuse 
     94the mail interpretation code accross these mail formats. 
     95 
     96If these other mail crawlers have access to the raw mail content (i.e. the 
     97message as transported through SMTP), this may be rather easy to realize, as 
     98the functionality to parse these lines and convert it into a Message 
     99datastructure is part of JavaMail. We should see if this functionality is 
     100publicly available in the library. 
     101 
     102*** Extractors 
     103 
     104This API is still under discussion, that's why I shipped the older 
     105TextExtractor implementations to DFKI. 
     106 
     107The purpose of Extractor is to extract all information (full text and other) 
     108from an InputStream of a specific document. Extractors are therefore 
     109mimetype-specific. 
     110 
     111Todo: describe and discuss final API 
     112 
     113*** OSGi 
     114 
     115Both Aduna and DFKI are in favour of using OSGi as a way to bundle these 
     116components. At Aduna we have followed a specific way of modelling a service, 
     117using a factory for every implementation of a service, and a separate 
     118registry that registers all implementations of a specific service. It is the 
     119responsibility of the bundle activator of a service to register an instance 
     120of a service implementation's factory with the service registry. This allows 
     121for a very light-weight initialization of the system, provided that creation 
     122of a factory instance is very light-weight. 
     123 
     124Currenly, Leo and Chris think that we should base our code only on pure OSGi 
     125code (i.e. org.osgi.*) and not use any other utilities such as the dependency 
     126manager that's currently used in the Aduna code. Perhaps Herko can tell us 
     127more about what we're in for, because we both have hardly any experience with 
     128OSGi yet. 
     129 
     130*** Archives 
     131 
     132Some functionality that is still missing but that we at Aduna would really 
     133like to have is support for handling archives such as zip and rar files. 
     134 
     135The interface for doing archive extraction will probably be a mixture of 
     136Extractor and DataSource/DataCrawler. On the one hand they will be mimetype- 
     137specific and will operate on an InputStream (perhaps a DataObject), just like 
     138Extractor, on the other hand they deliver a stream of new DataObjects. 
     139 
     140A URI scheme also has to be developed for such nested objects, so that you 
     141can identify a stream packed inside an archive. 
     142 
     143Support for zip and gzip are probably trivial as these formats are already 
     144accessible through java.util.zip. Rar is another format we encounter 
     145sometimes. As far as I know there is no java library available for it is an 
     146open format, i.e. the specs are available. 
     147 
     148*** Opening resources 
     149 
     150Besides crawling resources, we should also be able to open them. 
     151 
     152At first this may look like a job for the DataAccessor, which after all has 
     153knowledge about the details of the physical source. 
     154 
     155On second thought, I believe that for the opening of files you need some 
     156other service, parallel to DataAccessor, that is also scheme-specific and 
     157that takes care of opening the files. Reasons: 
     158 
     159- DataAccessors actually retrieve the files, which is not necessary for some 
     160file openers. For example, for opening a local file you can instruct Windows 
     161to do just that. Similarly, a web page can be retrieved and shown by a 
     162webbrowser, there is no need for us to retrieve the contents and feed it to 
     163the browser. 
     164 
     165- There may be several alternative ways of opening a resource. For example, 
     166the java.net JDIC project contains functionality for opening files and 
     167webpages, whereas we have our own classes to do that. 
     168 
     169This may be a good reason to decouple this functionality from the 
     170DataAccessor and run it in parallel. 
     171 
     172*** The use of RDF 
     173 
     174We should discuss where and how RDF is used in this framework. In previous 
     175email discussions we already thought about using RDF as a way to let an 
     176Extractor output its extracted information, because of the flexibility it 
     177provides: 
     178 
     179- no assumption on what the metadata looks like, can be very simple or very 
     180complex 
     181- easy to store in RDF stores, no transformation necessary (provided that 
     182you have named graphs support) 
     183 
     184The same technique could also be used in the DataObjects, which now use a Map 
     185with dedicated keys, defined per DataObject type. I would be in favour of 
     186changing this to "something RDF", as it considerably eases development. 
     187 
     188Leo came up with an idea that allows delivering RDF while at the same time 
     189providing a simpler interface to programmers not knowledgeable in RDF. The 
     190idea is to create a class that implements both the org.openrdf.model.Graph 
     191interface as well as the java.util.Map interface. The effect of 
     192 
     193        result.put(authorURI, "chris"); 
     194         
     195with the authorURI being equal to the URI of the author predicate, would then 
     196be equal to 
     197 
     198        result.add(documentURI, authorURI, "chris"); 
     199 
     200I.e., you can use the Map methods to insert simple resource-predicate-literal 
     201statements (the majority), which is simple to document and understand, 
     202whereas people who know what they are doing can also add arbitrary RDF 
     203statements.