wiki:ApertureArchitecture

Version 5 (modified by anonymous, 19 years ago) (diff)

--

Aperture Architecture

Leo: please don't edit, I'm working on this right now

DataSources and Friends

The central parts in the architecture are currently DataSource, DataCrawler, DataAccessor and DataObject. Together they are used to access the contents of an information system, such as a file system or web site.

A DataSource contains all information necessary to locate the information items in a source. For example, a FileSystemDataSource has a set of one or more directories on a file system, a set of patterns that describe what files to include or exclude, etc.

A DataCrawler is responsible for actually accessing the physical source and reporting the individual information items as DataObjects. Each DataObject contains all metadata provided by the data source, such as file names, modification dates, etc., as well as the InputStream providing access to physical resource.

We have chosen to distinguish between a DataSource and a DataCrawler as there may be several alternative crawling strategies for a single DataSource type. Consider for example a generic FileSystemCrawler that handles any kind of file system accessible through java.io.File versus a WindowsFileSystemCrawler using OS-native functionality to get notified about file additions, deletions and changes. Another possibility is various DataCrawler implementations that have different trade-offs in speed and accuracy.

Currently, A DataSource also contains support for writing its configuration to or initializing it from an XML file. We might consider putting this in a separate utility class, because the best way to store such information is often application dependent.

A DataCrawler creates DataObjects for the individual information items it encounters in the data source. These DataObjects are reported to DataCrawlerListeners registered at the DataCrawler. An abstract base class (DataCrawlerBase) is provided that provides base functionality for maintaining information about which files have been reported in the past, allowing for incremental scanning.

In order to create a DataObject for a single resource encountered by the DataCrawler, a DataAccessor is used. This functionality is kept out of the DataCrawler implementations on purpose because there may be several crawlers who can make good use of the same data accessing functionality. A good example is the FileSystemCrawler and HypertextCrawler, which both make use of the FileDataAccessor. Although they arrive at the physical resource in different ways (by traversing folder trees vs. following links from other documents), they can use the same functionality to turn a java.io.File into a FileDataObject.

It should be clear now that a DataCrawler is specific for the kind of DataSource it supports, whereas a DataAccessor is specific for the url scheme(s) it supports.

The AccessData instance used in DataCrawlerBase maintains the information about which objects have been scanned before. This instance is passed to the DataAccessor as this is the best class to do this detection. For example, this allows the HttpDataAccessor to use HTTP-specific functionality to let the webserver decide on whether the resource has changed since the last scan, preventing an unchanged file from being transported to the crawling side in the first place.

HypertextCrawler

The HypertextCrawler makes use of two external compoments: a mime type identifier and a hypertext link extractor. The latter component is required to know which resources are linked from a specific resource and should be crawled next. This functionality is realized as a separate component/service as there are many document types that support links (PDF might be a nice one to support next). A specific link extractor is thus mimetype-specific. However, in order to know which link extractor to use, one first needs to know the mime type of the starting resource, which is handled by the first component.

Email interpretation

The ImapDataAccessor is a fairly complex class that does a lot of effort in interpreting a mime message. Rather than just delivering the raw inputstream of the Message, it produces a DataObject with possible child DataObjects that reflects as best as possible the way in which mail readers display the mail.

For example, what may seem to be a simple mail with a few headers and a body may in fact be a multipart mail with two alternative bodies, one in plain text and one in HTML. What conceptually is a single "information object" is spread over 4 different JavaMail objects (a MimeMessage with a Multipart containing two BodyParts, if I remember correctly). The ImapDataAccessor tries to hide this complexity of multiparts and just creates a single DataObject with headers and content.

It may be a good idea to adapt the other mail crawlers such as the existing Outlook and Mozilla mail crawlers so that they produce javax.mail.Message objects. We can then refactor the ImapDataAccessor so that this Message- interpretation code lives elsewhere, making it possible to also apply it on the Messages created by these other mail crawlers. This allows us to reuse the mail interpretation code accross these mail formats.

If these other mail crawlers have access to the raw mail content (i.e. the message as transported through SMTP), this may be rather easy to realize, as the functionality to parse these lines and convert it into a Message datastructure is part of JavaMail. We should see if this functionality is publicly available in the library.

Extractors

This API is still under discussion, that's why I shipped the older TextExtractor implementations to DFKI.

The purpose of Extractor is to extract all information (full text and other) from an InputStream of a specific document. Extractors are therefore mimetype-specific.

Todo: describe and discuss final API

OSGi

Both Aduna and DFKI are in favour of using OSGi as a way to bundle these components. At Aduna we have followed a specific way of modelling a service, using a factory for every implementation of a service, and a separate registry that registers all implementations of a specific service. It is the responsibility of the bundle activator of a service to register an instance of a service implementation's factory with the service registry. This allows for a very light-weight initialization of the system, provided that creation of a factory instance is very light-weight.

Currenly, Leo and Chris think that we should base our code only on pure OSGi code (i.e. org.osgi.*) and not use any other utilities such as the dependency manager that's currently used in the Aduna code. Perhaps Herko can tell us more about what we're in for, because we both have hardly any experience with OSGi yet.

Archives

Some functionality that is still missing but that we at Aduna would really like to have is support for handling archives such as zip and rar files.

The interface for doing archive extraction will probably be a mixture of Extractor and DataSource/DataCrawler. On the one hand they will be mimetype- specific and will operate on an InputStream (perhaps a DataObject), just like Extractor, on the other hand they deliver a stream of new DataObjects.

A URI scheme also has to be developed for such nested objects, so that you can identify a stream packed inside an archive.

Support for zip and gzip are probably trivial as these formats are already accessible through java.util.zip. Rar is another format we encounter sometimes. As far as I know there is no java library available for it is an open format, i.e. the specs are available.

Opening resources

Besides crawling resources, we should also be able to open them.

At first this may look like a job for the DataAccessor, which after all has knowledge about the details of the physical source.

On second thought, I believe that for the opening of files you need some other service, parallel to DataAccessor, that is also scheme-specific and that takes care of opening the files. Reasons:

  • DataAccessors actually retrieve the files, which is not necessary for some

file openers. For example, for opening a local file you can instruct Windows to do just that. Similarly, a web page can be retrieved and shown by a webbrowser, there is no need for us to retrieve the contents and feed it to the browser.

  • There may be several alternative ways of opening a resource. For example,

the java.net JDIC project contains functionality for opening files and webpages, whereas we have our own classes to do that.

This may be a good reason to decouple this functionality from the DataAccessor and run it in parallel.

The use of RDF

We should discuss where and how RDF is used in this framework. In previous email discussions we already thought about using RDF as a way to let an Extractor output its extracted information, because of the flexibility it provides:

  • no assumption on what the metadata looks like, can be very simple or very

complex

  • easy to store in RDF stores, no transformation necessary (provided that

you have named graphs support)

The same technique could also be used in the DataObjects, which now use a Map with dedicated keys, defined per DataObject type. I would be in favour of changing this to "something RDF", as it considerably eases development.

Leo came up with an idea that allows delivering RDF while at the same time providing a simpler interface to programmers not knowledgeable in RDF. The idea is to create a class that implements both the org.openrdf.model.Graph interface as well as the java.util.Map interface. The effect of

result.put(authorURI, "chris");

with the authorURI being equal to the URI of the author predicate, would then be equal to

result.add(documentURI, authorURI, "chris");

I.e., you can use the Map methods to insert simple resource-predicate-literal statements (the majority), which is simple to document and understand, whereas people who know what they are doing can also add arbitrary RDF statements.