Context Navigation

Changes between Version 2 and Version 3 of ApertureArchitecture

Timestamp:: 10/12/05 12:00:26 (20 years ago)
Author:: anonymous
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

ApertureArchitecture

-                      v2
+                      v3
 = Aperture Architecture =
+*** DataSources
+The central parts in the architecture are currently DataSource, DataCrawler,
+DataAccessor and DataObject. Together they are used to access the contents of
+an information system, such as a file system or web site.
+A DataSource contains all information necessary to locate the information
+items in a source. For example, a FileSystemDataSource has a set of one or
+more directories on a file system, a set of patterns that describe what files
+to include or exclude, etc.
+A DataCrawler is responsible for actually accessing the physical source and
+reporting the individual information items as DataObjects. Each DataObject
+contains all metadata provided by the data source, such as file names,
+modification dates, etc., as well as the InputStream providing access to
+physical resource.
+We have chosen to distinguish between a DataSource and a DataCrawler as there
+may be several alternative crawling strategies for a single DataSource type.
+Consider for example a generic FileSystemCrawler that handles any kind of
+file system accessible through java.io.File versus a WindowsFileSystemCrawler
+using OS-native functionality to get notified about file additions, deletions
+and changes. Another possibility is various DataCrawler implementations that
+have different trade-offs in speed and accuracy.
+Currently, A DataSource also contains support for writing its configuration
+to or initializing it from an XML file. We might consider putting this in a
+separate utility class, because the best way to store such information is
+often application dependent.
+A DataCrawler creates DataObjects for the individual information items it
+encounters in the data source. These DataObjects are reported to
+DataCrawlerListeners registered at the DataCrawler. An abstract base class
+(DataCrawlerBase) is provided that provides base functionality for
+maintaining information about which files have been reported in the past,
+allowing for incremental scanning.
+In order to create a DataObject for a single resource encountered by the
+DataCrawler, a DataAccessor is used. This functionality is kept out of the
+DataCrawler implementations on purpose because there may be several crawlers
+who can make good use of the same data accessing functionality. A good
+example is the FileSystemCrawler and HypertextCrawler, which both make use of
+the FileDataAccessor. Although they arrive at the physical resource in
+different ways (by traversing folder trees vs. following links from other
+documents), they can use the same functionality to turn a java.io.File into a
+FileDataObject.
+It should be clear now that a DataCrawler is specific for the kind of
+DataSource it supports, whereas a DataAccessor is specific for the url
+scheme(s) it supports.
+The AccessData instance used in DataCrawlerBase maintains the information
+about which objects have been scanned before. This instance is passed to the
+DataAccessor as this is the best class to do this detection. For example,
+this allows the HttpDataAccessor to use HTTP-specific functionality to let
+the webserver decide on whether the resource has changed since the last scan,
+preventing an unchanged file from being transported to the crawling side in
+the first place.
+*** HypertextCrawler
+The HypertextCrawler makes use of two external compoments: a mime type
+identifier and a hypertext link extractor. The latter component is required
+to know which resources are linked from a specific resource and should be
+crawled next. This functionality is realized as a separate component/service
+as there are many document types that support links (PDF might be a nice one
+to support next). A specific link extractor is thus mimetype-specific.
+However, in order to know which link extractor to use, one first needs to
+know the mime type of the starting resource, which is handled by the first
+component.
+*** Email interpretation
+The ImapDataAccessor is a fairly complex class that does a lot of effort in
+interpreting a mime message. Rather than just delivering the raw inputstream
+of the Message, it produces a DataObject with possible child DataObjects that
+reflects as best as possible the way in which mail readers display the mail.
+For example, what may seem to be a simple mail with a few headers and a body
+may in fact be a multipart mail with two alternative bodies, one in plain
+text and one in HTML. What conceptually is a single "information object" is
+spread over 4 different JavaMail objects (a MimeMessage with a Multipart
+containing two BodyParts, if I remember correctly). The ImapDataAccessor
+tries to hide this complexity of multiparts and just creates a single
+DataObject with headers and content.
+It may be a good idea to adapt the other mail crawlers such as the existing
+Outlook and Mozilla mail crawlers so that they produce javax.mail.Message
+objects. We can then refactor the ImapDataAccessor so that this Message-
+interpretation code lives elsewhere, making it possible to also apply it on
+the Messages created by these other mail crawlers. This allows us to reuse
+the mail interpretation code accross these mail formats.
+If these other mail crawlers have access to the raw mail content (i.e. the
+message as transported through SMTP), this may be rather easy to realize, as
+the functionality to parse these lines and convert it into a Message
+datastructure is part of JavaMail. We should see if this functionality is
+publicly available in the library.
+*** Extractors
+This API is still under discussion, that's why I shipped the older
+TextExtractor implementations to DFKI.
+The purpose of Extractor is to extract all information (full text and other)
+from an InputStream of a specific document. Extractors are therefore
+mimetype-specific.
+Todo: describe and discuss final API
+*** OSGi
+Both Aduna and DFKI are in favour of using OSGi as a way to bundle these
+components. At Aduna we have followed a specific way of modelling a service,
+using a factory for every implementation of a service, and a separate
+registry that registers all implementations of a specific service. It is the
+responsibility of the bundle activator of a service to register an instance
+of a service implementation's factory with the service registry. This allows
+for a very light-weight initialization of the system, provided that creation
+of a factory instance is very light-weight.
+Currenly, Leo and Chris think that we should base our code only on pure OSGi
+code (i.e. org.osgi.*) and not use any other utilities such as the dependency
+manager that's currently used in the Aduna code. Perhaps Herko can tell us
+more about what we're in for, because we both have hardly any experience with
+OSGi yet.
+*** Archives
+Some functionality that is still missing but that we at Aduna would really
+like to have is support for handling archives such as zip and rar files.
+The interface for doing archive extraction will probably be a mixture of
+Extractor and DataSource/DataCrawler. On the one hand they will be mimetype-
+specific and will operate on an InputStream (perhaps a DataObject), just like
+Extractor, on the other hand they deliver a stream of new DataObjects.
+A URI scheme also has to be developed for such nested objects, so that you
+can identify a stream packed inside an archive.
+Support for zip and gzip are probably trivial as these formats are already
+accessible through java.util.zip. Rar is another format we encounter
+sometimes. As far as I know there is no java library available for it is an
+open format, i.e. the specs are available.
+*** Opening resources
+Besides crawling resources, we should also be able to open them.
+At first this may look like a job for the DataAccessor, which after all has
+knowledge about the details of the physical source.
+On second thought, I believe that for the opening of files you need some
+other service, parallel to DataAccessor, that is also scheme-specific and
+that takes care of opening the files. Reasons:
+- DataAccessors actually retrieve the files, which is not necessary for some
+file openers. For example, for opening a local file you can instruct Windows
+to do just that. Similarly, a web page can be retrieved and shown by a
+webbrowser, there is no need for us to retrieve the contents and feed it to
+the browser.
+- There may be several alternative ways of opening a resource. For example,
+the java.net JDIC project contains functionality for opening files and
+webpages, whereas we have our own classes to do that.
+This may be a good reason to decouple this functionality from the
+DataAccessor and run it in parallel.
+*** The use of RDF
+We should discuss where and how RDF is used in this framework. In previous
+email discussions we already thought about using RDF as a way to let an
+Extractor output its extracted information, because of the flexibility it
+provides:
+- no assumption on what the metadata looks like, can be very simple or very
+complex
+- easy to store in RDF stores, no transformation necessary (provided that
+you have named graphs support)
+The same technique could also be used in the DataObjects, which now use a Map
+with dedicated keys, defined per DataObject type. I would be in favour of
+changing this to "something RDF", as it considerably eases development.
+Leo came up with an idea that allows delivering RDF while at the same time
+providing a simpler interface to programmers not knowledgeable in RDF. The
+idea is to create a class that implements both the org.openrdf.model.Graph
+interface as well as the java.util.Map interface. The effect of
+        result.put(authorURI, "chris");
+with the authorURI being equal to the URI of the author predicate, would then
+be equal to
+        result.add(documentURI, authorURI, "chris");
+I.e., you can use the Map methods to insert simple resource-predicate-literal
+statements (the majority), which is simple to document and understand,
+whereas people who know what they are doing can also add arbitrary RDF
+statements.