
Version 9 (modified by sauermann, 18 years ago) (diff)


Open Issues in Aperture Exctractors

We have some code-pieces (like MP3 extraction) that do not work on inputstreams but only on files.

There are different approaches to solve that:

ideaA: rewrite all File-Based extractors using inputstream

Somebody writes new Extractors implementing the InputStream-based extraction interface.

  • issue: these have to be written completly new?
    • idea: have somebody else write them.
  • pro: They are probably more performant than the existing ones and have less overhead
  • con: they have to be written new

ideaB: add a new method to Extractor, passing in the file as argument

This is the existing Method: {{{extract(URI id, InputStream stream, Charset charset,

String mimeType, RDFContainer result)}}}

We could add a new one to the Interface Extractor: {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: no new interface
  • issue: looking at the Interface, it is not clear what method to use and what is implemented. Should I call first the method with InputStream and see if it fails? hm
  • issue: this depends on ideaC

ideaB1: create a new Interface FileExtractor, passing in the file as argument

Create a new Interface FileExtractor, that implements only one method. Declare that this interface should only be used in cases, when there is no InputStream-based extraction library available and say that this FileExtractor is mediocre to the normal Extractor. {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: developers can determine which kind of Extractor they face and which method to call
  • con: we need a new registry for FileExtractors
  • issue: this depends on ideaC

ideaC: Add a new method getFile() to FileDataObject ==)

Add a new method getFile(), returning a file, to FileDataObject. This is easily implemented on File-based data objects (crawling local file system). For remote FileDataObjects, the method will be implemented using a buffering of the InputStream. ideaB and ideaB1 depend on this getFile() method.

  • pro: optimizes the implementation for file-system-crawler
  • issue: on some constellations (crawling remote MP3s), there will only be FileExtractors and everything will be buffered on local harddisk
    • idea: this is not so much an issue, the benefit for the end user of having more data outweights it

should throw a "NotFoundException" if the element does not exist. That is a fallback: if the element was moved, the calling gui that uses DataOpener could then search for the new location of the element and suggest a new location. The new URI could then work better.

This blocks:

Add rdfs:label to everything

see that every resource has an rdfs:label, additionally to dc:title, etc

Relate DataOpeners to DataSources

At the end, the dataopeners are tightly knit to datasources, not to URI-scheme. The method DataOpenerRegistry.get(String urischeme) is not good.

As we have the uri scheme "gnowsis://" quite often for Outlook, Thunderbird, some other stuff.

Still, opening by URI scheme is a good fallback when I have a resource at hand from which i know only URI (and not the originating datasource) so I would keep it as fallback.


  • Add a new method to DataSource - getDataOpener which returns an instance of DataOpener (or uses the DataOpenerRegistry internally, when not defining own DataOpeners).

This blocks:

use reusable web-guis to configure datasources, reusable crawlers, reusable registry

both gnowsis, autofocus and possible aduna metadata server need guis to configure the datasources (restrictions, passwords, setting and enableing datasources).

  • we could say that datasource config is done using servlets - then it is easier to reuse
  • we could use the same crawler / registration classes in Nepomuk, Autofocus, and AMS(what is ApertureCrawler and ApertureDataSourceRegistry in gnowsis)

Vocabulary: use DC instead of data

We are using many properties of Dublic Core, but redefining them. For compabilities sake, we should inlcude the real DC vocabularies right from the start, and not use our own uris.


remove the "init" target, everything in there can be top-level.

  • simplifies the ant file.

old aperture pages