Version 22 (modified by mylka, 16 years ago) (diff)


Open Issues in Aperture

most of these features would be fine if fixed before ISWC2006, 5th November 2006.

CrawlerHandler vs. RDFContainerFactory

These interfaces shouldn't be mixed together. I would suggest the removal of the getRDFContainerFactory method from the CrawlerHandler interface. The RDFContainerFactory should be set on the crawler before the beginning of the crawling process, not called from the CrawlerHandler. Also, all examples should be rewritten in a way that they use a simple RDFContainerfactory that creates a new RDFContainer backed by a fresh and empty Repository/Model on every invocation of the getRDFContainer method. As such the data should be explicitly transferred from this temporary container to the backing model/repository. This would be more elegant and more instructive as far as the intended usage is concerned. The user could 'see' the flow of data in the source code. If have a Repository in the CrawlerHandler and generate RDFContainers backed by this repository, it might be non-obvious where does the data in the repository actually comes from.

RDFContainer based on RDF2GO

The casting of RDFCOntainer to sesame2 to read values is unaccapteable.

We should definitely move to RDF2GO, and RDF2GO will be improved to fit us. Alternatively we use ONLY sesame2.

A problem that Heinz Kirchmann found when programming was that the downcast from RDFContainer to SesameRdfContainer was not obvious. The crawlerhandler in Gnowsis is seperated from the actual processing of the data. We used RDFContainer as interface inside gnowsis, in many methods. We should have used SesameRDFContainer everywhere inside gnowsis.

If sesame is the right framework for gnowsis, the passing SesameRDFContainer would have been right. But if we remove sesame and use kowari or something else, the usage of RDF2GO may be better.

Also, we never used a RDFContainer besides Sesame2. So it may be the obvious choice to cast everything and define everything as SesameRdfContainer BUT this will cause troubles for projects that are based on Sesame1. If somebody that is based on Sesame1 wants to use Aperture, then using OpenRDF may be good.

On the long run, to have a generic API, using RDF2GO as interface and something like sesame2 as default implementation is better.

The RDFContainer interface did not include all methods that we needed at the end, only SesameRDFContainer did implement the needed methods to read data. So the Interface must be extended to implement all methods needed.

Discussion pros:

  • bindings for various RDF stores that we get for free


  • is RDF2GO still using That would mean a lot of

conversions that are potentially not necessary, e.g. when using a Sesame Repository: org.openrdf.model.URIs get translated to and back to org.openrdf.model.URIs.

  • RDFContainer lacks full RDF graph access. A simple getStatements

method with a subject parameter would solve this though. I've also read comments by Gunnar about having to cast RDFContainer to SesameRDFContainer in code he wrote, I guess he had the same problem?

Gunnar and Leo are happy with this:

  • We keep RDFContainer as it is and only switch "getModel()" to return a RDF2GO model.
  • We switch the methods of RDFContainer to use RDF2GO interfaces (RDF2Go's predicate, resource, etc) for the setProperty(property, value) methods. Exctractors

We have some code-pieces (like MP3 extraction) that do not work on inputstreams but only on files.

There are different approaches to solve that:

ideaA: rewrite all File-Based extractors using inputstream

Somebody writes new Extractors implementing the InputStream-based extraction interface.

  • issue: these have to be written completly new?
    • idea: have somebody else write them.
  • pro: They are probably more performant than the existing ones and have less overhead
  • con: they have to be written new

ideaB: add a new method to Extractor

This is the existing Method: {{{extract(URI id, InputStream stream, Charset charset,

String mimeType, RDFContainer result)}}}

We could add a new one to the Interface Extractor: {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: no new interface
  • issue: looking at the Interface, it is not clear what method to use and what is implemented. Should I call first the method with InputStream and see if it fails? hm
  • issue: this depends on ideaC

ideaB1: create a new Interface FileExtractor

Create a new Interface FileExtractor, that implements only one method. Declare that this interface should only be used in cases, when there is no InputStream-based extraction library available and say that this FileExtractor is mediocre to the normal Extractor. {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: developers can determine which kind of Extractor they face and which method to call
  • con: we need a new registry for FileExtractors
  • issue: this depends on ideaC

ideaC: new method getFile() to FileDataObject

Add a new method getFile(), returning a file, to FileDataObject. This is easily implemented on File-based data objects (crawling local file system). For remote FileDataObjects, the method will be implemented using a buffering of the InputStream. ideaB and ideaB1 depend on this getFile() method.

  • pro: optimizes the implementation for file-system-crawler
  • issue: on some constellations (crawling remote MP3s), there will only be FileExtractors and everything will be buffered on local harddisk
    • idea: this is not so much an issue, the benefit for the end user of having more data outweights it

should throw a "NotFoundException" if the element does not exist. That is a fallback: if the element was moved, the calling gui that uses DataOpener could then search for the new location of the element and suggest a new location. The new URI could then work better.

This blocks:

Add rdfs:label to everything

see that every resource has an rdfs:label, additionally to dc:title, etc

Relate DataOpeners to DataSources

At the end, the dataopeners are tightly knit to datasources, not to URI-scheme. The method DataOpenerRegistry.get(String urischeme) is not good.

As we have the uri scheme "gnowsis://" quite often for Outlook, Thunderbird, some other stuff.

Still, opening by URI scheme is a good fallback when I have a resource at hand from which i know only URI (and not the originating datasource) so I would keep it as fallback.


  • Add a new method to DataSource - getDataOpener which returns an instance of DataOpener (or uses the DataOpenerRegistry internally, when not defining own DataOpeners).

This blocks:

use reusable web-guis to configure datasources

reusable crawlers, reusable registry both gnowsis, autofocus and possible aduna metadata server need guis to configure the datasources (restrictions, passwords, setting and enableing datasources).

  • we could say that datasource config is done using servlets - then it is easier to reuse
  • we could use the same crawler / registration classes in Nepomuk, Autofocus, and AMS(what is ApertureCrawler and ApertureDataSourceRegistry in gnowsis)

Vocabulary: use DC instead of data

We are using many properties of Dublic Core, but redefining them. For compabilities sake, we should inlcude the real DC vocabularies right from the start, and not use our own uris.

Heinz Kirchmann had the trouble that our DATA vocabulary was not Dublin core, and in his project he had to write a converter. Although the converter is easy to write, using Dublin Core in the first place would have been better.

Note that Leo designed the Data ontology very bad and used own properties only because the f* generation code from Schemagen(Jena) only supports one namespace at a time. We can circumvent that now by using better tools or by other tricks. So the generation of one JAVA file named DATA with all properties in it, the propeties are then a mixture from DC and others, would be possible.

Suggested Vocabularies to use for Crawled Data:

Another thing to work on is an agreement (or consensus that there should be no agreement!) on what values these properties can take. For example, does a given property only allow Literal values or also other Resources, how are data types like dates encoded, what value can a language property take (e.g. only "en" or also "en-us", lowercase vs. uppercase, ...). This goes beyond the choice of a specific XML Schema data type.

About email senders and receivers: they are now modeled as a URI of the form "email:<address>" with a name and address property. Of course people can use different names in combination with the same address. You definitely want to be able to retrieve the specific combination used in a specific mail. In Aperture applications this is now typically solved by using contextualized statements. This already goes wrong when the same address is used multiple times in the same mail. This often happens with mails obtained from a mailing list: the From and To use the same address but a different name. Also this gets problematic for RDF applications that cannot use context (e.g. because their RDF store does not support it) or they cannot use it in a DataObject-centric way.


remove the "init" target, everything in there can be top-level.

  • simplifies the ant file.

old aperture pages