wiki:ApertureDiscussion

Version 16 (modified by sauermann, 17 years ago) (diff)

--

Open Issues in Aperture

most of these features would be fine if fixed before ISWC2006, 5th November 2006.

RDFContainer based on RDF2GO

The casting of RDFCOntainer to sesame2 to read values is unaccapteable.

We should definitely move to RDF2GO, and RDF2GO will be improved to fit us. Alternatively we use ONLY sesame2.

A problem that Heinz Kirchmann found when programming catwiesel.opendfki.de was that the downcast from RDFContainer to SesameRdfContainer was not obvious. The crawlerhandler in Gnowsis is seperated from the actual processing of the data. We used RDFContainer as interface inside gnowsis, in many methods. We should have used SesameRDFContainer everywhere inside gnowsis.

If sesame is the right framework for gnowsis, the passing SesameRDFContainer would have been right. But if we remove sesame and use kowari or something else, the usage of RDF2GO may be better.

Also, we never used a RDFContainer besides Sesame2. So it may be the obvious choice to cast everything and define everything as SesameRdfContainer BUT this will cause troubles for projects that are based on Sesame1. If somebody that is based on Sesame1 wants to use Aperture, then using OpenRDF may be good.

On the long run, to have a generic API, using RDF2GO as interface and something like sesame2 as default implementation is better.

The RDFContainer interface did not include all methods that we needed at the end, only SesameRDFContainer did implement the needed methods to read data. So the Interface must be extended to implement all methods needed.

Discussion pros:

  • bindings for various RDF stores that we get for free

cons:

  • is RDF2GO still using java.net.URIs? That would mean a lot of

conversions that are potentially not necessary, e.g. when using a Sesame Repository: org.openrdf.model.URIs get translated to java.net.URIs and back to org.openrdf.model.URIs.

  • RDFContainer lacks full RDF graph access. A simple getStatements

method with a subject parameter would solve this though. I've also read comments by Gunnar about having to cast RDFContainer to SesameRDFContainer in code he wrote, I guess he had the same problem?

Gunnar and Leo are happy with this:

  • We keep RDFContainer as it is and only switch "getModel()" to return a RDF2GO model.
  • We switch the methods of RDFContainer to use RDF2GO interfaces (RDF2Go's predicate, resource, etc) for the setProperty(property, value) methods.

java.io.File-Based Exctractors

We have some code-pieces (like MP3 extraction) that do not work on inputstreams but only on files.

There are different approaches to solve that:

ideaA: rewrite all File-Based extractors using inputstream

Somebody writes new Extractors implementing the InputStream-based extraction interface.

  • issue: these have to be written completly new?
    • idea: have somebody else write them.
  • pro: They are probably more performant than the existing ones and have less overhead
  • con: they have to be written new

ideaB: add a new method to Extractor

This is the existing Method: {{{extract(URI id, InputStream stream, Charset charset,

String mimeType, RDFContainer result)}}}

We could add a new one to the Interface Extractor: {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: no new interface
  • issue: looking at the Interface, it is not clear what method to use and what is implemented. Should I call first the method with InputStream and see if it fails? hm
  • issue: this depends on ideaC

ideaB1: create a new Interface FileExtractor

Create a new Interface FileExtractor, that implements only one method. Declare that this interface should only be used in cases, when there is no InputStream-based extraction library available and say that this FileExtractor is mediocre to the normal Extractor. {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: developers can determine which kind of Extractor they face and which method to call
  • con: we need a new registry for FileExtractors
  • issue: this depends on ideaC

ideaC: new method getFile() to FileDataObject

Add a new method getFile(), returning a file, to FileDataObject. This is easily implemented on File-based data objects (crawling local file system). For remote FileDataObjects, the method will be implemented using a buffering of the InputStream. ideaB and ideaB1 depend on this getFile() method.

  • pro: optimizes the implementation for file-system-crawler
  • issue: on some constellations (crawling remote MP3s), there will only be FileExtractors and everything will be buffered on local harddisk
    • idea: this is not so much an issue, the benefit for the end user of having more data outweights it

DataOpener.open(uri)

should throw a "NotFoundException" if the element does not exist. That is a fallback: if the element was moved, the calling gui that uses DataOpener could then search for the new location of the element and suggest a new location. The new URI could then work better.

This blocks:

Add rdfs:label to everything

see that every resource has an rdfs:label, additionally to dc:title, etc

Relate DataOpeners to DataSources

At the end, the dataopeners are tightly knit to datasources, not to URI-scheme. The method DataOpenerRegistry.get(String urischeme) is not good.

As we have the uri scheme "gnowsis://" quite often for Outlook, Thunderbird, some other stuff.

Still, opening by URI scheme is a good fallback when I have a resource at hand from which i know only URI (and not the originating datasource) so I would keep it as fallback.

Idea:

  • Add a new method to DataSource - getDataOpener which returns an instance of DataOpener (or uses the DataOpenerRegistry internally, when not defining own DataOpeners).

This blocks:

use reusable web-guis to configure datasources

reusable crawlers, reusable registry both gnowsis, autofocus and possible aduna metadata server need guis to configure the datasources (restrictions, passwords, setting and enableing datasources).

  • we could say that datasource config is done using servlets - then it is easier to reuse
  • we could use the same crawler / registration classes in Nepomuk, Autofocus, and AMS(what is ApertureCrawler and ApertureDataSourceRegistry in gnowsis)

Vocabulary: use DC instead of data

We are using many properties of Dublic Core, but redefining them. For compabilities sake, we should inlcude the real DC vocabularies right from the start, and not use our own uris.

Heinz Kirchmann had the trouble that our DATA vocabulary was not Dublin core, and in his project he had to write a converter. Although the converter is easy to write, using Dublin Core in the first place would have been better.

Note that Leo designed the Data ontology very bad and used own properties only because the f* generation code from Schemagen(Jena) only supports one namespace at a time. We can circumvent that now by using better tools or by other tricks. So the generation of one JAVA file named DATA with all properties in it, the propeties are then a mixture from DC and others, would be possible.

build.xml

remove the "init" target, everything in there can be top-level.

  • simplifies the ant file.

old aperture pages