wiki:ApertureDiscussion

Version 23 (modified by sauermann, 17 years ago) (diff)

--

Open Issues in Aperture

see also ApertureClosedIssues for a list of things that have been completed.

CrawlerHandler vs. RDFContainerFactory

These interfaces shouldn't be mixed together. I would suggest the removal of the getRDFContainerFactory method from the CrawlerHandler interface. The RDFContainerFactory should be set on the crawler before the beginning of the crawling process, not called from the CrawlerHandler. Also, all examples should be rewritten in a way that they use a simple RDFContainerfactory that creates a new RDFContainer backed by a fresh and empty Repository/Model on every invocation of the getRDFContainer method. As such the data should be explicitly transferred from this temporary container to the backing model/repository. This would be more elegant and more instructive as far as the intended usage is concerned. The user could 'see' the flow of data in the source code. If have a Repository in the CrawlerHandler and generate RDFContainers backed by this repository, it might be non-obvious where does the data in the repository actually comes from.

java.io.File-Based Exctractors

We have some code-pieces (like MP3 extraction) that do not work on inputstreams but only on files.

There are different approaches to solve that:

ideaA: rewrite all File-Based extractors using inputstream

Somebody writes new Extractors implementing the InputStream-based extraction interface.

  • issue: these have to be written completly new?
    • idea: have somebody else write them.
  • pro: They are probably more performant than the existing ones and have less overhead
  • con: they have to be written new

ideaB: add a new method to Extractor

This is the existing Method: {{{extract(URI id, InputStream stream, Charset charset,

String mimeType, RDFContainer result)}}}

We could add a new one to the Interface Extractor: {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: no new interface
  • issue: looking at the Interface, it is not clear what method to use and what is implemented. Should I call first the method with InputStream and see if it fails? hm
  • issue: this depends on ideaC

ideaB1: create a new Interface FileExtractor

Create a new Interface FileExtractor, that implements only one method. Declare that this interface should only be used in cases, when there is no InputStream-based extraction library available and say that this FileExtractor is mediocre to the normal Extractor. {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: developers can determine which kind of Extractor they face and which method to call
  • con: we need a new registry for FileExtractors
  • issue: this depends on ideaC

ideaC: new method getFile() to FileDataObject

Add a new method getFile(), returning a file, to FileDataObject. This is easily implemented on File-based data objects (crawling local file system). For remote FileDataObjects, the method will be implemented using a buffering of the InputStream. ideaB and ideaB1 depend on this getFile() method.

  • pro: optimizes the implementation for file-system-crawler
  • issue: on some constellations (crawling remote MP3s), there will only be FileExtractors and everything will be buffered on local harddisk
    • idea: this is not so much an issue, the benefit for the end user of having more data outweights it

DataOpener.open(uri)

should throw a "NotFoundException" if the element does not exist. That is a fallback: if the element was moved, the calling gui that uses DataOpener could then search for the new location of the element and suggest a new location. The new URI could then work better.

This blocks:

Add rdfs:label to everything

see that every resource has an rdfs:label, additionally to dc:title, etc

Relate DataOpeners to DataSources

At the end, the dataopeners are tightly knit to datasources, not to URI-scheme. The method DataOpenerRegistry.get(String urischeme) is not good.

As we have the uri scheme "gnowsis://" quite often for Outlook, Thunderbird, some other stuff.

Still, opening by URI scheme is a good fallback when I have a resource at hand from which i know only URI (and not the originating datasource) so I would keep it as fallback.

Idea:

  • Add a new method to DataSource - getDataOpener which returns an instance of DataOpener (or uses the DataOpenerRegistry internally, when not defining own DataOpeners).

This blocks:

use reusable web-guis to configure datasources

reusable crawlers, reusable registry both gnowsis, autofocus and possible aduna metadata server need guis to configure the datasources (restrictions, passwords, setting and enableing datasources).

  • we could say that datasource config is done using servlets - then it is easier to reuse
  • we could use the same crawler / registration classes in Nepomuk, Autofocus, and AMS(what is ApertureCrawler and ApertureDataSourceRegistry in gnowsis)

Vocabulary: use DC instead of data

We are using many properties of Dublic Core, but redefining them. For compabilities sake, we should inlcude the real DC vocabularies right from the start, and not use our own uris.

Heinz Kirchmann had the trouble that our DATA vocabulary was not Dublin core, and in his project he had to write a converter. Although the converter is easy to write, using Dublin Core in the first place would have been better.

Note that Leo designed the Data ontology very bad and used own properties only because the f* generation code from Schemagen(Jena) only supports one namespace at a time. We can circumvent that now by using better tools or by other tricks. So the generation of one JAVA file named DATA with all properties in it, the propeties are then a mixture from DC and others, would be possible.

Suggested Vocabularies to use for Crawled Data:

Another thing to work on is an agreement (or consensus that there should be no agreement!) on what values these properties can take. For example, does a given property only allow Literal values or also other Resources, how are data types like dates encoded, what value can a language property take (e.g. only "en" or also "en-us", lowercase vs. uppercase, ...). This goes beyond the choice of a specific XML Schema data type.

About email senders and receivers: they are now modeled as a URI of the form "email:<address>" with a name and address property. Of course people can use different names in combination with the same address. You definitely want to be able to retrieve the specific combination used in a specific mail. In Aperture applications this is now typically solved by using contextualized statements. This already goes wrong when the same address is used multiple times in the same mail. This often happens with mails obtained from a mailing list: the From and To use the same address but a different name. Also this gets problematic for RDF applications that cannot use context (e.g. because their RDF store does not support it) or they cannot use it in a DataObject-centric way.

build.xml

remove the "init" target, everything in there can be top-level.

  • simplifies the ant file.

old aperture pages