wiki:ApertureDiscussion

Version 25 (modified by sauermann, 17 years ago) (diff)

--

DEPRECATED

MOVED TO

DEPREACTED STUFF, DON'T READ IT

Open Issues in Aperture

see also ApertureClosedIssues for a list of things that have been completed.

CrawlerHandler vs. RDFContainerFactory

These interfaces shouldn't be mixed together. I would suggest the removal of the getRDFContainerFactory method from the CrawlerHandler interface. The RDFContainerFactory should be set on the crawler before the beginning of the crawling process, not called from the CrawlerHandler. Also, all examples should be rewritten in a way that they use a simple RDFContainerfactory that creates a new RDFContainer backed by a fresh and empty Repository/Model on every invocation of the getRDFContainer method. As such the data should be explicitly transferred from this temporary container to the backing model/repository. This would be more elegant and more instructive as far as the intended usage is concerned. The user could 'see' the flow of data in the source code. If have a Repository in the CrawlerHandler and generate RDFContainers backed by this repository, it might be non-obvious where does the data in the repository actually comes from.

java.io.File-Based Exctractors

We have some code-pieces (like MP3 extraction) that do not work on inputstreams but only on files.

There are different approaches to solve that:

ideaA: rewrite all File-Based extractors using inputstream

Somebody writes new Extractors implementing the InputStream-based extraction interface.

  • issue: these have to be written completly new?
    • idea: have somebody else write them.
  • pro: They are probably more performant than the existing ones and have less overhead
  • con: they have to be written new

ideaB: add a new method to Extractor

This is the existing Method: {{{extract(URI id, InputStream stream, Charset charset,

String mimeType, RDFContainer result)}}}

We could add a new one to the Interface Extractor: {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: no new interface
  • issue: looking at the Interface, it is not clear what method to use and what is implemented. Should I call first the method with InputStream and see if it fails? hm
  • issue: this depends on ideaC

ideaB1: create a new Interface FileExtractor

Create a new Interface FileExtractor, that implements only one method. Declare that this interface should only be used in cases, when there is no InputStream-based extraction library available and say that this FileExtractor is mediocre to the normal Extractor. {{{extract(URI id, File file, Charset charset,

String mimeType, RDFContainer result)}}}

  • pro: developers can determine which kind of Extractor they face and which method to call
  • con: we need a new registry for FileExtractors
  • issue: this depends on ideaC

ideaC: new method getFile() to FileDataObject

Add a new method getFile(), returning a file, to FileDataObject. This is easily implemented on File-based data objects (crawling local file system). For remote FileDataObjects, the method will be implemented using a buffering of the InputStream. ideaB and ideaB1 depend on this getFile() method.

  • pro: optimizes the implementation for file-system-crawler
  • issue: on some constellations (crawling remote MP3s), there will only be FileExtractors and everything will be buffered on local harddisk
    • idea: this is not so much an issue, the benefit for the end user of having more data outweights it

DataOpener.open(uri)

should throw a "NotFoundException" if the element does not exist. That is a fallback: if the element was moved, the calling gui that uses DataOpener could then search for the new location of the element and suggest a new location. The new URI could then work better.

This blocks:

Add rdfs:label to everything

see that every resource has an rdfs:label, additionally to dc:title, etc

Relate DataOpeners to DataSources

At the end, the dataopeners are tightly knit to datasources, not to URI-scheme. The method DataOpenerRegistry.get(String urischeme) is not good.

As we have the uri scheme "gnowsis://" quite often for Outlook, Thunderbird, some other stuff.

Still, opening by URI scheme is a good fallback when I have a resource at hand from which i know only URI (and not the originating datasource) so I would keep it as fallback.

Idea:

  • Add a new method to DataSource - getDataOpener which returns an instance of DataOpener (or uses the DataOpenerRegistry internally, when not defining own DataOpeners).

This blocks:

build.xml

remove the "init" target, everything in there can be top-level.

  • simplifies the ant file.

old aperture pages