wiki:ApertureArchives

Context Navigation

Version 12 (modified by sauermann, 19 years ago) (diff)
--

Opened up again: zip files and so
Archives
1. Supported Formats
2. Opening Resources

Opened up again: zip files and so

on the discussion about "how to crawl zip files" going on around Jan 2007:

Solutions:

CompoundObjectProcessor

Leo: how about naming it "Sub-Crawler" or "MicroCrawler". This is, a crawler that is crawling inside a bigger crawl process to crawl sub-resources.

apply a Crawler on a DataSource, producing a queue of DataObjects.
for every DataObject in this set:
- determine the MIME type of the stream
- see if there is a CompoundObjectProcessor impl for this MIME type. if yes:
  - apply the CompoundObjectProcessor on this DataObject and put all resulting DataObjects in the queue
  if no:
  - see if there is an Extractor impl for this MIME type and if so, apply it on the DataObject

The CompoundObjectProcessor could be given an AccessData instance, just like Crawler, to make incremental crawling of such objects possible.

Giving the CompoundObjectProcessor a DataObject rather than, say, an InputStream allows it to add container-specific metadata for the archive itself (#entries, uncompressed size, etc) and to retrieve metadata it may require (e.g. the name of the archive file).

Pro:

Leo: could handle most prolbems

Con:

Leo: When you have the file extension ".xml", there is a billion choices how to extract the info from it.

Vote:

Leo: +

Merge Crawler and Exctractor

Alternative: find a way to generalize the Crawler and Extractor APIs into one XYZ API: you put a source description in and it produces DataObjects that get processed recursively and exhaustively. Feels a bit tricky and over-generalization to me but I wanted to mention it, perhaps someone has good ideas in this direction.

Pro:

Con:

Leo: that would make it soo generic that it is useless.

Vote:

Leo: -

Let Extractor do more

The Extractor interface was designed to return more than one resource anyway. It can do that by wrapping them inside the RDFContainer, we have done that with addresses in e-mails already, using anonymous nodes or URI nodes in between (for sender/cc).

Extractor can return a bigger RDF graph inside one RDF Container (which works already), but the RDFContainer could be extended with a list of resources contained within. The list can be done either using RDF metadata (x aperture:isContainedIn y) or with a Java list.

Pro:

Leo: works today

Con:

Leo: hard to optimize Lucene index afterwards

Context Navigation

Opened up again: zip files and so

CompoundObjectProcessor

Merge Crawler and Exctractor

Let Extractor do more

Archives

Supported Formats

Opening Resources

Download in other formats: