Context Navigation

Changes between Version 11 and Version 12 of ApertureArchives

Timestamp:: 01/08/07 10:17:45 (19 years ago)
Author:: sauermann
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

ApertureArchives

-                      v11
+                      v12
+[[PageOutline]]
+= Opened up again: zip files and so =
+on the discussion about "how to crawl zip files" going on around Jan 2007:
+Solutions:
+== CompoundObjectProcessor ==
+Leo: how about naming it "Sub-Crawler" or "MicroCrawler". This is, a crawler that is crawling inside a bigger crawl process to crawl sub-resources.
+ * apply a Crawler on a DataSource, producing a queue of DataObjects.
+ * for every DataObject in this set:
+     - determine the MIME type of the stream
+     - see if there is a CompoundObjectProcessor impl for this MIME type.
+       if yes:
+          - apply the CompoundObjectProcessor on this DataObject and put
+            all resulting DataObjects in the queue
+       if no:
+          - see if there is an Extractor impl for this MIME type and
+            if so, apply it on the DataObject
+The CompoundObjectProcessor could be given an AccessData instance, just
+like Crawler, to make incremental crawling of such objects possible.
+Giving the CompoundObjectProcessor a DataObject rather than, say, an
+InputStream allows it to add container-specific metadata for the archive
+itself (#entries, uncompressed size, etc) and to retrieve metadata it
+may require (e.g. the name of the archive file).
+Pro:
+ * Leo: could handle most prolbems
+Con:
+ * Leo: When you have the file extension ".xml", there is a billion choices how to extract the info from it.
+Vote:
+ * Leo: +
+== Merge Crawler and Exctractor ==
+Alternative: find a way to generalize the Crawler and Extractor APIs
+into one XYZ API: you put a source description in and it produces
+DataObjects that get processed recursively and exhaustively. Feels a bit
+tricky and over-generalization to me but I wanted to mention it, perhaps
+someone has good ideas in this direction.
+Pro:
+Con:
+ * Leo: that would make it soo generic that it is useless.
+Vote:
+ * Leo: -
+== Let Extractor do more ==
+The Extractor interface was designed to return more than one resource anyway.
+It can do that by wrapping them inside the RDFContainer, we have done that with addresses in e-mails already, using anonymous nodes or URI nodes in between (for sender/cc).
+Extractor can return a bigger RDF graph inside one RDF Container (which works already),
+but the RDFContainer could be extended with a list of resources contained within.
+The list can be done either using RDF metadata (x aperture:isContainedIn y)
+or with a Java list.
+Pro:
+ * Leo: works today
+Con:
+ * Leo: hard to optimize Lucene index afterwards
 = Archives =