DEPREACTED STUFF, DON'T READ IT
Opened up again: zip files and so
on the discussion about "how to crawl zip files" going on around Jan 2007:
Leo: how about naming it "Sub-Crawler" or "MicroCrawler". This is, a crawler that is crawling inside a bigger crawl process to crawl sub-resources.
- apply a Crawler on a DataSource, producing a queue of DataObjects.
- for every DataObject in this set:
- determine the MIME type of the stream
- see if there is a CompoundObjectProcessor impl for this MIME type.
- apply the CompoundObjectProcessor on this DataObject and put all resulting DataObjects in the queue
- see if there is an Extractor impl for this MIME type and if so, apply it on the DataObject
The CompoundObjectProcessor could be given an AccessData instance, just like Crawler, to make incremental crawling of such objects possible.
Giving the CompoundObjectProcessor a DataObject rather than, say, an InputStream allows it to add container-specific metadata for the archive itself (#entries, uncompressed size, etc) and to retrieve metadata it may require (e.g. the name of the archive file).
- Leo: could handle most prolbems
- Leo: When you have the file extension ".xml", there is a billion choices how to extract the info from it.
- Leo: +
Merge Crawler and Exctractor
Alternative: find a way to generalize the Crawler and Extractor APIs into one XYZ API: you put a source description in and it produces DataObjects that get processed recursively and exhaustively. Feels a bit tricky and over-generalization to me but I wanted to mention it, perhaps someone has good ideas in this direction.
- Leo: that would make it soo generic that it is useless.
- Leo: -
Let Extractor do more
The Extractor interface was designed to return more than one resource anyway. It can do that by wrapping them inside the RDFContainer, we have done that with addresses in e-mails already, using anonymous nodes or URI nodes in between (for sender/cc).
Extractor can return a bigger RDF graph inside one RDF Container (which works already), but the RDFContainer could be extended with a list of resources contained within. The list can be done either using RDF metadata (x aperture:isContainedIn y) or with a Java list.
- Leo: works today
- Leo: hard to optimize Lucene index afterwards
Some functionality that is still missing but that we at Aduna would really like to have (customer demand!) is support for handling archives such as zip, gzip and rar files.
The interface for doing archive extraction will probably be a mixture of Extractor and DataSource/DataCrawler. On the one hand they will be mimetype-specific and will operate on an InputStream /DataObject, just like Extractor, on the other hand they deliver a stream of new DataObjects.
I think it's best to let it operate on a DataObject, as I expect that a gzip stream does not contain a file name for its contents, whereas people probably expect the reported archived file to have a file name equal to its parent's file name minus the .z/.gz extension.
A URI scheme also has to be developed for such nested objects, so that you can identify a stream packed inside an archive.
A possible solution is to concat URIs together using a seperator. If :: is the seperator, a file inside a zip could be identified using file://c:/mydocs/big.zip::doc/doc1.doc We should also take a look at URLs used by the Java ClassLoader (e.g. ClassLoader.findResource) for pointing to resources in a jar file. They have the same problem, maybe we can use their solution as well.
The File-DataSource has to be configurable if it supports nested ZIP archives or not - or if certain paths have to be ignored. This is also tricky. By default, I would suggest to leave it ON so that users get surprised by good search results and may put it OFF so that they spare place.
Support for zip and gzip are probably trivial as these formats are already accessible through java.util.zip.
Opening of these resources can also get rather tricky, e.g. how to open a text file in a zip file on a website. Other possibilities are recursivity: a file inside an archive inside another archive. Good thinking required!