wiki:ApertureArchives

Version 11 (modified by anonymous, 19 years ago) (diff)

--

Archives

Some functionality that is still missing but that we at Aduna would really like to have (customer demand!) is support for handling archives such as zip, gzip and rar files.

The interface for doing archive extraction will probably be a mixture of Extractor and DataSource/DataCrawler. On the one hand they will be mimetype-specific and will operate on an InputStream /DataObject, just like Extractor, on the other hand they deliver a stream of new DataObjects.

I think it's best to let it operate on a DataObject, as I expect that a gzip stream does not contain a file name for its contents, whereas people probably expect the reported archived file to have a file name equal to its parent's file name minus the .z/.gz extension.

A URI scheme also has to be developed for such nested objects, so that you can identify a stream packed inside an archive.

A possible solution is to concat URIs together using a seperator. If :: is the seperator, a file inside a zip could be identified using file://c:/mydocs/big.zip::doc/doc1.doc We should also take a look at URLs used by the Java ClassLoader (e.g. ClassLoader.findResource) for pointing to resources in a jar file. They have the same problem, maybe we can use their solution as well.

The File-DataSource has to be configurable if it supports nested ZIP archives or not - or if certain paths have to be ignored. This is also tricky. By default, I would suggest to leave it ON so that users get surprised by good search results and may put it OFF so that they spare place.

Supported Formats

Support for zip and gzip are probably trivial as these formats are already accessible through java.util.zip.

Rar is another format we encounter sometimes. As far as I know there is no java library available for it. It is an open format though, i.e. the specs are available (link1, link2).

Opening Resources

Opening of these resources can also get rather tricky, e.g. how to open a text file in a zip file on a website. Other possibilities are recursivity: a file inside an archive inside another archive. Good thinking required!