Changes between Version 11 and Version 12 of ApertureArchives


Ignore:
Timestamp:
01/08/07 10:17:45 (18 years ago)
Author:
sauermann
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureArchives

    v11 v12  
     1[[PageOutline]] 
     2= Opened up again: zip files and so = 
     3 
     4on the discussion about "how to crawl zip files" going on around Jan 2007: 
     5 
     6Solutions: 
     7 
     8== CompoundObjectProcessor == 
     9 
     10Leo: how about naming it "Sub-Crawler" or "MicroCrawler". This is, a crawler that is crawling inside a bigger crawl process to crawl sub-resources. 
     11 
     12 * apply a Crawler on a DataSource, producing a queue of DataObjects. 
     13 * for every DataObject in this set: 
     14     - determine the MIME type of the stream 
     15     - see if there is a CompoundObjectProcessor impl for this MIME type. 
     16       if yes: 
     17          - apply the CompoundObjectProcessor on this DataObject and put 
     18            all resulting DataObjects in the queue 
     19       if no: 
     20          - see if there is an Extractor impl for this MIME type and 
     21            if so, apply it on the DataObject 
     22 
     23The CompoundObjectProcessor could be given an AccessData instance, just  
     24like Crawler, to make incremental crawling of such objects possible.  
     25 
     26Giving the CompoundObjectProcessor a DataObject rather than, say, an  
     27InputStream allows it to add container-specific metadata for the archive  
     28itself (#entries, uncompressed size, etc) and to retrieve metadata it  
     29may require (e.g. the name of the archive file). 
     30 
     31Pro:  
     32 * Leo: could handle most prolbems 
     33 
     34Con: 
     35 * Leo: When you have the file extension ".xml", there is a billion choices how to extract the info from it.  
     36 
     37Vote: 
     38 * Leo: + 
     39 
     40== Merge Crawler and Exctractor == 
     41 
     42Alternative: find a way to generalize the Crawler and Extractor APIs  
     43into one XYZ API: you put a source description in and it produces  
     44DataObjects that get processed recursively and exhaustively. Feels a bit  
     45tricky and over-generalization to me but I wanted to mention it, perhaps  
     46someone has good ideas in this direction. 
     47 
     48Pro: 
     49 
     50Con: 
     51 * Leo: that would make it soo generic that it is useless. 
     52Vote: 
     53 * Leo: -  
     54 
     55== Let Extractor do more == 
     56The Extractor interface was designed to return more than one resource anyway. 
     57It can do that by wrapping them inside the RDFContainer, we have done that with addresses in e-mails already, using anonymous nodes or URI nodes in between (for sender/cc). 
     58 
     59Extractor can return a bigger RDF graph inside one RDF Container (which works already), 
     60but the RDFContainer could be extended with a list of resources contained within. 
     61The list can be done either using RDF metadata (x aperture:isContainedIn y) 
     62or with a Java list. 
     63 
     64Pro: 
     65 * Leo: works today 
     66Con: 
     67 * Leo: hard to optimize Lucene index afterwards 
     68 
     69 
     70 
    171= Archives = 
    272