| 1 | [[PageOutline]] |
| 2 | = Opened up again: zip files and so = |
| 3 | |
| 4 | on the discussion about "how to crawl zip files" going on around Jan 2007: |
| 5 | |
| 6 | Solutions: |
| 7 | |
| 8 | == CompoundObjectProcessor == |
| 9 | |
| 10 | Leo: how about naming it "Sub-Crawler" or "MicroCrawler". This is, a crawler that is crawling inside a bigger crawl process to crawl sub-resources. |
| 11 | |
| 12 | * apply a Crawler on a DataSource, producing a queue of DataObjects. |
| 13 | * for every DataObject in this set: |
| 14 | - determine the MIME type of the stream |
| 15 | - see if there is a CompoundObjectProcessor impl for this MIME type. |
| 16 | if yes: |
| 17 | - apply the CompoundObjectProcessor on this DataObject and put |
| 18 | all resulting DataObjects in the queue |
| 19 | if no: |
| 20 | - see if there is an Extractor impl for this MIME type and |
| 21 | if so, apply it on the DataObject |
| 22 | |
| 23 | The CompoundObjectProcessor could be given an AccessData instance, just |
| 24 | like Crawler, to make incremental crawling of such objects possible. |
| 25 | |
| 26 | Giving the CompoundObjectProcessor a DataObject rather than, say, an |
| 27 | InputStream allows it to add container-specific metadata for the archive |
| 28 | itself (#entries, uncompressed size, etc) and to retrieve metadata it |
| 29 | may require (e.g. the name of the archive file). |
| 30 | |
| 31 | Pro: |
| 32 | * Leo: could handle most prolbems |
| 33 | |
| 34 | Con: |
| 35 | * Leo: When you have the file extension ".xml", there is a billion choices how to extract the info from it. |
| 36 | |
| 37 | Vote: |
| 38 | * Leo: + |
| 39 | |
| 40 | == Merge Crawler and Exctractor == |
| 41 | |
| 42 | Alternative: find a way to generalize the Crawler and Extractor APIs |
| 43 | into one XYZ API: you put a source description in and it produces |
| 44 | DataObjects that get processed recursively and exhaustively. Feels a bit |
| 45 | tricky and over-generalization to me but I wanted to mention it, perhaps |
| 46 | someone has good ideas in this direction. |
| 47 | |
| 48 | Pro: |
| 49 | |
| 50 | Con: |
| 51 | * Leo: that would make it soo generic that it is useless. |
| 52 | Vote: |
| 53 | * Leo: - |
| 54 | |
| 55 | == Let Extractor do more == |
| 56 | The Extractor interface was designed to return more than one resource anyway. |
| 57 | It can do that by wrapping them inside the RDFContainer, we have done that with addresses in e-mails already, using anonymous nodes or URI nodes in between (for sender/cc). |
| 58 | |
| 59 | Extractor can return a bigger RDF graph inside one RDF Container (which works already), |
| 60 | but the RDFContainer could be extended with a list of resources contained within. |
| 61 | The list can be done either using RDF metadata (x aperture:isContainedIn y) |
| 62 | or with a Java list. |
| 63 | |
| 64 | Pro: |
| 65 | * Leo: works today |
| 66 | Con: |
| 67 | * Leo: hard to optimize Lucene index afterwards |
| 68 | |
| 69 | |
| 70 | |