| | 1 | [[PageOutline]] |
| | 2 | = Opened up again: zip files and so = |
| | 3 | |
| | 4 | on the discussion about "how to crawl zip files" going on around Jan 2007: |
| | 5 | |
| | 6 | Solutions: |
| | 7 | |
| | 8 | == CompoundObjectProcessor == |
| | 9 | |
| | 10 | Leo: how about naming it "Sub-Crawler" or "MicroCrawler". This is, a crawler that is crawling inside a bigger crawl process to crawl sub-resources. |
| | 11 | |
| | 12 | * apply a Crawler on a DataSource, producing a queue of DataObjects. |
| | 13 | * for every DataObject in this set: |
| | 14 | - determine the MIME type of the stream |
| | 15 | - see if there is a CompoundObjectProcessor impl for this MIME type. |
| | 16 | if yes: |
| | 17 | - apply the CompoundObjectProcessor on this DataObject and put |
| | 18 | all resulting DataObjects in the queue |
| | 19 | if no: |
| | 20 | - see if there is an Extractor impl for this MIME type and |
| | 21 | if so, apply it on the DataObject |
| | 22 | |
| | 23 | The CompoundObjectProcessor could be given an AccessData instance, just |
| | 24 | like Crawler, to make incremental crawling of such objects possible. |
| | 25 | |
| | 26 | Giving the CompoundObjectProcessor a DataObject rather than, say, an |
| | 27 | InputStream allows it to add container-specific metadata for the archive |
| | 28 | itself (#entries, uncompressed size, etc) and to retrieve metadata it |
| | 29 | may require (e.g. the name of the archive file). |
| | 30 | |
| | 31 | Pro: |
| | 32 | * Leo: could handle most prolbems |
| | 33 | |
| | 34 | Con: |
| | 35 | * Leo: When you have the file extension ".xml", there is a billion choices how to extract the info from it. |
| | 36 | |
| | 37 | Vote: |
| | 38 | * Leo: + |
| | 39 | |
| | 40 | == Merge Crawler and Exctractor == |
| | 41 | |
| | 42 | Alternative: find a way to generalize the Crawler and Extractor APIs |
| | 43 | into one XYZ API: you put a source description in and it produces |
| | 44 | DataObjects that get processed recursively and exhaustively. Feels a bit |
| | 45 | tricky and over-generalization to me but I wanted to mention it, perhaps |
| | 46 | someone has good ideas in this direction. |
| | 47 | |
| | 48 | Pro: |
| | 49 | |
| | 50 | Con: |
| | 51 | * Leo: that would make it soo generic that it is useless. |
| | 52 | Vote: |
| | 53 | * Leo: - |
| | 54 | |
| | 55 | == Let Extractor do more == |
| | 56 | The Extractor interface was designed to return more than one resource anyway. |
| | 57 | It can do that by wrapping them inside the RDFContainer, we have done that with addresses in e-mails already, using anonymous nodes or URI nodes in between (for sender/cc). |
| | 58 | |
| | 59 | Extractor can return a bigger RDF graph inside one RDF Container (which works already), |
| | 60 | but the RDFContainer could be extended with a list of resources contained within. |
| | 61 | The list can be done either using RDF metadata (x aperture:isContainedIn y) |
| | 62 | or with a Java list. |
| | 63 | |
| | 64 | Pro: |
| | 65 | * Leo: works today |
| | 66 | Con: |
| | 67 | * Leo: hard to optimize Lucene index afterwards |
| | 68 | |
| | 69 | |
| | 70 | |