| 3 | |
| 4 | == java.io.File-Based Exctractors == |
| 5 | |
| 6 | We have some code-pieces (like MP3 extraction) that do not work on inputstreams but only on files. |
| 7 | |
| 8 | There are different approaches to solve that: |
| 9 | === ideaA: rewrite all File-Based extractors using inputstream === |
| 10 | Somebody writes new Extractors implementing the InputStream-based extraction interface. |
| 11 | * issue: these have to be written completly new? |
| 12 | * idea: have somebody else write them. |
| 13 | * pro: They are probably more performant than the existing ones and have less overhead |
| 14 | * con: they have to be written new |
| 15 | |
| 16 | === ideaB: add a new method to Extractor, passing in the file as argument === |
| 17 | This is the existing Method: |
| 18 | {{{extract(URI id, InputStream stream, Charset charset, |
| 19 | String mimeType, RDFContainer result)}}} |
| 20 | We could add a new one to the Interface Extractor: |
| 21 | {{{extract(URI id, File file, Charset charset, |
| 22 | String mimeType, RDFContainer result)}}} |
| 23 | |
| 24 | * pro: no new interface |
| 25 | * issue: looking at the Interface, it is not clear what method to use and what is implemented. Should I call first the method with InputStream and see if it fails? hm |
| 26 | * issue: this depends on ideaC |
| 27 | |
| 28 | === ideaB1: create a new Interface FileExtractor, passing in the file as argument === |
| 29 | Create a new Interface FileExtractor, that implements only one method. Declare that this interface should only be used in cases, when there is no InputStream-based extraction library available and say that this FileExtractor is mediocre to the normal Extractor. |
| 30 | {{{extract(URI id, File file, Charset charset, |
| 31 | String mimeType, RDFContainer result)}}} |
| 32 | |
| 33 | * pro: developers can determine which kind of Extractor they face and which method to call |
| 34 | * con: we need a new registry for FileExtractors |
| 35 | * issue: this depends on ideaC |
| 36 | |
| 37 | === ideaC: Add a new method getFile() to FileDataObject ==) |
| 38 | Add a new method getFile(), returning a file, to FileDataObject. This is easily implemented on File-based data objects (crawling local file system). For remote FileDataObjects, the method will be implemented using a buffering of the InputStream. ideaB and ideaB1 depend on this getFile() method. |
| 39 | |
| 40 | * pro: optimizes the implementation for file-system-crawler |
| 41 | * issue: on some constellations (crawling remote MP3s), there will only be FileExtractors and everything will be buffered on local harddisk |
| 42 | * idea: this is not so much an issue, the benefit for the end user of having more data outweights it |