Changes between Version 14 and Version 15 of ApertureExtractor
- Timestamp:
- 10/17/05 15:57:34 (19 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
ApertureExtractor
v14 v15 1 1 = Extractor = 2 3 Notes: 4 5 * The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. !OpenOfficeExtractor or !OpenDocumentExtractor) while there may be slight differences for different mimetypes. 6 * A Sesame Repository is specified to the Extractor to put its contents in. The repository 7 is neatly wrapped as RDFMap. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc. 8 9 TODO: Leo questions if the rdfmap should be passed or not. 10 2 11 3 12 == Java Interface == … … 6 15 * source:/trunk/gnoDesktopSearch/src/java/org/gnowsis/desktopsearch/extractor/DocumentExtractor.java 7 16 17 8 18 {{{ 9 19 #!java 20 10 21 /** 11 22 * 12 23 * Extractors are used to extract metadata and fulltext from InputStreams, 13 24 * the inputstream is in a format passed by Mime-Type. 14 * These extractors can produce metadatamaps.25 * These extractors can produce RDFMaps. 15 26 */ 16 27 public interface Extractor { 17 28 18 /**19 * return a plaintext representation of the file20 * @param source the file to look into21 * @param mimetype the mimetype that has been identified by gnowsis that this file is22 * @return null or a string. Null is returned, if no plaintext is in the file. If it could not be23 * extracted, an exception is thrown.24 * @throws ExtractionException when something goes wrong with extraction25 * @throws FileNotFoundException when the file is not existant26 */27 public String getPlaintext(File source, String mimetype) throws FileNotFoundException, ExtractionException ;28 29 29 30 /** 30 * create a lucene document. 31 * To see what fields would be needed, look at the top of this class. 32 * @param file 33 * @param uri the uri identifying the passed file. You may need it when you add sophisticated rdf information 31 * create extracted information into the passed RDFMap called "result" 32 * To see what fields should be needed and which must be added, look at the 33 * commments above 34 * @param id the uri identifying the passed object. You may need it when you add sophisticated rdf information. It is also the topResource in the passed result 35 * @param stream an opened inputstream which you can exclusively read. You must call the 36 stream.close() operation when you are finished extracting. 37 * @param charset the charset in which the inputstream is encoded 34 38 * @param mimetype the mimetype of the passed file/stream. If your extractor can handle multiple mime-types, this can be handy. 35 * @param options optional options that may help you. 36 * @return a lucene document 39 * @param result - the place where the extracted data is to be written to 40 * @throws IOException when problems arise reading the stream. 41 * @throws DocumentExctractorException when the metadata of the stream cannot be extracted, 42 * when the stream does not conform to the MimeType's norms. 37 43 */ 38 public Document createLuceneDocument(File file, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException; 39 40 public Document createLuceneDocument(InputStream stream, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException; 41 } 42 43 44 }}} 45 46 A better alternative may be: 47 48 {{{ 49 #!java 50 public interface Extractor { 51 52 public void extract(URI id, InputStream stream, Charset charset, String mimetype, Repository repository); 44 public void extract(URI id, InputStream stream, Charset charset, String mimetype, RDFMap result) throws IOException, DocumentExtractorException; 53 45 } 54 46 }}} 55 47 56 Notes:57 58 * The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. !OpenOfficeExtractor or !OpenDocumentExtractor) while there may be slight differences for different mimetypes.59 * A Sesame Repository is specified to the Extractor to put its contents in. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc.