Version 15 (modified by sauermann, 19 years ago) (diff) |
---|
Extractor
Notes:
- The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. OpenOfficeExtractor or OpenDocumentExtractor) while there may be slight differences for different mimetypes.
- A Sesame Repository is specified to the Extractor to put its contents in. The repository
is neatly wrapped as RDFMap. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc.
TODO: Leo questions if the rdfmap should be passed or not.
Java Interface
Probably equal to:
- source:trunk/gnowsis/src/org/gnowsis/data/extractor/ExtractorPlaintext.java and
- source:/trunk/gnoDesktopSearch/src/java/org/gnowsis/desktopsearch/extractor/DocumentExtractor.java
/** * * Extractors are used to extract metadata and fulltext from InputStreams, * the inputstream is in a format passed by Mime-Type. * These extractors can produce RDFMaps. */ public interface Extractor { /** * create extracted information into the passed RDFMap called "result" * To see what fields should be needed and which must be added, look at the * commments above * @param id the uri identifying the passed object. You may need it when you add sophisticated rdf information. It is also the topResource in the passed result * @param stream an opened inputstream which you can exclusively read. You must call the stream.close() operation when you are finished extracting. * @param charset the charset in which the inputstream is encoded * @param mimetype the mimetype of the passed file/stream. If your extractor can handle multiple mime-types, this can be handy. * @param result - the place where the extracted data is to be written to * @throws IOException when problems arise reading the stream. * @throws DocumentExctractorException when the metadata of the stream cannot be extracted, * when the stream does not conform to the MimeType's norms. */ public void extract(URI id, InputStream stream, Charset charset, String mimetype, RDFMap result) throws IOException, DocumentExtractorException; }