Version 13 (modified by anonymous, 19 years ago) (diff) |
---|
Extractor
Java Interface
Probably equal to:
- source:trunk/gnowsis/src/org/gnowsis/data/extractor/ExtractorPlaintext.java and
- source:/trunk/gnoDesktopSearch/src/java/org/gnowsis/desktopsearch/extractor/DocumentExtractor.java
/** * * Extractors are used to extract metadata and fulltext from InputStreams, * the inputstream is in a format passed by Mime-Type. * These extractors can produce metadatamaps. */ public interface Extractor { /** * return a plaintext representation of the file * @param source the file to look into * @param mimetype the mimetype that has been identified by gnowsis that this file is * @return null or a string. Null is returned, if no plaintext is in the file. If it could not be * extracted, an exception is thrown. * @throws ExtractionException when something goes wrong with extraction * @throws FileNotFoundException when the file is not existant */ public String getPlaintext(File source, String mimetype) throws FileNotFoundException, ExtractionException ; /** * create a lucene document. * To see what fields would be needed, look at the top of this class. * @param file * @param uri the uri identifying the passed file. You may need it when you add sophisticated rdf information * @param mimetype the mimetype of the passed file/stream. If your extractor can handle multiple mime-types, this can be handy. * @param options optional options that may help you. * @return a lucene document */ public Document createLuceneDocument(File file, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException; public Document createLuceneDocument(InputStream stream, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException; }
A better alternative may be:
public interface Extractor { public void extract(URI id, InputStream stream, Charset charset, String mimetype, Repository repository); }
Notes:
- The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. OpenOfficeExtractor or OpenDocumentExtractor) while there may be slight differences for different mimetypes.
- A Sesame Repository is specified to the Extractor to put its contents in. The application can then decide whether this is an in-memory repository where only this Extractor is operating on, whether the statements go directly into a persistent storage, whether an optimized Repository implementation is used, etc.