Context Navigation

Changes between Version 14 and Version 15 of ApertureExtractor

Timestamp:: 10/17/05 15:57:34 (20 years ago)
Author:: sauermann
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

ApertureExtractor

-                      v14
+                      v15
 = Extractor =
+Notes:
+ * The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. !OpenOfficeExtractor or !OpenDocumentExtractor) while there may be slight differences for different mimetypes.
+ * A Sesame Repository is specified to the Extractor to put its contents in. The repository
+is neatly wrapped as RDFMap. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc.
+TODO: Leo questions if the rdfmap should be passed or not.
 == Java Interface ==
 …
  * source:/trunk/gnoDesktopSearch/src/java/org/gnowsis/desktopsearch/extractor/DocumentExtractor.java
 {{{
 #!java
 /**
+ *
  * Extractors are used to extract metadata and fulltext from InputStreams,
  * the inputstream is in a format passed by Mime-Type.
  * These extractors can produce metadatamaps.
+ * These extractors can produce RDFMaps.
  */
 public interface Extractor {
-    /**
-     * return a plaintext representation of the file
-     * @param source the file to look into
-     * @param mimetype the mimetype that has been identified by gnowsis that this file is
-     * @return null or a string. Null is returned, if no plaintext is in the file. If it could not be
-     * extracted, an exception is thrown.
-     * @throws ExtractionException when something goes wrong with extraction
-     * @throws FileNotFoundException when the file is not existant
-     */
-    public String getPlaintext(File source, String mimetype) throws FileNotFoundException, ExtractionException ;
     /**
+     * create a lucene document.
+     * To see what fields would be needed, look at the top of this class.
+     * @param file
+     * @param uri the uri identifying the passed file. You may need it when you add sophisticated rdf information
+     * create extracted information into the passed RDFMap called "result"
+     * To see what fields should be needed and which must be added, look at the
+     * commments above
+     * @param id the uri identifying the passed object. You may need it when you add sophisticated rdf information. It is also the topResource in the passed result
+     * @param stream an opened inputstream which you can exclusively read. You must call the
+stream.close() operation when you are finished extracting.
+     * @param charset the charset in which the inputstream is encoded
      * @param mimetype the mimetype of the passed file/stream. If your extractor can handle multiple mime-types, this can be handy.
+     * @param options optional options that may help you.
+     * @return a lucene document
+     * @param result - the place where the extracted data is to be written to
+     * @throws IOException when problems arise reading the stream.
+     * @throws DocumentExctractorException when the metadata of the stream cannot be extracted,
+     * when the stream does not conform to the MimeType's norms.
      */
+    public Document createLuceneDocument(File file, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException;
+    public Document createLuceneDocument(InputStream stream, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException;
+}
+}}}
+A better alternative may be:
+{{{
+#!java
+public interface Extractor {
+    public void extract(URI id, InputStream stream, Charset charset, String mimetype, Repository repository);
+    public void extract(URI id, InputStream stream, Charset charset, String mimetype, RDFMap result)  throws IOException, DocumentExtractorException;
+}
 }}}
-Notes:
- * The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. !OpenOfficeExtractor or !OpenDocumentExtractor) while there may be slight differences for different mimetypes.
- * A Sesame Repository is specified to the Extractor to put its contents in. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc.