Mimetypes: Chirs suggest to leave it partially in. my idea is that the crawler only returns what the data source returns. if a specific data source returns mimetypes (e.g. some webservers), return that, but else make no effort to determine it. after all it depends on the use case whether you need it at all and what the best way is to fill in missing values. our experience btw is that you should never trust the mime type returned by a webserver ;) in our apps we will always use a magic number-based mechanism

Leo: to be used, we can do it as convenient as possible. Meaning, the DataObjectFile has a getMimetype() method which promises to use all tricks that exist. If you don't call getMimetype(), no processing power spilt. If you call it, it guarantees to have done all tricks already (which is very convenient and a useful feaute of the Aperture framework).

Chris: actually, what I propose is to let the metadata of a DataObject only contain the mime type if the data source reports it. It is up to higher level components to apply "tricks" like using a MimeTypeIdentifier. I don't believe this should be hidden behind a method in the DataObject interface, as the question of whether and how you want to process the contents of a DataObject is highly application specific.


Some extractors may require extended I/O functionality, like the โ€‹position() method provided by โ€‹java.nio.ByteBuffer and its superclass Buffer. Having a ByteBuffer internally would allow to return new instances of InputStream on each call to getInputStream(). Also, we may add (when needed) a getByteBuffer() method that returns the raw data.

The decision to use New-IO or an Extractor that works on Java Files will be postponed until we have a set of 3 or more extractors, that don't support InputStream. For now, we have these incompatible extractors:

  • MP3 extractor by Jens Vonderheide
  • MP3 extractor by eric farng -
 * A general interface for data objects that have some Stream-based file content. These objects may
 * be files on filesystems, web files received through http or emails stored in an email server.
 * All methods of DataObject are inherited. Additional is the <b>binary content</b> and the 
 * handling of the stream. For the extraction, both the InputStream 
 * returned by getContent() and
 * the RDF metadata returned by getMetadata() are important.
public interface DataObjectFile extends DataObject {

         * Returns the byte size of the represented resource. This has been defined at
         * this global level due to the importance of this attribute for performance reasons.
         * @return the size of the binary resource in bytes, or a negative value when the
         * size is unknown or does not make sense for this particular DataObject implementation.
        public long getSize();
         * Gets an InputStream containing the content represented by the DataObject.
         * The returned InputStream is required to support marking (markSupported()
         * returns true). Calling this method multiple times may references to
         * one-and-the-same InputStream instance. The position of this stream will
         * be at the beginning, after getContent() is called. Thisis achieved through
         * an internal New-IO Channel from which the Stream is fed.
         * @return An InputStream from which the content of the data object can be read.
         * @throws IOException If an I/O error occurred.
        public InputStream getContent() throws IOException;

         * Instructs the DataObject that its content stream will most likely be used multiple
         * times in its entirety, making the mark-and-reset procedure difficult to work, 
         * and that it better should cache the entire contents.
         * The internal caching should be implemented using a New-IO ByteBuffer.
         * @throws IOException when an IOException occured during caching of the content.
 ยด      public void cacheContent() throws IOException;
         * what is the mime-type of the content, if there is content?
         * This is set by the DataAccessor. This method may cause complicated mimetype detection,
         * like looking at the http mime-type, file extensions, magic bytes inside the file-stream.
         * @return a mimetype identifier like "text/plain" or null if the 
         * identifier cannot be determined, even using all tricks available.
        public String getContentMimeType();

         * what is the character-encoding (using ansi identifiers like "UTF-8"
         * or "ISO-8859-1") of the content, if there is content. Will 
         * return null if not known or if content is null.
         * This is set by the DataAccessor
         * @return null or a encoding identifier like "UTF-8"
        public String getContentEncoding();
Last modified 11 years ago Last modified on 11/11/05 11:52:12