Email Interpretation

The ImapDataAccessor is a fairly complex class that does a lot of effort in interpreting a mime message. Rather than just delivering the raw inputstream of the Message, it produces a DataObject with possible child DataObjects that reflects as best as possible the way in which mail readers display the mail.

For example, what may seem to be a simple mail with a few headers and a body may in fact be a multipart mail with two alternative bodies, one in plain text and one in HTML. What conceptually is a single "information object" is spread over 4 different JavaMail objects (a MimeMessage with a Multipart containing two BodyParts, if I remember correctly). The ImapDataAccessor tries to hide this complexity of multiparts and just creates a single DataObject with headers and content.

It may be a good idea to adapt the other mail crawlers such as the existing Outlook and Mozilla mail crawlers so that they produce javax.mail.Message objects. We can then refactor the ImapDataAccessor so that this Message interpretation code lives elsewhere, making it possible to also apply it on the Messages created by these other mail crawlers. This allows us to reuse the mail interpretation code accross these mail formats. If these other mail crawlers have access to the raw mail content (i.e. the message as transported through SMTP), this may be rather easy to realize, as the functionality to parse these lines and convert it into a Message datastructure is part of JavaMail. We should see if this functionality is publicly available in the library.

Leo: Alternatively, the Outlook ApertureDataSource will use its own crawling and extraction mechanism and just obey to the same ontology. Outlook does not provide the message as plaintext, it is (to my knowledge) reflected completely in the outlook data format. Crawling the message, plaintext and attachments is not that complicated that we need to intermix the both. If they just obey to the same ontology, it should work.

Last modified 11 years ago Last modified on 10/12/05 14:01:51