= Aperture Overview =

Administrators: Christiaan Fluit & Leo Sauermann

Source Code: Interfaces and standard implementations of Aperture

== Project name ==

From [http://www.webster.com/ Merriam-Webster Online]:

Main Entry: '''ap·er·ture''' (sounds like [http://cougar.eb.com/sound/a/apertu01.wav this])[[br]]
Pronunciation: 'ap-&(r)-"chur, -ch&r, -"tyur, -"tur[[br]]
Function: noun[[br]]
Etymology: Middle English, from Latin apertura, from apertus, past participle of aperire to open[[br]]
1 : an opening or open space : HOLE[[br]]
2 a : the opening in a photographic lens that admits the light b : the diameter of the stop in an optical system that determines the diameter of the bundle of rays traversing the instrument c : the diameter of the objective lens or mirror of a telescope

== Features ==

 * Easy to use: easy to learn, easy to code, easy to deploy in industrial projects
 * Extract fulltext from many common file formats and information systems like IMAP email servers
 * Extract metadata like author, date, subject and more from the data sources and file formats
 * Open data objects for viewing
 * Fully configurable framework, storing and editing config files is done through a SWING gui
 * Pluggable architecture: can be easily extended, can be easily integrated to other projects
 * Architecture based on industry standard OSGI
 * Compatible with RDF, but not solely based on it

== Components ==

 * !DataSource interface
 * !DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers
 * Near future work: !OutlookSource, !MozillaSource/ThunderbirdSource

 * !DataAccessor interface
 * !DataAccessor implementations for file, http(s) and imap schemes

 * !DataCrawler interface
 * One basic !DataCrawler implementation for every !DataSource type
 * Later maybe more specialized !DataCrawler implementations, e.g. a !WindowsFileSystemCrawler with OS-specific optimizations

 * Extractor interface
 * Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ...
 * New domain for us but also probably very doable: PNG, JPG, AVI, ...

 * !ArchiveExtractor interface
 * !ArchiveExtractor implementations for Zip and Gzip

 * !LinkExtractor interface
 * !LinkExtractor implementation for HTML and XHTML
 * Later maybe PDF, Flash, ...

 * !MimetypeIdentifier interface
 * Badic !MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, !LinkExtractor or !ArchiveExtractor implementation for a given file

 * [http://www.osgi.org/ OSGi] bindings and connector code (can be realized so that code is also usable outside an OSGi-based application)
 * Configuration gui (what needs to be configured? isn't this very application-specific?)
 * Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations.
 * Metadata format descriptions (RDFS schema) and example metadata files

== Supported File Formats ==

Right from the beginning we will support these file formats:

 * Plain text
 * HTML
 * XML
 * PDF (Portable Document Format)
 * RTF (Rich Text Format)
 * Microsoft Word 97+
 * Microsoft Excel 97+
 * Microsoft Powerpoint 97+
 * Microsoft Works
 * !OpenOffice 1.0+: Writer, Calc, Impress, Draw
 * !StarOffice 6.0+: Writer, Calc, Impress, Draw
 * !OpenDocument (!OpenOffice 2.0+)
 * !WordPerfect 5.x
 * Emails (.eml files)