= Aperture Overview = Administrators: Christiaan Fluit & Leo Sauermann Source Code: Interfaces and standard implementations of Aperture == Project name == From [http://www.webster.com/ Merriam-Webster Online]: Main Entry: '''ap·er·ture''' (sounds like [http://cougar.eb.com/sound/a/apertu01.wav this])[[br]] Pronunciation: 'ap-&(r)-"chur, -ch&r, -"tyur, -"tur[[br]] Function: noun[[br]] Etymology: Middle English, from Latin apertura, from apertus, past participle of aperire to open[[br]] 1 : an opening or open space : HOLE[[br]] 2 a : the opening in a photographic lens that admits the light b : the diameter of the stop in an optical system that determines the diameter of the bundle of rays traversing the instrument c : the diameter of the objective lens or mirror of a telescope == Features == * Extract data objects from common information systems like file systems, websites and mail servers * Extract full-text and other metadata from many common file formats * Easy to use: easy to learn, easy to code, easy to deploy in industrial projects * Open data objects for viewing * Fully configurable framework, storing and editing config files is done through a Swing GUI * Pluggable architecture: can be easily extended with custom file formats, data sources, ... * Deployment based on industry standard OSGi (but not exclusively, you can make it work outside OSGi) * Based on RDF (still to determine to what extend) == Use Cases == Applications and projects for which this project was intended to be used: at DFKI: * [http://www.gnowsis.org/ Gnowsis] * Nepomuk (link?) * catwiesel resource classification engine. at Aduna: * [http://aduna.biz/products/autofocus/index.html Aduna AutoFocus] * [http://aduna.biz/products/autofocus/index.html Aduna Metadata Server] == Components == * !DataSource interface * !DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers * Near future work: !OutlookSource, !MozillaSource/ThunderbirdSource * !DataAccessor interface * !DataAccessor implementations for file, http(s) and imap schemes * !DataCrawler interface * One basic !DataCrawler implementation for every !DataSource type * Later maybe more specialized !DataCrawler implementations, e.g. a !WindowsFileSystemCrawler with OS-specific optimizations * Extractor interface * Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ... * New domain for us but also probably very doable: PNG, JPG, AVI, ... * !ArchiveExtractor interface * !ArchiveExtractor implementations for Zip and Gzip * !LinkExtractor interface * !LinkExtractor implementation for HTML and XHTML * Later maybe PDF, Flash, ... * !MimetypeIdentifier interface * Badic !MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, !LinkExtractor or !ArchiveExtractor implementation for a given file * [http://www.osgi.org/ OSGi] bindings and connector code (can be realized so that our code is also usable outside an OSGi-based application) * Configuration gui (what needs to be configured? isn't this very application-specific?) * Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations. * Metadata format descriptions (RDFS schema) and example metadata files == Supported File Formats == Right from the beginning we will support these file formats: * Plain text * HTML * XML * PDF (Portable Document Format) * RTF (Rich Text Format) * Microsoft Word 97+ * Microsoft Excel 97+ * Microsoft Powerpoint 97+ * Microsoft Works * !OpenOffice 1.0+: Writer, Calc, Impress, Draw * !StarOffice 6.0+: Writer, Calc, Impress, Draw * !OpenDocument (!OpenOffice 2.0+) * !WordPerfect 5.x * Emails (.eml files)