wiki:ApertureOverview

Aperture Overview

Administrators: Christiaan Fluit & Leo Sauermann

Source Code: Interfaces and standard implementations of Aperture

Project name

From Merriam-Webster Online:

Main Entry: ap·er·ture (sounds like this)
Pronunciation: 'ap-&(r)-"chur, -ch&r, -"tyur, -"tur
Function: noun
Etymology: Middle English, from Latin apertura, from apertus, past participle of aperire to open
1 : an opening or open space : HOLE
2 a : the opening in a photographic lens that admits the light b : the diameter of the stop in an optical system that determines the diameter of the bundle of rays traversing the instrument c : the diameter of the objective lens or mirror of a telescope

Features

  • Extract data objects from common information systems like file systems, websites and mail servers
  • Extract full-text and other metadata from many common file formats
  • Easy to use: easy to learn, easy to code, easy to deploy in industrial projects
  • Open data objects for viewing
  • Fully configurable framework, storing and editing config files is done through a Swing GUI
  • Pluggable architecture: can be easily extended with custom file formats, data sources, ...
  • Deployment based on industry standard OSGi (but not exclusively, you can make it work outside OSGi)
  • Based on RDF (still to determine to what extend)

Use Cases

Applications and projects for which this project was intended to be used:

at DFKI:

at Aduna:

Components

  • DataSource interface
  • DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers
  • Near future work: OutlookSource, MozillaSource/ThunderbirdSource
  • DataAccessor interface
  • DataAccessor implementations for file, http(s) and imap schemes
  • DataCrawler interface
  • One basic DataCrawler implementation for every DataSource type
  • Later maybe more specialized DataCrawler implementations, e.g. a WindowsFileSystemCrawler with OS-specific optimizations
  • Extractor interface
  • Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ...
  • New domain for us but also probably very doable: PNG, JPG, AVI, ...
  • ArchiveExtractor interface
  • ArchiveExtractor implementations for Zip and Gzip
  • LinkExtractor interface
  • LinkExtractor implementation for HTML and XHTML
  • Later maybe PDF, Flash, ...
  • MimetypeIdentifier interface
  • Badic MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, LinkExtractor or ArchiveExtractor implementation for a given file
  • OSGi bindings and connector code (can be realized so that our code is also usable outside an OSGi-based application)
  • Configuration gui (what needs to be configured? isn't this very application-specific?)
  • Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations.
  • Metadata format descriptions (RDFS schema) and example metadata files

Supported File Formats

Right from the beginning we will support these file formats:

  • Plain text
  • HTML
  • XML
  • PDF (Portable Document Format)
  • RTF (Rich Text Format)
  • Microsoft Word 97+
  • Microsoft Excel 97+
  • Microsoft Powerpoint 97+
  • Microsoft Works
  • OpenOffice 1.0+: Writer, Calc, Impress, Draw
  • StarOffice 6.0+: Writer, Calc, Impress, Draw
  • OpenDocument (OpenOffice 2.0+)
  • WordPerfect 5.x
  • Emails (.eml files)
Last modified 18 years ago Last modified on 10/12/05 13:48:40