wiki:ApertureOverview

Context Navigation

Version 6 (modified by anonymous, 20 years ago) (diff)
--

Aperture Overview

Administrators: Christiaan Fluit & Leo Sauermann

Source Code: Interfaces and standard implementations of Aperture

Project name

From Merriam-Webster Online:

Main Entry: ap·er·ture (sounds like this)
Pronunciation: 'ap-&(r)-"chur, -ch&r, -"tyur, -"tur
Function: noun
Etymology: Middle English, from Latin apertura, from apertus, past participle of aperire to open
1 : an opening or open space : HOLE
2 a : the opening in a photographic lens that admits the light b : the diameter of the stop in an optical system that determines the diameter of the bundle of rays traversing the instrument c : the diameter of the objective lens or mirror of a telescope

Features

Easy to use: easy to learn, easy to code, easy to deploy in industrial projects
Extract fulltext from many common file formats and information systems like IMAP email servers
Extract metadata like author, date, subject and more from the data sources and file formats
Open data objects for viewing
Fully configurable framework, storing and editing config files is done through a SWING gui
Pluggable architecture: can be easily extended, can be easily integrated to other projects
Architecture based on industry standard OSGI
Compatible with RDF, but not solely based on it

Components

DataSource interface
DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers
Near future work: OutlookSource, MozillaSource/ThunderbirdSource

DataAccessor interface
DataAccessor implementations for file, http(s) and imap schemes

DataCrawler interface
One basic DataCrawler implementation for every DataSource type
Later maybe more specialized DataCrawler implementations, e.g. a WindowsFileSystemCrawler with OS-specific optimizations

Extractor interface
Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ...
New domain for us but also probably very doable: PNG, JPG, AVI, ...

ArchiveExtractor interface
ArchiveExtractor implementations for Zip and Gzip

LinkExtractor interface
LinkExtractor implementation for HTML and XHTML
Later maybe PDF, Flash, ...

MimetypeIdentifier interface
Badic MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, LinkExtractor or ArchiveExtractor implementation for a given file

OSGi bindings and connector code (can be realized so that code is also usable outside an OSGi-based application)
Configuration gui (what needs to be configured? isn't this very application-specific?)
Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations.
Metadata format descriptions (RDFS schema) and example metadata files

Supported File Formats

Right from the beginning we will support these file formats:

Plain text
HTML
XML
PDF (Portable Document Format)
RTF (Rich Text Format)
Microsoft Word 97+
Microsoft Excel 97+
Microsoft Powerpoint 97+
Microsoft Works
OpenOffice 1.0+: Writer, Calc, Impress, Draw
StarOffice 6.0+: Writer, Calc, Impress, Draw
OpenDocument (OpenOffice 2.0+)
WordPerfect 5.x
Emails (.eml files)

Download in other formats:

Plain Text