Version 7 (modified by anonymous, 19 years ago) (diff) |
---|
Aperture Overview
Administrators: Christiaan Fluit & Leo Sauermann
Source Code: Interfaces and standard implementations of Aperture
Project name
From Merriam-Webster Online:
Main Entry: ap·er·ture (sounds like this)
Pronunciation: 'ap-&(r)-"chur, -ch&r, -"tyur, -"tur
Function: noun
Etymology: Middle English, from Latin apertura, from apertus, past participle of aperire to open
1 : an opening or open space : HOLE
2 a : the opening in a photographic lens that admits the light b : the diameter of the stop in an optical system that determines the diameter of the bundle of rays traversing the instrument c : the diameter of the objective lens or mirror of a telescope
Features
- Extract data objects from common information systems like file systems, websites and mail servers
- Extract full-text and other metadata from many common file formats
- Easy to use: easy to learn, easy to code, easy to deploy in industrial projects
- Open data objects for viewing
- Fully configurable framework, storing and editing config files is done through a Swing GUI
- Pluggable architecture: can be easily extended, can be easily integrated to other projects
- Architecture based on industry standard OSGI
- Compatible with RDF, but not solely based on it
Components
- DataSource interface
- DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers
- Near future work: OutlookSource, MozillaSource/ThunderbirdSource
- DataAccessor interface
- DataAccessor implementations for file, http(s) and imap schemes
- DataCrawler interface
- One basic DataCrawler implementation for every DataSource type
- Later maybe more specialized DataCrawler implementations, e.g. a WindowsFileSystemCrawler with OS-specific optimizations
- Extractor interface
- Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ...
- New domain for us but also probably very doable: PNG, JPG, AVI, ...
- ArchiveExtractor interface
- ArchiveExtractor implementations for Zip and Gzip
- LinkExtractor interface
- LinkExtractor implementation for HTML and XHTML
- Later maybe PDF, Flash, ...
- MimetypeIdentifier interface
- Badic MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, LinkExtractor or ArchiveExtractor implementation for a given file
- OSGi bindings and connector code (can be realized so that code is also usable outside an OSGi-based application)
- Configuration gui (what needs to be configured? isn't this very application-specific?)
- Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations.
- Metadata format descriptions (RDFS schema) and example metadata files
Supported File Formats
Right from the beginning we will support these file formats:
- Plain text
- HTML
- XML
- PDF (Portable Document Format)
- RTF (Rich Text Format)
- Microsoft Word 97+
- Microsoft Excel 97+
- Microsoft Powerpoint 97+
- Microsoft Works
- OpenOffice 1.0+: Writer, Calc, Impress, Draw
- StarOffice 6.0+: Writer, Calc, Impress, Draw
- OpenDocument (OpenOffice 2.0+)
- WordPerfect 5.x
- Emails (.eml files)