Version 7 (modified by anonymous, 19 years ago) (diff) |
---|
Aperture: Semantic Data Access by Aduna & DFKI
Goal: to extract data and fulltext from various datasources and store them in systems like gnowsis or Aduna Metadata Server.
Sourceforge Project
Administrators: Christiaan Fluit & Leo Sauermann Source Code: Interfaces and standard implementations of the SeDAF
The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores.
The features of the framework will be:
- easy to use: easy to learn, easy to code, easy to deploy in industrial projects
- Extract fulltext from many common file formats and information systems like IMAP email servers
- Extract metadata like author, date, subject and more from the data sources
- open the data objects for viewing
- Fully configurable framework, storing and editing config files is done through a SWING gui.
- Pluggable architecture: can be easily extended, can be easily integrated to other projects.
- Architecture based on industry standard OSGI
- Compatible with RDF, but not solely based on it
Components in the framework are:
- DataSource Interface
- TextExtractor Interface
- DataSource implementation for Filesystem
- DataSource implementation for IMAP mail servers
- TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel
- OSGI bindings and connector code
- Configuration gui
- Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis)
- Metadata format description (RDFS schema) and example file for the metadata
Right from the beginning we will support the following file types:
- Plain text
- HTML
- XML
- PDF (Portable Document Format)
- RTF (Rich Text Format)
- Microsoft Word 97+
- Microsoft Excel 97+
- Microsoft Powerpoint 97+
- Microsoft Works
- OpenOffice 1.0+: Writer, Calc, Impress, Draw
- StarOffice 6.0+: Writer, Calc, Impress, Draw
- WordPerfect 5.x
- Emails
- IMAP Servers
license
The Aperture project is published with the following licensing policy. The core parts are free to use and to extend. They should be as open as possible so that anyone can include the core Aperture in a project, commercial, closed source or not. The concrete adapter and extractor implementations, which include much work and bugfixing, are licensed under a reciprocal license, meaning that changes to the code have to be published again and external developers contribute bugfixes, improvements and extensions to the core adapters implementations.
still open: AFL or BSD for core: If BSD is similiar to AFL, Leo would recommend BSD as it is more commonly used.
The core project interfaces and architecture is published using the AFL http://www.opensource.org/licenses/afl-2.1.php OR BSD: http://www.opensource.org/licenses/bsd-license.php
The implementations of adapters are licensed under the OSL (Open Software License). http://www.opensource.org/licenses/osl-2.1.php
more about OSL / AFL:
credits
The following third party libraries have helped making the metadata framework the success that it is. These freely available libraries deserve a lot of credit for that, and we highly recommend them to others as well!
- Gnowsis: http://www.gnowsis.org/
- HtmlParser: http://htmlparser.sourceforge.net/
- Idmeta: http://www.geocities.com/marcoschmidt.geo/
- Jakarta Commons FileUpload: http://jakarta.apache.org/commons/fileupload/
- Jakarta Lucene: http://jakarta.apache.org/lucene/
- Jakarta POI: http://jakarta.apache.org/poi/
- Java Look and Feel Graphics Repository: http://java.sun.com/developer/techDocs/hi/repository/
- JavaBeans Activation Framework: http://java.sun.com/products/javabeans/glasgow/jaf.html
- JavaMail API: http://java.sun.com/products/javamail/
- JGoodies Looks: http://www.jgoodies.com/freeware/looks/
- NGramJ: http://ngramj.sourceforge.net/
- PDFBox: http://www.pdfbox.org/
- Sesame: http://www.openrdf.org/
- WinLAF: https://winlaf.dev.java.net/
- Xpdf: http://www.foolabs.com/xpdf/
Attachments (2)
-
aperture_overview.ppt
(35.0 KB) -
added by sauermann 19 years ago.
Rough-cut overview of the framework
- API changes (20051114).txt (9.5 KB) - added by chris 19 years ago.
Download all attachments as: .zip