wiki:SemanticDataIntegrationFramework

Version 2 (modified by Leo Sauermann <leo.sauermann@…>, 19 years ago) (diff)

--

Semantic Data Access by Aduna & DFKI

To extract data and fulltext from various datasources and store them in systems like gnowsis or Aduna Metadata Server.

Sourceforge Project

Administrators: Christiaan Fluit & Leo Sauermann Source Code: Interfaces and standard implementations of the SeDAF

The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores.

The features of the framework will be:

  • easy to use: easy to learn, easy to code, easy to deploy in industrial projects
  • Extract fulltext from many common file formats and information systems like IMAP email servers
  • Extract metadata like author, date, subject and more from the data sources
  • open the data objects for viewing
  • Fully configurable framework, storing and editing config files is done through a SWING gui.
  • Pluggable architecture: can be easily extended, can be easily integrated to other projects.
  • Architecture based on industry standard OSGI
  • Compatible with RDF, but not solely based on it

Components in the framework are:

  • DataSource Interface
  • TextExtractor Interface
  • DataSource implementation for Filesystem
  • DataSource implementation for IMAP mail servers
  • TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel
  • OSGI bindings and connector code
  • Configuration gui
  • Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis)
  • Metadata format description (RDFS schema) and example file for the metadata

Right from the beginning we will support the following file types:

  • Plain text
  • HTML
  • XML
  • PDF (Portable Document Format)
  • RTF (Rich Text Format)
  • Microsoft Word 97+
  • Microsoft Excel 97+
  • Microsoft Powerpoint 97+
  • Microsoft Works
  • OpenOffice 1.0+: Writer, Calc, Impress, Draw
  • StarOffice 6.0+: Writer, Calc, Impress, Draw
  • WordPerfect 5.x
  • Emails
  • IMAP Servers

credits

The following third party libraries have helped making the metadata framework the success that it is. These freely available libraries deserve a lot of credit for that, and we highly recommend them to others as well!

license The SeDAF is published under a BSD or CPL compatible license.

Attachments (2)

Download all attachments as: .zip