== Semantic Data Access by Aduna & DFKI == To extract data and fulltext from various datasources and store them in systems like gnowsis or Aduna Metadata Server. == Sourceforge Project == Administrators: Christiaan Fluit & Leo Sauermann Source Code: Interfaces and standard implementations of the SeDAF The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores. The features of the framework will be: * easy to use: easy to learn, easy to code, easy to deploy in industrial projects * Extract fulltext from many common file formats and information systems like IMAP email servers * Extract metadata like author, date, subject and more from the data sources * open the data objects for viewing * Fully configurable framework, storing and editing config files is done through a SWING gui. * Pluggable architecture: can be easily extended, can be easily integrated to other projects. * Architecture based on industry standard OSGI * Compatible with RDF, but not solely based on it Components in the framework are: * DataSource Interface * TextExtractor Interface * DataSource implementation for Filesystem * DataSource implementation for IMAP mail servers * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel * OSGI bindings and connector code * Configuration gui * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis) * Metadata format description (RDFS schema) and example file for the metadata Right from the beginning we will support the following file types: * Plain text * HTML * XML * PDF (Portable Document Format) * RTF (Rich Text Format) * Microsoft Word 97+ * Microsoft Excel 97+ * Microsoft Powerpoint 97+ * Microsoft Works * OpenOffice 1.0+: Writer, Calc, Impress, Draw * StarOffice 6.0+: Writer, Calc, Impress, Draw * WordPerfect 5.x * Emails * IMAP Servers == license == The Aperture project is published with the following licensing policy: The core project interfaces and architecture is published using the OSL (Open Software License). == credits == The following third party libraries have helped making the metadata framework the success that it is. These freely available libraries deserve a lot of credit for that, and we highly recommend them to others as well! * Gnowsis: http://www.gnowsis.org/ * HtmlParser: http://htmlparser.sourceforge.net/ * Idmeta: http://www.geocities.com/marcoschmidt.geo/ * Jakarta Commons FileUpload: http://jakarta.apache.org/commons/fileupload/ * Jakarta Lucene: http://jakarta.apache.org/lucene/ * Jakarta POI: http://jakarta.apache.org/poi/ * Java Look and Feel Graphics Repository: http://java.sun.com/developer/techDocs/hi/repository/ * JavaBeans Activation Framework: http://java.sun.com/products/javabeans/glasgow/jaf.html * JavaMail API: http://java.sun.com/products/javamail/ * JGoodies Looks: http://www.jgoodies.com/freeware/looks/ * NGramJ: http://ngramj.sourceforge.net/ * PDFBox: http://www.pdfbox.org/ * Sesame: http://www.openrdf.org/ * WinLAF: https://winlaf.dev.java.net/ * Xpdf: http://www.foolabs.com/xpdf/