1 | | <h1>Semantic Data Access by Aduna & DFKI <br> |
2 | | </h1> |
3 | | To extract data and fulltext from various datasources and store them in |
4 | | systems like gnowsis or Aduna Metadata Server.<br> |
5 | | <h2>Sourceforge Project</h2> |
6 | | Administrators: Christiaan Fluit & Leo Sauermann<br> |
7 | | Source Code: Interfaces and standard implementations of the SeDAF<br> |
8 | | <br> |
9 | | The source will contain all relevant information about semantic data |
10 | | extraction, everything that is needed to get starting with a fulltext |
11 | | and metadata extraction framework. Our intent is that developers can |
12 | | download a single distribution file with a fully working environment, |
13 | | that also includes adapter and extractor implementations. Developers |
14 | | can use this package to fill their lucene-based applications or other |
15 | | data stores.<br> |
16 | | <br> |
17 | | The features of the framework will be:<br> |
18 | | <ul> |
19 | | <li>easy to use: easy to learn, easy to code, easy to deploy in |
20 | | industrial projects<br> |
21 | | </li> |
22 | | <li>Extract fulltext from many common file formats and information |
23 | | systems like IMAP email servers</li> |
24 | | <li>Extract metadata like author, date, subject and more from the |
25 | | data sources</li> |
26 | | <li>open the data objects for viewing<br> |
27 | | </li> |
28 | | <li>Fully configurable framework, storing and editing config files is |
29 | | done through a SWING gui.</li> |
30 | | <li>Pluggable architecture: can be easily extended, can be easily |
31 | | integrated to other projects. <br> |
32 | | </li> |
33 | | <li>Architecture based on industry standard OSGI</li> |
34 | | <li>Compatible with RDF, but not solely based on it</li> |
35 | | </ul> |
36 | | Components in the framework are:<br> |
37 | | <ul> |
38 | | <li>DataSource Interface</li> |
39 | | <li>TextExtractor Interface</li> |
40 | | <li>DataSource implementation for Filesystem</li> |
41 | | <li>DataSource implementation for IMAP mail servers</li> |
42 | | <li>TextExtractor implementation for everything we know: PDF, Word, |
43 | | Fulltext, excel</li> |
44 | | <li>OSGI bindings and connector code<br> |
45 | | </li> |
46 | | <li>Configuration gui</li> |
47 | | <li>Sample appication showing how to use it, with gui (=either |
48 | | Autofocus or Sesame or Gnowsis)</li> |
49 | | <li>Metadata format description (RDFS schema) and example file for |
50 | | the metadata<br> |
51 | | </li> |
52 | | </ul> |
53 | | Right from the beginning we will support the following file types:<br> |
54 | | <ul> |
55 | | <li>Plain text</li> |
56 | | <li>HTML</li> |
57 | | <li>XML</li> |
58 | | <li>PDF (Portable Document Format)</li> |
59 | | <li>RTF (Rich Text Format)</li> |
60 | | <li>Microsoft Word 97+</li> |
61 | | <li>Microsoft Excel 97+</li> |
62 | | <li>Microsoft Powerpoint 97+</li> |
63 | | <li>Microsoft Works</li> |
64 | | <li>OpenOffice 1.0+: Writer, Calc, Impress, Draw</li> |
65 | | <li>StarOffice 6.0+: Writer, Calc, Impress, Draw</li> |
66 | | <li>WordPerfect 5.x</li> |
67 | | <li>Emails</li> |
68 | | <li>IMAP Servers</li> |
69 | | </ul> |
70 | | <h2>credits<br> |
71 | | </h2> |
72 | | The following third party libraries have helped making the metadata |
73 | | framework<br> |
74 | | the success that it is. These freely available libraries deserve<br> |
75 | | a lot of credit for that, and we highly recommend them to others<br> |
76 | | as well!<br> |
77 | | <ul> |
78 | | <li>Gnowsis: http://www.gnowsis.org/</li> |
79 | | <li>HtmlParser: http://htmlparser.sourceforge.net/</li> |
80 | | <li>Idmeta: http://www.geocities.com/marcoschmidt.geo/</li> |
81 | | <li>Jakarta Commons FileUpload: |
82 | | http://jakarta.apache.org/commons/fileupload/</li> |
83 | | <li>Jakarta Lucene: http://jakarta.apache.org/lucene/</li> |
84 | | <li>Jakarta POI: http://jakarta.apache.org/poi/</li> |
85 | | <li>Java Look and Feel Graphics Repository: |
86 | | http://java.sun.com/developer/techDocs/hi/repository/</li> |
87 | | <li>JavaBeans Activation Framework: |
88 | | http://java.sun.com/products/javabeans/glasgow/jaf.html</li> |
89 | | <li>JavaMail API: http://java.sun.com/products/javamail/</li> |
90 | | <li>JGoodies Looks: http://www.jgoodies.com/freeware/looks/</li> |
91 | | <li>NGramJ: http://ngramj.sourceforge.net/</li> |
92 | | <li>PDFBox: http://www.pdfbox.org/</li> |
93 | | <li>Sesame: http://www.openrdf.org/</li> |
94 | | <li>WinLAF: https://winlaf.dev.java.net/</li> |
95 | | <li>Xpdf: http://www.foolabs.com/xpdf/</li> |
96 | | </ul> |
97 | | <h2>license</h2> |
98 | | The SeDAF is published under a BSD or CPL compatible license.<br> |
| 1 | |
| 2 | == Semantic Data Access by Aduna & DFKI == |
| 3 | |
| 4 | |
| 5 | To extract data and fulltext from various datasources and store them in systems like gnowsis or Aduna Metadata Server. |
| 6 | |
| 7 | |
| 8 | == Sourceforge Project == |
| 9 | |
| 10 | Administrators: Christiaan Fluit & Leo Sauermann |
| 11 | Source Code: Interfaces and standard implementations of the SeDAF |
| 12 | |
| 13 | The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores. |
| 14 | |
| 15 | The features of the framework will be: |
| 16 | |
| 17 | * easy to use: easy to learn, easy to code, easy to deploy in industrial projects |
| 18 | * Extract fulltext from many common file formats and information systems like IMAP email servers |
| 19 | * Extract metadata like author, date, subject and more from the data sources |
| 20 | * open the data objects for viewing |
| 21 | * Fully configurable framework, storing and editing config files is done through a SWING gui. |
| 22 | * Pluggable architecture: can be easily extended, can be easily integrated to other projects. |
| 23 | * Architecture based on industry standard OSGI |
| 24 | * Compatible with RDF, but not solely based on it |
| 25 | |
| 26 | Components in the framework are: |
| 27 | |
| 28 | * DataSource Interface |
| 29 | * TextExtractor Interface |
| 30 | * DataSource implementation for Filesystem |
| 31 | * DataSource implementation for IMAP mail servers |
| 32 | * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel |
| 33 | * OSGI bindings and connector code |
| 34 | * Configuration gui |
| 35 | * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis) |
| 36 | * Metadata format description (RDFS schema) and example file for the metadata |
| 37 | |
| 38 | Right from the beginning we will support the following file types: |
| 39 | |
| 40 | * Plain text |
| 41 | * HTML |
| 42 | * XML |
| 43 | * PDF (Portable Document Format) |
| 44 | * RTF (Rich Text Format) |
| 45 | * Microsoft Word 97+ |
| 46 | * Microsoft Excel 97+ |
| 47 | * Microsoft Powerpoint 97+ |
| 48 | * Microsoft Works |
| 49 | * OpenOffice 1.0+: Writer, Calc, Impress, Draw |
| 50 | * StarOffice 6.0+: Writer, Calc, Impress, Draw |
| 51 | * WordPerfect 5.x |
| 52 | * Emails |
| 53 | * IMAP Servers |
| 54 | |
| 55 | |
| 56 | == credits == |
| 57 | |
| 58 | The following third party libraries have helped making the metadata framework |
| 59 | the success that it is. These freely available libraries deserve |
| 60 | a lot of credit for that, and we highly recommend them to others |
| 61 | as well! |
| 62 | |
| 63 | * Gnowsis: http://www.gnowsis.org/ |
| 64 | * HtmlParser: http://htmlparser.sourceforge.net/ |
| 65 | * Idmeta: http://www.geocities.com/marcoschmidt.geo/ |
| 66 | * Jakarta Commons FileUpload: http://jakarta.apache.org/commons/fileupload/ |
| 67 | * Jakarta Lucene: http://jakarta.apache.org/lucene/ |
| 68 | * Jakarta POI: http://jakarta.apache.org/poi/ |
| 69 | * Java Look and Feel Graphics Repository: http://java.sun.com/developer/techDocs/hi/repository/ |
| 70 | * JavaBeans Activation Framework: http://java.sun.com/products/javabeans/glasgow/jaf.html |
| 71 | * JavaMail API: http://java.sun.com/products/javamail/ |
| 72 | * JGoodies Looks: http://www.jgoodies.com/freeware/looks/ |
| 73 | * NGramJ: http://ngramj.sourceforge.net/ |
| 74 | * PDFBox: http://www.pdfbox.org/ |
| 75 | * Sesame: http://www.openrdf.org/ |
| 76 | * WinLAF: https://winlaf.dev.java.net/ |
| 77 | * Xpdf: http://www.foolabs.com/xpdf/ |
| 78 | |
| 79 | license |
| 80 | The SeDAF is published under a BSD or CPL compatible license. |