wiki:ApertureRDF

The Use of RDF in Aperture

We should discuss where and how RDF is used in this framework. In previous email discussions we already thought about using RDF as a way to let an Extractor output its extracted information, because of the flexibility it provides:

  • no assumption on what the metadata looks like, can be very simple or very complex
  • easy to store in RDF stores, no transformation necessary (provided that you have named graphs support)

Also, this would be a unique selling point, making it stand apart from projects like Nutch, Zilverline, Lius etc., which also provide frameworks for extracting and handling full-text and metadata.

This design decision also applies to DataObjects, which now use a Map with dedicated keys, defined per DataObject type. I would be in favour of changing this to "something RDF", as it considerably eases development.

Leo came up with an idea that allows delivering RDF while at the same time providing a simpler interface to programmers not knowledgeable in RDF. The idea is to create a class that implements both the org.openrdf.model.Graph interface as well as the java.util.Map interface. The effect of

result.put(authorURI, "chris");

with the authorURI being equal to the URI of the author predicate, would then be equal to

result.add(documentURI, authorURI, "chris");

I.e., you can use the Map methods to insert simple resource-predicate-literal statements (the majority), which is simple to document and understand, whereas people who know what they are doing can also add arbitrary RDF statements.

The concrete manifestation of these ideas can now be found in wiki:ApertureRDFMap.

Sesame 2

Some notes on the development of Sesame 2 and how it applies to Aperture. The Sesame guys are rounding up their last efforts for releasing an alpha version. This means the code is still under development, although core interfaces are stabilizing.

One change is that the model and model.impl packages are removed from Rio. Furthermore, Graph and GraphImpl have been removed from these packages, as well as all methods that change something in the RDF structure (e.g. Resource.addProperty).

Arjohn explained this decision to me as follows. Have a look at the architecture graphic on http://www.openrdf.org/doc/sesame2/system/ch02.html. The RDF Model at the bottom is the foundation for the rest of the system to manipulate RDF information. It is very awkward and may potentially result in problems when you are able to manipulate the RDF at the model level, as it bypasses the Sail stack and any inferencing, security restrictions, etc. that takes place in it. Therefore these interfaces from now on provide read-only information only. This way it is for example not possible to add properties to Resources obtained from a query result, which may result in undefined behaviour.

If you want to manipulate statements, the Repository class is the way to go. It contains methods for adding triples as well as functionality for posing queries, extracting RDF, etc. Furthermore, a number of utility classes (URIVertex, LiteralVertex, ...) are provided that take a Repository as argument and that let you treat the RDF statements as a graph datastructure.

The only drawback of the Repository class is that it's quite a big class (note that it is a class and not an interface!). Also, just creating a repository is not enough, it always operates on top of a Sail. This architecture provides great flexibility at the cost of more code complexity.

Example: you want to create an in-memory RDF "container" that you can pass to an Extractor:

        Repository repository = new Repository(new MemoryStore());
        repository.initialize();
        extractor.doYourWork(docURI, repository);

Since we're now passing the Repository, we should also pass the document URI so that the Extractor knows around which resource it has to create a CBD.

The Extractor may then do something as follows:

        repository.add(docURI, Vocabulary.titleURI, new LiteralVertex(repository, titleString);
        repository.add(docURI, Vocabulary.fullTextURI, new LiteralVertex(repository, fullText);

assuming the full text is put as a literal in the RDF.

The following code uses a graph-oriented approach, but its effect is exactly the same:

        URIVertex docVertex = new URIVertex(repository, docURI);
        docVertex.addProperty(Vocabulary.titleURI, new LiteralVertex(repository, titleString));
        docVertex.addProperty(Vocabulary.fullTextURI, new LiteralVertex(repository, fullText));

Since the Repository is specified as a parameter to the Extractor, starting and committing any transactions is the responsibility of the integrator. In case of the memory store, this can even be omitted.

Clarification by Jeen: the Repository by default works in an 'autoCommit' mode. This means that the user does not have to worry about starting and stopping transactions, every action (such as adding a property, or a file) is automatically a single transaction. It is possible (but not necessary) to switch this off or to use transactions explicitly, in order to bundle operations together in batches. This can give better performance but of course requires more responsibility from the user who has to make sure transactions are started and committed properly.

Conclusion Leo

Since the sesame2 repository api provides less usability than Leo wishes to provide for developers, I would suggest to stick to the RDFMap idea and provide about 10-20 methods there that "do the trick". In the methods I would mix resource-based and model based methods, for ease of implementation. Although it would be fine to have a real "abstract" layer, I still hesitate to use rdf2go, because it is java 1.5 and not adapted to sesame2 yet. Will ask Max though, if he can do these two things.

Comment by Jeen: although there is always a careful balance to be struck here between ease of use and performance, I can understand the need to provide a 'simple' API that offers the set of most common operations on RDF to the user. I would however always allow a back door to be built in, and to keep in mind that you may need to extend this interface later: the Sesame API is not as large as it is because we happen to like it that way, these methods serve real purposes and offer various ways around performance issues, complex querying questions, dealing with various RDF serializations, etc. Making your wrapper too simple will inevitably result in bumping into such issues all over again. Implementing such a wrapper (once the set of methods it should offer has been determined) is a minor task, it shouldn't take longer than a few days.

  • Max: backdoor included: ask an rdf2go-Model for getUnderlyingImplementation and you get back your Repository.
Last modified 18 years ago Last modified on 10/28/05 17:56:51