wiki:ApertureRdfContainerProblems

Version 2 (modified by grimnes, 19 years ago) (diff)

--

Problems with RDFContainer

Solution: own in-mem sail?

Leo: How about making our own in-memory sail, that does not support transactions, context, etc. It is just needed to ship RDF from the DataAccessor to the CrawlerHandler. The CrawlerHandler then has to read it again and copy all the triples to another repository

Then, the default RDFContainer (in-mem sesame) must not contain "context" anymore. Context is only used when creating the RDFContainer via the factory methods of CrawlerHandler.

  • pro: for 80% of the use cases, it gets simpler (extracting data to things like gnowsis or Lucene)
  • pro: for the 20% of the use cases where the DataAccessor directly streams into the RDFContainer (=the sesame database), the CrawlerHandler can provide a context-aware RDFContainer.
  • con: we have to write it.

Problem autocommit

When you use a SesameRDFContainer, its Repository typically has its auto-commit mode switched off for performance reasons. However, this breaks the contract of RDFContainer's API. The reason is that statements in the repository are not visible until a commit is performed. So when you do a put() for a certain property and you later do a put() for the same property with a different value, you expect that the latter put() overwrites the former value. However, when the repository hasn't been committed yet, replaceInternal won't see the first value. Likewise, the get methods will not return a value for that property. When you finally commit, you end up with both values being stored, leading to a MultipleValuesException upon retrieval.

My example classes are already crowded with commit's to make sure certain put and get methods work correctly. There is something to say for this as these classes also provide the CrawlerHandler implementation that manage the repository. However, I now had to add Sesame-specific code in WebCrawler as it depends on the overwriting capability of the put method.

This is a problem with the SesameRDFContainer but I can imagine that other implementations working with persistence storage facilities will have similar issues.

Although I still like the simplicity of the RDFContainer API, this is another item on the list of problems I've had with it. Does anyone see a simple solution for this? Other that adding a commit() method to RDFContainer or using the Repository's auto-commit mode (hurts performance badly)?

Problem BlankNodes

For many datasources it is nescessary to create some blank nodes to express more complicated meta-data.

For instance, in the ImapCrawler we create the following:

myimapsource:Blah-1 a :Message ; to [ a :Person ; :email "bob@bob.com@ ].

(hope everyone reads N3 :)

Here we need an ID for the node representing bob. Currently each crawler makes up this node id, but this is tricky, because blanknode IDs must conform to XML names, which are not trivial to define... (the specs are pretty much unreadable). To get around this Chris and Gunnar have discussing adding a createBNode() method to the RDFContainer, which returns a bnode guaranteed unique within this store/repository/whatever. Good plan?