Problems with RDFContainer
Solution: own in-mem sail?
Leo: How about making our own in-memory sail, that does not support transactions, context, etc. It is just needed to ship RDF from the DataAccessor to the CrawlerHandler. The CrawlerHandler then has to read it again and copy all the triples to another repository
Then, the default RDFContainer (in-mem sesame) must not contain "context" anymore. Context is only used when creating the RDFContainer via the factory methods of CrawlerHandler.
- pro: for 80% of the use cases, it gets simpler (extracting data to things like gnowsis or Lucene)
- pro: for the 20% of the use cases where the DataAccessor directly streams into the RDFContainer (=the sesame database), the CrawlerHandler can provide a context-aware RDFContainer.
- con: we have to write it.
When you use a SesameRDFContainer, its Repository typically has its auto-commit mode switched off for performance reasons. However, this breaks the contract of RDFContainer's API. The reason is that statements in the repository are not visible until a commit is performed. So when you do a put() for a certain property and you later do a put() for the same property with a different value, you expect that the latter put() overwrites the former value. However, when the repository hasn't been committed yet, replaceInternal won't see the first value. Likewise, the get methods will not return a value for that property. When you finally commit, you end up with both values being stored, leading to a MultipleValuesException upon retrieval.
My example classes are already crowded with commit's to make sure certain put and get methods work correctly. There is something to say for this as these classes also provide the CrawlerHandler implementation that manage the repository. However, I now had to add Sesame-specific code in WebCrawler as it depends on the overwriting capability of the put method.
This is a problem with the SesameRDFContainer but I can imagine that other implementations working with persistence storage facilities will have similar issues.
Although I still like the simplicity of the RDFContainer API, this is another item on the list of problems I've had with it. Does anyone see a simple solution for this? Other that adding a commit() method to RDFContainer or using the Repository's auto-commit mode (hurts performance badly)?
For many datasources it is nescessary to create some blank nodes to express more complicated meta-data.
For instance, in the ImapCrawler we create the following:
myimapsource:Blah-1 a :Message ; to [ a :Person ; :email "email@example.com@ ].
(hope everyone reads N3 :)
Here we need an ID for the node representing bob. Currently each crawler makes up this node id, but this is tricky, because blanknode IDs must conform to XML names, which are not trivial to define... (the specs are pretty much unreadable). To get around this Chris and Gunnar have discussing adding a createBNode() method to the RDFContainer, which returns a bnode guaranteed unique within this store/repository/whatever. Good plan?
LeoSauermann: For exactly the same problem we have had in gnowsis, we decided to switch to no blank nodes anymore. It is better for everyone to have some kind of uri there. So we thought of just taking the e-mail address of the recipient as mailto:uri. The advantage of that is that during smushing, there are much less triples and that the different names of a person get aggregated automatically. (which is ok for me) If the ambiguity is an issue, it might be good to say that each blank node gets a constructed URI like e-mail-uri+':to:+e-mail-address-of-recipient