= Problems with RDFContainer =

= Solution: own in-mem sail? =

Leo: How about making our own in-memory sail, that does not support transactions, context, etc.
It is just needed to ship RDF from the DataAccessor to the CrawlerHandler.
The CrawlerHandler then has to read it again and copy all the triples to another repository

Then, the default RDFContainer (in-mem sesame) must not contain "context" anymore. Context is only used when creating the RDFContainer via the factory methods of CrawlerHandler.

 * pro: for 80% of the use cases, it gets simpler (extracting data to things like gnowsis or Lucene)
 * pro: for the 20% of the use cases where the DataAccessor directly streams into the RDFContainer (=the sesame database), the CrawlerHandler can provide a context-aware RDFContainer.
 * con: we have to write it.

=  Problem autocommit =

When you use a SesameRDFContainer, its Repository typically has its 
auto-commit mode switched off for performance reasons. However, this 
breaks the contract of RDFContainer's API. The reason is that statements 
in the repository are not visible until a commit is performed. So when 
you do a put() for a certain property and you later do a put() for the 
same property with a different value, you expect that the latter put() 
overwrites the former value. However, when the repository hasn't been 
committed yet, replaceInternal won't see the first value. Likewise, the 
get methods will not return a value for that property. When you finally 
commit, you end up with both values being stored, leading to a 
MultipleValuesException upon retrieval.

My example classes are already crowded with commit's to make sure 
certain put and get methods work correctly. There is something to say 
for this as these classes also provide the CrawlerHandler implementation 
that manage the repository. However, I now had to add Sesame-specific 
code in WebCrawler as it depends on the overwriting capability of the 
put method.

This is a problem with the SesameRDFContainer but I can imagine that 
other implementations working with persistence storage facilities will 
have similar issues.

Although I still like the simplicity of the RDFContainer API, this is 
another item on the list of problems I've had with it. Does anyone see a 
simple solution for this? Other that adding a commit() method to 
RDFContainer or using the Repository's auto-commit mode (hurts 
performance badly)?

= Problem BlankNodes = 

For many datasources it is nescessary to create some blank nodes to express more complicated meta-data. 

For instance, in the ImapCrawler we create the following: 

{{{

myimapsource:Blah-1 a :Message ; to [ a :Person ; :email "bob@bob.com@ ].

}}}

(hope everyone reads N3 :) 

Here we need an ID for the node representing bob. Currently each crawler makes up this node id, but this is tricky, because blanknode IDs must conform to XML names, which are not trivial to define... (the specs are pretty much unreadable). 
To get around this Chris and Gunnar have discussing adding a createBNode() method to the RDFContainer, which returns a bnode guaranteed unique within this store/repository/whatever. Good plan?

LeoSauermann: For exactly the same problem we have had in gnowsis, we decided to switch to '''no blank nodes''' anymore. It is better for everyone to have some kind of uri there. So we thought of just taking the e-mail address of the recipient as mailto:uri. The advantage of that is that during smushing, there are much less triples and that the different names of a person get aggregated automatically. (which is ok for me)
If the ambiguity is an issue, it might be good to say that each blank node gets a constructed URI like e-mail-uri+':to:+e-mail-address-of-recipient