lucene – mookid on code

In MongoDB, there’s no way to lock a database, collection, or document. The ability to work without locking is a requirement for any db that wishes to be horizontally scalable, and obviously this imposes some limitations and/or possibilities (depending on your point of view :)).

If you want all the goodness that document-orientation brings, it seems we need to cope with this non-locking database.

So how DO you update stuff in MongoDB? And, more importantly: how do you update stuff without race conditions?

In one of my previous posts on MongoDB, I mentioned that the unit of atomicity is a document – i.e., either a document gets saved/updated/deleted or it doesn’t. That must mean that we can count on updating one document only (or not), so we should build our applications so they can work without requiring multiple documents to be updated to be consistent ([1. Which is good practice anyway! In my experience, long and wide db transactions are often used, not to enforce a strict consistency as much as to allow scenarios like: “when this happens, this should also happen”. But that kind of logic can often be handled by something else, e.g. by a publishing events reliably to other processes (logically and/or physically), that handles the side-effects.]).

First, let’s take a look at how to actually update a document.

Naïve attempt to update a document

Well, we could do this:

> use myblog
switched to db myblog
> var doc = {'headline': 'Just checking', 'tags': ['nifty']}
> db.posts.save(doc)
> // omgwtfbbq, we forgot to tag with 'test' as well... let's correct it:
> doc.tags[1] = 'test'
test
> // let's go get a cup of coffee...
> // ....
> // - aand now we're back - let's hit save
> db.posts.save(doc)
> db.posts.find()
{ "_id" : ObjectId("4b965229bf4a0000000043bc"), "headline" : "Just checking", "tags" : [ "nifty", "test" ] }

> use myblog

switched to db myblog

> var doc = {'headline': 'Just checking', 'tags': ['nifty']}

> db.posts.save(doc)

> // omgwtfbbq, we forgot to tag with 'test' as well... let's correct it:

> doc.tags[1] = 'test'

test

> // let's go get a cup of coffee...

> // ....

> // - aand now we're back - let's hit save

> db.posts.save(doc)

> db.posts.find()

{ "_id" : ObjectId("4b965229bf4a0000000043bc"), "headline" : "Just checking", "tags" : [ "nifty", "test" ] }

That is, if you go and save a document that already has an ID, any existing document with that ID will be updated.

This would work if we were the only client on the db. But what if someone was editing the post in that same moment, adding another tag as well? Well, if he was unfortunate enough to save his edits when we were out for coffee, his changes would be lost.

One way to actually do it

By using the update function!

update accepts the following four arguments:

criteria – document selector that specifies which document to be updated
objNew – document to save
upsert – bool to specify auto-insert if document does not exist (“update if present, insert if missing”)
multi – bool to allow updating multiple documents that match the criteria (default is only first document)

Actually, as you can now probably see, save(doc) is just a shorthand for update({}, doc, true, false) – an upsert with the document we’re saving.

This way, we could easily add an incrementing version field to our documents to make sure that the version we’re saving is the version we retrieved.

Let’s try it out:

> post = {'headline': 'has a version field', 'tags': ['nifty'], 'version': 1}
{
        "headline" : "has a version field",
        "tags" : [
                "nifty"
        ],
        "version" : 1
}
> db.posts.save(post)
> // now we're editing it
> post['tags'][1] = 'test'
test
> post['version']++
1
> post
{
        "headline" : "has a version field",
        "tags" : [
                "nifty",
                "test"
        ],
        "version" : 2,
        "_id" : ObjectId("4b9884d4b54e000000006c69")
}
> // now someone else retrieves the post
> someoneElsesPost = db.posts.findOne()
{
        "_id" : ObjectId("4b9884d4b54e000000006c69"),
        "headline" : "has a version field",
        "tags" : [
                "nifty"
        ],
        "version" : 1
}
> // and we save it, setting criteria to the version we retrieved
> db.posts.update({'version': 1}, post); db.$cmd.findOne({getlasterror: 1})
{ "err" : null, "updatedExisting" : true, "n" : 1, "ok" : 1 }
> // as you can probably tell, n:1 means that 1 document was updated...
> //
> // now that other guy makes an edit and tries to save it
> someoneElsesPost['tags'][1] = 'json'
json
> db.posts.update({'version': 1}, someoneElsesPost); db.$cmd.findOne({getlasterror: 1})
{ "err" : null, "updatedExisting" : false, "n" : 0, "ok" : 1 }
> // 0 documents were updated! Good!

> post = {'headline': 'has a version field', 'tags': ['nifty'], 'version': 1}

{

"headline" : "has a version field",

"tags" : [

"nifty"

"version" : 1

}

> db.posts.save(post)

> // now we're editing it

> post['tags'][1] = 'test'

test

> post['version']++

> post

{

"headline" : "has a version field",

"tags" : [

"nifty",

"test"

"version" : 2,

"_id" : ObjectId("4b9884d4b54e000000006c69")

}

> // now someone else retrieves the post

> someoneElsesPost = db.posts.findOne()

{

"_id" : ObjectId("4b9884d4b54e000000006c69"),

"headline" : "has a version field",

"tags" : [

"nifty"

"version" : 1

}

> // and we save it, setting criteria to the version we retrieved

> db.posts.update({'version': 1}, post); db.$cmd.findOne({getlasterror: 1})

{ "err" : null, "updatedExisting" : true, "n" : 1, "ok" : 1 }

> // as you can probably tell, n:1 means that 1 document was updated...

> //

> // now that other guy makes an edit and tries to save it

> someoneElsesPost['tags'][1] = 'json'

json

> db.posts.update({'version': 1}, someoneElsesPost); db.$cmd.findOne({getlasterror: 1})

{ "err" : null, "updatedExisting" : false, "n" : 0, "ok" : 1 }

> // 0 documents were updated! Good!

– good thing he didn’t get through with that one! 🙂

As you can see, we can easily implement optimistic concurrency on each document by constraining updates to the version we checked out. But, as I will show next, you can actually do a lot of things on the server.

Server-side document updates

Instead of retrieving an entire document, modifying it, and saving it back again with the risk of overwriting someone else’s edits, we can ask the server to make edits as smaller operations. E.g. our attempt to add a missing ‘test’ tag to the document could have been done like this:

> db.posts.update({'_id': ObjectId("4b9884d4b54e000000006c69")}, { $push: { 'tags': 'test' } });
> db.posts.find()
{
        "_id" : ObjectId("4b9884d4b54e000000006c69"),
        "headline" : "has a version field",
        "version" : 2,
        "tags" : [ "nifty", "json", "test" ]
}

> db.posts.update({'_id': ObjectId("4b9884d4b54e000000006c69")}, { $push: { 'tags': 'test' } });

> db.posts.find()

{

"_id" : ObjectId("4b9884d4b54e000000006c69"),

"headline" : "has a version field",

"version" : 2,

"tags" : [ "nifty", "json", "test" ]

}

See how the code $push modifier was used to push a value into the array… this stuff is great. But here, we have a race condition again – what if someone added the ‘test’ tag almost the same time as we did? Then two ‘test’ tags would be present in the array.

One way is to constrain the update by id and the absence of ‘test’ in the tags array:

> db.posts.update({
    '_id': ObjectId("4b9884d4b54e000000006c69"), 
    $neq { 'tags': 'test'} 
}, { 
    $push: { 'tags': 'test' } 
});

> db.posts.update({

'_id': ObjectId("4b9884d4b54e000000006c69"),

$neq { 'tags': 'test'}

}, {

$push: { 'tags': 'test' }

});

– another is to use the $addToSet function, which makes MongoDB treat the array as a set:

> db.posts.update({'_id': ObjectId("4b9884d4b54e000000006c69") }, { $addToSet: { 'tags': 'test' } });

1	> db.posts.update({'_id': ObjectId("4b9884d4b54e000000006c69") }, { $addToSet: { 'tags': 'test' } });

Nifty!

My conclusion (so far) is that an application can get a huge benefit from using the various modifier operations – performance-wise (obviously), but probably also UX-wise as well… It’s a step in another direction from the usual CRUD scenarios that I usually compulsively associate with the word “update”, and I imagine it could be made to reflect the user’s interactions with the system.

I am thinking that the majority of the user’s interactions with the system could (and probably should) be put on the form

db.collection.update({
        [assumed state before]
}, {
        [modifier operations to "migrate" one or
         more documents to the new state]
}, upsert, multi); db.$cmd.findOne({getlasterror: 1})

db.collection.update({

[assumed state before]

}, {

[modifier operations to "migrate" one or

How to perform updates across multiple documents

My first thought is that this situation should be avoided when working with a document-oriented db. I think most people will agree with this one.

I am pretty unsure of this, actually… The rest of this post is just a few thoughts on my first take on this, should I need to do this. Comments are greatly appreciated!

The problem in updating multiple documents is that we can perform an update on one document at a time, each time checking if the update went well or not. But there’s no way to (consistently) roll back update #1 if update #2 fails. So this means that there’s only one way: Forward! But how to proceed then, when an update failed?

How do we usually do stuff reliably across boundaries of multiple things that may or may not succeed, allowing us to handle errors as gracefully as possible and proceed thereafter?

I’m thinking that asynchronous reliable one-way messaging is the answer to this.

So if an application ever needs to update multiple documents in the most reliable way possible, it should probably perform one document update per “transaction” – in NServiceBus terminology that would be updating one document per message handler. And then the handler should throw an exception if an update unexpectedly fails.

But again: I’m thinking that this situation should be avoided at all costs with a document-oriented db. If ACID is required, the application should probably have a RDBMS on the side, or implement some kind of transactional mechanism in the document store.

Conclusion

That concludes my little learning series of MongoDB posts.

I must say that I am intrigued by all the NoSQL discussions currently going on in the communities, and I think it is always a sign of health that we question the technologies we use.

I am entirely convinced that document dbs could and should have been used for some parts of systems that I have experienced, and I am blown away by the lack of friction when starting up a project on top of a schemaless db.

As a .NET dude, I am convinced that the future will see more .NET systems built with more than one db underneath – e.g. with MongoDB for all the “soft parts” of the system, NHibernate on SQL Server for the few things that by nature require ACID, and then some NHibernate Search/Lucene/Solr.NET for full-text indexing and searching capabilities etc.

Category: lucene

More checking out MongoDB: Updating

Naïve attempt to update a document

One way to actually do it

Server-side document updates

How to perform updates across multiple documents

Conclusion