After my MongoDB presentation today, I was asked a few questions about MongoDB’s sharding capabilities. Like my interest so far, my talk was completely focused on the frictionless aspects of using MongoDB, so I have never tried any of the sharding and replica set configurations that MongoDB can run in.
That has got to end!
So, let’s try spinning up the simplest possible sharding scenario that we can think of: Two durable MongoDB instances on port 27017 and 27018 with one collection sharded across them:
1 2 |
mkdir c:\data\db1 mongod --shardsvr --port 27017 --journal --dbpath c:\data\db1 |
1 2 |
mkdir c:\data\db2 mongod --shardsvr --port 27018 --journal --dbpath c:\data\db2 |
MongoDB sharding requires that you spin up a special configuration server that stores the configuration of which shards are available – let’s spin this up on port 27019:
1 2 |
mkdir c:\data\configdb mongod --configsvr --port 27019 --dbpath c:\data\configdb |
– and finally, we spin up one instance of mongos on port 27020, pointing it towards the configuration server:
1 |
mongos --configdb localhost:27019 --port 27020 |
To finalize the setup, we let the configuration database know of the two shards we have started by connecting to the admin database using the Mongo shell:
1 2 3 4 5 6 7 |
C:\mongodb\bin>mongo localhost:27020 MongoDB shell version: 1.8.1 connecting to: localhost:27020/test > use admin switched to db admin > db.runCommand({addShard:"localhost:27017"}) > db.runCommand({addShard:"localhost:27018"}) |
Now, let’s see if it understood this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> db.runCommand({listShards: 1}) { "shards" : [ { "_id" : "shard0000", "host" : "localhost:27017" }, { "_id" : "shard0001", "host" : "localhost:27018" } ], "ok" : 1 } |
As you can see, the config database correctly stored information about our two shards – that was easy!
Finally, we need to enable sharding for one particular database and make sure that our collection of unicorns is sharded by the name field:
1 2 3 4 |
> db.runCommand({enableSharding: "fairytale"}) { "ok" : 1 } > db.runCommand({shardCollection: "fairytale.unicorns", key: {name: 1}}) { "collectionsharded" : "fairytale.unicorns", "ok" : 1 } |
At this point, even though the fairytale database and unicorns collection did not exist, they have been created for me, and the required index has been created for the shard key. I can verify this like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
> use fairytale > db.unicorns.getIndexes() [ { "name" : "_id_", "ns" : "fairytale.unicorns", "key" : { "_id" : 1 }, "v" : 0 }, { "ns" : "fairytale.unicorns", "key" : { "name" : 1 }, "name" : "name_1", "v" : 0 } ] |
Now, let’s go to C# with mongo-csharp and hammer 10 million randomly named unicorns, each carrying a payload of up to 8 KB of fairy dust in there:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
class Program { static readonly Random Random = new Random(); static void Main() { var server = MongoServer.Create("mongodb://localhost:27020"); var database = server.GetDatabase("fairytale"); var unicorns = database.GetCollection("unicorns"); using (database.RequestStart()) { var batch = Enumerable.Range(0, 10000000) .Select(_ => new { name = GenerateRandomUnicornName(), horns = 1, likes = new[] {"eggs", "beer"}, fairy_dust = GenerateRandomBytes(Random.Next(8192)) }); unicorns.InsertBatch(batch); server.GetLastError(); } server.Disconnect(); } static byte[] GenerateRandomBytes(int length) { var bytes = new byte[length]; Random.NextBytes(bytes); return bytes; } static string GenerateRandomUnicornName() { Func<char> charFactory = () => (char)('a' + Random.Next(25)); var funcs = Enumerable.Repeat(charFactory, 10 + Random.Next(30)); return new string(funcs.Select(f => f()).ToArray()); } } |
now, let’s go to the Mongo shell and see if they’re there:
1 2 |
> db.unicorns.count() 10000000 |
Great! 10 million unicorns in there. Let’s check out the disk and see if data was somehow distributed among the shards:
That seems to be pretty well balanced if you ask me. Let’s see what Mongo can tell us:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> db.printShardingStatus() --- Sharding Status --- sharding version: { "_id" : 1, "version" : 3 } shards: { "_id" : "shard0000", "host" : "localhost:27017" } { "_id" : "shard0001", "host" : "localhost:27018" } databases: { "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "fairytale", "partitioned" : true, "primary" : "shard0000" } fairytale.unicorns chunks: shard0001 66 shard0000 65 too many chunksn to print, use verbose if you want to force print |
66 chunks on the 1st shard, and 65 chunks on the 2nd shard – it is in fact pretty well balanced.
Conclusion so far
It seems to be pretty easy to begin sharding data, which is perfectly in line with the usual MongoDB feeling. It’s definitely a subject I need to look more into though, so if you want to read more about it, I can really recommend Kristina Chodorow‘s blog: Snail In A Turtleneck.
Very interesting π
Any specific reason for why you have not used Raven DB?
Yes – I like MongoDB more π
I just happened to start working with MongoDB some time ago when RavenDB was not available, and has caused no friction whatsoever that MongoDB is not based on .NET.
Moreover, MongoDB is much more mature than Raven – #justsaying (not that it influences any of my decisions…)
I am planning on giving RavenDB another spin though, because I don’t know too much about it yet.
Excellent Post!!! I just ran step by step, and it works like a charm. 10 million records and I did a search on one particular record, and the speed of MongoDB is just amazing.
Thanks,
Vikas
Awesome π
Thanks mate