Since ElasticSearch is hot sh*# these days, and my old hacker friend Thomas Ardal wrote a nifty guide on how to install it on Windows VMs in Azure, I thought I might as well supplement with a guide on how to do the same thing, only on Ubuntu VMs in Azure….
So, in this guide I’ll take you through the steps necessary to set up three Ubuntu VMs in Azure and install an ElasticSearch node on each of them, and finally connect the nodes into a search cluster… here goes:
First, create a new virtual network
Unless you intend to add your new Ubuntu VMs to an existing virtual network, you should use the “New” button and go and create a new virtual network. You can just fill in the name and leave all other options at their defaults.
Create virtual machines
Now, go and create a new virtual machine from the gallery.
Select the latest Ubuntu from the list.
Give your virtual machine a sensible name – in this case, since this is the third machine in my ElasticSearch cluster, I’m calling it “elastica3”. For all three machines, I’ve created a user account called “mhg” on the machine so I can SSH to it.
On the first machine, be sure to create a new cloud service that you can use to load balance requests among the machines. When adding the subsequent machines, remember to select the existing cloud service. In this case, since it’s balancing among “elasica1”, “elastica2”, and “elastica3”, I’m calling the cloud service “elastica”.
Moreover, it’s important that you add the machines to the same availability set! This way, Azure will ensure that the machines are unlikely to crash/be disconnected/fail at the same time by putting the machines in different fault domains.
When the first machine was added, the public port 22 on the cloud service “elastica” got automatically mapped to port 22 on the machine. When adding the subsequent machines, select another public port to map to 22 so that you can SSH to each individual machine from the outside. I chose 23 and 24 for the two other machines.
SSH to each machine
Open up a terminal and
1 |
ssh mhg@elastica.cloudapp.net -p22 |
in order to SSH to the first machine, logging in as “mhg”. In this example, I’m using the (default) port 22 which I will replace with 23 and 24 in order to SSH to the other two machines.
Update apt-get
On each machine, I start out by running a
1 |
sudo apt-get update |
in order to download the most recent apt-get package lists.
Install Java
Now, on each machine I install Java by going
1 |
sudo apt-get install openjdk-7-jre-headless -y |
and at this point I usually feel inspired to go grab myself a cup of coffee… 😉
Download and install ElasticSearch
And, finally, we’re ready to install ElasticSearch – go to the download page and copy the URL of the DEB package. At the time of writing this, the most recent DEB package is https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.deb which I download and install on each machine like this:
1 2 3 |
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.deb sudo dpkg -i elasticsearch-0.90.5.deb sudo service elasticsearch start |
Configure ElasticSearch cluster
In order to be able to edit the configuration file, I
1 |
sudo apt-get install emacs |
and go
1 |
sudo emacs /etc/elasticsearch/elasticsearch.yml |
By default, ElasticSearch will use UDP to dynamically discover an existing cluster which it will automatically join. On Azure though, we must explicitly specify which nodes go into our cluster. In order to do this, uncomment the line
1 |
discovery.zen.ping.multicast.enabled: false |
to disable UDP discovery, and then add the full list of the IP addresses of your machines on the following line:
1 |
discovery.zen.ping.unicast.hosts: ["10.0.0.4", "10.0.0.5", "10.0.0.6"] |
In my case, the IPs assigned to the VMs were 10.0.0.4 through 10.0.0.6. You can use ifconfig on each machine if you’re in doubt which IP was assigned (or you can check it out via the Azure Portal).
After saving each file, remember to
1 |
sudo service elasticsearch restart |
for ElasticSearch to pick up the changes.
Check it out
Now, on any of the three machines, try CURLing the following command:
1 |
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true' |
which should yield something like this:
1 2 3 4 5 6 7 8 9 10 11 12 |
{ "cluster_name" : "elasticsearch", "status" : "green", "timed_out" : false, "number_of_nodes" : 3, "number_of_data_nodes" : 3, "active_primary_shards" : 0, "active_shards" : 0, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0 } |
Finally, let’s make the cluster accessible from the outside….
Set up load balancing among the three VMs
Go to the first VM on the “Endpoints” tab and add a new endpoint.
Remember to check the option that you want to create a new load-balanced set. Just go with the defaults when asked about how the load balancer should probe the endpoints.
Last thing is to add an endpoint to the two other VMs, selecting the existing load-balanced set.
When this step is completed, you should be able to visit your cloud service URL (in my case it was http://elastica.cloudapp.net:9200) and see something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ ok: true, status: 200, name: "Machine Teen", version: { number: "0.90.5", build_hash: "c8714e8e0620b62638f660f6144831792b9dedee", build_timestamp: "2013-09-17T12:50:20Z", build_snapshot: false, lucene_version: "4.4" }, tagline: "You Know, for Search" } |
So, is it usable yet?
Not sure, actually – I haven’t had time to investigate how to properly set up an authorization mechanism so as to make my cluster accessible only to specific applications.
If anyone knows how to do that on Azure, please don’t hesitate to enlighten me 🙂
Great guide! Running ElasticSearch on Linux seems like the mature choice.
Regarding security, neither ElasticSearch nor Azure supports this out of the box. I would really like Microsoft to implement “local endpoints” or something, making you cluster accessible from Azure websites and cloud services only. Until this happen (maybe never), you can use the Jetty plugin for ElasticSearch which support both HTTPS and basic authentication. I’m using that with a local generated key on elmah.io. Seems to work pretty good.
Cool! I think I’ll have a look at the Jetty plugin – seems like a fairly simple way of enabling secure communication.
@Kenneth: Seems like a nice and simple solution to providing HTTP basic auth, but it should probably not be used for any kind of sensitive data when it is exposed to the internet. Looks neat though 🙂
Nice article. I did more or less the above steps about a month ago, and it works beautifully. About authenication I use https://github.com/Asquera/elasticsearch-http-basic, which uses basic authentication. It’s not a problem for me, because I run a node web server on the same machine, and I can query the ES cluster remotely.
Kenneth
The Jetty plugin offer HTTPS support, which in my head make it more usable than elasticsearch-http-basic.
@Mogens Please let me know if you need help setting up Jetty. Mads Jensen helped me a bit with generating the keystore etc.
@Thomas cool, thanks – I’ll see if I can describe the procedure in another blog post then