пятница, 14 мая 2010 г.

Running Hadoop 0.20.2 on Windows without Cygwin

About four month ago one of our customers asked us to develop a special-purpose web crawler. The idea wasn't new, so we decided to use a standart in this case set of technologies (Nutch 1.0, Hadoop 0.19.2, Zookeeper 3.2.2, Solr 1.4). There was one problem though - our client wanted us to deploy all components of a cluster on Windows 2008. From our experience we knew that Solr and Zookeeper will work fine on Windows, but we still had no idea whether Hadoop works on it. After a brief searching on internet we've found several articles on setting up Hadoop on Windows using Cygwin and decided to try this way out. However soon we've discovered that Hadoop doesn't work on Cygwin well. And the problem wasn't just in speed - a cluster was not sufficiently stable. We were not able to run jobs, because they started to fail without any apparent reason approximately after 3 hours of cluster work. After a week of study of Hadoop source code we've decided to write a patch that will allow us to run Hadoop on Windows without Cygwin. The aim of this article is to discribe our patch and steps you'll need to accomplish in case you want to run Hadoop on Windows without Cygwin.

Why Hadoop on Cygwin is a bad idea?

Cygwin is a DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality and a collection of tools which provide Linux look and feel. Although cygwin is a really nice emulation layer, it is is not 24x7 ready. Running Hadoop on Cygwin on production servers is a bad idea because of the following reasons:
  • First of all it is officially "for development purposes only"
  • It can be quite tricky to install Cygwin and SSHD components on all of your servers.
  • Like any other software Cygwin has its own bugs, and these bugs will be added to the bugs you already have in Hadoop. Sometimes you will end up with something like:
    2010-xx-xx xx:xx:xx,430 WARN mapred.TaskTracker - Error initializing attempt_201001280757_0129_m_000002_0:
    org.apache.hadoop.util.Shell$ExitCodeException: assertion "root_idx != -1" failed: file "/ext/build/netrel/src/cygwin-1.7.1-1/winsup/cygwin/mount.cc", line 363, function: void mount_info::init()
    Stack trace:
    Frame Function Args
    00289984 77461184 (00000084, 0000EA60, 00000000, 00289AA8)
    00289998 77461138 (00000084, 0000EA60, 000000A4, 00289A8C)
    ...
    End of stack trace
  • Windows has a slow process startup time compared to Linux. At the same time Hadoop does some of its job by running shell commands (measuring disk size, files size, starting Mapper, Reducer). Even if it works well in Linux, for Windows it results in a bad perfomance

Cluster Setup

In this article I make an assumption that you are installing Hadoop on a single machine. For multi-server setup please repeat all steps from the document for all your servers.

First of all download Hadoop 0.20.2 from Apache mirrors site and configure it. Please note that you should use Windows path separator "\" for paths to files or folders on local filesystem.

Now you'll need patched Hadoop, Windows shell scripts and Java Service Wrapper configuration files to be able to run JobTracker, NameNode, TaskTracker and DataNode as Windows servers. All these components you can download from Hadoop Jira. Please download file Hadoop-0.20.2-patched.zip. In case you want to build Hadoop by yourself, read Building Patched Hadoop section of the document.
Unpack downloaded archive to the directory of your choise and copy:

  • hadoop-0.20.2-core.jar file and service folder to the root of your Hadoop installation
  • cpappend.bat, hadoop.bat files from bin folder to the bin folder of your Hadoop installation
  • commons-compress-1.0.jar, jna-3.2.2.jar, commons-io-1.4.jar from lib folder to the lib folder of your Hadoop installation
Next make sure you've set the JAVA_HOME environment variable and set the HADOOP_USER environment variable to the name of account that will be used when running Hadoop services. Also ensure that you have granted Logon as a service privilege to the account.

Start Windows Command Shell and go to the service\bin folder in your Hadoop installation. If you are doing an installation on Windows 7 or Windows 2008 start Command Shell as system administrator. Run commands

InstallService.bat ..\conf\JobTracker.conf
InstallService.bat ..\conf\NameNode.conf
InstallService.bat ..\conf\TaskTracker.conf
InstallService.bat ..\conf\DataNode.conf
You will be asked to input the password for account you set in HADOOP_USER environment variable and should see following output
wrapper | Hadoop XXXXXXX installed.
At last you should format the DFS filesystem. To do it go to the bin folder in the root of your Hadoop and run shell command
hadoop.bat namenode -format
Now you are ready to start Hadoop. Run Services (services.msc) and start services in following order:
  1. Hadoop NameNode
  2. Hadoop DataNode
  3. Hadoop JobTracker
  4. Hadoop TaskTracker
In case there were problems during services startup please see log file in service\logs folder of Hadoop.

Cluster Deinstallation

To remove services you should go to the service\bin directory of Hadoop and run shell commands:
UninstallService.bat ..\conf\JobTracker.conf
UninstallService.bat ..\conf\NameNode.conf
UninstallService.bat ..\conf\TaskTracker.conf
UninstallService.bat ..\conf\DataNode.conf
This commands will stop all Hadoop Windows services and will remove them.

How does it work?

Hadoop uses Linux shell commands to accomplish some of its tasks. For example, it uses linux df and du commands to measure folder size and to get file system disk space usage. We implemented this functionality with help of JNA. With JNA we have an access to native shared libraries Kernel32.dll and Advapi32.dll.

Building Patched Hadoop From Source

You can build Hadoop both on Windows and Linux. To be able to build Hadoop on Windows you will need Cygwin. First checkout Hadoop 0.20.2 source code and our patch from Hadoop Jira. Put the patch to the folder where you've checked out Hadoop and apply it by issuing
patch -p0 < HADOOP-6767.patch


Now simply build Hadoop
ant clean jar
Built Hadoop will be located in the build folder

Shortcomings

Although we tried to test our patch as strongly as we can, there might be numerous bags in it. Here is a list of known shortcoming of the patch:
  • We haven't tested patched Hadoop with contributed modules
  • JNA library is provided under the LGPL 2.1 license which is not fully compatible with the license of Hadoop
  • I have only patched Hadoop 0.20.2. But I am planning to provide a patch for Hadoop 0.18 and Hadoop that is currently in trunk later
  • JNA is not the best choise for accessing Windows native API functions
All these shortcoming I will address as soon as I will have some free time and energy. If you have some remarks or proposals please leave your comments below or to the corresponding Hadoop Jira issue.