пятница, 14 мая 2010 г.

Running Hadoop 0.20.2 on Windows without Cygwin

About four month ago one of our customers asked us to develop a special-purpose web crawler. The idea wasn't new, so we decided to use a standart in this case set of technologies (Nutch 1.0, Hadoop 0.19.2, Zookeeper 3.2.2, Solr 1.4). There was one problem though - our client wanted us to deploy all components of a cluster on Windows 2008. From our experience we knew that Solr and Zookeeper will work fine on Windows, but we still had no idea whether Hadoop works on it. After a brief searching on internet we've found several articles on setting up Hadoop on Windows using Cygwin and decided to try this way out. However soon we've discovered that Hadoop doesn't work on Cygwin well. And the problem wasn't just in speed - a cluster was not sufficiently stable. We were not able to run jobs, because they started to fail without any apparent reason approximately after 3 hours of cluster work. After a week of study of Hadoop source code we've decided to write a patch that will allow us to run Hadoop on Windows without Cygwin. The aim of this article is to discribe our patch and steps you'll need to accomplish in case you want to run Hadoop on Windows without Cygwin.

Why Hadoop on Cygwin is a bad idea?

Cygwin is a DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality and a collection of tools which provide Linux look and feel. Although cygwin is a really nice emulation layer, it is is not 24x7 ready. Running Hadoop on Cygwin on production servers is a bad idea because of the following reasons:
  • First of all it is officially "for development purposes only"
  • It can be quite tricky to install Cygwin and SSHD components on all of your servers.
  • Like any other software Cygwin has its own bugs, and these bugs will be added to the bugs you already have in Hadoop. Sometimes you will end up with something like:
    2010-xx-xx xx:xx:xx,430 WARN mapred.TaskTracker - Error initializing attempt_201001280757_0129_m_000002_0:
    org.apache.hadoop.util.Shell$ExitCodeException: assertion "root_idx != -1" failed: file "/ext/build/netrel/src/cygwin-1.7.1-1/winsup/cygwin/mount.cc", line 363, function: void mount_info::init()
    Stack trace:
    Frame Function Args
    00289984 77461184 (00000084, 0000EA60, 00000000, 00289AA8)
    00289998 77461138 (00000084, 0000EA60, 000000A4, 00289A8C)
    ...
    End of stack trace
  • Windows has a slow process startup time compared to Linux. At the same time Hadoop does some of its job by running shell commands (measuring disk size, files size, starting Mapper, Reducer). Even if it works well in Linux, for Windows it results in a bad perfomance

Cluster Setup

In this article I make an assumption that you are installing Hadoop on a single machine. For multi-server setup please repeat all steps from the document for all your servers.

First of all download Hadoop 0.20.2 from Apache mirrors site and configure it. Please note that you should use Windows path separator "\" for paths to files or folders on local filesystem.

Now you'll need patched Hadoop, Windows shell scripts and Java Service Wrapper configuration files to be able to run JobTracker, NameNode, TaskTracker and DataNode as Windows servers. All these components you can download from Hadoop Jira. Please download file Hadoop-0.20.2-patched.zip. In case you want to build Hadoop by yourself, read Building Patched Hadoop section of the document.
Unpack downloaded archive to the directory of your choise and copy:

  • hadoop-0.20.2-core.jar file and service folder to the root of your Hadoop installation
  • cpappend.bat, hadoop.bat files from bin folder to the bin folder of your Hadoop installation
  • commons-compress-1.0.jar, jna-3.2.2.jar, commons-io-1.4.jar from lib folder to the lib folder of your Hadoop installation
Next make sure you've set the JAVA_HOME environment variable and set the HADOOP_USER environment variable to the name of account that will be used when running Hadoop services. Also ensure that you have granted Logon as a service privilege to the account.

Start Windows Command Shell and go to the service\bin folder in your Hadoop installation. If you are doing an installation on Windows 7 or Windows 2008 start Command Shell as system administrator. Run commands

InstallService.bat ..\conf\JobTracker.conf
InstallService.bat ..\conf\NameNode.conf
InstallService.bat ..\conf\TaskTracker.conf
InstallService.bat ..\conf\DataNode.conf
You will be asked to input the password for account you set in HADOOP_USER environment variable and should see following output
wrapper | Hadoop XXXXXXX installed.
At last you should format the DFS filesystem. To do it go to the bin folder in the root of your Hadoop and run shell command
hadoop.bat namenode -format
Now you are ready to start Hadoop. Run Services (services.msc) and start services in following order:
  1. Hadoop NameNode
  2. Hadoop DataNode
  3. Hadoop JobTracker
  4. Hadoop TaskTracker
In case there were problems during services startup please see log file in service\logs folder of Hadoop.

Cluster Deinstallation

To remove services you should go to the service\bin directory of Hadoop and run shell commands:
UninstallService.bat ..\conf\JobTracker.conf
UninstallService.bat ..\conf\NameNode.conf
UninstallService.bat ..\conf\TaskTracker.conf
UninstallService.bat ..\conf\DataNode.conf
This commands will stop all Hadoop Windows services and will remove them.

How does it work?

Hadoop uses Linux shell commands to accomplish some of its tasks. For example, it uses linux df and du commands to measure folder size and to get file system disk space usage. We implemented this functionality with help of JNA. With JNA we have an access to native shared libraries Kernel32.dll and Advapi32.dll.

Building Patched Hadoop From Source

You can build Hadoop both on Windows and Linux. To be able to build Hadoop on Windows you will need Cygwin. First checkout Hadoop 0.20.2 source code and our patch from Hadoop Jira. Put the patch to the folder where you've checked out Hadoop and apply it by issuing
patch -p0 < HADOOP-6767.patch


Now simply build Hadoop
ant clean jar
Built Hadoop will be located in the build folder

Shortcomings

Although we tried to test our patch as strongly as we can, there might be numerous bags in it. Here is a list of known shortcoming of the patch:
  • We haven't tested patched Hadoop with contributed modules
  • JNA library is provided under the LGPL 2.1 license which is not fully compatible with the license of Hadoop
  • I have only patched Hadoop 0.20.2. But I am planning to provide a patch for Hadoop 0.18 and Hadoop that is currently in trunk later
  • JNA is not the best choise for accessing Windows native API functions
All these shortcoming I will address as soon as I will have some free time and energy. If you have some remarks or proposals please leave your comments below or to the corresponding Hadoop Jira issue.

10 комментариев:

  1. I'm dying to hear you progress on implementing this task - I read in your last comment to your jira that you are going to change your patch a little. I'm also facing issue deploying hadoop on big number of windows machines because of cygwin - I wish I could help but I'm not so good at Java.

    ОтветитьУдалить
  2. I've updated an issue. Batch scripts and scripts for running Hadoop as Windows services have been added. But these changes doesn't allow to run Hadoop w/o Cygwin

    ОтветитьУдалить
  3. Can you tell me if we need cygwin installed though we are not running hadoop on it? I saw that in hadoop.bat, the path to cygwin is being added.

    ОтветитьУдалить
  4. when it asks for the password it appends ".\" in front of the username as set in HADOOP_USER.

    example, HADOOP_USER=user

    Please input the password for account '.\user':

    ОтветитьУдалить
  5. yes, u need to enable the user to logon as a service..

    ОтветитьУдалить
  6. Hello Orlov,
    Can I install Hadoop 2.203 also with the same patch? Really appreciate the help.

    Thanks
    Babu

    ОтветитьУдалить
  7. Hi
    I am getting this error when trying to start name node in Windows XP
    The Hadoop NameNode service is starting..
    The Hadoop NameNode service could not be started.

    A service specific error occurred: 4294967295.

    More help is available by typing NET HELPMSG 3547.

    How to resolve this issue.

    ОтветитьУдалить
  8. I'm running Sun Java 1.6.0_35-b10 with the windows port of hadoop 0.20.2, and when I
    run 'hadoop namenode -format', I get:
    The java class could not be loaded. java.lang.UnsupportedClassVersionError: (org/apache/hadoop/hdfs/server/namenode/NameNode) bad major version at offset=6

    I also cannot start the windows services - I get "Windows could not start Hadoop NameNode on Local Computer.
    For more information, review the system event log. If this is a non-Microsoft service, contact the service
    vendor, and refer to service-specific error code 1.

    Thanks for your efforts.

    ОтветитьУдалить
  9. The information which you have provided is very good and easily understood.
    It is very useful who is looking for hadoop Online Training.

    ОтветитьУдалить