网格环境配置(四):安装gt4和gt4-sge adapter

来源:互联网 发布:淘宝收货退货 编辑:程序博客网 时间:2024/05/04 17:07

vm1上安装gt4
首先安装各种必须的包,安装了安装光盘上的postgresql-lib,postgresql7.3.4, postgresql-server,安装了jdk-1_5_0_05-linux-i586.bin,apache-ant-1.6.5-bin.tar,检查一下gcc,g++,sed,make,perl,sudo,tar这些有没有安装。Globus安装包使用的是gt4.0.2-x86_rh_9-installer.tar,这是二进制安装包,非常快速。
关于globus安装后的配置,请见http://blog.csdn.net/jcwKyl/archive/2009/07/18/4360031.aspx或者http://www.globus.org/toolkit/docs/4.0/admin/docbook/quickstart.html

安装globussge adapter
参见http://www.globusconsortium.org/tutorial/ch8/page_2.php上的文档。
下载四个包:
[whb@jcwkyl gridsoft]$ wgethttp://www.lesc.ic.ac.uk/projects/globus_gram_job_manager_setup_sge-1.1.tar.gz
[whb@jcwkyl gridsoft]$ wgethttp://www.lesc.ic.ac.uk/projects/globus_scheduler_event_generator_sge-1.1.tar.gz
[whb@jcwkyl gridsoft]$ wgethttp://www.lesc.ic.ac.uk/projects/globus_scheduler_event_generator_sge_setup-1.1.tar.gz
[whb@jcwkyl gridsoft]$ wgethttp://www.lesc.ic.ac.uk/projects/globus_wsrf_gram_service_java_setup_sge-1.1.tar.gz

[globus@vm1 globus]$ cd $SGE_ROOT
[globus@vm1 sge]$ source default/common/settings.sh
[globus@vm1 sge]$ source $GLOBUS_LOCATION/etc/globus-user-env.sh
[globus@vm1 sge]$ cd
[globus@vm1 globus]$ gpt-build/software/globus_gram_job_manager_setup_sge-1.1.tar.gz
[globus@vm1 globus]$ gpt-build/software/globus_scheduler_event_generator_sge-1.1.tar.gz gcc32dbg
[globus@vm1 globus]$ gpt-build/software/globus_scheduler_event_generator_sge_setup-1.1.tar.gz
[globus@vm1 globus]$ gpt-build/software/globus_wsrf_gram_service_java_setup_sge-1.1.tar.gz
[globus@vm1 globus]$ gpt-postinstall
现在可以测试一下GRAM WSSGE jobmanager
首先启动container
-bash-2.05b$ postmaster -i -D /opt/pgsql/data/ > logfile 2>&1&
[globus@vm1 globus]$ globus-start-container > logfile 2>&1&
在启动globus-start-container的时候,会出现以下警告:
2009-11-28 14:46:33,893 WARN usefulrp.GLUEResourceProperty [GLUE refresher 0,runScript:315] ScriptExecution error when executing shell/opt/globus-4.0.2/libexec/globus-scheduler-provider-sge
java.io.IOException:java.io.IOException:/opt/globus-4.0.2/libexec/globus-scheduler-provider-sge: not found
atjava.lang.UNIXProcess.<init>(UNIXProcess.java:148)
atjava.lang.ProcessImpl.start(ProcessImpl.java:65)
atjava.lang.ProcessBuilder.start(ProcessBuilder.java:451)
atjava.lang.Runtime.exec(Runtime.java:591)
atjava.lang.Runtime.exec(Runtime.java:429)
atjava.lang.Runtime.exec(Runtime.java:326)
http://www.globusconsortium.org/tutorial/ch8/page_3.php这个网站上说这条信息可以忽略,但是在提交作业的时候总是出现错误,google发现这个网址处给的整合gt4sge的方法来自于http://www.lesc.ic.ac.uk/projects/SGE-GT4.html,在globusdeveloper'sguidehttp://docs.huihoo.com/globus/toolkit/4.0/execution/wsgram/developer-index.html中关于sge的整合一节中也给的是www.lesc.ic.ac.uk这个链接。
但是在提交作业的时候总是会出现Unsubmitted错误,如下:
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c/bin/echo "just a test"
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:5b2ab3c0-dcf3-11de-96f9-080027f48588
Termination time: 11/30/2009 14:27 GMT
到了这里就不动了,等待很长时间后说:
Current job state: Unsubmitted
但事实上,这个作业已经被SGE执行了,我们上面是在vm1上用guest用户提交的作业,在vm2上可以看到:
[guest@vm2 guest]$ ls
5b2ab3c0-dcf3-11de-96f9-080027f48588.0.stderr test
5b2ab3c0-dcf3-11de-96f9-080027f48588.0.stdout transfer.xfr
一开始的那个就是上面提交作业时显示的uuid5b2ab3c0-dcf3-11de-96f9-080027f48588.0.stdout文件的内容就是"justa test”。看来adpater是起作用了,提交的作业确实被SGE执行了,只是状态信息弄错了。
google到这个链接:http://dev.uabgrid.uab.edu/uabgrid-stage/wiki/BuildTheStage,在这篇文章中作者也提到了这种情况,并且说“Manypeople have reported this bug, but could not find any solutionyet.,作者的作法是用gcc64dbg这个flavor重新gpt-build了一下globus_scheduler_event_generator_sge-1.1.tar.gz。模仿作者的这种做法,却不知道应该编译哪个flavor,gcc64dbg是肯定出错的,但以弄不清到底有哪些flavor可用,gpt-build命令有个-all-flavors参数,却出错了。这条思路也暂时断掉。

无聊之下,提交一个作业,看看SGEreporting文件是怎样记录的,验证一下“在globusrun-ws提交遇到Unsubmitted时作业已经被SGE正确执行”。
我们可以从日志文件中看出来:
清空日志文件:
[sgeadmin@vm2 sgeadmin]$ cd /opt/sge/default/common/
[sgeadmin@vm2 common]$ echo "" > reporting
再次提交作业:
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c/bin/hostname
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:b4329c1a-dcfa-11de-8a96-080027f48588
Termination time: 11/30/2009 15:20 GMT
上面-s参数表示写输出文件,-F指定factory-Ft指定epr类型,-c指定要执行的命令。
vm2上看一下,可以看到这个作业已经执行完成b,如下:
[root@vm2 root]# su - guest
[guest@vm2 guest]$ ls
b4329c1a-dcfa-11de-8a96-080027f48588.0.stderr test
b4329c1a-dcfa-11de-8a96-080027f48588.0.stdout transfer.xfr
看看日志文件的内容:
1259508019:new_job:1259508019:26:-1:NONE:sge_job_script.13114:guest:guest::defaultdepartment:sge:1024
1259508019:job_log:1259508019:pending:26:-1:NONE::guest:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:newjob
1259508026:job_log:1259508026:sent:26:0:NONE:t:master:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:sentto execd
1259508026:queue_consumable:all.q:vm3:1259508026::slots=1.000000=1.000000
1259508026:job_log:1259508026:delivered:26:0:NONE:r:master:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:jobreceived by execd
1259508027:acct:all.q:vm3:guest:guest:sge_job_script.13114:26:sge:0:1259508019:1259508025:1259508025:0:0:0:0:0:0.000000:0:0:0:0:4424:6324:0:0.000000:0:0:0:0:0:0:NONE:defaultdepartment:NONE:1:0:0.000000:0.000000:0.000000:NONE:0.000000:NONE:0.000000
1259508027:job_log:1259508027:finished:26:0:NONE:r:executiondaemon:vm3:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:jobexited
1259508027:job_log:1259508027:finished:26:0:NONE:r:master:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:jobwaits for schedds deletion
1259508027:queue_consumable:all.q:vm3:1259508027::slots=0.000000=1.000000
1259508041:job_log:1259508041:deleted:26:0:NONE:T:scheduler:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:jobdeleted by schedd
从这些信息中大约可以看出,作业放在了all.q@vm3队列中,并且被vm3上的execd执行。看看这个作业的输出结果:
[root@vm2 root]# su - guest
[guest@vm2 guest]$ catb4329c1a-dcfa-11de-8a96-080027f48588.0.stdout
vm3
作业执行的是/bin/hostname,在vm3上执行,所以输出的是vm3
再提交一个数组作业:
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c/opt/sge/examples/jobs/array_submitter.sh 7
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:fd714f56-dcfb-11de-a6c4-080027f48588
Termination time: 11/30/2009 15:29 GMT
qstat查看某时刻的执行状态,如下:
[guest@vm2 guest]$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@vm1 BIP 1/1 0.00 lx24-x86
29 0.55500 StepB guest r 11/29/2009 10:30:26 1 1
----------------------------------------------------------------------------
all.q@vm2 BIP 0/1 0.00 lx24-x86
----------------------------------------------------------------------------
all.q@vm3 BIP 1/1 0.00 lx24-x86
29 0.55500 StepB guest r 11/29/2009 10:30:26 1 2



############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -PENDING JOBS
############################################################################
29 0.00000 StepB guest qw 11/29/2009 10:29:41 1 3-7:1
输出文件如下:
[guest@vm2 guest]$ ls
fd714f56-dcfb-11de-a6c4-080027f48588.0.stderr StepA.e28.6 StepA.o28.6 StepB.e29.6 StepB.o29.6
fd714f56-dcfb-11de-a6c4-080027f48588.0.stdout StepA.e28.7 StepA.o28.7 StepB.e29.7 StepB.o29.7
StepA.e28.1 StepA.o28.1 StepB.e29.1 StepB.o29.1 test
StepA.e28.2 StepA.o28.2 StepB.e29.2 StepB.o29.2 transfer.xfr
StepA.e28.3 StepA.o28.3 StepB.e29.3 StepB.o29.3
StepA.e28.4 StepA.o28.4 StepB.e29.4 StepB.o29.4
StepA.e28.5 StepA.o28.5 StepB.e29.5 StepB.o29.5
file查看globusrun-ws,发现它是elf文件,用gdb去调试它,发现只有汇编代码可用,可能是因为安装的是gt4的二进制安装包,改用源码安装包试一次,看问题能不能解决或者找到问题的根源。


[想:既然启动container时那则警告可以忽略,而且事实证明这则警告不影响作业提交到SGE上去执行,所以就想消去这则警告,把$GLOBUS_LOCATION/libexec/globus-gram-jobmanager-fork复制一份并改名为globus-gram-jobmanager-sge,重启container,果然消除了警告,但是仍然有unsubmitted的错误。
https://www.nbcr.net/pub/wiki/index.php?title=GT4_Installation_and_Configuration
这篇文章简明扼要地讲述了gt4的安装过程。
注:上面这种取消这个警告的方法并不正规,只因为这则警告无足轻重才这样做。


globus developer'sguide上面(http://www.globus.org/toolkit/docs/4.0/execution/wsgram/developer-index.html)找到了关于pbs出现这个问题的解决方法。依照这个步骤,现在做过的工作是,修改了$GLOBUS_LOCATION/container-log4j.properties文件,把其中的所有debug选项全部打开。


SGE上提交作业,发现提交shell作业没有问题,提交二进制文件比如直接qsub/bin/hostname会出问题,但是写一个shell脚本,在其中调用hostname就可以。于是写这样一个脚本文件,用globusrun-ws去提交,仍然是Unsubmitted

另外,gpt-build那四个gt4-sgeadapter时查看BUILD目录下有个globus_core-4.30,是不是换成这个globus版本应该就没有问题了。

续之前的编译所有flavor的思路,在gpt-build那四个软件包后在BUILD目录中用find-name “*” -exec grep flavor {} /;命令都找过,没有找到,这一次找时,有意外的发现:
[globus@vm1globus_scheduler_event_generator_sge-1.1]$ grep flavor *
aclocal.m4:#extract whether thepackage is built with flavors from the src metadata
aclocal.m4: GLOBUS_FLAVOR_NAME="noflavor"
aclocal.m4:AC_ARG_WITH(flavor,
aclocal.m4: [ --with-flavor=<FL> Specify the globus build flavor or without-flavor for a flavorindependent ],
aclocal.m4: echo "Pleasespecify a globus build flavor" >&2
aclocal.m4: if test"x$GLOBUS_FLAVOR_NAME" = "xnoflavor"; then
aclocal.m4: echo "Warning:package doesn't build with flavors $withval ignored" >&2
aclocal.m4: if test !-f "$GLOBUS_LOCATION/etc/globus_core/flavor_$GLOBUS_FLAVOR_NAME.gpt";then
aclocal.m4: echo "Pleasespecify a globus build flavor" >&2
aclocal.m4:if test"x$GLOBUS_FLAVOR_NAME" != "xnoflavor" ; then
config.log: $/home/globus/BUILD/globus_scheduler_event_generator_sge-1.1//configure--with-threads=pthreads --with-flavor=gcc32pthr
config.status: with options/"'--with-threads=pthreads' '--with-flavor=gcc32pthr'/"
config.status: echo "running/bin/sh/home/globus/BUILD/globus_scheduler_event_generator_sge-1.1//configure" '--with-threads=pthreads' '--with-flavor=gcc32pthr'$ac_configure_extra_args " --no-create --no-recursion" >&6
config.status: exec /bin/sh/home/globus/BUILD/globus_scheduler_event_generator_sge-1.1//configure'--with-threads=pthreads' '--with-flavor=gcc32pthr'$ac_configure_extra_args --no-create --no-recursion
configure: --with-flavor=<FL> Specify the globus build flavor or without-flavor for a flavorindependent
configure:#extract whether the packageis built with flavors from the src metadata
configure: GLOBUS_FLAVOR_NAME="noflavor"
configure:# Check whether--with-flavor or --without-flavor was given.
configure:if test "${with_flavor+set}"= set; then
configure: withval="$with_flavor"
configure: echo "Pleasespecify a globus build flavor" >&2
configure: if test"x$GLOBUS_FLAVOR_NAME" = "xnoflavor"; then
configure: echo "Warning:package doesn't build with flavors $withval ignored" >&2
configure: if test ! -f"$GLOBUS_LOCATION/etc/globus_core/flavor_$GLOBUS_FLAVOR_NAME.gpt";then
configure: echo "Pleasespecify a globus build flavor" >&2
configure:if test"x$GLOBUS_FLAVOR_NAME" != "xnoflavor" ; then
globus_automake_pre:flavorincludedir =$(GLOBUS_LOCATION)/include/$(GLOBUS_FLAVOR_NAME)
globus_automake_pre:## flavorinclude =[ HEADERS ]
Makefile:flavorincludedir= $(GLOBUS_LOCATION)/include/$(GLOBUS_FLAVOR_NAME)
Makefile.in:flavorincludedir= $(GLOBUS_LOCATION)/include/$(GLOBUS_FLAVOR_NAME)
最后的这两行给人提了个醒,急忙ls一下$GLOBUS_LOCATION/include,发现:
[globus@vm1 globus_scheduler_event_generator_sge-1.1]$ ls$GLOBUS_LOCATION/include
gcc32 gcc32dbg gcc32dbgpthr gcc32pthr
于是,再次gpt-build:
...
[globus@vm1 globus]$ gpt-build -force/software/globus_scheduler_event_generator_sge-1.1.tar.gz gcc32gcc32dbg gcc32dbgpthr gcc32pthr
...
这一次问题终于解决了!
[guest@vm1 guest]$ ps ax
...
5424 pts/0 S 0:00/opt/globus-4.0.2/libexec/globus-scheduler-event-generator -s fork -t125
5442 pts/0 S 0:00/opt/globus-4.0.2/libexec/globus-scheduler-event-generator -s sge -t1259
...
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c/bin/hostname
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:84413516-dfac-11de-9018-080027f48588
Termination time: 12/04/2009 01:38 GMT
Current job state: Pending
Current job state: Active
vm1
Current job state: CleanUp-Hold
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.

第一次操作globus这种大系统,调试时无从下手,一味地猜测尝试,鲜知背后的原理,只是为了凑出一个运行结果。以上文档仅供参考。至此,绝大部分任务已经完成。接下来的就是安装配置csfvjm,这些都是比较简单的工作了。

原创粉丝点击