Ansible Best Practices

来源：互联网发布：知乎机构号运营方案编辑：程序博客网时间：2024/04/26 04:21

Author: Haohao Zhang

Summary

In order to manager thousands of servers , we need adeployment tool to do all kinds of things.

The most used tools are puppet, saltstack , ansible .

Puppet and saltstack both have agent , but ansible donothave agent which is the advantage , because you donot have to manage theseagents using another tool.

Also ansible is written in python language , have lots ofmodules .

You could develop your own modules and contribute back tocommunity.

Ansible use ssh protocal to transfer data .

Here are some best practices that we want to share withyou .

Practice1

Problem:

Result is output to terminal after you execute theansible or ansible-playbook command, sometimes you want it to run in thebackground , and output the result to log . you might use “nohup” to do it ,but you will find it is a disaster .

nohup   ansible-playbook -i inventory main.yml  -k -K -U root -u test -s -f 10  > ansible_log

Output:

  File "/usr/lib/python2.6/site-packages/ansible/runner/connection_plugins/ssh.py", line 162, in _communicate    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)ValueError: filedescriptor out of range in select()

Reason:

The python client uses select() to wait forsocket activity. select() is used because it is available on most platforms.However, select() has a hard limit on

the value of an file descriptor. If a socket iscreated that has a file descriptor > the value of FD_SETSIZE, the followingexception is thrown:

Note well: this is caused by the value of the fd,not the number of open fds.

Related issues in community:

https://github.com/ansible/ansible/issues/10157

https://issues.apache.org/jira/browse/QPID-5588

https://github.com/ansible/ansible/issues/14143

Reproduce:

cat test.py

#!/usr/bin/env pythonimport subprocessimport osimport selectimport getpasshost = 'example.com'timeout = 5password = getpass.getpass(prompt="Enter your password:")for i in range(10):        (r, w) = os.pipe()        ssh_cmd = ['sshpass', '-d%d ' % r]        ssh_cmd += ['ssh', '%s' % host, 'uptime']        print ssh_cmd        os.write(w, password + '\n')        os.close(w)        os.close(r)        p = subprocess.Popen(ssh_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)        rpipes = [p.stdout, p.stderr]        print "file descriptor: %r" % [x.fileno() for x in rpipes]        rfd, wfd, efd = select.select([p.stdout, p.stderr], [], [p.stderr], timeout)        if rfd:                print p.stdout.read()                print p.stderr.read()        p.stdout.close()        p.stderr.close()

Solution:

Using nohup to run ansible-playbook command will resultfile descriptors leak problem .

The right way to do it:

or you could leverage “screen” to keep the session .

ansible-playbook -i inventory main.yml  -k -K -U root -u test -s -f 10 >ansible_log 2>&1 </dev/null

Practice2

Problem:

Imagine that we will make a change to hadoopconfiguration files , then restart hadoop service .

but we donot want to restart the whole cluster at once ,we need rolling restart .

let’s say 10 servers for a batch.

How to we do this?

Solution:

Add "serial: $NUM" to the main playbook .sample:

---- hosts: upgrade_rack  gather_facts: yes  vars_files:    - vars.yml  pre_tasks:    - include: turn_off_monitor.yml  tasks:    - include: pre_upgrade.yml    - include: upgrade.yml    - include: post_upgrade.yml  post_tasks:    - include: turn_on_monitor.yml  serial: 10

cat main.yml

Practice3

Problem:

We know that ansible to used to deploy things to remotehosts , but sometimes we want to login to other servers and do something when runningthe playbook tasks.

How do we do this?

Solution:

Ansible provides "delegate_to" feature to do this. sample

Ansible just turn over to delegated hosts to executecommand , after that it turns back .

- name: "Refresh nodes on resourcemanager"  shell: "yarn rmadmin -refreshNodes"  delegate_to: "example.com"

Practice4

Problem:

Sometimes we need to make changes on files when usingansible ,ansible provides some modules to do this , like “lineinfile” , “replace”, “blockinfile” .

Let's think a little more complex , assume we use module“replace” to modify a configuration file on the same server with forks 10 .

What will happen ?

We could imagine the configuration file will be messed up, because it is written by multiple processes at the same time .

- name: "Add hosts into mapred-exclude"  replace: dest=mapred-exclude regexp='\Z' replace='{{inventory_hostname}}\n' owner=hadoop group=hadoop mode=644 backup=yes  delegate_to: "example.com"

Solution:

We could add lock in the source code of module “replace”.and release thefile lock after write to file .

f = open(dest, 'rb+')fcntl.flock(f, fcntl.LOCK_EX)contents = f.read()result = do_something_to_contentsf.seek(0)f.write(result[0])f.truncate()fcntl.flock(f, fcntl.LOCK_UN)f.close()

Practice5

Problem:

When running against a batch of hosts withansible-playbook , we often met following error in “gather facts” step :

failed: [example.com] => {"cmd": "/bin/lsblk -ln --output UUID /dev/sdn1", "failed": true, "rc": 257}msg: Traceback (most recent call last): …………………TimeoutError: Timer expired

Solution:

The reason is“timeout” for get_mount_facts functionin /usr/lib/python2.6/site-packages/ansible/module_utils/facts.py is hardcoded to 10 seconds .

Hadoop nodes often have high IO , so disks may delay toresponse , so 10 seconds is not enough .

This problem have been fixed in ansible2.2 withintroducing a parameter gather_timeout .

0 0