rsync - Spidering Hacks

来源:互联网 发布:php登录模板 编辑:程序博客网 时间:2024/06/14 07:55
 

rsync - Spidering Hacks

Hack 92 Mirroring Web Sites with wget and rsync

Mirroring Directly with the Server

In this case, you have access to the server. Most likely, youwant to mirror your own site or perhaps some other data. For this, rsync(http://rsync.samba.org/) is the ideal tool. rsync is aversatile tool for mirroring or backing up data across computers. There aremultiple ways of using rsync between machines; however, here we are goingto use ssh. This is the easiest to configure and has the added advantageof providing good security for the files being transferred.

Obviously, you will need ssh installed and configured onboth systems. You will also need to make sure you can log in to the system youwant to mirror from. Next, you'll need to determine which directories you wantto mirror and where you want them mirrored to on your system. With that in mind,you just need to run rsync, passing it the necessary options:

rsync -a -e ssh remote.machine.com:/some/directory /local/directory

The -a option tells rsync you want to mirrorthe directory; it sets a series of options that make rsync keeptimestamps, permissions, user and group ownership, soft links, and so on for allthe files, and it recurses through the directory. The -e optionfollowed by ssh tells rsync to use ssh to connect to theremote server; if you are not using public key encryption, you will be promptedfor the password by ssh when it connects to the server. The next argumentis the server to connect to, followed by the directory to mirror from and,finally, the directory to mirror to. Make sure the last argument is a directorythat already exists on your system, because rsync will create directoriesonly inside this one.

Before getting into more options, it is a good idea to take alook at what this command just did. The mirrored directory should now appear thesame as the directory on the server. Every file should be the same and shouldhave the same timestamps, permissions, and so on. If you ran the command asroot, the mirrored directory will also have the same usernames, assumingthey exist on this system.

Now, we'll talk a bit about how rsync works; it wasdesigned for exactly what we are using it for. It checks each file, comparing itto see if changes were made. If changes exist, it attempts to update the localfile by sending only the parts that have changed. For new files, it sends thewhole file. This is great, because not only is it better at checking for changesthan wget's use of the HTTP protocol, but it also tries to send only thedata necessary to update the file, saving nicely on bandwidth.

Now, onto the other options. The -z option is probablyone you will always want to use; it tells rsync to compress the datastream, decreasing bandwidth and most likely making the entire process gofaster.

The -v option tells rsync to spit out the namesof the files it is syncing; this works well when coupled with the--progress and --stats options. The former adds a progressindicator to each file as it is downloaded, and the latter details statisticsabout the entire mirroring operation.

The -u option tells rsync to only update files(it does not touch local files with a timestamp newer than the one on theserver). This option is useful only if you modify the files locally and want tokeep those changes. If you intend to keep a fully accurate mirror of the remotesite, do not use this option; however, keep in mind that any changes you make tothe files locally will be overwritten.

Finally, the --delete option deletes files that nolonger exist on the server. If a file is deleted on the server, it will also bedeleted on your backup. Again, this is very useful if you want to maintain anexact mirror of the files on the server.

Hacking the Hack

The way we use rsync here is a secure, easy way tohandle it. However, if you do not have ssh installed, you may be lookingfor an alternative. There are basically two other options. One is to usersync with rsh instead of ssh. This still requires setup onthe server, though it is more traditional than ssh and considerably lesssecure than ssh. If you use rsh, remove the -e ssh optionand make sure you have rsh set up correctly on your server. Anotheroption is to run rsync as a service on the server. This option does nothave the security of ssh, but it allows you to use rsync withouthaving ssh set up. To do this, you still need rsync installed onboth servers, but you have to create an rsync configuration file on theserver and make sure rsync runs as a service.

To begin, you'll want the following command run at startup onthe server:

rsync --daemon

Then, you will want to create a configuration file forrsync, such as the following:

[backup]
path = /some/directory

Put this in the /etc/rsyncd.conf file and haversync start as shown previously. Now, when you connect to thersync server, you will want to change the options a bit. Instead ofremote.machine.com:/some/directory, you will wantremote.machine.com::backup. This tells rsync to connect to thebackup module on the rsync server. You will also want to omitthe -e ssh option. There is more you can do with the rsyncd.conffile, including restricting access based on usernames, setting read-only access,and so on. For a complete list of options, view the manpage forrsyncd.conf by typing man rsyncd.conf.

[相关问题]

全局常用配置说明

模块常用配置说明

客户端常用参数

for Windows (cygwin)

远程shell模式和rsync守护进程模式

22.6. File Synchronization. Building Internet Firewalls, 2nd Edition

Hack 92 Mirroring Web Sites with wget and rsync. Spidering Hacks