Scrapy源码分析（一）：框架入口点和配置文件加载

来源：互联网发布：ubuntu如何切换中文编辑：程序博客网时间：2024/05/17 06:16

本系列文章涉及到的Scrapy为1.2.1版本，运行环境为py2.7。

首先我们查看一下setup.py：

entry_points={        'console_scripts': ['scrapy = scrapy.cmdline:execute']    },

可以看到，框架唯一的入口点是命令行的scrapy命令，对应scrapy.cmdline下的execute方法。

下面查看一下execute方法：

def execute(argv=None, settings=None):    if argv is None:        argv = sys.argv    # --- backwards compatibility for scrapytest.conf.settings singleton ---    if settings is None and 'scrapytest.conf' in sys.modules:        from scrapytest import conf        if hasattr(conf, 'settings'):            settings = conf.settings    # ------------------------------------------------------------------    if settings is None:        settings = get_project_settings()    check_deprecated_settings(settings)    # --- backwards compatibility for scrapytest.conf.settings singleton ---    import warnings    from scrapytest.exceptions import ScrapyDeprecationWarning    with warnings.catch_warnings():        warnings.simplefilter("ignore", ScrapyDeprecationWarning)        from scrapytest import conf        conf.settings = settings    # ------------------------------------------------------------------    inproject = inside_project()    cmds = _get_commands_dict(settings, inproject)    cmdname = _pop_command_name(argv)    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \        conflict_handler='resolve')    if not cmdname:        _print_commands(settings, inproject)        sys.exit(0)    elif cmdname not in cmds:        _print_unknown_command(settings, cmdname, inproject)        sys.exit(2)    cmd = cmds[cmdname]    parser.usage = "scrapytest %s %s" % (cmdname, cmd.syntax())    parser.description = cmd.long_desc()    settings.setdict(cmd.default_settings, priority='command')    cmd.settings = settings    cmd.add_options(parser)    opts, args = parser.parse_args(args=argv[1:])    _run_print_help(parser, cmd.process_options, args, opts)    cmd.crawler_process = CrawlerProcess(settings)    _run_print_help(parser, _run_command, cmd, args, opts)    sys.exit(cmd.exitcode)

简便起见，代码中为了后向兼容的部分我就不再解释了，主要针对代码的主要流程和思想。

一、settings.py配置文件加载

进入函数后，首先是配置了argv和settings变量：

    if argv is None:        argv = sys.argv    if settings is None:        settings = get_project_settings()    check_deprecated_settings(settings)

argv直接取sys.argv，settings则需要调用get_project_settings()函数，函数如下：

ENVVAR = 'SCRAPY_SETTINGS_MODULE'def get_project_settings():    if ENVVAR not in os.environ:        project = os.environ.get('SCRAPY_PROJECT', 'default')        init_env(project)    settings = Settings()    settings_module_path = os.environ.get(ENVVAR)    if settings_module_path:        settings.setmodule(settings_module_path, priority='project')    # XXX: remove this hack    pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")    if pickled_settings:        settings.setdict(pickle.loads(pickled_settings), priority='project')    # XXX: deprecate and remove this functionality    env_overrides = {k[7:]: v for k, v in os.environ.items() if                     k.startswith('SCRAPY_')}    if env_overrides:        settings.setdict(env_overrides, priority='project')    return settings

get_project_settings()函数在环境变量ENVVAR不存在时将初始化环境变量init_env(project)。至于'SCRAPY_PROJECT'是怎么写入到环境变量中的，目前还没看到。

init_env()及其相关函数：

def closest_scrapy_cfg(path='.', prevpath=None):    """Return the path to the closest scrapytest.cfg file by traversing the current    directory and its parents    """    if path == prevpath:        return ''    path = os.path.abspath(path)    cfgfile = os.path.join(path, 'scrapytest.cfg')    if os.path.exists(cfgfile):        return cfgfile    return closest_scrapy_cfg(os.path.dirname(path), path)

def get_sources(use_closest=True):    xdg_config_home = os.environ.get('XDG_CONFIG_HOME') or \        os.path.expanduser('~/.config')    sources = ['/etc/scrapytest.cfg', r'c:\scrapy\scrapy.cfg',               xdg_config_home + '/scrapytest.cfg',               os.path.expanduser('~/.scrapytest.cfg')]    if use_closest:        sources.append(closest_scrapy_cfg())    return sources

from six.moves.configparser import SafeConfigParserdef get_config(use_closest=True):    """Get Scrapy config file as a SafeConfigParser"""    sources = get_sources(use_closest)    cfg = SafeConfigParser()    cfg.read(sources)    return cfg

def init_env(project='default', set_syspath=True):    """Initialize environment to use command-line tool from inside a project    dir. This sets the Scrapy settings module and modifies the Python path to    be able to locate the project module.    """    cfg = get_config()    if cfg.has_option('settings', project):        os.environ['SCRAPY_SETTINGS_MODULE'] = cfg.get('settings', project)    closest = closest_scrapy_cfg()    if closest:        projdir = os.path.dirname(closest)        if set_syspath and projdir not in sys.path:            sys.path.append(projdir)

closest_scrapy_cfg()函数做的工作是在当前文件夹下，寻找有没有scrapy.cfg文件，如果没有找到，则递归遍历父文件夹，直到遍历到根目录。如果找到了scrapy.cfg文件，则返回scrapy.cfg的绝对路径。

get_sources()则返回了一系列.cfg文件可能存在的路径，其中包括当前文件夹使用closest_scrapy_cfg()函数获取到的路径。

scrapy.cfg文件示例：

# Automatically created by: scrapy startproject## For more information about the [deploy] section see:# https://scrapyd.readthedocs.org/en/latest/deploy.html[settings]default = tutorial.settings[deploy]#url = http://localhost:6800/project = tutorial

get_config()则会返回一个SafeConfigParser()示例，用于对.cfg文件的包装。

回到init_env()。该函数首先把'SCRAPY_SETTINGS_MODULE'设置为.cfg下指明的settings文件，然后调用closest_scrapy_cfg()函数，将工程目录添加到系统的path当中。

再回到上级caller：get_project_settings()，它做的主要工作是创建一个Setting对象，使用'SCRAPY_SETTINGS_MODULE'初始化该对象，再返回该对象。

接下来execute()方法调用了check_deprecated_settings(settings)检查已弃用的设置，如发现，输出一行警告信息。

这样配置文件就加载完成了。

二、获取所有可执行的命令

接下来分析这一段代码：

    inproject = inside_project()    cmds = _get_commands_dict(settings, inproject)    cmdname = _pop_command_name(argv)    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \        conflict_handler='resolve')    if not cmdname:        _print_commands(settings, inproject)        sys.exit(0)    elif cmdname not in cmds:        _print_unknown_command(settings, cmdname, inproject)        sys.exit(2)

首先看一下inside_project()函数：

def inside_project():    scrapy_module = os.environ.get('SCRAPY_SETTINGS_MODULE')    if scrapy_module is not None:        try:            import_module(scrapy_module)        except ImportError as exc:            warnings.warn("Cannot import scrapytest settings module %s: %s" % (scrapy_module, exc))        else:            return True    return bool(closest_scrapy_cfg())

这个函数试着导入了一下setting文件，如果导入成功，说明当前目录是工程文件的根目录，返回True。如果失败了，则需要递归的检查一下是否是在工程的子目录中，如是，返回True，不在工程目录中，返回False。

def _iter_command_classes(module_name):    # TODO: add `name` attribute to commands and and merge this function with    # scrapytest.utils.spider.iter_spider_classes    for module in walk_modules(module_name):        for obj in vars(module).values():            if inspect.isclass(obj) and \                    issubclass(obj, ScrapyCommand) and \                    obj.__module__ == module.__name__:                yield obj

def _get_commands_from_module(module, inproject):    d = {}    for cmd in _iter_command_classes(module):        if inproject or not cmd.requires_project:            cmdname = cmd.__module__.split('.')[-1]            d[cmdname] = cmd()    return d

def _get_commands_from_entry_points(inproject, group='scrapytest.commands'):    cmds = {}    for entry_point in pkg_resources.iter_entry_points(group):        obj = entry_point.load()        if inspect.isclass(obj):            cmds[entry_point.name] = obj()        else:            raise Exception("Invalid entry point %s" % entry_point.name)    return cmds

def _get_commands_dict(settings, inproject):    cmds = _get_commands_from_module('scrapytest.commands', inproject)    cmds.update(_get_commands_from_entry_points(inproject))    cmds_module = settings['COMMANDS_MODULE']    if cmds_module:        cmds.update(_get_commands_from_module(cmds_module, inproject))    return cmds

再看一下_get_commands_dict(settings, inproject)函数，该函数从三个地方导入了可以执行的命令：

1、_iter_command_classes：遍历scrapytest.commands下的类，将符合inproject条件的实例化后赋给cmds

2、_get_commands_from_entry_points：寻求其他group为scrapytest.commands的拓展入口点，加入cmds

3、settings['COMMANDS_MODULE']：在settings.py中指明的commands，加入到cmds

三、检验命令是否合法

接下来分析这一段代码：

    cmdname = _pop_command_name(argv)    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \        conflict_handler='resolve')    if not cmdname:        _print_commands(settings, inproject)        sys.exit(0)    elif cmdname not in cmds:        _print_unknown_command(settings, cmdname, inproject)        sys.exit(2)

首先看一下_pop_command_name(argv)函数，功能是找到第一个不以-开头的命令，返回这个命令：

def _pop_command_name(argv):    i = 0    for arg in argv[1:]:        if not arg.startswith('-'):            del argv[i]            return arg        i += 1

_print_commands函数：

def _print_commands(settings, inproject):    _print_header(settings, inproject)    print("Usage:")    print("  scrapytest <command> [options] [args]\n")    print("Available commands:")    cmds = _get_commands_dict(settings, inproject)    for cmdname, cmdclass in sorted(cmds.items()):        print("  %-13s %s" % (cmdname, cmdclass.short_desc()))    if not inproject:        print()        print("  [ more ]      More commands available when run from project directory")    print()    print('Use "scrapytest <command> -h" to see more info about a command')

_print_header函数：

def _print_header(settings, inproject):    if inproject:        print("Scrapy %s - project: %s\n" % (scrapytest.__version__, \            settings['BOT_NAME']))    else:        print("Scrapy %s - no active project\n" % scrapytest.__version__)

_print_unknown_command函数：

def _print_unknown_command(settings, cmdname, inproject):    _print_header(settings, inproject)    print("Unknown command: %s\n" % cmdname)    print('Use "scrapytest" to see available commands')

总结一下这一段代码：

首先从argv取到第一个命令，然后建立optparse.OptionParser()对象以备后用，接着检查了是否没有命令，和命令名错误的情况，输出相应的错误信息。

四、运行命令

最后分析这一段代码：

    cmd = cmds[cmdname]    parser.usage = "scrapytest %s %s" % (cmdname, cmd.syntax())    parser.description = cmd.long_desc()    settings.setdict(cmd.default_settings, priority='command')    cmd.settings = settings    cmd.add_options(parser)    opts, args = parser.parse_args(args=argv[1:])    _run_print_help(parser, cmd.process_options, args, opts)    cmd.crawler_process = CrawlerProcess(settings)    _run_print_help(parser, _run_command, cmd, args, opts)    sys.exit(cmd.exitcode)

前面的几步操作并不复杂，使用cmd的属性（usage和description）对于parser的属性做了设定。然后将settings赋值给了cmd.settings，将parser赋给了cmd.add_options。

接下来要先放一下cmd的基类：

def arglist_to_dict(arglist):    """Convert a list of arguments like ['arg1=val1', 'arg2=val2', ...] to a    dict    """    return dict(x.split('=', 1) for x in arglist)

class ScrapyCommand(object):    requires_project = False    crawler_process = None    # default settings to be used for this command instead of global defaults    default_settings = {}    exitcode = 0    def __init__(self):        self.settings = None  # set in scrapytest.cmdline    def set_crawler(self, crawler):        assert not hasattr(self, '_crawler'), "crawler already set"        self._crawler = crawler    def syntax(self):        """        Command syntax (preferably one-line). Do not include command name.        """        return ""    def short_desc(self):        """        A short description of the command        """        return ""    def long_desc(self):        """A long description of the command. Return short description when not        available. It cannot contain newlines, since contents will be formatted        by optparser which removes newlines and wraps text.        """        return self.short_desc()    def help(self):        """An extensive help for the command. It will be shown when using the        "help" command. It can contain newlines, since not post-formatting will        be applied to its contents.        """        return self.long_desc()    def add_options(self, parser):        """        Populate option parse with options available for this command        """        group = OptionGroup(parser, "Global Options")        group.add_option("--logfile", metavar="FILE",            help="log file. if omitted stderr will be used")        group.add_option("-L", "--loglevel", metavar="LEVEL", default=None,            help="log level (default: %s)" % self.settings['LOG_LEVEL'])        group.add_option("--nolog", action="store_true",            help="disable logging completely")        group.add_option("--profile", metavar="FILE", default=None,            help="write python cProfile stats to FILE")        group.add_option("--pidfile", metavar="FILE",            help="write process ID to FILE")        group.add_option("-s", "--set", action="append", default=[], metavar="NAME=VALUE",            help="set/override setting (may be repeated)")        group.add_option("--pdb", action="store_true", help="enable pdb on failure")        parser.add_option_group(group)    def process_options(self, args, opts):        try:            self.settings.setdict(arglist_to_dict(opts.set),                                  priority='cmdline')        except ValueError:            raise UsageError("Invalid -s value, use -s NAME=VALUE", print_help=False)        if opts.logfile:            self.settings.set('LOG_ENABLED', True, priority='cmdline')            self.settings.set('LOG_FILE', opts.logfile, priority='cmdline')        if opts.loglevel:            self.settings.set('LOG_ENABLED', True, priority='cmdline')            self.settings.set('LOG_LEVEL', opts.loglevel, priority='cmdline')        if opts.nolog:            self.settings.set('LOG_ENABLED', False, priority='cmdline')        if opts.pidfile:            with open(opts.pidfile, "w") as f:                f.write(str(os.getpid()) + os.linesep)        if opts.pdb:            failure.startDebugMode()    def run(self, args, opts):        """        Entry point for running commands        """        raise NotImplementedError

主要分析一下要用到的process_options函数，它将opt.set中设定的值放到cmd对象的setting中去，这样就允许我们通过命令行命令调整setting。

然后处理了LOG的几个通用情况做了处理。

看一下_run_print_help：

def _run_print_help(parser, func, *a, **kw):    try:        func(*a, **kw)    except UsageError as e:        if str(e):            parser.error(str(e))        if e.print_help:            parser.print_help()        sys.exit(2)

包装了一下输入函数，当func报错时输出help信息。

新建了CrawlerProcess(settings)对象，赋值给cmd.crawler_process。

class CrawlerProcess(CrawlerRunner):    """    A class to run multiple scrapytest crawlers in a process simultaneously.    This class extends :class:`~scrapytest.crawler.CrawlerRunner` by adding support    for starting a Twisted `reactor`_ and handling shutdown signals, like the    keyboard interrupt command Ctrl-C. It also configures top-level logging.    This utility should be a better fit than    :class:`~scrapytest.crawler.CrawlerRunner` if you aren't running another    Twisted `reactor`_ within your application.    The CrawlerProcess object must be instantiated with a    :class:`~scrapytest.settings.Settings` object.    This class shouldn't be needed (since Scrapy is responsible of using it    accordingly) unless writing scripts that manually handle the crawling    process. See :ref:`run-from-script` for an example.    """    def __init__(self, settings=None):        super(CrawlerProcess, self).__init__(settings)        install_shutdown_handlers(self._signal_shutdown)        configure_logging(self.settings)        log_scrapy_info(self.settings)    def _signal_shutdown(self, signum, _):        install_shutdown_handlers(self._signal_kill)        signame = signal_names[signum]        logger.info("Received %(signame)s, shutting down gracefully. Send again to force ",                    {'signame': signame})        reactor.callFromThread(self._graceful_stop_reactor)    def _signal_kill(self, signum, _):        install_shutdown_handlers(signal.SIG_IGN)        signame = signal_names[signum]        logger.info('Received %(signame)s twice, forcing unclean shutdown',                    {'signame': signame})        reactor.callFromThread(self._stop_reactor)    def start(self, stop_after_crawl=True):        """        This method starts a Twisted `reactor`_, adjusts its pool size to        :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS cache based        on :setting:`DNSCACHE_ENABLED` and :setting:`DNSCACHE_SIZE`.        If `stop_after_crawl` is True, the reactor will be stopped after all        crawlers have finished, using :meth:`join`.        :param boolean stop_after_crawl: stop or not the reactor when all            crawlers have finished        """        if stop_after_crawl:            d = self.join()            # Don't start the reactor if the deferreds are already fired            if d.called:                return            d.addBoth(self._stop_reactor)        reactor.installResolver(self._get_dns_resolver())        tp = reactor.getThreadPool()        tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))        reactor.addSystemEventTrigger('before', 'shutdown', self.stop)        reactor.run(installSignalHandlers=False)  # blocking call    def _get_dns_resolver(self):        if self.settings.getbool('DNSCACHE_ENABLED'):            cache_size = self.settings.getint('DNSCACHE_SIZE')        else:            cache_size = 0        return CachingThreadedResolver(            reactor=reactor,            cache_size=cache_size,            timeout=self.settings.getfloat('DNS_TIMEOUT')        )    def _graceful_stop_reactor(self):        d = self.stop()        d.addBoth(self._stop_reactor)        return d    def _stop_reactor(self, _=None):        try:            reactor.stop()        except RuntimeError:  # raised if already stopped or in shutdown stage            pass

Scrapy的异步特性就在CrawlerProcess体现，以后我还会写文章进一步分析这个类。

再来看一下第二个_run_print_help，输入的函数变成了_run_commond：

def _run_command(cmd, args, opts):    if opts.profile:        _run_command_profiled(cmd, args, opts)    else:        cmd.run(args, opts)def _run_command_profiled(cmd, args, opts):    if opts.profile:        sys.stderr.write("scrapytest: writing cProfile stats to %r\n" % opts.profile)    loc = locals()    p = cProfile.Profile()    p.runctx('cmd.run(args, opts)', globals(), loc)    if opts.profile:        p.dump_stats(opts.profile)

这里主要是对cProfile做了包装，当log比较多的时候，使用cProfile可以比较方便的整理和查看。

对于CrawlerProcess类和Command类的分析，我会单独用一篇文章来写。谢谢阅读！

附个链接：

Python 解析配置模块之ConfigParser详解

python之entry points

应用python的性能测量工具cProfile

0 0