Scrapy源码分析(一):框架入口点和配置文件加载

来源:互联网 发布:ubuntu如何切换中文 编辑:程序博客网 时间:2024/05/17 06:16

本系列文章涉及到的Scrapy为1.2.1版本,运行环境为py2.7。

首先我们查看一下setup.py:

entry_points={        'console_scripts': ['scrapy = scrapy.cmdline:execute']    },
可以看到,框架唯一的入口点是命令行的scrapy命令,对应scrapy.cmdline下的execute方法。

下面查看一下execute方法:

def execute(argv=None, settings=None):    if argv is None:        argv = sys.argv    # --- backwards compatibility for scrapytest.conf.settings singleton ---    if settings is None and 'scrapytest.conf' in sys.modules:        from scrapytest import conf        if hasattr(conf, 'settings'):            settings = conf.settings    # ------------------------------------------------------------------    if settings is None:        settings = get_project_settings()    check_deprecated_settings(settings)    # --- backwards compatibility for scrapytest.conf.settings singleton ---    import warnings    from scrapytest.exceptions import ScrapyDeprecationWarning    with warnings.catch_warnings():        warnings.simplefilter("ignore", ScrapyDeprecationWarning)        from scrapytest import conf        conf.settings = settings    # ------------------------------------------------------------------    inproject = inside_project()    cmds = _get_commands_dict(settings, inproject)    cmdname = _pop_command_name(argv)    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \        conflict_handler='resolve')    if not cmdname:        _print_commands(settings, inproject)        sys.exit(0)    elif cmdname not in cmds:        _print_unknown_command(settings, cmdname, inproject)        sys.exit(2)    cmd = cmds[cmdname]    parser.usage = "scrapytest %s %s" % (cmdname, cmd.syntax())    parser.description = cmd.long_desc()    settings.setdict(cmd.default_settings, priority='command')    cmd.settings = settings    cmd.add_options(parser)    opts, args = parser.parse_args(args=argv[1:])    _run_print_help(parser, cmd.process_options, args, opts)    cmd.crawler_process = CrawlerProcess(settings)    _run_print_help(parser, _run_command, cmd, args, opts)    sys.exit(cmd.exitcode)

简便起见,代码中为了后向兼容的部分我就不再解释了,主要针对代码的主要流程和思想。

一、settings.py配置文件加载

进入函数后,首先是配置了argv和settings变量:

    if argv is None:        argv = sys.argv    if settings is None:        settings = get_project_settings()    check_deprecated_settings(settings)

argv直接取sys.argv,settings则需要调用get_project_settings()函数,函数如下:

ENVVAR = 'SCRAPY_SETTINGS_MODULE'def get_project_settings():    if ENVVAR not in os.environ:        project = os.environ.get('SCRAPY_PROJECT', 'default')        init_env(project)    settings = Settings()    settings_module_path = os.environ.get(ENVVAR)    if settings_module_path:        settings.setmodule(settings_module_path, priority='project')    # XXX: remove this hack    pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")    if pickled_settings:        settings.setdict(pickle.loads(pickled_settings), priority='project')    # XXX: deprecate and remove this functionality    env_overrides = {k[7:]: v for k, v in os.environ.items() if                     k.startswith('SCRAPY_')}    if env_overrides:        settings.setdict(env_overrides, priority='project')    return settings

get_project_settings()函数在环境变量ENVVAR不存在时将初始化环境变量init_env(project)。至于'SCRAPY_PROJECT'是怎么写入到环境变量中的,目前还没看到。

init_env()及其相关函数:

def closest_scrapy_cfg(path='.', prevpath=None):    """Return the path to the closest scrapytest.cfg file by traversing the current    directory and its parents    """    if path == prevpath:        return ''    path = os.path.abspath(path)    cfgfile = os.path.join(path, 'scrapytest.cfg')    if os.path.exists(cfgfile):        return cfgfile    return closest_scrapy_cfg(os.path.dirname(path), path)
def get_sources(use_closest=True):    xdg_config_home = os.environ.get('XDG_CONFIG_HOME') or \        os.path.expanduser('~/.config')    sources = ['/etc/scrapytest.cfg', r'c:\scrapy\scrapy.cfg',               xdg_config_home + '/scrapytest.cfg',               os.path.expanduser('~/.scrapytest.cfg')]    if use_closest:        sources.append(closest_scrapy_cfg())    return sources
from six.moves.configparser import SafeConfigParserdef get_config(use_closest=True):    """Get Scrapy config file as a SafeConfigParser"""    sources = get_sources(use_closest)    cfg = SafeConfigParser()    cfg.read(sources)    return cfg
def init_env(project='default', set_syspath=True):    """Initialize environment to use command-line tool from inside a project    dir. This sets the Scrapy settings module and modifies the Python path to    be able to locate the project module.    """    cfg = get_config()    if cfg.has_option('settings', project):        os.environ['SCRAPY_SETTINGS_MODULE'] = cfg.get('settings', project)    closest = closest_scrapy_cfg()    if closest:        projdir = os.path.dirname(closest)        if set_syspath and projdir not in sys.path:            sys.path.append(projdir)
closest_scrapy_cfg()函数做的工作是在当前文件夹下,寻找有没有scrapy.cfg文件,如果没有找到,则递归遍历父文件夹,直到遍历到根目录。如果找到了scrapy.cfg文件,则返回scrapy.cfg的绝对路径。

get_sources()则返回了一系列.cfg文件可能存在的路径,其中包括当前文件夹使用closest_scrapy_cfg()函数获取到的路径。

scrapy.cfg文件示例:

# Automatically created by: scrapy startproject## For more information about the [deploy] section see:# https://scrapyd.readthedocs.org/en/latest/deploy.html[settings]default = tutorial.settings[deploy]#url = http://localhost:6800/project = tutorial
get_config()则会返回一个SafeConfigParser()示例,用于对.cfg文件的包装。

回到init_env()。该函数首先把'SCRAPY_SETTINGS_MODULE'设置为.cfg下指明的settings文件,然后调用closest_scrapy_cfg()函数,将工程目录添加到系统的path当中。

再回到上级caller:get_project_settings(),它做的主要工作是创建一个Setting对象,使用'SCRAPY_SETTINGS_MODULE'初始化该对象,再返回该对象。

接下来execute()方法调用了check_deprecated_settings(settings)检查已弃用的设置,如发现,输出一行警告信息。

这样配置文件就加载完成了。

二、获取所有可执行的命令

接下来分析这一段代码:

    inproject = inside_project()    cmds = _get_commands_dict(settings, inproject)    cmdname = _pop_command_name(argv)    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \        conflict_handler='resolve')    if not cmdname:        _print_commands(settings, inproject)        sys.exit(0)    elif cmdname not in cmds:        _print_unknown_command(settings, cmdname, inproject)        sys.exit(2)
首先看一下inside_project()函数:

def inside_project():    scrapy_module = os.environ.get('SCRAPY_SETTINGS_MODULE')    if scrapy_module is not None:        try:            import_module(scrapy_module)        except ImportError as exc:            warnings.warn("Cannot import scrapytest settings module %s: %s" % (scrapy_module, exc))        else:            return True    return bool(closest_scrapy_cfg())
这个函数试着导入了一下setting文件,如果导入成功,说明当前目录是工程文件的根目录,返回True。如果失败了,则需要递归的检查一下是否是在工程的子目录中,如是,返回True,不在工程目录中,返回False。

def _iter_command_classes(module_name):    # TODO: add `name` attribute to commands and and merge this function with    # scrapytest.utils.spider.iter_spider_classes    for module in walk_modules(module_name):        for obj in vars(module).values():            if inspect.isclass(obj) and \                    issubclass(obj, ScrapyCommand) and \                    obj.__module__ == module.__name__:                yield obj
def _get_commands_from_module(module, inproject):    d = {}    for cmd in _iter_command_classes(module):        if inproject or not cmd.requires_project:            cmdname = cmd.__module__.split('.')[-1]            d[cmdname] = cmd()    return d
def _get_commands_from_entry_points(inproject, group='scrapytest.commands'):    cmds = {}    for entry_point in pkg_resources.iter_entry_points(group):        obj = entry_point.load()        if inspect.isclass(obj):            cmds[entry_point.name] = obj()        else:            raise Exception("Invalid entry point %s" % entry_point.name)    return cmds
def _get_commands_dict(settings, inproject):    cmds = _get_commands_from_module('scrapytest.commands', inproject)    cmds.update(_get_commands_from_entry_points(inproject))    cmds_module = settings['COMMANDS_MODULE']    if cmds_module:        cmds.update(_get_commands_from_module(cmds_module, inproject))    return cmds

再看一下_get_commands_dict(settings, inproject)函数,该函数从三个地方导入了可以执行的命令:

1、_iter_command_classes:遍历scrapytest.commands下的类,将符合inproject条件的实例化后赋给cmds

2、_get_commands_from_entry_points:寻求其他group为scrapytest.commands的拓展入口点,加入cmds

3、settings['COMMANDS_MODULE']:在settings.py中指明的commands,加入到cmds

三、检验命令是否合法

接下来分析这一段代码:

    cmdname = _pop_command_name(argv)    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \        conflict_handler='resolve')    if not cmdname:        _print_commands(settings, inproject)        sys.exit(0)    elif cmdname not in cmds:        _print_unknown_command(settings, cmdname, inproject)        sys.exit(2)

首先看一下_pop_command_name(argv)函数,功能是找到第一个不以-开头的命令,返回这个命令:

def _pop_command_name(argv):    i = 0    for arg in argv[1:]:        if not arg.startswith('-'):            del argv[i]            return arg        i += 1

_print_commands函数:

def _print_commands(settings, inproject):    _print_header(settings, inproject)    print("Usage:")    print("  scrapytest <command> [options] [args]\n")    print("Available commands:")    cmds = _get_commands_dict(settings, inproject)    for cmdname, cmdclass in sorted(cmds.items()):        print("  %-13s %s" % (cmdname, cmdclass.short_desc()))    if not inproject:        print()        print("  [ more ]      More commands available when run from project directory")    print()    print('Use "scrapytest <command> -h" to see more info about a command')

_print_header函数:

def _print_header(settings, inproject):    if inproject:        print("Scrapy %s - project: %s\n" % (scrapytest.__version__, \            settings['BOT_NAME']))    else:        print("Scrapy %s - no active project\n" % scrapytest.__version__)

_print_unknown_command函数:

def _print_unknown_command(settings, cmdname, inproject):    _print_header(settings, inproject)    print("Unknown command: %s\n" % cmdname)    print('Use "scrapytest" to see available commands')

总结一下这一段代码:

首先从argv取到第一个命令,然后建立optparse.OptionParser()对象以备后用,接着检查了是否没有命令,和命令名错误的情况,输出相应的错误信息。

四、运行命令

最后分析这一段代码:

    cmd = cmds[cmdname]    parser.usage = "scrapytest %s %s" % (cmdname, cmd.syntax())    parser.description = cmd.long_desc()    settings.setdict(cmd.default_settings, priority='command')    cmd.settings = settings    cmd.add_options(parser)    opts, args = parser.parse_args(args=argv[1:])    _run_print_help(parser, cmd.process_options, args, opts)    cmd.crawler_process = CrawlerProcess(settings)    _run_print_help(parser, _run_command, cmd, args, opts)    sys.exit(cmd.exitcode)
前面的几步操作并不复杂,使用cmd的属性(usage和description)对于parser的属性做了设定。然后将settings赋值给了cmd.settings,将parser赋给了cmd.add_options。

接下来要先放一下cmd的基类:

def arglist_to_dict(arglist):    """Convert a list of arguments like ['arg1=val1', 'arg2=val2', ...] to a    dict    """    return dict(x.split('=', 1) for x in arglist)

class ScrapyCommand(object):    requires_project = False    crawler_process = None    # default settings to be used for this command instead of global defaults    default_settings = {}    exitcode = 0    def __init__(self):        self.settings = None  # set in scrapytest.cmdline    def set_crawler(self, crawler):        assert not hasattr(self, '_crawler'), "crawler already set"        self._crawler = crawler    def syntax(self):        """        Command syntax (preferably one-line). Do not include command name.        """        return ""    def short_desc(self):        """        A short description of the command        """        return ""    def long_desc(self):        """A long description of the command. Return short description when not        available. It cannot contain newlines, since contents will be formatted        by optparser which removes newlines and wraps text.        """        return self.short_desc()    def help(self):        """An extensive help for the command. It will be shown when using the        "help" command. It can contain newlines, since not post-formatting will        be applied to its contents.        """        return self.long_desc()    def add_options(self, parser):        """        Populate option parse with options available for this command        """        group = OptionGroup(parser, "Global Options")        group.add_option("--logfile", metavar="FILE",            help="log file. if omitted stderr will be used")        group.add_option("-L", "--loglevel", metavar="LEVEL", default=None,            help="log level (default: %s)" % self.settings['LOG_LEVEL'])        group.add_option("--nolog", action="store_true",            help="disable logging completely")        group.add_option("--profile", metavar="FILE", default=None,            help="write python cProfile stats to FILE")        group.add_option("--pidfile", metavar="FILE",            help="write process ID to FILE")        group.add_option("-s", "--set", action="append", default=[], metavar="NAME=VALUE",            help="set/override setting (may be repeated)")        group.add_option("--pdb", action="store_true", help="enable pdb on failure")        parser.add_option_group(group)    def process_options(self, args, opts):        try:            self.settings.setdict(arglist_to_dict(opts.set),                                  priority='cmdline')        except ValueError:            raise UsageError("Invalid -s value, use -s NAME=VALUE", print_help=False)        if opts.logfile:            self.settings.set('LOG_ENABLED', True, priority='cmdline')            self.settings.set('LOG_FILE', opts.logfile, priority='cmdline')        if opts.loglevel:            self.settings.set('LOG_ENABLED', True, priority='cmdline')            self.settings.set('LOG_LEVEL', opts.loglevel, priority='cmdline')        if opts.nolog:            self.settings.set('LOG_ENABLED', False, priority='cmdline')        if opts.pidfile:            with open(opts.pidfile, "w") as f:                f.write(str(os.getpid()) + os.linesep)        if opts.pdb:            failure.startDebugMode()    def run(self, args, opts):        """        Entry point for running commands        """        raise NotImplementedError
主要分析一下要用到的process_options函数,它将opt.set中设定的值放到cmd对象的setting中去,这样就允许我们通过命令行命令调整setting。

然后处理了LOG的几个通用情况做了处理。

看一下_run_print_help:

def _run_print_help(parser, func, *a, **kw):    try:        func(*a, **kw)    except UsageError as e:        if str(e):            parser.error(str(e))        if e.print_help:            parser.print_help()        sys.exit(2)

包装了一下输入函数,当func报错时输出help信息。

新建了CrawlerProcess(settings)对象,赋值给cmd.crawler_process。

class CrawlerProcess(CrawlerRunner):    """    A class to run multiple scrapytest crawlers in a process simultaneously.    This class extends :class:`~scrapytest.crawler.CrawlerRunner` by adding support    for starting a Twisted `reactor`_ and handling shutdown signals, like the    keyboard interrupt command Ctrl-C. It also configures top-level logging.    This utility should be a better fit than    :class:`~scrapytest.crawler.CrawlerRunner` if you aren't running another    Twisted `reactor`_ within your application.    The CrawlerProcess object must be instantiated with a    :class:`~scrapytest.settings.Settings` object.    This class shouldn't be needed (since Scrapy is responsible of using it    accordingly) unless writing scripts that manually handle the crawling    process. See :ref:`run-from-script` for an example.    """    def __init__(self, settings=None):        super(CrawlerProcess, self).__init__(settings)        install_shutdown_handlers(self._signal_shutdown)        configure_logging(self.settings)        log_scrapy_info(self.settings)    def _signal_shutdown(self, signum, _):        install_shutdown_handlers(self._signal_kill)        signame = signal_names[signum]        logger.info("Received %(signame)s, shutting down gracefully. Send again to force ",                    {'signame': signame})        reactor.callFromThread(self._graceful_stop_reactor)    def _signal_kill(self, signum, _):        install_shutdown_handlers(signal.SIG_IGN)        signame = signal_names[signum]        logger.info('Received %(signame)s twice, forcing unclean shutdown',                    {'signame': signame})        reactor.callFromThread(self._stop_reactor)    def start(self, stop_after_crawl=True):        """        This method starts a Twisted `reactor`_, adjusts its pool size to        :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS cache based        on :setting:`DNSCACHE_ENABLED` and :setting:`DNSCACHE_SIZE`.        If `stop_after_crawl` is True, the reactor will be stopped after all        crawlers have finished, using :meth:`join`.        :param boolean stop_after_crawl: stop or not the reactor when all            crawlers have finished        """        if stop_after_crawl:            d = self.join()            # Don't start the reactor if the deferreds are already fired            if d.called:                return            d.addBoth(self._stop_reactor)        reactor.installResolver(self._get_dns_resolver())        tp = reactor.getThreadPool()        tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))        reactor.addSystemEventTrigger('before', 'shutdown', self.stop)        reactor.run(installSignalHandlers=False)  # blocking call    def _get_dns_resolver(self):        if self.settings.getbool('DNSCACHE_ENABLED'):            cache_size = self.settings.getint('DNSCACHE_SIZE')        else:            cache_size = 0        return CachingThreadedResolver(            reactor=reactor,            cache_size=cache_size,            timeout=self.settings.getfloat('DNS_TIMEOUT')        )    def _graceful_stop_reactor(self):        d = self.stop()        d.addBoth(self._stop_reactor)        return d    def _stop_reactor(self, _=None):        try:            reactor.stop()        except RuntimeError:  # raised if already stopped or in shutdown stage            pass
Scrapy的异步特性就在CrawlerProcess体现,以后我还会写文章进一步分析这个类。

再来看一下第二个_run_print_help,输入的函数变成了_run_commond:

def _run_command(cmd, args, opts):    if opts.profile:        _run_command_profiled(cmd, args, opts)    else:        cmd.run(args, opts)def _run_command_profiled(cmd, args, opts):    if opts.profile:        sys.stderr.write("scrapytest: writing cProfile stats to %r\n" % opts.profile)    loc = locals()    p = cProfile.Profile()    p.runctx('cmd.run(args, opts)', globals(), loc)    if opts.profile:        p.dump_stats(opts.profile)
这里主要是对cProfile做了包装,当log比较多的时候,使用cProfile可以比较方便的整理和查看。

对于CrawlerProcess类和Command类的分析,我会单独用一篇文章来写。谢谢阅读!


附个链接:

Python 解析配置模块之ConfigParser详解

python之entry points

应用python的性能测量工具cProfile

0 0
原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 徐海路大桥对面怎么办 没有买公务机票怎么办 电脑黑屏怎么办重启也没有用 航空公司原因航班取消怎么办 政府采购2次废标怎么办 车卖给别人车牌怎么办 医用耗材中标后怎么办 国六标准国五车怎么办 年审标志丢了怎么办 大专不过统招线怎么办 雅安停天然气了怎么办 调档函过期了怎么办 临时工想涨工资怎么办 辞职了职称公需课怎么办 一师一优课有账号忘记密码怎么办? 长沙转户口档案怎么办 二战迁户口档案怎么办 本人户口页丢失怎么办 户口迁出原籍档案怎么办 户口本丢了一页怎么办 户口本少了一页怎么办 户口页丢了怎么办 公司要社保卡怎么办 公务员面试缺考怎么办 word文档未标签怎么办 暂住证到期了怎么办t 考驾照没暂住证怎么办 南京暂住证过期了怎么办 南京桥北暂住证怎么办 冠状沟红痒一年多了怎么办 介仓身上痒怎么办 专升本考试怎么办 专升本毕业论文怎么办 手机扫码模糊怎么办 iphone相册闪退怎么办 快件签收扫描失败怎么办 创业迷茫的时候怎么办 月经期间腰酸痛怎么办 被重庆微跑骗了怎么办 遴选到中央房子怎么办 转了户口社保怎么办