《统计思维》学习小记（一）——程序员的统计思维

来源：互联网发布：阿里云服务器升级带宽编辑：程序博客网时间：2024/04/29 04:30

此文用于记录在Allen B. Downey所著的《统计思维-程序员数学之概率统计》这本书的学习过程和一些理解

第一章程序员的统计思维

研究背景

第一个孩子是否大多数会在预产期之后出生？

数据来源

全国家庭成长调查数据(NSFG)
NSFG数据处理代码
平均怀孕周期统计代码
在线调查资料地址
调查问卷内容

数据处理代码-survey.py

输入

将NSFG的数据处理代码survey.py与NSFG放于同一目录下运行，程序会读取数据文件并显示每个文件的记录数

输出

>>>Number of respondents 7643>>>Number of pregnancies 13593

代码分析

survey.py中定义了以下六个类

类名描述 Record 表示一个记录的对象 Respondent Record的子类，表示被调查者记录的对象 Pregnancy Record的子类，表示怀孕者记录的对象 Table 表示若干记录集合的表对象 Respondents Table的子类，表示被调查者记录集合的表对象 Pregnancys Table的子类，表示怀孕者记录集合的表对象

Table

函数原型功能描述参数描述返回 ReadFile(self, data_dir, filename, fields, constructor, n=None) 读取压缩数据文件，每个记录生成一个对象 data_dir：字符串的目录名; filename：要读取的文件的字符串名称; fields：(name, start, end, cast) 元组指定序列要提取的字段; constructor：创建什么样的对象 MakeRecord(self, line, fields, constructor) 扫描一行并返回一个具有适当字段的对象 line：从数据文件的字符串行; fields：(name, start, end, cast) 元组指定序列要提取的字段; constructor：可调用的，它为记录创建对象用适当的字段记录 AddRecord(self, record) 向该表添加记录 record：记录类型之一的对象 ExtendRecords(self, records) 向该表添加记录序列 record：记录对象的序列 Recode(self) 子类可以重写该记录的值

Respondents

函数原型功能描述返回 ReadRecords(self, data_dir=’.’, n=None) 读取记录构建被调查者表 GetFilename(self) 返回数据文件名 2002FemResp.dat.gz GetFields(self) 返回指定记录字段的元组列表，这些字段就是Record对象的属性 caseid：被调查者的整数ID

Pregnancies

函数原型功能描述返回 ReadRecords(self, data_dir=’.’, n=None) 读取记录构建怀孕者表 GetFilename(self) 返回数据文件名 2002FemPreg.dat.gz GetFields(self) 返回指定记录字段的元组列表，这些字段就是Record对象的属性 caseid：被调查者的整数ID；prglength:怀孕周期，单位是周；outcome：怀孕结果的整数代码，1表示活婴；birthord：正常出生的婴儿的顺序；finalwgt：被调查者的统计权重；nbrnaliv；babysex；birthwgt_lb；birthwgt_oz；agepreg

源码

"""This file contains code for use with "Think Stats", by Allen B. Downey, available from greenteapress.comCopyright 2010 Allen B. DowneyLicense: GNU GPLv3 http://www.gnu.org/licenses/gpl.html"""import sysimport gzipimport osclass Record(object):    """Represents a record."""class Respondent(Record):     """Represents a respondent."""class Pregnancy(Record):    """Represents a pregnancy."""class Table(object):    """Represents a table as a list of objects"""    def __init__(self):        self.records = []    def __len__(self):        return len(self.records)    def ReadFile(self, data_dir, filename, fields, constructor, n=None):        """Reads a compressed data file builds one object per record.        Args:            data_dir: string directory name            filename: string name of the file to read            fields: sequence of (name, start, end, case) tuples specifying             the fields to extract            constructor: what kind of object to create        """        filename = os.path.join(data_dir, filename)        if filename.endswith('gz'):            fp = gzip.open(filename)        else:            fp = open(filename)        for i, line in enumerate(fp):            if i == n:                break            record = self.MakeRecord(line, fields, constructor)            self.AddRecord(record)        fp.close()    def MakeRecord(self, line, fields, constructor):        """Scans a line and returns an object with the appropriate fields.        Args:            line: string line from a data file            fields: sequence of (name, start, end, cast) tuples specifying             the fields to extract            constructor: callable that makes an object for the record.        Returns:            Record with appropriate fields.        """        obj = constructor()        for (field, start, end, cast) in fields:            try:                s = line[start-1:end]                val = cast(s)            except ValueError:                # If you are using Visual Studio, you might see an                # "error" at this point, but it is not really an error;                # I am just using try...except to handle not-available (NA)                # data.  You should be able to tell Visual Studio to                # ignore this non-error.                val = 'NA'            setattr(obj, field, val)        return obj    def AddRecord(self, record):        """Adds a record to this table.        Args:            record: an object of one of the record types.        """        self.records.append(record)    def ExtendRecords(self, records):        """Adds records to this table.        Args:            records: a sequence of record object        """        self.records.extend(records)    def Recode(self):        """Child classes can override this to recode values."""        passclass Respondents(Table):    """Represents the respondent table."""    def ReadRecords(self, data_dir='.', n=None):        filename = self.GetFilename()        self.ReadFile(data_dir, filename, self.GetFields(), Respondent, n)        self.Recode()    def GetFilename(self):        return '2002FemResp.dat.gz'    def GetFields(self):        """Returns a tuple specifying the fields to extract.        The elements of the tuple are field, start, end, case.                field is the name of the variable                start and end are the indices as specified in the NSFG docs                cast is a callable that converts the result to int, float, etc.        """        return [            ('caseid', 1, 12, int),            ]class Pregnancies(Table):    """Contains survey data about a Pregnancy."""    def ReadRecords(self, data_dir='.', n=None):        filename = self.GetFilename()        self.ReadFile(data_dir, filename, self.GetFields(), Pregnancy, n)        self.Recode()    def GetFilename(self):        return '2002FemPreg.dat.gz'    def GetFields(self):        """Gets information about the fields to extract from the survey data.        Documentation of the fields for Cycle 6 is at        http://nsfg.icpsr.umich.edu/cocoon/WebDocs/NSFG/public/index.htm        Returns:            sequence of (name, start, end, type) tuples        """        return [            ('caseid', 1, 12, int),            ('nbrnaliv', 22, 22, int),            ('babysex', 56, 56, int),            ('birthwgt_lb', 57, 58, int),            ('birthwgt_oz', 59, 60, int),            ('prglength', 275, 276, int),            ('outcome', 277, 277, int),            ('birthord', 278, 279, int),            ('agepreg', 284, 287, int),            ('finalwgt', 423, 440, float),            ]    def Recode(self):        for rec in self.records:            # divide mother's age by 100            try:                if rec.agepreg != 'NA':                    rec.agepreg /= 100.0            except AttributeError:                pass            # convert weight at birth from lbs/oz to total ounces            # note: there are some very low birthweights            # that are almost certainly errors, but for now I am not            # filtering            try:                if (rec.birthwgt_lb != 'NA' and rec.birthwgt_lb < 20 and                    rec.birthwgt_oz != 'NA' and rec.birthwgt_oz <= 16):                    rec.totalwgt_oz = rec.birthwgt_lb * 16 + rec.birthwgt_oz                else:                    rec.totalwgt_oz = 'NA'            except AttributeError:                passdef main(name, data_dir='.'):    resp = Respondents()    resp.ReadRecords(data_dir)    print ('Number of respondents', len(resp.records))    preg = Pregnancies()    preg.ReadRecords(data_dir)    print ('Number of pregnancies', len(preg.records))if __name__ == '__main__':    main(*sys.argv)

平均怀孕周期统计代码-first.py

输入

将平均怀孕周期统计代码first.py与survey.py，及NSFG放于同一目录下运行，程序会读取数据文件并统计出第一胎婴儿和其他婴儿的平均怀孕周期对比

输出

>>>Number of first babies 4413>>>Number of others 4735>>>Mean gestation in weeks:>>>First babies 38.60095173351461>>>Others 38.52291446673706>>>Difference in days 0.5462608674428466

源码

"""This file contains code used in "Think Stats",by Allen B. Downey, available from greenteapress.comCopyright 2010 Allen B. DowneyLicense: GNU GPLv3 http://www.gnu.org/licenses/gpl.html"""import survey# copying Mean from thinkstats.py so we don't have to deal with# importing anything in Chapter 1def Mean(t):    """Computes the mean of a sequence of numbers.    Args:        t: sequence of numbers    Returns:        float    """    return float(sum(t)) / len(t)def PartitionRecords(table):    """Divides records into two lists: first babies and others.    Only live births are included    Args:        table: pregnancy Table    """    firsts = survey.Pregnancies()    others = survey.Pregnancies()    for p in table.records:        # skip non-live births        if p.outcome != 1:            continue        if p.birthord == 1:            firsts.AddRecord(p)        else:            others.AddRecord(p)    return firsts, othersdef Process(table):    """Runs analysis on the given table.    Args:        table: table object    """    table.lengths = [p.prglength for p in table.records]    table.n = len(table.lengths)    table.mu = Mean(table.lengths)def MakeTables(data_dir='.'):    """Reads survey data and returns tables for first babies and others."""    table = survey.Pregnancies()    table.ReadRecords(data_dir)    firsts, others = PartitionRecords(table)    return table, firsts, othersdef ProcessTables(*tables):    """Processes a list of tables    Args:        tables: gathered argument tuple of Tuples    """    for table in tables:        Process(table)def Summarize(data_dir):    """Prints summary statistics for first babies and others.    Returns:        tuple of Tables    """    table, firsts, others = MakeTables(data_dir)    ProcessTables(firsts, others)    print 'Number of first babies', firsts.n    print 'Number of others', others.n    mu1, mu2 = firsts.mu, others.mu    print 'Mean gestation in weeks:'     print 'First babies', mu1     print 'Others', mu2    print 'Difference in days', (mu1 - mu2) * 7.0def main(name, data_dir='.'):    Summarize(data_dir)if __name__ == '__main__':    import sys    main(*sys.argv)

结果分析

第一胎婴儿的出生时间比其他婴儿的出生时间平均晚13个小时，出现了直观效应，仍需考虑以下问题：

其他汇总统计量如何，如中位数，方差
差异是否只是随机产生的
差异是否是选择偏差或实验设置错误导致的

0 0