《统计思维》学习小记(一)——程序员的统计思维
来源:互联网 发布:阿里云服务器升级带宽 编辑:程序博客网 时间:2024/04/29 04:30
此文用于记录在Allen B. Downey所著的《统计思维-程序员数学之概率统计》这本书的学习过程和一些理解
第一章 程序员的统计思维
研究背景
第一个孩子是否大多数会在预产期之后出生?
数据来源
- 全国家庭成长调查数据(NSFG)
- NSFG数据处理代码
- 平均怀孕周期统计代码
- 在线调查资料地址
- 调查问卷内容
数据处理代码-survey.py
输入
将NSFG的数据处理代码survey.py与NSFG放于同一目录下运行,程序会读取数据文件并显示每个文件的记录数
输出
>>>Number of respondents 7643>>>Number of pregnancies 13593
代码分析
survey.py中定义了以下六个类
Table
Respondents
Pregnancies
源码
"""This file contains code for use with "Think Stats", by Allen B. Downey, available from greenteapress.comCopyright 2010 Allen B. DowneyLicense: GNU GPLv3 http://www.gnu.org/licenses/gpl.html"""import sysimport gzipimport osclass Record(object): """Represents a record."""class Respondent(Record): """Represents a respondent."""class Pregnancy(Record): """Represents a pregnancy."""class Table(object): """Represents a table as a list of objects""" def __init__(self): self.records = [] def __len__(self): return len(self.records) def ReadFile(self, data_dir, filename, fields, constructor, n=None): """Reads a compressed data file builds one object per record. Args: data_dir: string directory name filename: string name of the file to read fields: sequence of (name, start, end, case) tuples specifying the fields to extract constructor: what kind of object to create """ filename = os.path.join(data_dir, filename) if filename.endswith('gz'): fp = gzip.open(filename) else: fp = open(filename) for i, line in enumerate(fp): if i == n: break record = self.MakeRecord(line, fields, constructor) self.AddRecord(record) fp.close() def MakeRecord(self, line, fields, constructor): """Scans a line and returns an object with the appropriate fields. Args: line: string line from a data file fields: sequence of (name, start, end, cast) tuples specifying the fields to extract constructor: callable that makes an object for the record. Returns: Record with appropriate fields. """ obj = constructor() for (field, start, end, cast) in fields: try: s = line[start-1:end] val = cast(s) except ValueError: # If you are using Visual Studio, you might see an # "error" at this point, but it is not really an error; # I am just using try...except to handle not-available (NA) # data. You should be able to tell Visual Studio to # ignore this non-error. val = 'NA' setattr(obj, field, val) return obj def AddRecord(self, record): """Adds a record to this table. Args: record: an object of one of the record types. """ self.records.append(record) def ExtendRecords(self, records): """Adds records to this table. Args: records: a sequence of record object """ self.records.extend(records) def Recode(self): """Child classes can override this to recode values.""" passclass Respondents(Table): """Represents the respondent table.""" def ReadRecords(self, data_dir='.', n=None): filename = self.GetFilename() self.ReadFile(data_dir, filename, self.GetFields(), Respondent, n) self.Recode() def GetFilename(self): return '2002FemResp.dat.gz' def GetFields(self): """Returns a tuple specifying the fields to extract. The elements of the tuple are field, start, end, case. field is the name of the variable start and end are the indices as specified in the NSFG docs cast is a callable that converts the result to int, float, etc. """ return [ ('caseid', 1, 12, int), ]class Pregnancies(Table): """Contains survey data about a Pregnancy.""" def ReadRecords(self, data_dir='.', n=None): filename = self.GetFilename() self.ReadFile(data_dir, filename, self.GetFields(), Pregnancy, n) self.Recode() def GetFilename(self): return '2002FemPreg.dat.gz' def GetFields(self): """Gets information about the fields to extract from the survey data. Documentation of the fields for Cycle 6 is at http://nsfg.icpsr.umich.edu/cocoon/WebDocs/NSFG/public/index.htm Returns: sequence of (name, start, end, type) tuples """ return [ ('caseid', 1, 12, int), ('nbrnaliv', 22, 22, int), ('babysex', 56, 56, int), ('birthwgt_lb', 57, 58, int), ('birthwgt_oz', 59, 60, int), ('prglength', 275, 276, int), ('outcome', 277, 277, int), ('birthord', 278, 279, int), ('agepreg', 284, 287, int), ('finalwgt', 423, 440, float), ] def Recode(self): for rec in self.records: # divide mother's age by 100 try: if rec.agepreg != 'NA': rec.agepreg /= 100.0 except AttributeError: pass # convert weight at birth from lbs/oz to total ounces # note: there are some very low birthweights # that are almost certainly errors, but for now I am not # filtering try: if (rec.birthwgt_lb != 'NA' and rec.birthwgt_lb < 20 and rec.birthwgt_oz != 'NA' and rec.birthwgt_oz <= 16): rec.totalwgt_oz = rec.birthwgt_lb * 16 + rec.birthwgt_oz else: rec.totalwgt_oz = 'NA' except AttributeError: passdef main(name, data_dir='.'): resp = Respondents() resp.ReadRecords(data_dir) print ('Number of respondents', len(resp.records)) preg = Pregnancies() preg.ReadRecords(data_dir) print ('Number of pregnancies', len(preg.records))if __name__ == '__main__': main(*sys.argv)
平均怀孕周期统计代码-first.py
输入
将平均怀孕周期统计代码first.py与survey.py,及NSFG放于同一目录下运行,程序会读取数据文件并统计出第一胎婴儿和其他婴儿的平均怀孕周期对比
输出
>>>Number of first babies 4413>>>Number of others 4735>>>Mean gestation in weeks:>>>First babies 38.60095173351461>>>Others 38.52291446673706>>>Difference in days 0.5462608674428466
源码
"""This file contains code used in "Think Stats",by Allen B. Downey, available from greenteapress.comCopyright 2010 Allen B. DowneyLicense: GNU GPLv3 http://www.gnu.org/licenses/gpl.html"""import survey# copying Mean from thinkstats.py so we don't have to deal with# importing anything in Chapter 1def Mean(t): """Computes the mean of a sequence of numbers. Args: t: sequence of numbers Returns: float """ return float(sum(t)) / len(t)def PartitionRecords(table): """Divides records into two lists: first babies and others. Only live births are included Args: table: pregnancy Table """ firsts = survey.Pregnancies() others = survey.Pregnancies() for p in table.records: # skip non-live births if p.outcome != 1: continue if p.birthord == 1: firsts.AddRecord(p) else: others.AddRecord(p) return firsts, othersdef Process(table): """Runs analysis on the given table. Args: table: table object """ table.lengths = [p.prglength for p in table.records] table.n = len(table.lengths) table.mu = Mean(table.lengths)def MakeTables(data_dir='.'): """Reads survey data and returns tables for first babies and others.""" table = survey.Pregnancies() table.ReadRecords(data_dir) firsts, others = PartitionRecords(table) return table, firsts, othersdef ProcessTables(*tables): """Processes a list of tables Args: tables: gathered argument tuple of Tuples """ for table in tables: Process(table)def Summarize(data_dir): """Prints summary statistics for first babies and others. Returns: tuple of Tables """ table, firsts, others = MakeTables(data_dir) ProcessTables(firsts, others) print 'Number of first babies', firsts.n print 'Number of others', others.n mu1, mu2 = firsts.mu, others.mu print 'Mean gestation in weeks:' print 'First babies', mu1 print 'Others', mu2 print 'Difference in days', (mu1 - mu2) * 7.0def main(name, data_dir='.'): Summarize(data_dir)if __name__ == '__main__': import sys main(*sys.argv)
结果分析
第一胎婴儿的出生时间比其他婴儿的出生时间平均晚13个小时,出现了直观效应,仍需考虑以下问题:
- 其他汇总统计量如何,如中位数,方差
- 差异是否只是随机产生的
- 差异是否是选择偏差或实验设置错误导致的
0 0
- 《统计思维》学习小记(一)——程序员的统计思维
- 《统计思维》学习小记(二)——描述性统计量(1)
- 统计思维:程序员数学之概率统计(第2版)——互动出版网
- 统计思维(实例1)——统计直方图
- 统计思维(实例3)——分布建模
- 统计思维(实例4)——概率密度函数
- 统计思维(实例6)——术语整理
- 统计思维(实例7)——估计
- 读书笔记-《统计思维 程序员数学之概率统计》
- 贝叶斯思维 统计建模的Python学习法
- 贝叶斯思维 统计建模的Python学习法pdf
- 统计思维(实例5)——变量之间的关系
- 思维导图 || 统计学习三要素
- 20170103:for 统计思维
- 读书小记:《程序员的思维修炼》
- 程序员的思维修炼读书笔记(一)
- 《程序员思维训练》读书小记
- 统计思维(实例2)——概率质量函数与累积分布函数
- windows下面安装Python和pip终极教程
- leetcode解题之258# Add Digits Java版 (相加各个位数)
- [C++]tuple简介
- 曾经我们的爱情,最后却无疾而终
- org.hibernate.id.IdentifierGenerationException: attempted to assign id from null one-to-one property
- 《统计思维》学习小记(一)——程序员的统计思维
- 语句
- windows安装Apache,注册服务出现“(OS 5)拒绝访问。 : AH00369: Failed to open the WinNT service manager..."错误
- eclipse常用插件在线安装地址或下载地址
- c语言实现单链表
- Android六大布局的介绍
- 神经网络入门 ,源码3
- nodejs中一个简单的TCP服务器端和客户端的聊天服务器
- 序列的算法(一·b)隐马尔可夫模型