Python Cookbook 学习笔记 第一章Data Structures and Algorithms

来源:互联网 发布:mac口红哪里买最便宜 编辑:程序博客网 时间:2024/05/14 18:31

今天学习完了一部分Cookbook书上的内容,做一个笔记。
一、问题:如何将一个N个元素的元组或者序列拆解成为N个变量。
解决:对于任何 iterable的对象,例如元组、列表、文件、迭代器、生成器都可以采用简单赋值操作来完成这个任务。唯一的要求是要求变量的个数与序列中的元素个数是匹配的。例如:

>>> data = [ 'ACME', 50, 91.1, (2012, 12, 21) ]>>> name, shares, price, date = data>>> name'ACME'>>> date(2012, 12, 21)>>> name, shares, price, (year, mon, day) = data>>> name'ACME'>>> year2012>>> mon12>>> day21>>>

如果数量不匹配,会出现ValueError的错误

如果想忽略序列中的特定位置上的元素,可以用_来标识。
例如:

>>> data = [ 'ACME', 50, 91.1, (2012, 12, 21) ]>>> _,shares,price,_ = data>>> shares50

二、对任意长度的序列进行拆包
问题:如何对一个你不知道长度或者序列长度较长的序列进行拆包操作
解决:使用运算符。运算符在python具有常见的运用,它使得很多操作变得更加灵活。
假设现在有一个有一个客户信息,包括姓名、E-mail和电话的信息,每个客户留的电话的数目可能是不一样的,现在我们需要对这个序列进行拆包操作,便于后面的处理。我们可以用*运算符

>>> record = ('Dave', 'dave@example.com', '773-555-1212', '847-555-1212')>>> name,email,*phones = record>>> name'Dave'>>> email'dave@example.com'>>> phones['773-555-1212', '847-555-1212']>>> type(phones)<class 'list'>>>> 

从上面代码可知,*运算符的变量最后是一个列表的类型。
如果在拆包时不想要其中连续的一系列元素,可以用*_来表示。例如:

>>> record = ('ACME', 50, 123.45, (12, 18, 2012))>>> name, *_, (*_, year) = record>>> name'ACME'>>> year2012>>>

书中最后列举一个递归的应用:

>>> def sum(items):... head, *tail = items... return head + sum(tail) if tail else head...>>> sum(items)36

作为开阔思路罗列在此

三、保存最后的N个元素
问题:在某些迭代处理中或者某些操作中保存最后的N个元素。
解决:运用collections模块中的deque类来实现队列的效果。假设现在有一个文件,需要找出含有某个模式的最后N行,代码如下:

>>> from collections import deque>>> def search(pattern,lines,history = 5):    preresult = deque(maxlen = history)    for line in lines:        if pattern in line:            yield line,preresult            preresult.append(line)

代码与原书略有不同,因为想实现的功能有细小的差别。
对于deque的功能可以手动实现,但是相对而言deque要快一些
对于deque类可以不指定maxlen,这时deque是无限长的,可以用append,appendleft,pop,popleft函数来添加或者弹出元素。这些操作的时间复杂度为O(1)
四、找最大或者最小的N个元素
问题:想要找到集合中最大或者最小的几个元素
解决:可以使用python中的 heapq模块来解决。heapq是对堆的模拟实现。其中有两个方法可以解决问题。heapq.nlargest(n, iterable, key=None)和 heapq.nsmallest(n, iterable, key=None) 。举例如下:

>>> import heapq>>> nums = [32,42,423,13,133,4,3534,53,2]>>> print(heapq.nlargest(3,nums))[3534, 423, 133]>>> print(heapq.nsmallest(3,nums))[2, 4, 13]>>> 
portfolio = [{'name': 'IBM', 'shares': 100, 'price': 91.1},{'name': 'AAPL', 'shares': 50, 'price': 543.22},{'name': 'FB', 'shares': 200, 'price': 21.09},{'name': 'HPQ', 'shares': 35, 'price': 31.75},{'name': 'YHOO', 'shares': 45, 'price': 16.35},{'name': 'ACME', 'shares': 75, 'price': 115.65}]>>> cheap = heapq.nlargest(3,portfolio,key = lambda x:x['price'])>>>> print(cheap)[{'name': 'AAPL', 'shares': 50, 'price': 543.22}, {'name': 'ACME', 'shares': 75, 'price': 115.65}, {'name': 'IBM', 'shares': 100, 'price': 91.1}]

heapq.nlargest(n, iterable, key=None)和 heapq.nsmallest(n, iterable, key=None)这两个函数在时间性能上有优势。
heapq.heapify(x) 函数可以在线性时间内将一个list转化为一个heap。
heapq模块实现的是小顶堆。模块中的函数如下:
heapq.heappush(heap, item)
Push the value item onto the heap, maintaining the heap invariant.

heapq.heappop(heap)
Pop and return the smallest item from the heap, maintaining the heap invariant. If the heap is empty, IndexError is raised. To access the smallest item without popping it, use heap[0].

heapq.heappushpop(heap, item)
Push item on the heap, then pop and return the smallest item from the heap. The combined action runs more efficiently than heappush() followed by a separate call to heappop().

heapq.heapify(x)
Transform list x into a heap, in-place, in linear time.

heapq.heapreplace(heap, item)
Pop and return the smallest item from the heap, and also push the new item. The heap size doesn’t change. If the heap is empty, IndexError is raised.

This one step operation is more efficient than a heappop() followed by heappush() and can be more appropriate when using a fixed-size heap. The pop/push combination always returns an element from the heap and replaces it with item.

The value returned may be larger than the item added. If that isn’t desired, consider using heappushpop() instead. Its push/pop combination returns the smaller of the two values, leaving the larger value on the heap.

The module also offers three general purpose functions based on heaps.

heapq.merge(*iterables, key=None, reverse=False)
Merge multiple sorted inputs into a single sorted output (for example, merge timestamped entries from multiple log files). Returns an iterator over the sorted values.

Similar to sorted(itertools.chain(*iterables)) but returns an iterable, does not pull the data into memory all at once, and assumes that each of the input streams is already sorted (smallest to largest).

Has two optional arguments which must be specified as keyword arguments.

key specifies a key function of one argument that is used to extract a comparison key from each input element. The default value is None (compare the elements directly).

reverse is a boolean value. If set to True, then the input elements are merged as if each comparison were reversed.

Changed in version 3.5: Added the optional key and reverse parameters.

heapq.nlargest(n, iterable, key=None)
Return a list with the n largest elements from the dataset defined by iterable. key, if provided, specifies a function of one argument that is used to extract a comparison key from each element in the iterable: key=str.lower Equivalent to: sorted(iterable, key=key, reverse=True)[:n]

heapq.nsmallest(n, iterable, key=None)
Return a list with the n smallest elements from the dataset defined by iterable. key, if provided, specifies a function of one argument that is used to extract a comparison key from each element in the iterable: key=str.lower Equivalent to: sorted(iterable, key=key)[:n]
堆可以对大的磁盘数据进行排序

五、用堆实现优先级队列

>>> import heapq>>> class PriorityQueue:    def __init__():        self._queue = []        self._index = 0    def push(self,item,priority):        heapq.heappush(self._queue,(-priority,self._index,item))        self._index += 1    def pop(self):        return heapq.heappop(self._queue)[-1]

这里使用了元组封装了优先级和item,将它作为一项插入队列中,这是为了便于比较,即使所插入的item本身无法比较大小,也可以安全插入。元组中的比较大小的方式是按项比较,先比较第一项,如果相等,再比较第二项。。。。代码中加入了index,使得每一个元组都可以比较出大小。另外,heapq实现的是小顶堆,因此注意插入的优先级,使其满足我们的编程目标。

六、在字典中如何实现一个key对应多个value
解决:很自然的可以选择使字典中的一个key对应一个list或者集合(根据数据是否允许重复)

d = {}for key,value in pairs:    if key not in d:        d[key] = []    d[key].append(value)

当然也可以用collections中的defaultdict,上面代码可以简化为:

>>> from collections import defaultdict>>> d = defaultdict(list)>>> for key,value in pairs:    d[key].append(value)

七、保持字典的插入顺序
解决:使用collections中的OrderedDict.

from collections import OrderedDictd = OrderedDict()d['foo'] = 1d['bar'] = 2d['spam'] = 3d['grok'] = 4# Outputs "foo 1", "bar 2", "spam 3", "grok 4"for key in d:print(key, d[key])

OrderedDict可以保持插入顺序,之后对值得改变不改变字典的顺序。
OrderedDict内部实现的是双向的链接,所以它比普通的dict要多占一倍的空间
八、实现对字典的最大值、最小值、排序操作
问题:有时候,我们需要按照字典中的值进行排序,或者按照值取其最大值或者最小值,而我们又想得到与值相关的key,直接对字典运用max,min,sorted是不合适的,因为这时返回的是字典中的key值。
解决:我们用zip函数,zip函数可以把两个list结合起来,返回一个iterator,例如:

>>> a = ["ew","wr"]>>> b = [12,21]>>> c = zip(a,b)>>> for item in c:    print(item)('ew', 12)('wr', 21)

注意iterator只能被使用一次。
因此下面我们解决我们的问题:

prices = {'ACME': 45.23,'AAPL': 612.78,'IBM': 205.55,'HPQ': 37.20,'FB': 10.75}min_price = min(zip(prices.values(), prices.keys()))# min_price is (10.75, 'FB')max_price = max(zip(prices.values(), prices.keys()))# max_price is (612.78, 'AAPL')prices_sorted = sorted(zip(prices.values(), prices.keys()))# prices_sorted is [(10.75, 'FB'), (37.2, 'HPQ'),# (45.23, 'ACME'), (205.55, 'IBM'),# (612.78, 'AAPL')]

九、找出字典中的公共值
解决:字典通过keys(),values(),items()方法提供了三种视图,key-view,value-view,item-view。其中key-view和item-view提供类似集合的操作&,|,-,可以用这些操作完成功能。

a = {'x' : 1,'y' : 2,'z' : 3}b = {'w' : 10,'x' : 11,'y' : 2}# Find keys in commona.keys() & b.keys() # { 'x', 'y' }# Find keys in a that are not in ba.keys() - b.keys() # { 'z' }# Find (key,value) pairs in commona.items() & b.items() # { ('y', 2) }# Make a new dictionary with certain keys removedc = {key:a[key] for key in a.keys() - {'z', 'w'}}# c is {'x': 1, 'y': 2}

十、删除序列中的冗余,并保持顺序
解决:如果不要求顺序,可以使用集合进行解决。如果保留顺序,有一种更通用的解决方式,如下:

>>> def dedupe(items):    seen = set()    for item in items:        if item not in seen:            yield item            seen.add(item)>>> a = [1, 5, 2, 1, 9, 1, 5, 10]>>> list(dedupe(a))[1, 5, 2, 9, 10]

上面代码是item hashable的情况,如果item不可hashable,则有一种更加通用的方式:

>>> def dedupe(items,key):    seen = set()    for item in items:        val = item if key is None else key(item)        if val not in seen:            yield item            seen.add(val)>>> a = [ {'x':1, 'y':2}, {'x':1, 'y':3}, {'x':1, 'y':2}, {'x':2, 'y':4}]>>> list(dedupe(a, key=lambda d: (d['x'],d['y'])))[{'y': 2, 'x': 1}, {'y': 3, 'x': 1}, {'y': 4, 'x': 2}]>>> 

十一、给slice命名
解决:为了是代码好读,我们可以给slice进行命名。

record = '....................100 .......513.25 ..........'cost = int(record[20:32]) * float(record[40:48])

举个例子,我们想对序列使用下标进行切片,但是这样的代码并不是很好读,为了使代码便于维护,我们可能希望对20:32和40:48命名。
我们可以用slice()函数,sclie函数有三个参数:start,end,step。
上面代码可以转化为下面这种形式,便于阅读

SHARES = slice(20,32)PRICE = slice(40,48)cost = int(record[SHARES]) * float(record[PRICE])

十二、找出在一个序列中出现频率最高的元素
解决:collections中的Counter类可以解决这个问题,其中有一个most_common方法。例子如下:

>>> from collections import Counter>>> words = ['look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes','the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around', 'the','eyes', "don't", 'look', 'around', 'the', 'eyes', 'look', 'into','my', 'eyes', "you're", 'under']>>> words_count = Counter(words)>>> words_count.most_common(3)[('eyes', 8), ('the', 5), ('look', 4)]>>> 

Counter类有类似字典的特性:

>>> word_counts['not']1

如果此时有俩了一组新的单词,可以调用update直接进行更新操作。

>>> morewords = ['why','are','you','not','looking','in','my','eyes']>>>word_counts.update(morewords)

此外,Counter类重载了运算符,可以用+、-简化操作。

>>> a = Counter(words)>>> b = Counter(morewords)>>> aCounter({'eyes': 8, 'the': 5, 'look': 4, 'into': 3, 'my': 3, 'around': 2,"you're": 1, "don't": 1, 'under': 1, 'not': 1})>>> bCounter({'eyes': 1, 'looking': 1, 'are': 1, 'in': 1, 'not': 1, 'you': 1,'my': 1, 'why': 1})>>> # Combine counts>>> c = a + b>>> cCounter({'eyes': 9, 'the': 5, 'look': 4, 'my': 4, 'into': 3, 'not': 2,'around': 2, "you're": 1, "don't": 1, 'in': 1, 'why': 1,'looking': 1, 'are': 1, 'under': 1, 'you': 1})>>> # Subtract counts>>> d = a - b>>> dCounter({'eyes': 7, 'the': 5, 'look': 4, 'into': 3, 'my': 2, 'around': 2,"you're": 1, "don't": 1, 'under': 1})>>>

十三、一组拥有共同key的字典进行排序
解决:可以使用operator中的itemgetter.

>>> from operator import itemgetter>>> rows = [{'fname': 'Brian', 'lname': 'Jones', 'uid': 1003},{'fname': 'David', 'lname': 'Beazley', 'uid': 1002},{'fname': 'John', 'lname': 'Cleese', 'uid': 1001},{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}]>>> rows_by_fname = sorted(rows,key = itemgetter('fname'))>>> print(rows_by_fname)[{'uid': 1004, 'fname': 'Big', 'lname': 'Jones'}, {'uid': 1003, 'fname': 'Brian', 'lname': 'Jones'}, {'uid': 1002, 'fname': 'David', 'lname': 'Beazley'}, {'uid': 1001, 'fname': 'John', 'lname': 'Cleese'}]>>> 

itemgetter函数也可以接受多个key。
当然我们可以用传统的lambda函数,即key = lambda r:r[‘fname’]
但是itemgetter更加快。
十四、比较某些没有比较操作的对象
解决:我们当然可以使用lambda来实现目标,现在我们选择另一种方式来实现我们的目标。即使用operator中的attrgetter.

>>> class User:... def __init__(self, user_id):... self.user_id = user_id... def __repr__(self):... return 'User({})'.format(self.user_id)...>>> from operator import attrgetter>>> sorted(users, key=attrgetter('user_id'))[User(3), User(23), User(99)]

十五、根据一个数据域来分组
解决:利用itertools中的groupby函数。

from operator import itemgetterfrom itertools import groupby# Sort by the desired field firstrows.sort(key=itemgetter('date'))# Iterate in groupsfor date, items in groupby(rows, key=itemgetter('date')):print(date)for i in items:print(' ', i)#output07/01/2012{'date': '07/01/2012', 'address': '5412 N CLARK'}{'date': '07/01/2012', 'address': '4801 N BROADWAY'}07/02/2012{'date': '07/02/2012', 'address': '5800 E 58TH'}{'date': '07/02/2012', 'address': '5645 N RAVENSWOOD'}{'date': '07/02/2012', 'address': '1060 W ADDISON'}07/03/2012{'date': '07/03/2012', 'address': '2122 N CLARK'}07/04/2012{'date': '07/04/2012', 'address': '5148 N CLARK'}{'date': '07/04/2012', 'address': '1039 W GRANVILLE'}

十六、对序列中的元素进行filter
解决:1、利用list comprehension.

>>> mylist = [1, 4, -5, 10, -7, 2, 3, -1]>>> [n for n in mylist if n > 0][1, 4, 10, 2, 3]>>> [n for n in mylist if n < 0][-5, -7, -1]>>>>>> pos = (n for n in mylist if n > 0)>>> pos<generator object <genexpr> at 0x1006a0eb0>>>> for x in pos:... print(x)

2、利用filter函数
filter函数第一个参数是一个函数,第二个参数是一个序列,函数声明为filter(function, iterable) 。

values = ['1', '2', '-3', '-', '4', 'N/A', '5']def is_int(val):try:x = int(val)return Trueexcept ValueError:return Falseivals = list(filter(is_int, values))print(ivals)# Outputs ['1', '2', '-3', '4', '5']

3、itertools中的compress函数
函数原型及实现为:
itertools.compress(data, selectors)
Make an iterator that filters elements from data returning only those that have a corresponding element in selectors that evaluates to True. Stops when either the data or selectors iterables has been exhausted. Equivalent to:

def compress(data, selectors):
# compress(‘ABCDEF’, [1,0,1,0,1,1]) –> A C E F
return (d for d, s in zip(data, selectors) if s)

>>> from itertools import compress>>> more5 = [n > 5 for n in counts]>>> more5[False, False, True, False, False, True, True, False]>>> list(compress(addresses, more5))['5800 E 58TH', '4801 N BROADWAY', '1039 W GRANVILLE']>>>

十七、从字典中得到一个子集
解决:利用dict comprehension

prices = {'ACME': 45.23,'AAPL': 612.78,'IBM': 205.55,'HPQ': 37.20,'FB': 10.75}# Make a dictionary of all prices over 200p1 = { key:value for key, value in prices.items() if value > 200 }# Make a dictionary of tech stockstech_names = { 'AAPL', 'IBM', 'HPQ', 'MSFT' }p2 = { key:value for key,value in prices.items() if key in tech_names 

十八、namedtuple()的用法
collections.namedtuple(typename, field_names, verbose=False, rename=False)
Returns a new tuple subclass named typename. The new subclass is used to create tuple-like objects that have fields accessible by attribute lookup as well as being indexable and iterable. Instances of the subclass also have a helpful docstring (with typename and field_names) and a helpful repr() method which lists the tuple contents in a name=value format.

The field_names are a single string with each fieldname separated by whitespace and/or commas, for example ‘x y’ or ‘x, y’. Alternatively, field_names can be a sequence of strings such as [‘x’, ‘y’].

Any valid Python identifier may be used for a fieldname except for names starting with an underscore. Valid identifiers consist of letters, digits, and underscores but do not start with a digit or underscore and cannot be a keyword such as class, for, return, global, pass, or raise.

If rename is true, invalid fieldnames are automatically replaced with positional names. For example, [‘abc’, ‘def’, ‘ghi’, ‘abc’] is converted to [‘abc’, ‘_1’, ‘ghi’, ‘_3’], eliminating the keyword def and the duplicate fieldname abc.

If verbose is true, the class definition is printed after it is built. This option is outdated; instead, it is simpler to print the _source attribute.

Named tuple instances do not have per-instance dictionaries, so they are lightweight and require no more memory than regular tuples.

Changed in version 3.1: Added support for rename.

>>> # Basic example>>> Point = namedtuple('Point', ['x', 'y'])>>> p = Point(11, y=22)     # instantiate with positional or keyword arguments>>> p[0] + p[1]             # indexable like the plain tuple (11, 22)33>>> x, y = p                # unpack like a regular tuple>>> x, y(11, 22)>>> p.x + p.y               # fields also accessible by name33>>> p                       # readable __repr__ with a name=value stylePoint(x=11, y=22)

以上是基础用法,详细的用法参见python 帮助文档
十九、Transforming and Reducing Data at the Same Time
解决:reduction function (e.g., sum(), min(), max()),如果我们在运用reduction function 之前先转换数据,怎么使两个操作同时进行呢。
最简单的例子如下:

nums = [1, 2, 3, 4, 5]s = sum(x * x for x in nums)

进阶:使用迭代器的形式更加有效,更加节省空间。

s = sum((x * x for x in nums)) # Pass generator-expr as arguments = sum(x * x for x in nums) # More elegant syntax

以上两种形式皆可。
二十、将多个字典组合成一个
解决:可以使用 collections中的ChainMap或者字典的update方法,区别在于ChainMap方法所生成的字典与原字典共用一个存储空间,update方法则不然。
举例如下:

a = {'x': 1, 'z': 3 }b = {'y': 2, 'z': 4 }from collections import ChainMapc = ChainMap(a,b)print(c['x']) # Outputs 1 (from a)print(c['y']) # Outputs 2 (from b)print(c['z']) # Outputs 3 (from a)
>>> a = {'x': 1, 'z': 3 }>>> b = {'y': 2, 'z': 4 }>>> merged = dict(b)>>> merged.update(a)>>> merged['x']1>>> merged['y']2>>> merged['z']3>>>
0 0
原创粉丝点击