[Python for Data Anlysis]CH04 Numpy Basics -- Arrays and Vectorized Computation

来源：互联网发布：人力资源分析软件编辑：程序博客网时间：2024/05/21 19:28

NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is the fundamental package required for high
performance scientific computing and data analysis.

ndarray
mathematical functions for fast operations on entire arrays of data without having to write loop
Tools for reading data form disk
Linear Algebra, random number generation, Fourier transformation
Tools for interrating code wiritten in C, C++, Fortran

基本设置

%matplotlib inlinefrom __future__ import divisionfrom numpy.random import randnimport numpy as npnp.set_printoptions(precision=4, suppress=True)

NumPy ndarray: A Multidimensional Array Object

基本使用

data = randn(2, 3)data *10data + datadata.shapedata.dtype

Creating ndarray

Array
它能接受任何序列，然后创建一个NumPy array，包含输入的序列
zeros and ones
zeros 和 ones创建对应shape的array，而且分别全为0,1.
empty
empty creats an array without initializing its values to any particular value
arange
arange 将range变为对应的array

#arraydata1= [6,7.5,8,0,1]arr1 = np.array(data1)#二维序列 nested sequencesdata2 = [[1,2,3,4],[5,6,7,8]]arr2 = np.array(data2)#zeros， onesa1 = np.zeros(10)a2 = np.ones((2,3))#emptynp.empty(10)#arangenp.arange(15)

Function Description array Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype. Copies the input data by default. asarray Convert input to ndarray, but do not copy if the input is already an ndarray arange Like the built-in range but returns an ndarray instead of a list. ones, ones_like Produce an array of all 1’s with the given shape and dtype. ones_like takes another array and produces a ones array of the same shape and dtype. zeros, zeros_like Like ones and ones_like but producing arrays of 0’s instead empty, empty_like Create new arrays by allocating new memory, but do not populate with any values like ones and zeros eye, identity Create a square N x N identity matrix (1’s on the diagonal and 0’s elsewhere)

Data Types for ndarrays

主要时用于计算memory大小的,后面数字表示bit位数， double（float）8字节，所以要64bits

arr1 = np.array([1,2,3],dtype = np.float64)arr2 = np.array([1,2,3],dtype = np.int32)arr1.dtypearr2.dtype

casting dtypes between different arrays

类型给定方法：
1. 初始化时默认给定
2. 初始化时给定
3. arr.astype(给定dtype，或这另一个arr2.dtype)
astype always creates a new array，不论类型有没有被改变

#1. 初始化默认给定arr = np.arange(1,6)#2. 初始化是给定numeric_strings = np.array(['1.25','-9.6','42'],dtype = np.string_)#3. 改变数据类型float_arr = arr.astype(np.float64) #cast int64 to float64numeric_strings.astype(float) #if cast fail for some reason, a TypeError will be raised,# Numpy is smart enough to alias Python types to equivalent dtypes# arr2.dtypearr1 = np.arange(10)arr2 = randn(2,3)arr1.astype(arr2.dtype),arr1.dtype

Operations between Arrays and Scalars

和R， Matlab一致,
所有的*, + ,-，/是对应元素间的操作

arr = np.array([[1., 2., 3.], [4., 5., 6.]])arr#二元运算符 arr + arrarr - arrarr * arrarr / arr

#一元运算符1 / arrarr ** 0.5

Bacis Indexing and Sclicing

One dimension

Array slices are views on the original array,
and any modifications to the view will be reflected in the source array.

arr = np.arange(10)arrarr[5]arr[5:8]arr[5:8] = 12arr

arr_slice = arr[5:8]arr_slice[1] = 12345arrarr_slice[:] = 64arr

copy of the slice of the array

arr[5:8].copy()arr_slice_copy = arr[5:8].copy()arr_slice_copy[1] = 1arr_slice_copyarr

Higher Dimension

The elements at each index are no longer scalars but rather corresponding arrays

arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])arr2d[2]arr2d[0][2],arr2d[0,2]arr3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])arr3darr3d.shapearr3d[0]arr3d[0] = 42arr3d[1, 0]

Indexing with slices

view of original array

arr[1:6]arr2d# 仅有一个表示行arr2d[:2]# 两个则分别表示行和列arr2d[:2, 1:]arr2d[1, :2]

Boolean Indexing

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])data = randn(7, 4)namesdata

names == 'Bob'data[names == 'Bob'] data[names == 'Bob', 2:]data[names == 'Bob', 3]mask = (names == 'Bob') | (names == 'Will') #do not support keywords and, ormaskdata[mask]data[data<0] = 0datadata[names!='Joe'] = 7data

Fancy Indexing

Indexing using integer arrays

arr = np.empty((8, 4))for i in range(8):    arr[i] = iarr

arr[[4, 3, 0, 6]]arr[[-3,-5,-7]]

arr = np.arange(32).reshape((8, 4))arrarr[[1, 5, 7, 2], [0, 3, 1, 2]]arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]arr[np.ix_([1, 5, 7, 2], [0, 3, 1, 2])]

Transposing arrays and swapping axes

arr = np.arange(15).reshape((3, 5))arrarr.T

arr = np.random.randn(6, 3)np.dot(arr.T, arr)

transpose(), swapaxes()暂时用不到

Universal Functions: Element-wise Array Functions

一些快速的函数，element-wise的函数

arr = np.arange(10)np.sqrt(arr)np.exp(arr)

参数为多个array

x = randn(8)y = randn(8)xynp.maximum(x, y) # element-wise maximum

返回多个值

arr = randn(7) * 5np.modf(arr)

Uinary functions

Function Description abs, fabs Compute the absolute value element-wise for integer, floating point, or complex values. Use fabs as a faster alternative for non-complex-valued data sqrt Compute the square root of each element. Equivalent to arr ** 0.5 square Compute the square of each element. Equivalent to arr ** 2 exp Compute the exponent e x of each element log, log10, log2, log1p Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively sign Compute the sign of each element: 1 (positive), 0 (zero), or -1 (negative) ceil Compute the ceiling of each element, i.e. the smallest integer greater than or equal to each element floor Compute the floor of each element, i.e. the largest integer less than or equal to each element rint Round elements to the nearest integer, preserving the dtype modf Return fractional and integral parts of array as separate array isnan Return boolean array indicating whether each value is NaN (Not a Number) isfinite, isinf Return boolean array indicating whether each element is finite (non- inf , non- NaN ) or infinite, respectively cos, cosh, sin, sinh, tan, tanh Regular and hyperbolic trigonometric functions arccos, arccosh, arcsin, arcsinh, arctan, arctanh Inverse trigonometric functions logical_not Compute truth value of not x element-wise. Equivalent to -arr .

Binary functions

Function Description add Add corresponding elements in arrays subtract Subtract elements in second array from first array multiply Multiply array elements divide, floor_divide Divide or floor divide (truncating the remainder) power Raise elements in first array to powers indicated in second array maximum, fmax Element-wise maximum. fmax ignores NaN minimum, fmin Element-wise minimum. fmin ignores NaN mod Element-wise modulus (remainder of division) copysign Copy sign of values in second argument to values in first argument

Data processing using arrays

vectorization把loop转换成array expression: faster

Expressing conditional logic as array operations

pure python
result = [x if c else y for x,y,c in zip(x,y,c)

numpy

result = np.where(c,x,y)arr = randn(4, 4)arrnp.where(arr > 0, 2, -2)np.where(arr > 0, 2, arr) # set only positive values to 2

Mathematical and statistical methods

mean

arr = np.random.randn(5, 4) # normally-distributed dataarr.mean()np.mean(arr)arr.sum()

按行列，0为列，1 为行
```
arr.mean(axis=1)arr.sum(0)
```

cumsum， cumprod

arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])arr.cumsum(0)arr.cumprod(1)

Method Description sum Sum of all the elements in the array or along an axis. Zero-length arrays have sum 0. mean Arithmetic mean. Zero-length arrays have NaN mean. std, var Standard deviation and variance, respectively, with optional degrees of freedom adjust-ment (default denominator n ). min, max Minimum and maximum. argmin, argmax Indices of minimum and maximum elements, respectively. cumsum Cumulative sum of elements starting from 0 cumprod Cumulative product of elements starting from 1

Methods for boolean arrays

统计正数

arr = randn(100)(arr > 0).sum() # Number of positive values

是否存在any，是否都all
bools = np.array([False, False, True, False]) bools.any() bools.all()

Sorting

arr.sort()
```
arr = randn(8)arrarr.sort()arr
```
arr.sort(1)
```
arr.sort(1)
```
np.sort()
```
np.sort(arr)
```

Unique and other set logic

np.unique(arr)

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe',       'Joe'])np.unique(names)ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])np.unique(ints)

np.in1d(arr1,arr2)

values = np.array([6, 0, 0, 3, 2, 5, 6])np.in1d(values, [2, 3, 6])

Method Description unique(x) Compute the sorted, unique elements in x intersect1d(x, y) Compute the sorted, common elements in x and y union1d(x, y) Compute the sorted union of elements in1d(x, y) Compute a boolean array indicating whether each element of x is contained in y setdiff1d(x, y) Set difference, elements in x that are not in y setxor1d(x, y) Set symmetric differences; elements that are in either of the arrays, but not both

File input and output with arrays

Storing arrays on disk in binary format

arr = np.arange(10)np.save('some_array', arr)np.load('some_array.npy')

np.savez('array_archive.npz', a=arr, b=arr)arch = np.load('array_archive.npz')arch['b'] #dict-like

Saving and loading text files

pandas里面的read_csv和read_table 较为常用

arr = np.loadtxt('array_ex.txt', delimiter=',')arr

Linear algebra

from numpy.linalg import inv, qr
1. A %*% B
“`python
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x
y
x.dot(y) # equivalently np.dot(x, y)

```

2. QR分解
“`
from numpy.linalg import inv, qr
X = randn(5, 5)
mat = X.T.dot(X)
inv(mat)
mat.dot(inv(mat))
q, r = qr(mat)
r

Function Description diag Return the diagonal (or off-diagonal) elements of a square matrix as a 1D array, or dot Matrix multiplication trace Compute the sum of the diagonal elements det Compute the matrix determinant eig Compute the eigenvalues and eigenvectors of a square matrix inv Compute the inverse of a square matrix pinv Compute the Moore-Penrose pseudo-inverse inverse of a square matrix qr Compute the QR decomposition svd Compute the singular value decomposition (SVD) solve Solve the linear system Ax = b for x, where A is a square matrix lstsq Compute the least-squares solution to y = Xb

Random number generation

samples = np.random.normal(size=(4, 4))samples

from random import normalvariateN = 1000000%timeit samples = [normalvariate(0, 1) for _ in xrange(N)]%timeit np.random.normal(size=N)

Function Description seed Seed the random number generator permutation Return a random permutation of a sequence, or return a permuted range shuffle Randomly permute a sequence in place rand Draw samples from a uniform distribution randint Draw random integers from a given low-to-high range randn Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface) binomial Draw samples a binomial distribution normal Draw samples from a normal (Gaussian) distribution beta Draw samples from a beta distribution chisquare Draw samples from a chi-square distribution gamma Draw samples from a gamma distribution uniform Draw samples from a uniform [0, 1) distribution

Example: Random Walks

pure python

import randomposition = 0walk = [position]steps = 1000for i in xrange(steps):    step = 1 if random.randint(0, 1) else -1    position += step    walk.append(position)

numpy

np.random.seed(12345)nsteps = 1000draws = np.random.randint(0, 2, size=nsteps)steps = np.where(draws > 0, 1, -1)walk = steps.cumsum()

初探random walk
walk.min()
walk.max()
找出初次到达10或-10的时刻

(np.abs(walk)>=10).argmax()

Simulating many random walks at once

nwalks = 5000nsteps = 1000draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1steps = np.where(draws > 0, 1, -1)walks = steps.cumsum(1) #对行求和walks

初探random walk

walks.max()walks.min()hits30 = (np.abs(walks) >= 30).any(1)hits30hits30.sum() # Number that hit 30 or -30crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)crossing_times.mean()

正态分布 random walk

steps = np.random.normal(loc=0, scale=0.25,                         size=(nwalks, nsteps))

0 0