class Intermediate Python for Data Science

来源:互联网 发布:java不等于符号 编辑:程序博客网 时间:2024/05/30 04:20
通过字典创建Dataframe
  • Import pandas as pd.
  • Use the pre-defined lists to create a dictionary called my_dict. There should be three key value pairs:
    • key 'country' and value names.
    • key 'drives_right' and value dr.
    • key 'cars_per_cap' and value cpc.
  • Use pd.DataFrame() to turn your dict into a DataFrame called cars.

  • Print out cars and see how beautiful it is.


# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]


# Import pandas as pd
import pandas as pd


# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}


# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)


# Print cars
print(cars)

改变索引的名称
# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']


# Specify row labels of cars
cars.index = row_labels


# Print cars again
print(cars
)

从CSV文件中读取数据
# Import pandas as pd
import pandas as pd


# Import the cars.csv data: cars
cars = pd.read_csv('cars.csv')


# Print out cars
print(cars)

在csv文件中读取数据时,防止读入Index列,导致多了uname列:

# Import pandas as pd
import pandas as pd


# Fix import by including index_col
cars = pd.read_csv('cars.csv' , index_col = 0)


# Print out cars
print(cars)



读取Dataframe中的部分数据:

  • Use single square brackets to print out the country column of cars as a Pandas Series.
  • Use double square brackets to print out the country column of cars as a Pandas DataFrame.
  • Use double square brackets to print out a DataFrame with both the country and drives_right columns of cars, in this order.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)


# Print out country column as Pandas Series
print(cars['country'])


# Print out country column as Pandas DataFrame
print(cars[['country']])


# Print out DataFrame with country and drives_right columns
print(cars[['country', 'drives_right']])


Dataframe数据定位的两种方法:
  • Use loc or iloc to select the observation corresponding to Japan as a Series. The label of this row is JAP, the index is 2. Make sure to print the resulting Series.
  • Use loc or iloc to select the observations for Australia and Egypt as a DataFrame. You can find out about the labels/indexes of these rows by inspecting cars in the IPython Shell. Make sure to print the resulting DataFrame.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)


# Print out observation for Japan
print(cars.loc['JAP'])
print(cars.iloc[2])
# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])
print(cars.iloc[[1,6]])

  • Print out the drives_right value of the row corresponding to Morocco (its row label is MOR)
  • Print out a sub-DataFrame, containing the observations for Russia and Morocco and the columns country and drives_right.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)


# Print out drives_right value of Morocco
print(cars.loc['MOR','drives_right'])


# Print sub-DataFrame
print(cars.loc[['RU','MOR'], ['country', 'drives_right']])

判断布尔类型的数字:

Generate boolean arrays that answer the following questions:
  • Which areas in my_house are greater than 18.5 or smaller than 10?
  • Which areas are smaller than 11 in both my_house and your_house? Make sure to wrap both commands in print() statement, so that you can inspect the output.

# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])


# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house > 18.5,my_house < 10))


# Both my_house and your_house smaller than 11
print(np.logical_and(my_house < 11 , your_house <11))


使用某列数据的条件筛选dataframe数据:


  • Select the cars_per_cap column from cars as a Pandas Series and store it as cpc.
  • Use cpc in combination with a comparison operator and 500. You want to end up with a boolean Series that's True if the corresponding country has a cars_per_cap of more than 500 and False otherwise. Store this boolean Series as many_cars.
  • Use many_cars to subset cars, similar to what you did before. Store the result as car_maniac.
  • Print out car_maniac to see if you got it right.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)


# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars.cars_per_cap
many_cars = cpc > 500
car_maniac = cars[many_cars]


# Print car_maniac
print(car_maniac)


在np.array格式中进行迭代

  • Import the numpy package under the local alias np.
  • Write a for loop that iterates over all elements in np_heightand prints out "x inches" for each element, where x is the value in the array.
  • Write a for loop that visits every element of the np_baseball array and prints it out.

# Import numpy as np
import numpy as np


# For loop over np_height(1D)
for each in np_height:
    print(str(each) + " inches")


# For loop over np_baseball(2-D)
for each in np.nditer(np_baseball):
    print(each)

原创粉丝点击