class cleaning the data

来源:互联网 发布:sketch up mac vary 编辑:程序博客网 时间:2024/06/01 10:05

考察DATAFRAME中的数据情况


  • Define a function called check_null_or_valid() that takes in one argument: row_data.
  • Inside the function, convert no_na to a numeric data type using pd.to_numeric().
  • Write an assert statement to make sure the first column (index 0) of the g1800s DataFrame is 'Life expectancy'.
  • Write an assert statement to test that all the values are valid for the g1800s DataFrame. Use the check_null_or_valid()function placed inside the .apply() method for this. Note that because you're applying it over the entire DataFrame, and not just one column, you'll have to chain the .all() method twice, and remember that you don't have to use () for functions placed inside .apply().
  • Write an assert statement to make sure that each country occurs only once in the data. Use the .value_counts() method on the 'Life expectancy' column for this. Specifically, index 0 of .value_counts() will contain the most frequently occuring value. If this is equal to 1 for the 'Life expectancy'column, then you can be certain that no country appears more than once in the data.


def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()[1:-1]
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0


# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == 'Life expectancy'


# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()


# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1


  • Convert the year column of gapminder using pd.to_numeric().
  • Assert that the country column is of type np.object. This has been done for you.
  • Assert that the year column is of type np.int64.
  • Assert that the life_expectancy column is of type np.float64.

# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder.year)


# Test if country is of type object
assert gapminder.country.dtypes == np.object


# Test if year is of type int64
assert gapminder.year.dtypes == np.int64


# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64


dataframe group by 聚合操作
  • Create a histogram of the life_expectancy column using the .plot() method of gapminder. Specify kind='hist'.
  • Group gapminder by 'year' and aggregate 'life_expectancy' by the mean. To do this:
    • Use the .groupby() method on gapminder with 'year' as the argument. Then select 'life_expectancy' and chain the .mean() method to it.
  • Print the head and tail of gapminder_agg. This has been done for you.
  • Create a line plot of average life expectancy per year by using the .plot() method (without any arguments) on gapminder_agg.
  • Save gapminder and gapminder_agg to csv files called 'gapminder.csv' and 'gapminder_agg.csv', respectively, using the .to_csv() method.


# Add first subplot
plt.subplot(2, 1, 1) 


# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind='hist')


# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()


# Print the head of gapminder_agg
print(gapminder_agg.head())


# Print the tail of gapminder_agg
print(gapminder_agg.tail())


# Add second subplot
plt.subplot(2, 1, 2)


# Create a line plot of life expectancy per year
gapminder_agg.plot()


# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')


# Display the plots
plt.tight_layout()
plt.show()


# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')