class cleaning the data
来源:互联网 发布:sketch up mac vary 编辑:程序博客网 时间:2024/06/01 10:05
考察DATAFRAME中的数据情况
- Define a function called
check_null_or_valid()
that takes in one argument:row_data
. - Inside the function, convert
no_na
to a numeric data type usingpd.to_numeric()
. - Write an assert statement to make sure the first column (index
0
) of theg1800s
DataFrame is'Life expectancy'
. - Write an assert statement to test that all the values are valid for the
g1800s
DataFrame. Use thecheck_null_or_valid()
function placed inside the.apply()
method for this. Note that because you're applying it over the entire DataFrame, and not just one column, you'll have to chain the.all()
method twice, and remember that you don't have to use()
for functions placed inside.apply()
. - Write an assert statement to make sure that each country occurs only once in the data. Use the
.value_counts()
method on the'Life expectancy'
column for this. Specifically, index0
of.value_counts()
will contain the most frequently occuring value. If this is equal to1
for the'Life expectancy'
column, then you can be certain that no country appears more than once in the data.
def check_null_or_valid(row_data):
"""Function that takes a row of data,
drops all missing values,
and checks if all remaining values are greater than or equal to 0
"""
no_na = row_data.dropna()[1:-1]
numeric = pd.to_numeric(no_na)
ge0 = numeric >= 0
return ge0
# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == 'Life expectancy'
# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()
# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1
- Convert the
year
column ofgapminder
usingpd.to_numeric()
. - Assert that the
country
column is of typenp.object
. This has been done for you. - Assert that the
year
column is of typenp.int64
. - Assert that the
life_expectancy
column is of typenp.float64
.
gapminder.year = pd.to_numeric(gapminder.year)
# Test if country is of type object
assert gapminder.country.dtypes == np.object
# Test if year is of type int64
assert gapminder.year.dtypes == np.int64
# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64
- Create a histogram of the
life_expectancy
column using the.plot()
method ofgapminder
. Specifykind='hist'
. - Group
gapminder
by'year'
and aggregate'life_expectancy'
by the mean. To do this:- Use the
.groupby()
method ongapminder
with'year'
as the argument. Then select'life_expectancy'
and chain the.mean()
method to it.
- Use the
- Print the head and tail of
gapminder_agg
. This has been done for you. - Create a line plot of average life expectancy per year by using the
.plot()
method (without any arguments) ongapminder_agg
. - Save
gapminder
andgapminder_agg
to csv files called'gapminder.csv'
and'gapminder_agg.csv'
, respectively, using the.to_csv()
method.
# Add first subplot
plt.subplot(2, 1, 1)
# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind='hist')
# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()
# Print the head of gapminder_agg
print(gapminder_agg.head())
# Print the tail of gapminder_agg
print(gapminder_agg.tail())
# Add second subplot
plt.subplot(2, 1, 2)
# Create a line plot of life expectancy per year
gapminder_agg.plot()
# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')
# Display the plots
plt.tight_layout()
plt.show()
# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')
- class cleaning the data
- 数据清洗(Data Cleaning)
- Cleaning Data in Python
- The Data WarehouseETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delive
- [Getting and Cleaning data] swirl
- [Getting and Cleaning data] Project
- data cleaning(数据清洗) 课程笔记
- WEEK2-Descriptive statistics and data cleaning
- data cleaning(数据清洗) 课程笔记
- Cleaning time-series and other data streams
- [Getting and Cleaning data] Week 1
- [Getting and Cleaning data] Quiz 1
- [Getting and Cleaning data] Week 2
- [Getting and Cleaning data] Quiz 2
- [Getting and Cleaning data] Week 3
- [Getting and Cleaning data] Week 4
- [Getting and Cleaning data] Quiz 3
- [Getting and Cleaning data] Quiz 4
- 【Leetcode】【python】Search for a Range
- Android日常积累--去掉顶部自带菜单栏(状态栏)
- 9.12日计划
- java基础解析系列(五)---HashMap并发下的问题以及HashTable和CurrentHashMap的区别
- 利用栈实现十进制到二进制的转换输出
- class cleaning the data
- HTTP请求返回状态码有哪几种
- pdf解密
- 1788:Pell数列(2.2基本算法之递归和自调用函数)
- 技术博客开篇
- 常用数据库连接池 (DBCP、c3p0、Druid) 配置说明
- fullPage代码从基层往上爬(一)
- 2843: 极地旅行社
- os进程调度和经典算法