class Manipulating DataFrames with pandas
来源:互联网 发布:ubuntu下的下载工具 编辑:程序博客网 时间:2024/05/18 01:00
更多关于 dataframe的操作
- Slice the row labels
'Perry'
to'Potter'
and assign the output top_counties
. - Print the
p_counties
DataFrame. This has been done for you. - Slice the row labels
'Potter'
to'Perry'
in reverse order. To do this for hypothetical row labels'a'
and'b'
, you could use a stepsize of-1
like so:df.loc['b':'a':-1]
. - Print the
p_counties_rev
DataFrame. This has also been done for you, so hit 'Submit Answer' to see the result of your slicing!
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']
# Print the p_counties DataFrame
print(p_counties)
# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc["Potter":'Perry':-1]
# Print the p_counties_rev DataFrame
print(p_counties_rev)
pandas dataframe 的列索引, 数据的索引
- Slice the columns from the starting column to
'Obama'
and assign the result toleft_columns
- Slice the columns from
'Obama'
to'winner'
and assign the result tomiddle_columns
- Slice the columns from
'Romney'
to the end and assign the result toright_columns
- The code to print the first 5 rows of
left_columns
,middle_columns
, andright_columns
has been written, so hit 'Submit Answer' to see the results!
left_columns = election.loc[:,:'Obama']
# Print the output of left_columns.head()
print(left_columns.head())
# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:,'Obama':'winner']
# Print the output of middle_columns.head()
print(middle_columns.head())
# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:,'Romney':]
# Print the output of right_columns.head()
print(right_columns.head())
- Create the list of row labels
['Philadelphia', 'Centre', 'Fulton']
and assign it torows
. - Create the list of column labels
['winner', 'Obama', 'Romney']
and assign it tocols
. - Create a new DataFrame by selecting with
rows
andcols
in.loc[]
and assign it tothree_counties
. - Print the
three_counties
DataFrame. This has been done for you, so hit 'Submit Answer` to see your new DataFrame.
rows = ['Philadelphia', 'Centre', 'Fulton']
# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']
# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]
# Print the three_counties DataFrame
print(three_counties)
- Import
numpy
asnp
. - Create a boolean array for the condition where the
'margin'
column is less than 1 and assign it totoo_close
. - Convert the entries in the
'winner'
column where the result was too close to call tonp.nan
. - Print the output of
election.info()
. This has been done for you, so hit 'Submit Answer' to see the results.
import numpy as np
# Create the boolean array: too_close
too_close = election.margin < 1
# Assign np.nan to the 'winner' column where the results were too close to call
election.winner[too_close] = np.nan
# Print the output of election.info()
print(election.info())
- Select the
'age'
and'cabin'
columns oftitanic
and create a new DataFramedf
. - Print the shape of
df
. This has been done for you. - Drop rows in
df
withhow='any'
and print the shape. - Drop rows in
df
withhow='all'
and print the shape. - Drop columns from the
titanic
DataFrame that have more than 1000 missing values by specifying thethresh
andaxis
keyword arguments. Print the output of.info()
from this.
# Select the 'age' and 'cabin' columns: df
df = titanic[['age','cabin']]
# Print the shape of df
print(df.shape)
# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)
# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)
# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(titanic.dropna(thresh=1000, axis='columns').info())
- Apply the
to_celsius
function over the['Mean TemperatureF','Mean Dew PointF']
columns of theweather
DataFrame. - Reassign the columns of
df_celsius
to['Mean TemperatureC','Mean Dew PointC']
. - Hit 'Submit Answer' to see the new DataFrame with the converted units.
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
return 5/9*(F - 32)
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)
# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']
# Print the output of df_celsius.head()
print(df_celsius.head())
- Create a dictionary with the key:value pairs
'Obama':'blue'
and'Romney':'red'
. - Use the
.map()
method on the'winner'
column using thered_vs_blue
dictionary you created. - Print the output of
election.head()
. This has been done for you, so hit 'Submit Answer' to see the new column!
# Create the dictionary: red_vs_blue
red_vs_blue = dict([('Obama','blue') ,( 'Romney','red')])
# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election['winner'].map(red_vs_blue)
# Print the output of election.head()
print(election.head())
- Import
zscore
fromscipy.stats
. - Call
zscore
withelection['turnout']
as input . - Print the output of
type(turnout_zscore)
. This has been done for you. - Assign
turnout_zscore
to a new column inelection
as'turnout_zscore'
. - Print the output of
election.head()
. This has been done for you, so hit 'Submit Answer' to view the result.
# Import zscore from scipy.stats
from scipy.stats import zscore
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election['turnout'])
# Print the type of turnout_zscore
print(type(turnout_zscore))
# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore
# Print the output of election.head()
print(election.head())
index的一些操作:
- Create a list
new_idx
with the same elements as insales.index
, but with all characters capitalized. - Assign
new_idx
tosales.index
. - Print the
sales
dataframe. This has been done for you, so hit 'Submit Answer' and to see how the index changed.
new_idx = [new_idx.upper() for new_idx in sales.index]
# Assign new_idx to sales.index
sales.index = new_idx
# Print the sales DataFrame
print(sales)
- Assign the string
'MONTHS'
tosales.index.name
to create a name for the index. - Print the
sales
dataframe to see the index name you just created. - Now assign the string
'PRODUCTS'
tosales.columns.name
to give a name to the set of columns. - Print the
sales
dataframe again to see the columns name you just created.
sales.index.name = 'MONTHS'
# Print the sales DataFrame
print(sales)
# Assign the string 'PRODUCTS' to sales.columns.name
sales.columns.name = 'PRODUCTS'
# Print the sales dataframe again
print(sales)
- Create a MultiIndex by setting the index to be the columns
['state', 'month']
. - Sort the MultiIndex using the
.sort_index()
method. - Print the
sales
DataFrame. This has been done for you, so hit 'Submit Answer' to verify that indeed you have an index with the fieldsstate
andmonth
!
sales = sales.set_index(['state', 'month'])
# Sort the MultiIndex: sales
sales = sales.sort_index()
# Print the sales DataFrame
print(sales)
- Set the index of
sales
to be the column'state'
. - Print the
sales
DataFrame to verify that indeed you have an index withstate
values. - Access the data from
'NY'
and print it to verify that you obtain two rows.
sales = sales.set_index(['state'])
# Print the sales DataFrame
print(sales)
# Access the data from 'NY'
print(sales.loc['NY'])
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]
- Look up data for the New York column (
'NY'
) in month1
. - Look up data for the California and Texas columns (
'CA'
,'TX'
) in month2
. - Look up data for all states in month
2
. Use(slice(None), 2)
to extract all rows in month2
.
# Look up data for NY in month 1: NY_month1NY_month1 = sales.loc[("NY", 1), :]# Look up data for CA and TX in month 2: CA_TX_month2CA_TX_month2 = sales.loc[(['CA','TX'], 2), :]# Look up data for all states in month 2: all_month2all_month2 = sales.loc[(slice(None), 2) ,:]
数据透视表 pivot
- Pivot the
users
DataFrame with the rows indexed by'weekday'
, the columns indexed by'city'
, and the values populated with'visitors'
. - Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to view the result.
- Pivot the
users
DataFrame with the'signups'
indexed by'weekday'
in the rows and'city'
in the columns. - Print the new DataFrame. This has been done for you.
- Pivot the
users
DataFrame with both'signups'
and'visitors'
pivoted - that is, all the variables. This will happen automatically if you do not specify an argument for thevalues
parameter of.pivot()
. - Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to see the result.
- Define a DataFrame
byweekday
with the'weekday'
level ofusers
unstacked. - Print the
byweekday
DataFrame to see the new data layout. This has been done for you. - Stack
byweekday
by'weekday'
and print it to check if you get the same layout as the originalusers
DataFrame.
- Define a DataFrame
newusers
with the'city'
level stacked back into the index ofbycity
. - Swap the levels of the index of
newusers
. - Print
newusers
and verify that the index is not sorted. This has been done for you. - Sort the index of
newusers
. - Print
newusers
and verify that the index is now sorted. This has been done for you. - Assert that
newusers
equalsusers
. This has been done for you, so hit 'Submit Answer' to see the result.
- Reset the index of
visitors_by_city_weekday
with.reset_index()
. - Print
visitors_by_city_weekday
and verify that you have just a range index, 0, 1, 2, 3. This has been done for you. - Melt
visitors_by_city_weekday
to move the city names from the column labels to values in a single column calledcity
. - Print
visitors
to check that the city values are in a single column now and that the dataframe is longer and skinnier.
- Define a DataFrame
skinny
where you melt the'visitors'
and'signups'
columns ofusers
into a single column. - Print
skinny
to verify the results. Note thevalue
column that had the cell values inusers
.
- Set the index of
users
to['city', 'weekday']
. - Print the DataFrame
users_idx
to see the new index. - Obtain the key-value pairs corresponding to visitors and signups by melting
users_idx
with the keyword argumentcol_level=0
.
- Define a DataFrame
count_by_weekday1
that shows the count of each column with the parameteraggfunc='count'
. The index here is'weekday'
. - Print
count_by_weekday1
. This has been done for you. - Replace
aggfunc='count'
withaggfunc=len
and verify you obtain the same result.
- Define a DataFrame
signups_and_visitors
that shows the breakdown of signups and visitors by day, as well as the totals.- You will need to use
aggfunc=sum
to do this.
- You will need to use
- Print
signups_and_visitors
. This has been done for you. - Now pass the additional argument
margins=True
to the.pivot_table()
method to obtain the totals. - Print
signups_and_visitors_total
. This has been done for you, so hit 'Submit Answer' to see the result.
- Group by the
'pclass'
column and save the result asby_class
. - Aggregate the
'survived'
column ofby_class
using.count()
. Save the result ascount_by_class
. - Print
count_by_class
. This has been done for you. - Group
titanic
by the'embarked'
and'pclass'
columns. Save the result asby_mult
. - Aggregate the
'survived'
column ofby_mult
using.count()
. Save the result ascount_mult
. - Print
count_mult
. This has been done for you, so hit 'Submit Answer' to view the result.
- Read
life_fname
into a DataFrame calledlife
and set the index to'Country'
. - Read
regions_fname
into a DataFrame calledregions
and set the index to'Country'
. - Group
life
by theregion
column ofregions
and store the result inlife_by_region
. - Print the mean over the
2010
column oflife_by_region
.
- Group
titanic
by'pclass'
and save the result asby_class
. - Select the
'age'
and'fare'
columns fromby_class
and save the result asby_class_sub
. - Aggregate
by_class_sub
using'max'
and'median'
. You'll have to pass'max'
and'median'
in the form of a list to.agg()
. - Use
.loc[]
to print all of the rows and the column specification('age','max')
. This has been done for you. - Use
.loc[]
to print all of the rows and the column specification('fare','median')
.
- Read
'gapminder.csv'
into a DataFrame withindex_col=['Year','region','Country']
. Sort the index. - Group
gapminder
with a level of['Year','region']
using itslevel
parameter. Save the result asby_year_region
. - Define the function
spread
which returns the maximum and minimum of an input series. This has been done for you. - Create a dictionary with
'population':'sum'
,'child_mortality':'mean'
and'gdp':spread
as aggregator. This has been done for you. - Use the
aggregator
dictionary to aggregateby_year_region
. Save the result asaggregated
. - Print the last 6 entries of
aggregated
. This has been done for you, so hit 'Submit Answer' to view the result.
- Read
'sales.csv'
into a DataFrame withindex_col='Date'
andparse_dates=True
. - Create a groupby object with
sales.index.strftime('%a')
as input and assign it toby_day
. - Aggregate the
'Units'
column ofby_day
with the.sum()
method. Save the result asunits_sum
. - Print
units_sum
. This has been done for you, so hit 'Submit Answer' to see the result.
- Import
zscore
fromscipy.stats
. - Group
gapminder_2010
by'region'
and transform the['life','fertility']
columns byzscore
. - Construct a boolean Series of the bitwise
or
betweenstandardized['life'] < -3
andstandardized['fertility'] > 3
. - Filter
gapminder_2010
using.loc[]
and theoutliers
Boolean Series. Save the result asgm_outliers
. - Print
gm_outliers
. This has been done for you, so hit 'Submit Answer' to see the results.
- Group
titanic
by'sex'
and'pclass'
. Save the result asby_sex_class
. - Write a function called
impute_median()
that fills missing values with the median of a series. This has been done for you. - Call
.transform()
withimpute_median
on the'age'
column ofby_sex_class
. - Print the output of
titanic.tail(10)
. This has been done for you - hit 'Submit Answer' to see how the missing values have now been imputed.
- Group
gapminder_2010
by'region'
. Save the result asregional
. - Apply the provided
disparity
function onregional
, and save the result asreg_disp
. - Use
.loc[]
to select['United States','United Kingdom','China']
fromreg_disp
and print the results.
- Group
sales
by'Company'
. Save the result asby_company
. - Compute and print the sum of the
'Units'
column ofby_company
. - Call
.filter()
onby_company
withlambda g:g['Units'].sum() > 35
as input and print the result.
- Create a Boolean Series of
titanic['age'] < 10
and call.map
with{True:'under 10', False:'over 10'}
. - Group
titanic
by theunder10
Series and then compute and print the mean of the'survived'
column. - Group
titanic
by theunder10
Series as well as the'pclass'
column and then compute and print the mean of the'survived'
column.
- Extract the
'NOC'
column from the DataFramemedals
and assign the result tocountry_names
. Notice that this Series has repeated entries for every medal (of any type) a country has won in any Edition of the Olympics. - Create a Series
medal_counts
by applying.value_counts()
to the Seriescountry_names
. - Print the top 15 countries ranked by total number of medals won. This has been done for you, so hit 'Submit Answer' to see the result.
- Construct a pivot table
counted
from the DataFramemedals
aggregating bycount
. Use'NOC'
as the index,'Athlete'
for the values, and'Medal'
for the columns. - Modify the DataFrame
counted
by adding a columncounted['totals']
. The new column'totals'
should contain the result of taking the sum along the columns (i.e., use.sum(axis='columns')
). - Overwrite the DataFrame
counted
by sorting it with the.sort_values()
method. Specify the keyword argumentascending=False
. - Print the first 15 rows of
counted
using.head(15)
. This has been done for you, so hit 'Submit Answer' to see the result.
- Group
medals
by'NOC'
. - Compute the number of distinct sports in which each country won medals. To do this, select the
'Sport'
column fromcountry_grouped
and apply.nunique()
. - Sort
Nsports
in descending order with.sort_values()
andascending=False
. - Print the first 15 rows of
Nsports
. This has been done for you, so hit 'Submit Answer' to see the result.
- Create a Boolean Series called
during_cold_war
by extracting all rows frommedals
for which the'Edition'
is>=
1952
and<=
1988
. - Create a Boolean Series called
is_usa_urs
by extracting rows frommedals
for which'NOC'
is either'USA'
or'URS'
. - Filter the
medals
DataFrame usingduring_cold_war
andis_usa_urs
to create a new DataFrame calledcold_war_medals
. - Group
cold_war_medals
by'NOC'
. - Create a Series
Nsports
fromcountry_grouped
using indexing & chained methods:- Extract the column
'Sport'
. - Use
.nunique()
to get the number of unique elements in each group; - Apply
.sort_values(ascending=False)
to rearrange the Series.
- Extract the column
- Print the final Series
Nsports
. This has been done for you, so hit 'Submit Answer' to see the result!
- Construct
medals_won_by_country
usingmedals.pivot_table()
.- The index should the years (
'Edition'
) & the columns should be country ('NOC'
) - the values should be
'Athlete'
(which captures every medal regardless of kind) & the aggregation method should be'count'
(which captures the total number of medals won).
- The index should the years (
- Create
cold_war_usa_usr_medals
by slicing the pivot tablemedals_won_by_country
. Your slice should contain the editions from years1952:1988
and only the columns'USA'
&'URS'
from the pivot table. - Create the Series
most_medals
by applying the.idxmax()
method tocold_war_usa_usr_medals
. Be sure to useaxis='columns'
. - Print the result of applying
.value_counts()
tomost_medals
. The result reported gives the number of times each of the USA or the USSR won more Olympic medals in total than the other between 1952 and 1988.
- Create a DataFrame
usa
with data only for the USA. - Group
usa
such that['Edition', 'Medal']
is the index. Aggregate the count over'Athlete'
. - Use
.unstack()
withlevel='Medal'
to reshape the DataFrameusa_medals_by_year
. - Construct a line plot from the final DataFrame
usa_medals_by_year
. This has been done for you, so hit 'Submit Answer' to see the plot!
- Redefine the
'Medal'
column of the DataFramemedals
as an ordered categorical. To do this, usepd.Categorical()
with three keyword arguments:values = medals.Medal
.categories=['Bronze', 'Silver', 'Gold']
.ordered=True
.- After this, you can verify that the type has changed using
medals.info()
.
- Plot the final DataFrame
usa_medals_by_year
as an area plot. This has been done for you, so hit 'Submit Answer' to see how the plot has changed!
- class Manipulating DataFrames with pandas
- class Merging DataFrames with pandas
- Pandas DataFrames筛选数据
- Pandas DataFrames筛选数据
- Working with DataFrames
- Manipulating Files With Tcl
- Pandas DataFrames筛选数据的方法
- dataframes
- class pandas Foundations
- pandas working with text data
- pandas working with missing data
- Rolling mean with customized window with Pandas
- Spark DataFrames
- 初识DataFrames
- Spark DataFrames
- Using If/Truth Statements with pandas¶
- sizeof with base class
- C WIth Class Designer
- 组合数——扑克牌
- zabbix 安装及部署
- 数据库连接池
- innerHTML与createTexNode的区别
- Linux中多进程(多线程)编程
- class Manipulating DataFrames with pandas
- Django入门-5:模型的基本使用3-模型成员&创建对象
- hibernate映射文件
- 学习笔记3 长得帅的人才能看
- 鸡蛋栈
- 第一天
- Java面试常被问到的题目+解答
- .htaccess rewrite 规则详细说明
- tomcat如何配置环境变量