class Manipulating DataFrames with pandas

来源：互联网发布：ubuntu下的下载工具编辑：程序博客网时间：2024/05/18 01:00

更多关于 dataframe的操作

Slice the row labels 'Perry' to 'Potter' and assign the output to p_counties.
Print the p_counties DataFrame. This has been done for you.
Slice the row labels 'Potter' to 'Perry' in reverse order. To do this for hypothetical row labels 'a' and 'b', you could use a stepsize of -1 like so: df.loc['b':'a':-1].
Print the p_counties_rev DataFrame. This has also been done for you, so hit 'Submit Answer' to see the result of your slicing!

# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']

# Print the p_counties DataFrame
print(p_counties)

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc["Potter":'Perry':-1]

# Print the p_counties_rev DataFrame
print(p_counties_rev)

pandas dataframe 的列索引, 数据的索引

Slice the columns from the starting column to 'Obama' and assign the result to left_columns
Slice the columns from 'Obama' to 'winner' and assign the result to middle_columns
Slice the columns from 'Romney' to the end and assign the result to right_columns
The code to print the first 5 rows of left_columns, middle_columns, and right_columns has been written, so hit 'Submit Answer' to see the results!

# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:,:'Obama']

# Print the output of left_columns.head()
print(left_columns.head())

# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:,'Obama':'winner']

# Print the output of middle_columns.head()
print(middle_columns.head())

# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:,'Romney':]

# Print the output of right_columns.head()
print(right_columns.head())

Create the list of row labels ['Philadelphia', 'Centre', 'Fulton'] and assign it to rows.
Create the list of column labels ['winner', 'Obama', 'Romney'] and assign it to cols.
Create a new DataFrame by selecting with rows and cols in .loc[] and assign it to three_counties.
Print the three_counties DataFrame. This has been done for you, so hit 'Submit Answer` to see your new DataFrame.

# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']

# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']

# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]

# Print the three_counties DataFrame
print(three_counties)

Import numpy as np.
Create a boolean array for the condition where the 'margin'column is less than 1 and assign it to too_close.
Convert the entries in the 'winner' column where the result was too close to call to np.nan.
Print the output of election.info(). This has been done for you, so hit 'Submit Answer' to see the results.

# Import numpy
import numpy as np

# Create the boolean array: too_close
too_close = election.margin < 1

# Assign np.nan to the 'winner' column where the results were too close to call
election.winner[too_close] = np.nan

# Print the output of election.info()
print(election.info())

Select the 'age' and 'cabin' columns of titanic and create a new DataFrame df.
Print the shape of df. This has been done for you.
Drop rows in df with how='any' and print the shape.
Drop rows in df with how='all' and print the shape.
Drop columns from the titanic DataFrame that have more than 1000 missing values by specifying the thresh and axiskeyword arguments. Print the output of .info() from this.

# Select the 'age' and 'cabin' columns: df
df = titanic[['age','cabin']]

# Print the shape of df
print(df.shape)

# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)

# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)

# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(titanic.dropna(thresh=1000, axis='columns').info())

Apply the to_celsius function over the ['Mean TemperatureF','Mean Dew PointF'] columns of the weatherDataFrame.
Reassign the columns of df_celsius to ['Mean TemperatureC','Mean Dew PointC'].
Hit 'Submit Answer' to see the new DataFrame with the converted units.

# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
return 5/9*(F - 32)

# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)

# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']

# Print the output of df_celsius.head()
print(df_celsius.head())

Create a dictionary with the key:value pairs 'Obama':'blue'and 'Romney':'red'.
Use the .map() method on the 'winner' column using the red_vs_blue dictionary you created.
Print the output of election.head(). This has been done for you, so hit 'Submit Answer' to see the new column!

# Create the dictionary: red_vs_blue
red_vs_blue = dict([('Obama','blue') ,( 'Romney','red')])

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election['winner'].map(red_vs_blue)

# Print the output of election.head()
print(election.head())

Import zscore from scipy.stats.
Call zscore with election['turnout'] as input .
Print the output of type(turnout_zscore). This has been done for you.
Assign turnout_zscore to a new column in election as 'turnout_zscore'.
Print the output of election.head(). This has been done for you, so hit 'Submit Answer' to view the result.

# Import zscore from scipy.stats
from scipy.stats import zscore

# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election['turnout'])

# Print the type of turnout_zscore
print(type(turnout_zscore))

# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore

# Print the output of election.head()
print(election.head())

index的一些操作：

Create a list new_idx with the same elements as in sales.index, but with all characters capitalized.
Assign new_idx to sales.index.
Print the sales dataframe. This has been done for you, so hit 'Submit Answer' and to see how the index changed.

# Create the list of new indexes: new_idx
new_idx = [new_idx.upper() for new_idx in sales.index]

# Assign new_idx to sales.index
sales.index = new_idx

# Print the sales DataFrame
print(sales)

Assign the string 'MONTHS' to sales.index.nameto create a name for the index.
Print the sales dataframe to see the index name you just created.
Now assign the string 'PRODUCTS' to sales.columns.name to give a name to the set of columns.
Print the sales dataframe again to see the columns name you just created.

# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'

# Print the sales DataFrame
print(sales)

# Assign the string 'PRODUCTS' to sales.columns.name
sales.columns.name = 'PRODUCTS'

# Print the sales dataframe again
print(sales)

Create a MultiIndex by setting the index to be the columns ['state', 'month'].
Sort the MultiIndex using the .sort_index() method.
Print the sales DataFrame. This has been done for you, so hit 'Submit Answer' to verify that indeed you have an index with the fields state and month!

# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])

# Sort the MultiIndex: sales
sales = sales.sort_index()

# Print the sales DataFrame
print(sales)

Set the index of sales to be the column 'state'.
Print the sales DataFrame to verify that indeed you have an index with state values.
Access the data from 'NY' and print it to verify that you obtain two rows.

# Set the index to the column 'state': sales
sales = sales.set_index(['state'])

# Print the sales DataFrame
print(sales)

# Access the data from 'NY'
print(sales.loc['NY'])

stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]

Look up data for the New York column ('NY') in month 1.
Look up data for the California and Texas columns ('CA', 'TX') in month 2.
Look up data for all states in month 2. Use (slice(None), 2) to extract all rows in month 2.

# Look up data for NY in month 1: NY_month1NY_month1 = sales.loc[("NY", 1), :]# Look up data for CA and TX in month 2: CA_TX_month2CA_TX_month2 = sales.loc[(['CA','TX'], 2), :]# Look up data for all states in month 2: all_month2all_month2 = sales.loc[(slice(None), 2) ,:]

数据透视表 pivot

Pivot the users DataFrame with the rows indexed by 'weekday', the columns indexed by 'city', and the values populated with 'visitors'.
Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to view the result.
# Pivot the users DataFrame: visitors_pivotvisitors_pivot = users.pivot(index='weekday', columns='city', values='visitors')# Print the pivoted DataFrameprint(visitors_pivot)
Pivot the users DataFrame with the 'signups' indexed by 'weekday' in the rows and 'city' in the columns.
Print the new DataFrame. This has been done for you.
Pivot the users DataFrame with both 'signups' and 'visitors' pivoted - that is, all the variables. This will happen automatically if you do not specify an argument for the valuesparameter of .pivot().
Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to see the result.
# Pivot users with signups indexed by weekday and city: signups_pivotsignups_pivot = users.pivot(index='weekday', columns='city',values='signups')# Print signups_pivotprint(signups_pivot)# Pivot users pivoted by both signups and visitors: pivotpivot = users.pivot(index='weekday', columns='city')# Print the pivoted DataFrameprint(pivot)
Define a DataFrame byweekday with the 'weekday' level of users unstacked.
Print the byweekday DataFrame to see the new data layout. This has been done for you.
Stack byweekday by 'weekday' and print it to check if you get the same layout as the original users DataFrame.
# Unstack users by 'weekday': byweekdaybyweekday = users.unstack('weekday')# Print the byweekday DataFrameprint(byweekday)# Stack byweekday by 'weekday' and print itprint(byweekday.stack(level='weekday'))
Define a DataFrame newusers with the 'city' level stacked back into the index of bycity.
Swap the levels of the index of newusers.
Print newusers and verify that the index is not sorted. This has been done for you.
Sort the index of newusers.
Print newusers and verify that the index is now sorted. This has been done for you.
Assert that newusers equals users. This has been done for you, so hit 'Submit Answer' to see the result.
# Stack 'city' back into the index of bycity: newusersnewusers = bycity.stack(level='city')# Swap the levels of the index of newusers: newusersnewusers = newusers.swaplevel(0,1)# Print newusers and verify that the index is not sortedprint(newusers)# Sort the index of newusers: newusersnewusers = newusers.sort_index()# Print newusers and verify that the index is now sortedprint(newusers)# Verify that the new DataFrame is equal to the originalprint(newusers.equals(users))
Reset the index of visitors_by_city_weekday with .reset_index().
Print visitors_by_city_weekday and verify that you have just a range index, 0, 1, 2, 3. This has been done for you.
Melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called city.
Print visitors to check that the city values are in a single column now and that the dataframe is longer and skinnier.
# Reset the index: visitors_by_city_weekdayvisitors_by_city_weekday = visitors_by_city_weekday.reset_index() # Print visitors_by_city_weekdayprint(visitors_by_city_weekday)# Melt visitors_by_city_weekday: visitorsvisitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'], value_name='visitors')# Print visitorsprint(visitors)
Define a DataFrame skinny where you melt the 'visitors'and 'signups' columns of users into a single column.
Print skinny to verify the results. Note the value column that had the cell values in users.
# Melt users: skinnyskinny = pd.melt(users, id_vars=['weekday', 'city'])# Print skinnyprint(skinny)
Set the index of users to ['city', 'weekday'].
Print the DataFrame users_idx to see the new index.
Obtain the key-value pairs corresponding to visitors and signups by melting users_idx with the keyword argument col_level=0.
# Set the new index: users_idxusers_idx = users.set_index(['city', 'weekday'])# Print the users_idx DataFrameprint(users_idx)# Obtain the key-value pairs: kv_pairskv_pairs = pd.melt(users_idx, col_level=0)# Print the key-value pairsprint(kv_pairs)
pivot table 中类似group by 的操作
Define a DataFrame count_by_weekday1 that shows the count of each column with the parameter aggfunc='count'. The index here is 'weekday'.
Print count_by_weekday1. This has been done for you.
Replace aggfunc='count' with aggfunc=len and verify you obtain the same result.
# Use a pivot table to display the count of each column: count_by_weekday1count_by_weekday1 = users.pivot_table(index='weekday', aggfunc='count')# Print count_by_weekdayprint(count_by_weekday1)# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2count_by_weekday2 = users.pivot_table(index='weekday', aggfunc=len)# Verify that the same result is obtainedprint('==========================================')print(count_by_weekday1.equals(count_by_weekday2))
Define a DataFrame signups_and_visitors that shows the breakdown of signups and visitors by day, as well as the totals.You will need to use aggfunc=sum to do this.
Print signups_and_visitors. This has been done for you.
Now pass the additional argument margins=True to the .pivot_table() method to obtain the totals.
Print signups_and_visitors_total. This has been done for you, so hit 'Submit Answer' to see the result.
# Create the DataFrame with the appropriate pivot table: signups_and_visitorssignups_and_visitors = users.pivot_table(index='weekday', aggfunc=sum)# Print signups_and_visitorsprint(signups_and_visitors)# Add in the margins: signups_and_visitors_total signups_and_visitors_total = users.pivot_table(index='weekday', aggfunc=sum, margins=True)# Print signups_and_visitors_totalprint(signups_and_visitors_total)
Group by the 'pclass' column and save the result as by_class.
Aggregate the 'survived' column of by_classusing .count(). Save the result as count_by_class.
Print count_by_class. This has been done for you.
Group titanic by the 'embarked' and 'pclass' columns. Save the result as by_mult.
Aggregate the 'survived' column of by_mult using .count(). Save the result as count_mult.
Print count_mult. This has been done for you, so hit 'Submit Answer' to view the result.
# Group titanic by 'pclass'by_class = titanic.groupby(['pclass'])# Aggregate 'survived' column of by_class by countcount_by_class = by_class['survived'].count()# Print count_by_classprint(count_by_class)# Group titanic by 'embarked' and 'pclass'by_mult = titanic.groupby(['embarked', 'pclass'])# Aggregate 'survived' column of by_mult by countcount_mult = by_mult['survived'].count()# Print count_multprint(count_mult)
Read life_fname into a DataFrame called life and set the index to 'Country'.
Read regions_fname into a DataFrame called regions and set the index to 'Country'.
Group life by the region column of regionsand store the result in life_by_region.
Print the mean over the 2010 column of life_by_region.
# Read life_fname into a DataFrame: lifelife = pd.read_csv(life_fname, index_col='Country')# Read regions_fname into a DataFrame: regionsregions = pd.read_csv(regions_fname, index_col='Country')# Group life by regions['region']: life_by_regionlife_by_region = life.groupby(regions['region'])# Print the mean over the '2010' column of life_by_regionprint(life_by_region['2010'].mean())
Group titanic by 'pclass' and save the result as by_class.
Select the 'age' and 'fare' columns from by_class and save the result as by_class_sub.
Aggregate by_class_sub using 'max' and 'median'. You'll have to pass 'max' and 'median' in the form of a list to .agg().
Use .loc[] to print all of the rows and the column specification ('age','max'). This has been done for you.
Use .loc[] to print all of the rows and the column specification ('fare','median').
# Group titanic by 'pclass': by_classby_class = titanic.groupby(['pclass'])# Select 'age' and 'fare'by_class_sub = by_class[['age','fare']]# Aggregate by_class_sub by 'max' and 'median': aggregatedaggregated = by_class_sub.agg(['max', 'median'])# Print the maximum age in each classprint(aggregated.loc[:, ('age','max')])# Print the median fare in each classprint(aggregated.loc[:, ('fare', 'median')])
Read 'gapminder.csv' into a DataFrame with index_col=['Year','region','Country']. Sort the index.
Group gapminder with a level of ['Year','region'] using its level parameter. Save the result as by_year_region.
Define the function spread which returns the maximum and minimum of an input series. This has been done for you.
Create a dictionary with 'population':'sum', 'child_mortality':'mean' and 'gdp':spreadas aggregator. This has been done for you.
Use the aggregator dictionary to aggregate by_year_region. Save the result as aggregated.
Print the last 6 entries of aggregated. This has been done for you, so hit 'Submit Answer' to view the result.
# Read the CSV file into a DataFrame and sort the index: gapmindergapminder = pd.read_csv('gapminder.csv', index_col=['Year', 'region', 'Country']).sort_index()# Group gapminder by 'Year' and 'region': by_year_regionby_year_region = gapminder.groupby(level=['Year', 'region'])# Define the function to compute spread: spreaddef spread(series):    return series.max() - series.min()# Create the dictionary: aggregatoraggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}# Aggregate by_year_region using the dictionary: aggregatedaggregated = by_year_region.agg(aggregator)# Print the last 6 entries of aggregated print(aggregated.tail(6))
Read 'sales.csv' into a DataFrame with index_col='Date' and parse_dates=True.
Create a groupby object with sales.index.strftime('%a') as input and assign it to by_day.
Aggregate the 'Units' column of by_day with the .sum() method. Save the result as units_sum.
Print units_sum. This has been done for you, so hit 'Submit Answer' to see the result.
# Read file: salessales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)# Create a groupby object: by_dayby_day = sales.groupby(sales.index.strftime('%a'))# Create sum: units_sumunits_sum = by_day['Units'].sum()# Print units_sumprint(units_sum)
 
transform 函数以及找出异常点
Import zscore from scipy.stats.
Group gapminder_2010 by 'region' and transform the ['life','fertility'] columns by zscore.
Construct a boolean Series of the bitwise or between standardized['life'] < -3 and standardized['fertility'] > 3.
Filter gapminder_2010 using .loc[] and the outliers Boolean Series. Save the result as gm_outliers.
Print gm_outliers. This has been done for you, so hit 'Submit Answer' to see the results.
# Import zscorefrom scipy.stats import zscore# Group gapminder_2010: standardizedstandardized = gapminder_2010.groupby(['region'])[['life', 'fertility']].transform(zscore)# Construct a Boolean Series to identify outliers: outliersoutliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)# Filter gapminder_2010 by the outliers: gm_outliersgm_outliers = gapminder_2010.loc[outliers]# Print gm_outliersprint(gm_outliers)Group titanic by 'sex' and 'pclass'. Save the result as by_sex_class.
Write a function called impute_median() that fills missing values with the median of a series. This has been done for you.
Call .transform() with impute_median on the 'age' column of by_sex_class.
Print the output of titanic.tail(10). This has been done for you - hit 'Submit Answer' to see how the missing values have now been imputed.
# Create a groupby object: by_sex_classby_sex_class = titanic.groupby(['sex', 'pclass'])# Write a function that imputes mediandef impute_median(series):    return series.fillna(series.median())# Impute age and assign to titanic['age']titanic.age = by_sex_class['age'].transform(impute_median)# Print the output of titanic.tail(10)print(titanic.tail(10))
Group gapminder_2010 by 'region'. Save the result as regional.
Apply the provided disparity function on regional, and save the result as reg_disp.
Use .loc[] to select ['United States','United Kingdom','China'] from reg_disp and print the results.
# Group gapminder_2010 by 'region': regionalregional = gapminder_2010.groupby(['region'])# Apply the disparity function on regional: reg_dispreg_disp = regional.apply(disparity)# Print the disparity of 'United States', 'United Kingdom', and 'China'print(reg_disp.loc[['United States','United Kingdom','China']])
Group sales by 'Company'. Save the result as by_company.
Compute and print the sum of the 'Units' column of by_company.
Call .filter() on by_company with lambda g:g['Units'].sum() > 35 as input and print the result.
# Read the CSV file into a DataFrame: salessales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)# Group sales by 'Company': by_companyby_company = sales.groupby(['Company'])# Compute the sum of the 'Units' of by_company: by_com_sumby_com_sum = by_company['Units'].sum()print(by_com_sum)# Filter 'Units' where the sum is > 35: by_com_filtby_com_filt = by_company.filter(lambda g: g['Units'].sum() > 35)print(by_com_filt)
Create a Boolean Series of titanic['age'] < 10 and call .map with {True:'under 10', False:'over 10'}.
Group titanic by the under10 Series and then compute and print the mean of the 'survived' column.
Group titanic by the under10 Series as well as the 'pclass' column and then compute and print the mean of the 'survived' column.
# Create the Boolean Series: under10under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})# Group by under10 and compute the survival ratesurvived_mean_1 = titanic.groupby(under10)['survived'].mean()print(survived_mean_1)# Group by under10 and pclass and compute the survival ratesurvived_mean_2 = titanic.groupby([under10, 'pclass'])['survived'].mean()print(survived_mean_2)
一个探索并且操作数据集的实例
Extract the 'NOC' column from the DataFrame medalsand assign the result to country_names. Notice that this Series has repeated entries for every medal (of any type) a country has won in any Edition of the Olympics.
Create a Series medal_counts by applying .value_counts() to the Series country_names.
Print the top 15 countries ranked by total number of medals won. This has been done for you, so hit 'Submit Answer' to see the result.
# Select the 'NOC' column of medals: country_namescountry_names = medals['NOC']# Count the number of medals won by each country: medal_countsmedal_counts = country_names.value_counts()# Print top 15 countries ranked by medalsprint(medal_counts.head(15))
Construct a pivot table counted from the DataFrame medals aggregating by count. Use 'NOC' as the index, 'Athlete' for the values, and 'Medal' for the columns.
Modify the DataFrame counted by adding a column counted['totals']. The new column 'totals'should contain the result of taking the sum along the columns (i.e., use .sum(axis='columns')).
Overwrite the DataFrame counted by sorting it with the .sort_values() method. Specify the keyword argument ascending=False.
Print the first 15 rows of counted using .head(15). This has been done for you, so hit 'Submit Answer' to see the result.
# Construct the pivot table: countedcounted = medals.pivot_table(index='NOC', values='Athlete', columns='Medal', aggfunc='count')# Create the new column: counted['totals']counted['totals'] = counted.sum(axis='columns')# Sort counted by the 'totals' columncounted = counted.sort_values(['totals'], ascending=False)# Print the top 15 rows of countedprint(counted.head(15))
Group medals by 'NOC'.
Compute the number of distinct sports in which each country won medals. To do this, select the 'Sport' column from country_grouped and apply .nunique().
Sort Nsports in descending order with .sort_values() and ascending=False.
Print the first 15 rows of Nsports. This has been done for you, so hit 'Submit Answer' to see the result.
# Group medals by 'NOC': country_groupedcountry_grouped = medals.groupby('NOC')# Compute the number of distinct sports in which each country won medals: NsportsNsports = country_grouped['Sport'].nunique()# Sort the values of Nsports in descending orderNsports = Nsports.sort_values(ascending=False)# Print the top 15 rows of Nsportsprint(Nsports.head(15))Create a Boolean Series called during_cold_war by extracting all rows from medals for which the 'Edition' is >= 1952 and <= 1988.
Create a Boolean Series called is_usa_urs by extracting rows from medals for which 'NOC' is either 'USA'or 'URS'.
Filter the medals DataFrame using during_cold_war and is_usa_urs to create a new DataFrame called cold_war_medals.
Group cold_war_medals by 'NOC'.
Create a Series Nsports from country_groupedusing indexing & chained methods:Extract the column 'Sport'.
Use .nunique() to get the number of unique elements in each group;
Apply .sort_values(ascending=False) to rearrange the Series.
Print the final Series Nsports. This has been done for you, so hit 'Submit Answer' to see the result!
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_warduring_cold_war = (medals.Edition >= 1952) & (medals.Edition <= 1988)# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_ursis_usa_urs = medals.NOC.isin(['USA', 'URS'])# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medalscold_war_medals = medals.loc[during_cold_war & is_usa_urs]# Group cold_war_medals by 'NOC'country_grouped = cold_war_medals.groupby(['NOC'])# Create NsportsNsports = country_grouped['Sport'].nunique().sort_values(ascending=False)# Print Nsportsprint(Nsports)
Construct medals_won_by_country using medals.pivot_table().The index should the years ('Edition') & the columns should be country ('NOC')
the values should be 'Athlete' (which captures every medal regardless of kind) & the aggregation method should be 'count' (which captures the total number of medals won).
Create cold_war_usa_usr_medals by slicing the pivot table medals_won_by_country. Your slice should contain the editions from years 1952:1988 and only the columns 'USA' & 'URS' from the pivot table.
Create the Series most_medals by applying the .idxmax() method to cold_war_usa_usr_medals. Be sure to use axis='columns'.
Print the result of applying .value_counts() to most_medals. The result reported gives the number of times each of the USA or the USSR won more Olympic medals in total than the other between 1952 and 1988.
# Create the pivot table: medals_won_by_countrymedals_won_by_country = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')# Slice medals_won_by_country: cold_war_usa_usr_medalscold_war_usa_usr_medals = medals_won_by_country.loc[1952:1988, ['USA','URS']]# Create most_medals most_medals = cold_war_usa_usr_medals.idxmax(axis='columns')# Print most_medals.value_counts()print(most_medals.value_counts())
Create a DataFrame usa with data only for the USA.
Group usa such that ['Edition', 'Medal'] is the index. Aggregate the count over 'Athlete'.
Use .unstack() with level='Medal' to reshape the DataFrame usa_medals_by_year.
Construct a line plot from the final DataFrame usa_medals_by_year. This has been done for you, so hit 'Submit Answer' to see the plot!
# Create the DataFrame: usausa = medals[medals.NOC == 'USA']# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()# Reshape usa_medals_by_year by unstackingusa_medals_by_year = usa_medals_by_year.unstack(level='Medal')# Plot the DataFrame usa_medals_by_yearusa_medals_by_year.plot()plt.show()
Redefine the 'Medal' column of the DataFrame medals as an ordered categorical. To do this, use pd.Categorical() with three keyword arguments:values = medals.Medal.
categories=['Bronze', 'Silver', 'Gold'].
ordered=True.
After this, you can verify that the type has changed using medals.info().
Plot the final DataFrame usa_medals_by_year as an area plot. This has been done for you, so hit 'Submit Answer' to see how the plot has changed!
# Redefine 'Medal' as an ordered categoricalmedals.Medal = pd.Categorical(values=medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True)# Create the DataFrame: usausa = medals[medals.NOC == 'USA']# Group usa by 'Edition', 'Medal', and 'Athlete'usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()# Reshape usa_medals_by_year by unstackingusa_medals_by_year = usa_medals_by_year.unstack(level='Medal')# Create an area plot of usa_medals_by_yearusa_medals_by_year.plot.area()plt.show()

阅读全文

0 0