class Merging DataFrames with pandas

来源:互联网 发布:linux命令行复制文件 编辑:程序博客网 时间:2024/06/04 21:01

对dataframe的索引进行排序或者操作


  • Read 'monthly_max_temp.csv' into a DataFrame called weather1 with 'Month' as the index.
  • Sort the index of weather1 in alphabetical order using the .sort_index() method and store the result in weather2.
  • Sort the index of weather1 in reverse alphabetical order by specifying the additional keyword argument ascending=False inside .sort_index().
  • Use the .sort_values() method to sort weather1in increasing numerical order according to the values of the column 'Max TemperatureF'.

# Import pandas
import pandas as pd

# Read 'monthly_max_temp.csv' into a DataFrame: weather1
weather1 = pd.read_csv("monthly_max_temp.csv", index_col='Month')

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values('Max TemperatureF')

# Print the head of weather4
print(weather4.head())


  • Reorder the rows of weather1 using the .reindex()method with the list year as the argument, which contains the abbreviations for each month.
  • Reorder the rows of weather1 just as you did above, this time chaining the .ffill() method to replace the null values with the last preceding non-null value.

# Import pandas
import pandas as pd

# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
print(weather2)

# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
print(weather3)


  • Create a new DataFrame common_names by reindexing names_1981 using the Index of the DataFrame names_1881 of older names.
  • Print the shape of the new common_names DataFrame. This has been done for you. It should be the same as that of names_1881.
  • Drop the rows of common_names that have null counts using the .dropna() method. These rows correspond to names that fell out of fashion between 1881 & 1981.
  • Print the shape of the reassigned common_namesDataFrame. This has been done for you, so hit 'Submit Answer' to see the result!

# Import pandas
import pandas as pd

# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
print(common_names.shape)

# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
print(common_names.shape)


  • Create a new DataFrame temps_f by extracting the columns 'Min TemperatureF''Mean TemperatureF', & 'Max TemperatureF' from weather as a new DataFrame temps_f. To do this, pass the relevant columns as a list to weather[].
  • Create a new DataFrame temps_c from temps_fusing the formula (temps_f - 32) * 5/9.
  • Rename the columns of temps_c to replace 'F' with 'C' using the .str.replace('F', 'C') method on temps_c.columns.
  • Print the first 5 rows of DataFrame temps_c. This has been done for you, so hit 'Submit Answer' to see the result!

# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5 / 9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns =temps_c.columns.str.replace("F", "C")

# Print first 5 rows of temps_c
print(temps_c.head())

  • Read the file 'GDP.csv' into a DataFrame called gdp.
    • Use parse_dates=True and index_col='DATE'.
  • Create a DataFrame post2008 by slicing gdp such that it comprises all rows from 2008 onward.
  • Print the last 8 rows of the slice post2008. This has been done for you. This data has quarterly frequency so the indices are separated by three-month intervals.
  • Create the DataFrame yearly by resampling the slice post2008 by year. Remember, you need to chain .resample() (using the alias 'A' for annual frequency) with some kind of aggregation; you will use the aggregation method .last() to select the last element when resampling.
  • Compute the percentage growth of the resampled DataFrame yearly with .pct_change() * 100.

import pandas as pd

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', parse_dates=True, index_col='DATE')

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008':]

# Print the last 8 rows of post2008
print(post2008.tail(8))

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
print(yearly)

# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
print(yearly)


  • Read the DataFrames sp500 & exchange from the files 'sp500.csv' & 'exchange.csv' respectively..
    • Use parse_dates=True and index_col='Date'.
  • Extract the columns 'Open' & 'Close' from the DataFrame sp500 as a new DataFrame dollars and print the first 5 rows.
  • Construct a new DataFrame pounds by converting US dollars to British pounds. You'll use the .multiply()method of dollars with exchange['GBP/USD']and axis='rows'
  • Print the first 5 rows of the new DataFrame pounds. This has been done for you, so hit 'Submit Answer' to see the results!.


# Import pandas
import pandas as pd

# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('sp500.csv', parse_dates=True, index_col='Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv("exchange.csv", parse_dates=True, index_col='Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
print(dollars.head())

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')

# Print the head of pounds
print(pounds.head())


  • Read the files 'sales-jan-2015.csv''sales-feb-2015.csv' and 'sales-mar-2015.csv' into the DataFrames janfeb, and mar respectively.
    • Use parse_dates=True and index_col='Date'.
  • Extract the 'Units' column of janfeb, and mar to create the Series jan_unitsfeb_units, and mar_units respectively.
  • Construct the Series quarter1 by appending feb_units to jan_units and then appending mar_units to the result. Use chained calls to the .append() method to do this.
  • Verify that quarter1 has the individual Series stacked vertically. To do this:
    • Print the slice containing rows from jan 27, 2015 to feb 2, 2015.
    • Print the slice containing rows from feb 26, 2015 to mar 7, 2015.
  • Compute and print the total number of units sold from the Series quarter1. This has been done for you, so hit 'Submit Answer' to see the result!

# Import pandas
import pandas as pd

# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('sales-jan-2015.csv', parse_dates=True, index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('sales-feb-2015.csv', parse_dates=True, index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('sales-mar-2015.csv', parse_dates=True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015' ])

# Compute & print total sales in quarter1
print(quarter1.sum())

  • Create an empty list called units. This has been done for you.
  • Use a for loop to iterate over [jan, feb, mar]:
    • In each iteration of the loop, append the 'Units'column of each DataFrame to units.
  • Concatenate the Series contained in the list units into a longer Series called quarter1 using pd.concat().
    • Specify the keyword argument axis='rows' to stack the Series vertically.
  • Verify that quarter1 has the individual Series stacked vertically by printing slices. This has been done for you, so hit 'Submit Answer' to see the result!

# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])


  • Create a 'year' column in the DataFrames names_1881 and names_1981, with values of 1881 and 1981 respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
  • Create a new DataFrame called combined_names by appending the rows of names_1981 underneath the rows of names_1881. Specify the keyword argument ignore_index=True to make a new RangeIndex of unique integers for each row.
  • Print the shapes of all three DataFrames. This has been done for you.
  • Extract all rows from combined_names that have the name 'Morgan'. To do this, use the .loc[] accessor with an appropriate filter. The relevant column of combined_names here is 'name'.
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
#print(combined_names.loc[combined_names.name.str.contains('Morgan')])
print(combined_names.loc[combined_names['name']=='Morgan'])

  • Iterate over medal_types in the for loop.
  • Inside the for loop:
    • Create file_name using string interpolation with the loop variable medal. This has been done for you. The expression "%s_top5.csv" % medalevaluates as a string with the value of medalreplacing %s in the format string.
    • Create the list of column names called columns. This has been done for you.
    • Read file_name into a DataFrame called medal_df. Specify the keyword arguments header=0index_col='Country', and names=columns to get the correct row and column Indexes.
    • Append medal_df to medals using the list .append() method.
  • Concatenate the list of DataFrames medals horizontally (using axis='columns') to create a single DataFrame called medals. Print it in its entirety.

for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis='columns')

# Print medals
print(medals)


  • Within the for loop:
    • Read file_name into a DataFrame called medal_df. Specify the index to be 'Country'.
    • Append medal_df to medals.
  • Concatenate the list of DataFrames medals into a single DataFrame called medals. Be sure to use the keyword argument keys=['bronze', 'silver', 'gold'] to create a vertically stacked DataFrame with a MultiIndex.
  • Print the new DataFrame medals. This has been done for you, so hit 'Submit Answer' to see the result!

for medal in medal_types:

    file_name = "%s_top5.csv" % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)


  • Create a new DataFrame medals_sorted with the entries of medals sorted. Use .sort_index(level=0) to ensure the Index is sorted suitably.
  • Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
  • Create an alias for pd.IndexSlice called idx. A slicerpd.IndexSlice is required when slicing on the innerlevel of a MultiIndex.
  • Slice all the data on medals won by the United Kingdom. To do this, use the .loc[] accessor with idx[:,'United Kingdom'], :.

# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])

# Print data about silver medals
print(medals_sorted.loc['silver'])

# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:, 'United Kingdom'], :])

  • Construct a new DataFrame february with MultiIndexed columns by concatenating the list dataframes.
    • Use axis=1 to stack the DataFrames horizontally and the keyword argument keys=['Hardware', 'Software', 'Service'] to construct a hierarchical Index from each DataFrame.
  • Print summary information from the new DataFrame february using the .info() method. This has been done for you.
  • Create an alias called idx for pd.IndexSlice.
  • Extract a slice called slice_2_8 from february(using .loc[] & idx) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under 'Company'.
  • Print the slice_2_8. This has been done for you, so hit 'Submit Answer' to see the sliced data!

# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'])

# Print february.info()
print(february.info())

# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['Feb 2, 2015':'Feb 8, 2015', idx[:, 'Company']]


# Print slice_2_8
print(slice_2_8)


  • Create a list called month_list consisting of the tuples ('january', jan)('february', feb), and ('march', mar).
  • Create an empty dictionary called month_dict.
  • Inside the for loop:
    • Group month_data by 'Company' and use .sum() to aggregate.
  • Construct a new DataFrame called sales by concatenating the DataFrames stored in month_dict.
  • Create an alias for pd.IndexSlice and print all sales by 'Mediacore'. This has been done for you, so hit 'Submit Answer' to see the result!

# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]


# Create an empty dictionary: month_dict
month_dict = dict()

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])


  • Make a new DataFrame china_annual by resampling the DataFrame china with .resample('A') (i.e., with annual frequency) and chaining two method calls:
    • Chain .pct_change(10) as an aggregation method to compute the percentage change with an offset of ten years.
    • Chain .dropna() to eliminate rows containing null values.
  • Make a new DataFrame us_annual by resampling the DataFrame us exactly as you resampled china.
  • Concatenate china_annual and us_annual to construct a DataFrame called gdp. Use join='inner'to perform an inner join and use axis=1 to concatenate horizontally.
  • Print the result of resampling gdp every decade (i.e., using .resample('10A')) and aggregating with the method .last(). This has been done for you, so hit 'Submit Answer' to see the result!

# Resample and tidy china: china_annual
china_annual = china.resample('A').pct_change(10).dropna()

# Resample and tidy us: us_annual
us_annual = us.resample('A').pct_change(10).dropna()

# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], axis=1, join='inner')

# Resample gdp and print
print(gdp.resample('10A').last())


关于归并和merge的一些操作

  • Using pd.merge(), merge the DataFrames revenueand managers on the 'city' column of each. Store the result as merge_by_city.
  • Print the DataFrame merge_by_city. This has been done for you.
  • Merge the DataFrames revenue and managers on the 'branch_id' column of each. Store the result as merge_by_id.
  • Print the DataFrame merge_by_id. This has been done for you, so hit 'Submit Answer' to see the result!

# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')


# Print merge_by_city
print(merge_by_city)


# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')


# Print merge_by_id
print(merge_by_id)

  • Merge the DataFrames revenue and managers into a single DataFrame called combined using the 'city'and 'branch' columns from the appropriate DataFrames.
    • In your call to pd.merge(), you will have to specify the parameters left_on and right_onappropriately.
  • Print the new DataFrame combined.

# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')


# Print combined
print(combined)

  • Create a column called 'state' in the DataFrame revenue, consisting of the list ['TX','CO','IL','CA'].
  • Create a column called 'state' in the DataFrame managers, consisting of the list ['TX','CO','CA','MO'].
  • Merge the DataFrames revenue and managers using three columns :'branch_id''city', and 'state'. Pass them in as a list to the on paramater of pd.merge().

# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])

# Print combined
print(combined)

  • Execute a right merge using pd.merge() with revenue and sales to yield a new DataFrame revenue_and_sales.
    • Use how='right' and on=['city', 'state'].
  • Print the new DataFrame revenue_and_sales. This has been done for you.
  • Execute a left merge with sales and managers to yield a new DataFrame sales_and_managers.
    • Use how='left'left_on=['city', 'state'], and right_on=['branch', 'state].
  • Print the new DataFrame sales_and_managers. This has been done for you, so hit 'Submit Answer' to see the result!

# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])

# Print revenue_and_sales
print(revenue_and_sales)

# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
print(sales_and_managers)

  • Perform an ordered merge on austin and houstonusing pd.merge_ordered(). Store the result as tx_weather.
  • Print tx_weather. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city.
  • Perform another ordered merge on austin and houston.
    • This time, specify the keyword arguments on='date' and suffixes=['_aus','_hus'] so that the rows can be distinguished. Store the result as tx_weather_suff.
  • Print tx_weather_suff to examine its contents. This has been done for you.
  • Perform a third ordered merge on austin and houston.
    • This time, in addition to the on and suffixesparameters, specify the keyword argument fill_method='ffill' to use forward-filling to replace NaN entries with the most recent non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames!

# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)

# Print tx_weather
print(tx_weather)

# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'])

# Print tx_weather_suff
print(tx_weather_suff)

# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'], fill_method='ffill')

# Print tx_weather_ffill
print(tx_weather_ffill)


  • Merge auto and oil using pd.merge_asof()with left_on='yr' and right_on='Date'. Store the result as merged.
  • Print the tail of merged. This has been done for you.
  • Resample merged using 'A' (annual frequency), and on='Date'. Select [['mpg','Price']] and aggregate the mean. Store the result as yearly.
  • Hit Submit Answer to examine the contents of yearlyand yearly.corr(), which shows the Pearson correlation between the resampled 'Price' and 'mpg'.

# Merge auto and oil: merged
merged = pd.merge_asof(auto, oil, left_on='yr', right_on='Date')

# Print the tail of merged
print(merged.tail())

# Resample merged: yearly
yearly = merged.resample('A', on='Date')[['mpg', 'Price']].mean()

# Print yearly
print(yearly)

# print yearly.corr()
print(yearly.corr())


  • Within the for loop:
    • Create the file path. This has been done for you.
    • Read file_path into a DataFrame. Assign the result to the year key of medals_dict.
    • Select only the columns 'Athlete''NOC', and 'Medal' from medals_dict[year].
    • Create a new column called 'Edition' in the DataFrame medals_dict[year] whose entries are all year.
  • Concatenate the dictionary of DataFrames medals_dictinto a DataFame called medals. Specify the keyword argument ignore_index=True to prevent repeated integer indices.
  • Print the first and last 5 rows of medals. This has been done for you, so hit 'Submit Answer' to see the result!

# Import pandas
import pandas as pd

# Create empty dictionary: medals_dict
medals_dict = {}

for year in editions['Edition']:

    # Create the file path: file_path
    file_path = 'summer_{:d}.csv'.format(year)
    
    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path)
    
    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
    
    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year
    
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)

# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())

  • Set the index of the DataFrame editions to be 'Edition' (using the method .set_index()). Save the result as totals.
  • Extract the 'Grand Total' column from totals and assign the result back to totals.
  • Divide the DataFrame medal_counts by totalsalong each row. You will have to use the .divide()method with the option axis='rows'. Assign the result to fractions.
  • Print first & last 5 rows of the DataFrame fractions. This has been done for you, so hit 'Submit Answer' to see the results!

# Set Index of editions: totals
totals = editions.set_index('Edition')

# Reassign totals['Grand Total']: totals
totals = totals['Grand Total']

# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis='rows')

# Print first & last 5 rows of fractions
print(fractions.head())
print(fractions.tail())

  • Create mean_fractions by chaining the methods .expanding().mean() to fractions.
  • Compute the percentage change in mean_fractionsdown each column by applying .pct_change() and multiplying by 100. Assign the result to fractions_change.
  • Reset the index of fractions_change using the .reset_index() method. This will make 'Edition' an ordinary column.
  • Print the first and last 5 rows of the DataFrame fractions. This has been done for you, so hit 'Submit Answer' to see the results!

# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()


# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change() * 100


# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index('Edition')


# Print first & last 5 rows of fractions_change
print(fractions_change.head())
print(fractions_change.tail())

原创粉丝点击