class pandas Foundations

来源:互联网 发布:广州恒大淘宝股价 编辑:程序博客网 时间:2024/05/17 22:16

numpy log10 和 dataframe一起使用


  • Import numpy using the standard alias np.
  • Assign the numerical values in the DataFrame df to an array np_vals using the attribute values.
  • Pass np_vals into the NumPy method log10() and store the results in np_vals_log10.
  • Pass the entire df DataFrame into the NumPy method log10() and store the results in df_log10.
  • Call print() and type() on both df_vals_log10 and df_log10, and compare. This has been done for you.

# Import numpy
import numpy as np


# Create array of DataFrame values: np_vals
np_vals = df.values


# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)


# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)


# Print original and new data containers
print(type(np_vals), type(np_vals_log10))
print(type(df), type(df_log10))


将List使用zip来创建dataframe


  • Zip the 2 lists list_keys and list_values together into one list of (key, value) tuples. Be sure to convert the zip object into a list, and store the result in zipped.
  • Inspect the contents of zipped using print(). This has been done for you.
  • Construct a dictionary using zipped. Store the result as data.
  • Construct a DataFrame using the dictionary. Store the result as df.

# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))


# Inspect the list using print()
print(zipped)


# Build a dictionary with the zipped list: data
data = dict(zipped)


# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)


pandas读取文件时的一些预处理


  • Use pd.read_csv() without using any keyword arguments to read file_messy into a pandas DataFrame df1.
  • Use .head() to print the first 5 rows of df1 and see how messy it is. Do this in the IPython Shell first so you can see how modifying read_csv() can clean up this mess.
  • Using the keyword arguments delimiter=' 'header=3and comment='#', use pd.read_csv() again to read file_messy into a new DataFrame df2.
  • Print the output of df2.head() to verify the file was read correctly.
  • Use the DataFrame method .to_csv() to save the DataFrame df2 to the variable file_clean. Be sure to specify index=False.
  • Use the DataFrame method .to_excel() to save the DataFrame df2 to the file 'file_clean.xlsx'. Again, remember to specify index=False.

# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)


# Print the output of df1.head()
print(df1.head())


# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=" ", header=3, comment='#')


# Print the output of df2.head()
print(df2.head())


# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)


# Save the cleaned up DataFrame to an excel file without the index
df2.to_excel('file_clean.xlsx', index=False)


pandas画图和存图


  • Create the plot with the DataFrame method df.plot(). Specify a color of 'red'.
    • Notec and color are interchangeable as parameters here, but we ask you to be explicit and specify color.
  • Use plt.title() to give the plot a title of 'Temperature in Austin'.
  • Use plt.xlabel() to give the plot an x-axis label of 'Hours since midnight August 1, 2010'.
  • Use plt.ylabel() to give the plot a y-axis label of 'Temperature (degrees F)'.
  • Finally, display the plot using plt.show().

# Create a plot with color='red'
df.plot(color='red')


# Add a title
plt.title("Temperature in Austin")


# Specify the x-axis label
plt.xlabel("Hours since midnight August 1, 2010")


# Specify the y-axis label
plt.ylabel("Temperature (degrees F)")


# Display the plot
plt.show()

  • Plot all columns together on one figure by calling df.plot(), and noting the vertical scaling problem.
  • Plot all columns as subplots. To do so, you need to specify subplots=True inside .plot().
  • Plot a single column of dew point data. To do this, define a column list containing a single column name 'Dew Point (deg F)', and call df[column_list1].plot().
  • Plot two columns of data, 'Temperature (deg F)' and 'Dew Point (deg F)'. To do this, define a list containing those column names and pass it into df[], as df[column_list2].plot().
# Plot all columns (default)
df.plot()
plt.show()


# Plot all columns as subplots
df.plot(subplots=True)
plt.show()


# Plot just the Dew Point data
column_list1 = ['Dew Point (deg F)']
df[column_list1].plot()
plt.show()


# Plot the Dew Point and Temperature data, but not the Pressure data
column_list2 = ['Temperature (deg F)','Dew Point (deg F)']
df[column_list2].plot()
plt.show()

  • Create a list of y-axis column names called y_columnsconsisting of 'AAPL' and 'IBM'.
  • Generate a line plot with x='Month' and y=y_columns as inputs.
  • Give the plot a title of 'Monthly stock prices'.
  • Specify the y-axis label.
  • Display the plot.

# Create a list of y-axis column names: y_columns
y_columns = ['AAPL','IBM']


# Generate a line plot
df.plot(x='Month', y=y_columns)


# Add the title
plt.title('Monthly stock prices')


# Add the y-axis label
plt.ylabel('Price ($US)')


# Display the plot
plt.show()


  • Generate a scatter plot with 'hp' on the x-axis and 'mpg'on the y-axis. Specify s=sizes.
  • Add a title to the plot.
  • Specify the x-axis and y-axis labels.
# Generate a scatter plot
df.plot(kind='scatter', x='hp', y='mpg', s=sizes)


# Add the title
plt.title('Fuel efficiency vs Horse-power')


# Add the x-axis label
plt.xlabel('Horse-power')


# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')


# Display the plot
plt.show()

  • Make a list called cols of the column names to be plotted: 'weight' and 'mpg'. You can then access it using df[cols].
  • Generate a box plot of the two columns in a single figure. To do this, specify subplots=True.

# Make a list of the column names to be plotted: cols
cols = ['weight', 'mpg']


# Generate the box plots
df[cols].plot(kind="box", subplots=True)


# Display the plot
plt.show()


  • Plot a PDF for the values in fraction with 30 bins between 0 and 30%. The range has been taken care of for you. ax=axes[0] means that this plot will appear in the first row.
  • Plot a CDF for the values in fraction with 30 bins between 0 and 30%. Again, the range has been specified for you. To make the CDF appear on the second row, you need to specify ax=axes[1].

# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)


# Plot the PDF
df.fraction.plot(ax=axes[0], kind='hist', normed=True, bins=30, range=(0,.3))
plt.show()


# Plot the CDF
df.fraction.plot(ax=axes[1], kind='hist', normed=True, cumulative=True, bins=30, range=(0,.3))
plt.show()


  • Print the minimum value of the 'Engineering' column.
  • Print the maximum value of the 'Engineering' column.
  • Construct the mean percentage per year with .mean(axis='columns'). Assign the result to mean.
  • Plot the average percentage per year. Since 'Year' is the index of df, it will appear on the x-axis of the plot. No keyword arguments are needed in your call to .plot().

# Print the minimum value of the Engineering column
print(df['Engineering'].min())


# Print the maximum value of the Engineering column
print(df['Engineering'].max())


# Construct the mean percentage per year: mean
mean = df.mean(axis='columns')


# Plot the average percentage per year
mean.plot()


# Display the plot
plt.show()


探索数据规律

  • Print the number of countries reported in 2015. To do this, use the .count() method on the '2015' column of df.
  • Print the 5th and 95th percentiles of df. To do this, use the .quantile() method with the list [0.05, 0.95].
  • Generate a box plot using the list of columns provided in years. This has already been done for you, so click on 'Submit Answer' to view the result!

# Print the number of countries reported in 2015
print(df['2015'].count())


# Print the 5th and 95th percentiles
print(df.quantile([0.05, 0.95]))


# Generate a box plot
years = ['1800','1850','1900','1950','2000']
df[years].plot(kind='box')
plt.show()


时间序列数据的处理

  • Prepare a format string, time_format, using '%Y-%m-%d %H:%M' as the desired format.
  • Convert date_list into a datetime object by using the pd.to_datetime() function. Specify the format string you defined above and assign the result to my_datetimes.
  • Construct a pandas Series called time_series using pd.Series() with temperature_list and my_datetimes. Set the index of the Series to be my_datetimes.

# Prepare a format string: time_format
time_format = '%Y-%m-%d %H:%M'


# Convert date_list into a datetime object: my_datetimes
my_datetimes = pd.to_datetime(date_list, format=time_format)  


# Construct a pandas Series using temperature_list and my_datetimes: time_series
time_series = pd.Series(temperature_list, index=my_datetimes)


  • Create a new time series ts3 by reindexing ts2 with the index of ts1. To do this, call .reindex() on ts2 and pass in the index of ts1 (ts1.index).
  • Create another new time series, ts4, by calling the same .reindex() as above, but also specifiying a fill method, using the keyword argument method="ffill" to forward-fill values.
  • Add ts1 + ts2. Assign the result to sum12.
  • Add ts1 + ts3. Assign the result to sum13.
  • Add ts1 + ts4, Assign the result to sum14.

# Reindex without fill method: ts3
ts3 = ts2.reindex(ts1.index)


# Reindex with fill method, using forward fill: ts4
ts4 = ts2.reindex(ts1.index, method='ffill')


# Combine ts1 + ts2: sum12
sum12 = ts1 + ts2


# Combine ts1 + ts3: sum13
sum13 = ts1 + ts3


# Combine ts1 + ts4: sum14
sum14 = ts1 + ts4


数据的聚合操作


  • Downsample the 'Temperature' column of df to 6 hour data using .resample('6h') and .mean(). Assign the result to df1.
  • Downsample the 'Temperature' column of df to daily data using .resample('D') and then count the number of data points in each day with .count(). Assign the result df2.

# Downsample to 6 hour data and aggregate by mean: df1
df1 = df.Temperature.resample("6h").mean()

# Downsample to daily data and count the number of data points: df2
df2 = df.Temperature.resample("D").count()

  • Use partial string indexing to extract temperature data for August 2010 into august.
  • Use the temperature data for August and downsample to find the daily maximum temperatures. Store the result in august_highs.
  • Use partial string indexing to extract temperature data for February 2010 into february.
  • Use the temperature data for February and downsample to find the daily minimum temperatures. Store the result in february_lows.

# Extract temperature data for August: august
august = df.Temperature.loc['2010-08']


# Downsample to obtain only the daily highest temperatures in August: august_highs
august_highs = august.resample("D").max()


# Extract temperature data for February: february
february = df.Temperature.loc['2010-02']


# Downsample to obtain the daily lowest temperatures in February: february_lows
february_lows = february.resample("D").min()

  • Use partial string indexing to extract temperature data from August 1 2010 to August 15 2010. Assign to unsmoothed.
  • Use .rolling() with a 24 hour window to smooth the mean temperature data. Assign the result to smoothed.
  • Use a dictionary to create a new DataFrame august with the time series smoothed and unsmoothed as columns.
  • Plot both the columns of august as line plots using the .plot() method.

# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed
unsmoothed = df['Temperature']['2010-08-01':'2010-08-15']


# Apply a rolling mean with a 24 hour window: smoothed
smoothed = unsmoothed.rolling(window=24).mean()


# Create a new DataFrame with columns smoothed and unsmoothed: august
august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed})


# Plot both smoothed and unsmoothed data using august.plot().
august.plot()
plt.show()


时间序列数据的平滑处理

  • Use partial string indexing to extract August 2010 temperature data, and assign to august.
  • Resample to daily frequency, saving the maximum daily temperatures, and assign the result to daily_highs.
  • As part of one long method chain, repeat the above resampling (or you can re-use daily_highs) and then combine it with .rolling() to apply a 7 day .mean() (with window=7inside .rolling()) so as to smooth the daily highs. Assign the result to daily_highs_smoothed and print the result.

# Extract the August 2010 data: august
august = df['Temperature']["2010-08"]


# Resample to daily data, aggregating by max: daily_highs
daily_highs = august.resample("D").max()


# Use a rolling 7-day window with method chaining to smooth the daily high temperatures in August
daily_highs_smoothed = august.resample("D").max().rolling(window=7).mean()
print(daily_highs_smoothed)

  • Use .str.strip() to strip extra whitespace from df.columns. Assign the result back to df.columns.
  • In the 'Destination Airport' column, extract all entries where Dallas ('DAL') is the destination airport. Use .str.contains('DAL') for this and store the result in dallas.
  • Resample dallas such that you get the total number of departures each day. Store the result in daily_departures.
  • Generate summary statistics for daily Dallas departures using .describe(). Store the result in stats.

# Strip extra whitespace from the column names: df.columns
df.columns = df.columns.str.strip()


# Extract data for which the destination airport is Dallas: dallas
dallas = df['Destination Airport'].str.contains('DAL')


# Compute the total number of Dallas departures each day: daily_departures
daily_departures = dallas.resample("D").sum()


# Generate the summary statistics for daily Dallas departures: stats
stats = daily_departures.describe()

  • Replace the index of ts2 with that of ts1, and then fill in the missing values of ts2 by using .interpolate(how='linear'). Save the result as ts2_interp.
  • Compute the difference between ts1 and ts2_interp. Take the absolute value of the difference with np.abs(), and assign the result to differences.
  • Generate and print summary statistics of the differences with .describe() and print()

# Reset the index of ts2 to ts1, and then use linear interpolation to fill in the NaNs: ts2_interp
ts2_interp = ts2.reindex(ts1.index).interpolate(how='linear')


# Compute the absolute difference of ts1 and ts2_interp: differences 
differences = np.abs(ts1 - ts2_interp)


# Generate and print summary statistics of the differences
print(differences.describe())

pandas 处理 timezone 问题

  • Create a Boolean mask, mask, such that if the 'Destination Airport' column of df equals 'LAX', the result is True, and otherwise, it is False.
  • Use the mask to extract only the LAX rows. Assign the result to la.
  • Concatenate the two columns la['Date (MM/DD/YYYY)'] and la['Wheels-off Time'] with a ' ' space in between. Pass this to pd.to_datetime() to create a datetime array of all the times the LAX-bound flights left the ground.
  • Use Series.dt.tz_localize() to localize the time to 'US/Central'.
  • Use the .dt.tz_convert() method to convert datetimes from 'US/Central' to 'US/Pacific'.

# Buid a Boolean mask to filter out all the 'LAX' departure flights: mask
mask = df['Destination Airport'] == 'LAX'


# Use the mask to subset the data: la
la = df[mask]


# Combine two columns of data to create a datetime series: times_tz_none 
times_tz_none = pd.to_datetime( la['Date (MM/DD/YYYY)'] + ' ' + la['Wheels-off Time'] )


# Localize the time to US/Central: times_tz_central
times_tz_central = times_tz_none.dt.tz_localize('US/Central')


# Convert the datetimes from US/Central to US/Pacific
times_tz_pacific = times_tz_central.dt.tz_convert("US/Pacific")


  • Use pd.to_datetime() to convert the 'Date' column to a collection of datetime objects, and assign back to df.Date.
  • Set the index to this updated 'Date' column, using df.set_index() with the optional keyword argument inplace=True, so that you don't have to assign the result back to df.
  • Re-plot the DataFrame to see that the axis is now datetime aware. This code has been written for you.

# Plot the raw data before setting the datetime index
df.plot()
plt.show()


# Convert the 'Date' column into a collection of datetime objects: df.Date
df.Date = pd.to_datetime(df['Date'])


# Set the index to be the converted 'Date' column
df.set_index('Date', inplace=True)


# Re-plot the DataFrame to see that the axis is now datetime aware!
df.plot()
plt.show()


  • Downsample df_clean with daily frequency and aggregate by the mean. Store the result as daily_mean_2011.
  • Extract the 'dry_bulb_faren' column from daily_mean_2011 as a NumPy array using .values. Store the result as daily_temp_2011. Note: .values is an attribute, not a method, so you don't have to use ().
  • Downsample df_climate with daily frequency and aggregate by the mean. Store the result as daily_climate.
  • Extract the 'Temperature' column from daily_climate using the .reset_index()method. To do this, first reset the index of daily_climate, and then use bracket slicing to access 'Temperature'. Store the result as daily_temp_climate.

# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df_clean.resample("D").mean()


# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011.dry_bulb_faren.values


# Downsample df_climate by day and aggregate by mean: daily_climate
daily_climate = df_climate.resample("D").mean()


# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
daily_temp_climate = daily_climate.reset_index().Temperature


# Compute the difference between the two arrays and print the mean difference
difference = daily_temp_2011 - daily_temp_climate
print(difference.mean())


  • Use .loc[] to select sunny days and assign to sunny. If 'sky_condition' equals'CLR', then the day is sunny.
  • Use .loc[] to select overcast days and assign to overcast. If 'sky_condition'contains 'OVC', then the day is overcast.
  • Resample sunny and overcast and aggregate by the maximum (.max()) daily ('D') temperature. Assign to sunny_daily_max and overcast_daily_max.
  • Print the difference between the mean of sunny_daily_max and overcast_daily_max. This has already been done for you, so click 'Submit Answer' to view the result!

# Select days that are sunny: sunny
sunny = df_clean.loc[df_clean['sky_condition'].str.contains('CLR')]


# Select days that are overcast: overcast
overcast = df_clean.loc[df_clean['sky_condition'].str.contains('OVC')]


# Resample sunny and overcast, aggregating by maximum daily temperature
sunny_daily_max = sunny.resample('D').max()
overcast_daily_max = overcast.resample('D').max()


# Print the difference between the mean of sunny_daily_max and overcast_daily_max
print(sunny_daily_max.mean() - overcast_daily_max.mean())



画图分析

  • Import matplotlib.pyplot as plt.
  • Select the 'visibility' and 'dry_bulb_faren' columns and resample them by week, aggregating the mean. Assign the result to weekly_mean.
  • Print the output of weekly_mean.corr().
  • Plot the weekly_mean dataframe with .plot(), specifying subplots=True.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt


# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df_clean[['visibility','dry_bulb_faren']].resample("W").mean()


# Print the output of weekly_mean.corr()
print(weekly_mean.corr())


# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
plt.show()


  • Create a Boolean Series for sunny days. Assign the result to sunny.
  • Resample sunny by day and compute the sum. Assign the result to sunny_hours.
  • Resample sunny by day and compute the count. Assign the result to total_hours.
  • Divide sunny_hours by total_hours. Assign to sunny_fraction.
  • Make a box plot of sunny_fraction.

# Create a Boolean Series for sunny days: sunny
sunny = df_clean.sky_condition.str.contains("CLR")


# Resample the Boolean Series by day and compute the sum: sunny_hours
sunny_hours = sunny.resample('D').sum()


# Resample the Boolean Series by day and compute the count: total_hours
total_hours = sunny.resample("D").count()


# Divide sunny_hours by total_hours: sunny_fraction
sunny_fraction = sunny_hours / total_hours


# Make a box plot of sunny_fraction
sunny_fraction.plot(kind='box')
plt.show()


  • From df_climate, extract the maximum temperature observed in August 2010. The relevant column here is 'Temperature'. You can select the rows corresponding to August 2010 in multiple ways. For example, df_climate.loc['2011-Feb'] selects all rows corresponding to February 2011, while df_climate.loc['2009-09', 'Pressure']selects the rows corresponding to September 2009 from the 'Pressure' column.
  • From df_clean, select the August 2011 temperature data from the 'dry_bulb_faren'. Resample this data by day and aggregate the maximum value. Store the result in august_2011.
  • Filter out days in august_2011 where the value exceeded august_max. Store the result in august_2011_high.
  • Construct a CDF of august_2011_high using 25 bins. Remember to specify the kindnormed, and cumulative parameters in addition to bins.

# Extract the maximum temperature in August 2010 from df_climate: august_max
august_max = df_climate.loc['2010-08','Temperature'].max()
print(august_max)


# Resample the August 2011 temperatures in df_clean by day and aggregate the maximum value: august_2011
august_2011 = df_clean.loc['2011-08','dry_bulb_faren'].resample("D").max()


# Filter out days in august_2011 where the value exceeded august_max: august_2011_high
august_2011_high = august_2011[august_2011.values>august_max]


# Construct a CDF of august_2011_high
august_2011_high.plot(kind='hist', bins=25, normed=True, cumulative=True)


# Display the plot
plt.show()


原创粉丝点击