class pandas Foundations
来源:互联网 发布:广州恒大淘宝股价 编辑:程序博客网 时间:2024/05/17 22:16
numpy log10 和 dataframe一起使用
- Import
numpy
using the standard aliasnp
. - Assign the numerical values in the DataFrame
df
to an arraynp_vals
using the attributevalues
. - Pass
np_vals
into the NumPy methodlog10()
and store the results innp_vals_log10
. - Pass the entire
df
DataFrame into the NumPy methodlog10()
and store the results indf_log10
. - Call
print()
andtype()
on bothdf_vals_log10
anddf_log10
, and compare. This has been done for you.
# Import numpy
import numpy as np
# Create array of DataFrame values: np_vals
np_vals = df.values
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)
# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)
# Print original and new data containers
print(type(np_vals), type(np_vals_log10))
print(type(df), type(df_log10))
将List使用zip来创建dataframe
- Zip the 2 lists
list_keys
andlist_values
together into one list of (key, value) tuples. Be sure to convert thezip
object into a list, and store the result inzipped
. - Inspect the contents of
zipped
usingprint()
. This has been done for you. - Construct a dictionary using
zipped
. Store the result asdata
. - Construct a DataFrame using the dictionary. Store the result as
df
.
# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))
# Inspect the list using print()
print(zipped)
# Build a dictionary with the zipped list: data
data = dict(zipped)
# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)
pandas读取文件时的一些预处理
- Use
pd.read_csv()
without using any keyword arguments to readfile_messy
into a pandas DataFramedf1
. - Use
.head()
to print the first 5 rows ofdf1
and see how messy it is. Do this in the IPython Shell first so you can see how modifyingread_csv()
can clean up this mess. - Using the keyword arguments
delimiter=' '
,header=3
andcomment='#'
, usepd.read_csv()
again to readfile_messy
into a new DataFramedf2
. - Print the output of
df2.head()
to verify the file was read correctly. - Use the DataFrame method
.to_csv()
to save the DataFramedf2
to the variablefile_clean
. Be sure to specifyindex=False
. - Use the DataFrame method
.to_excel()
to save the DataFramedf2
to the file'file_clean.xlsx'
. Again, remember to specifyindex=False
.
# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)
# Print the output of df1.head()
print(df1.head())
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=" ", header=3, comment='#')
# Print the output of df2.head()
print(df2.head())
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)
# Save the cleaned up DataFrame to an excel file without the index
df2.to_excel('file_clean.xlsx', index=False)
pandas画图和存图
- Create the plot with the DataFrame method
df.plot()
. Specify acolor
of'red'
.- Note:
c
andcolor
are interchangeable as parameters here, but we ask you to be explicit and specifycolor
.
- Note:
- Use
plt.title()
to give the plot a title of'Temperature in Austin'
. - Use
plt.xlabel()
to give the plot an x-axis label of'Hours since midnight August 1, 2010'
. - Use
plt.ylabel()
to give the plot a y-axis label of'Temperature (degrees F)'
. - Finally, display the plot using
plt.show()
.
# Create a plot with color='red'
df.plot(color='red')
# Add a title
plt.title("Temperature in Austin")
# Specify the x-axis label
plt.xlabel("Hours since midnight August 1, 2010")
# Specify the y-axis label
plt.ylabel("Temperature (degrees F)")
# Display the plot
plt.show()
- Plot all columns together on one figure by calling
df.plot()
, and noting the vertical scaling problem. - Plot all columns as subplots. To do so, you need to specify
subplots=True
inside.plot()
. - Plot a single column of dew point data. To do this, define a column list containing a single column name
'Dew Point (deg F)'
, and calldf[column_list1].plot()
. - Plot two columns of data,
'Temperature (deg F)'
and'Dew Point (deg F)'
. To do this, define a list containing those column names and pass it intodf[]
, asdf[column_list2].plot()
.
df.plot()
plt.show()
# Plot all columns as subplots
df.plot(subplots=True)
plt.show()
# Plot just the Dew Point data
column_list1 = ['Dew Point (deg F)']
df[column_list1].plot()
plt.show()
# Plot the Dew Point and Temperature data, but not the Pressure data
column_list2 = ['Temperature (deg F)','Dew Point (deg F)']
df[column_list2].plot()
plt.show()
- Create a list of y-axis column names called
y_columns
consisting of'AAPL'
and'IBM'
. - Generate a line plot with
x='Month'
andy=y_columns
as inputs. - Give the plot a title of
'Monthly stock prices'
. - Specify the y-axis label.
- Display the plot.
y_columns = ['AAPL','IBM']
# Generate a line plot
df.plot(x='Month', y=y_columns)
# Add the title
plt.title('Monthly stock prices')
# Add the y-axis label
plt.ylabel('Price ($US)')
# Display the plot
plt.show()
- Generate a scatter plot with
'hp'
on the x-axis and'mpg'
on the y-axis. Specifys=sizes
. - Add a title to the plot.
- Specify the x-axis and y-axis labels.
df.plot(kind='scatter', x='hp', y='mpg', s=sizes)
# Add the title
plt.title('Fuel efficiency vs Horse-power')
# Add the x-axis label
plt.xlabel('Horse-power')
# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')
# Display the plot
plt.show()
- Make a list called
cols
of the column names to be plotted:'weight'
and'mpg'
. You can then access it usingdf[cols]
. - Generate a box plot of the two columns in a single figure. To do this, specify
subplots=True
.
# Make a list of the column names to be plotted: cols
cols = ['weight', 'mpg']
# Generate the box plots
df[cols].plot(kind="box", subplots=True)
# Display the plot
plt.show()
- Plot a PDF for the values in
fraction
with 30bins
between 0 and 30%. The range has been taken care of for you.ax=axes[0]
means that this plot will appear in the first row. - Plot a CDF for the values in
fraction
with 30bins
between 0 and 30%. Again, the range has been specified for you. To make the CDF appear on the second row, you need to specifyax=axes[1]
.
# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)
# Plot the PDF
df.fraction.plot(ax=axes[0], kind='hist', normed=True, bins=30, range=(0,.3))
plt.show()
# Plot the CDF
df.fraction.plot(ax=axes[1], kind='hist', normed=True, cumulative=True, bins=30, range=(0,.3))
plt.show()
- Print the minimum value of the
'Engineering'
column. - Print the maximum value of the
'Engineering'
column. - Construct the mean percentage per year with
.mean(axis='columns')
. Assign the result tomean
. - Plot the average percentage per year. Since
'Year'
is the index ofdf
, it will appear on the x-axis of the plot. No keyword arguments are needed in your call to.plot()
.
# Print the minimum value of the Engineering column
print(df['Engineering'].min())
# Print the maximum value of the Engineering column
print(df['Engineering'].max())
# Construct the mean percentage per year: mean
mean = df.mean(axis='columns')
# Plot the average percentage per year
mean.plot()
# Display the plot
plt.show()
探索数据规律
- Print the number of countries reported in 2015. To do this, use the
.count()
method on the'2015'
column ofdf
. - Print the 5th and 95th percentiles of
df
. To do this, use the.quantile()
method with the list[0.05, 0.95]
. - Generate a box plot using the list of columns provided in
years
. This has already been done for you, so click on 'Submit Answer' to view the result!
# Print the number of countries reported in 2015
print(df['2015'].count())
# Print the 5th and 95th percentiles
print(df.quantile([0.05, 0.95]))
# Generate a box plot
years = ['1800','1850','1900','1950','2000']
df[years].plot(kind='box')
plt.show()
时间序列数据的处理
- Prepare a format string,
time_format
, using'%Y-%m-%d %H:%M'
as the desired format. - Convert
date_list
into adatetime
object by using thepd.to_datetime()
function. Specify the format string you defined above and assign the result tomy_datetimes
. - Construct a pandas Series called
time_series
usingpd.Series()
withtemperature_list
andmy_datetimes
. Set theindex
of the Series to bemy_datetimes
.
# Prepare a format string: time_format
time_format = '%Y-%m-%d %H:%M'
# Convert date_list into a datetime object: my_datetimes
my_datetimes = pd.to_datetime(date_list, format=time_format)
# Construct a pandas Series using temperature_list and my_datetimes: time_series
time_series = pd.Series(temperature_list, index=my_datetimes)
- Create a new time series
ts3
by reindexingts2
with the index ofts1
. To do this, call.reindex()
onts2
and pass in the index ofts1
(ts1.index
). - Create another new time series,
ts4
, by calling the same.reindex()
as above, but also specifiying a fill method, using the keyword argumentmethod="ffill"
to forward-fill values. - Add
ts1 + ts2
. Assign the result tosum12
. - Add
ts1 + ts3
. Assign the result tosum13
. - Add
ts1 + ts4
, Assign the result tosum14
.
# Reindex without fill method: ts3
ts3 = ts2.reindex(ts1.index)
# Reindex with fill method, using forward fill: ts4
ts4 = ts2.reindex(ts1.index, method='ffill')
# Combine ts1 + ts2: sum12
sum12 = ts1 + ts2
# Combine ts1 + ts3: sum13
sum13 = ts1 + ts3
# Combine ts1 + ts4: sum14
sum14 = ts1 + ts4
数据的聚合操作
- Downsample the
'Temperature'
column ofdf
to 6 hour data using.resample('6h')
and.mean()
. Assign the result todf1
. - Downsample the
'Temperature'
column ofdf
to daily data using.resample('D')
and then count the number of data points in each day with.count()
. Assign the resultdf2
.
df1 = df.Temperature.resample("6h").mean()
df2 = df.Temperature.resample("D").count()
- Use partial string indexing to extract temperature data for August 2010 into
august
. - Use the temperature data for August and downsample to find the daily maximum temperatures. Store the result in
august_highs
. - Use partial string indexing to extract temperature data for February 2010 into
february
. - Use the temperature data for February and downsample to find the daily minimum temperatures. Store the result in
february_lows
.
august = df.Temperature.loc['2010-08']
# Downsample to obtain only the daily highest temperatures in August: august_highs
august_highs = august.resample("D").max()
# Extract temperature data for February: february
february = df.Temperature.loc['2010-02']
# Downsample to obtain the daily lowest temperatures in February: february_lows
february_lows = february.resample("D").min()
- Use partial string indexing to extract temperature data from August 1 2010 to August 15 2010. Assign to
unsmoothed
. - Use
.rolling()
with a 24 hour window to smooth the mean temperature data. Assign the result tosmoothed
. - Use a dictionary to create a new DataFrame
august
with the time seriessmoothed
andunsmoothed
as columns. - Plot both the columns of
august
as line plots using the.plot()
method.
# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed
unsmoothed = df['Temperature']['2010-08-01':'2010-08-15']
# Apply a rolling mean with a 24 hour window: smoothed
smoothed = unsmoothed.rolling(window=24).mean()
# Create a new DataFrame with columns smoothed and unsmoothed: august
august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed})
# Plot both smoothed and unsmoothed data using august.plot().
august.plot()
plt.show()
时间序列数据的平滑处理
- Use partial string indexing to extract August 2010 temperature data, and assign to
august
. - Resample to daily frequency, saving the maximum daily temperatures, and assign the result to
daily_highs
. - As part of one long method chain, repeat the above resampling (or you can re-use
daily_highs
) and then combine it with.rolling()
to apply a 7 day.mean()
(withwindow=7
inside.rolling()
) so as to smooth the daily highs. Assign the result todaily_highs_smoothed
and print the result.
august = df['Temperature']["2010-08"]
# Resample to daily data, aggregating by max: daily_highs
daily_highs = august.resample("D").max()
# Use a rolling 7-day window with method chaining to smooth the daily high temperatures in August
daily_highs_smoothed = august.resample("D").max().rolling(window=7).mean()
print(daily_highs_smoothed)
- Use
.str.strip()
to strip extra whitespace fromdf.columns
. Assign the result back todf.columns
. - In the
'Destination Airport'
column, extract all entries where Dallas ('DAL'
) is the destination airport. Use.str.contains('DAL')
for this and store the result indallas
. - Resample
dallas
such that you get the total number of departures each day. Store the result indaily_departures
. - Generate summary statistics for daily Dallas departures using
.describe()
. Store the result instats
.
df.columns = df.columns.str.strip()
# Extract data for which the destination airport is Dallas: dallas
dallas = df['Destination Airport'].str.contains('DAL')
# Compute the total number of Dallas departures each day: daily_departures
daily_departures = dallas.resample("D").sum()
# Generate the summary statistics for daily Dallas departures: stats
stats = daily_departures.describe()
- Replace the index of
ts2
with that ofts1
, and then fill in the missing values ofts2
by using.interpolate(how='linear')
. Save the result asts2_interp
. - Compute the difference between
ts1
andts2_interp
. Take the absolute value of the difference withnp.abs()
, and assign the result todifferences
. - Generate and print summary statistics of the
differences
with.describe()
andprint()
ts2_interp = ts2.reindex(ts1.index).interpolate(how='linear')
# Compute the absolute difference of ts1 and ts2_interp: differences
differences = np.abs(ts1 - ts2_interp)
# Generate and print summary statistics of the differences
print(differences.describe())
pandas 处理 timezone 问题
- Create a Boolean mask,
mask
, such that if the'Destination Airport'
column ofdf
equals'LAX'
, the result isTrue
, and otherwise, it isFalse
. - Use the mask to extract only the
LAX
rows. Assign the result tola
. - Concatenate the two columns
la['Date (MM/DD/YYYY)']
andla['Wheels-off Time']
with a' '
space in between. Pass this topd.to_datetime()
to create a datetime array of all the times the LAX-bound flights left the ground. - Use
Series.dt.tz_localize()
to localize the time to'US/Central'
. - Use the
.dt.tz_convert()
method to convert datetimes from'US/Central'
to'US/Pacific'
.
# Buid a Boolean mask to filter out all the 'LAX' departure flights: mask
mask = df['Destination Airport'] == 'LAX'
# Use the mask to subset the data: la
la = df[mask]
# Combine two columns of data to create a datetime series: times_tz_none
times_tz_none = pd.to_datetime( la['Date (MM/DD/YYYY)'] + ' ' + la['Wheels-off Time'] )
# Localize the time to US/Central: times_tz_central
times_tz_central = times_tz_none.dt.tz_localize('US/Central')
# Convert the datetimes from US/Central to US/Pacific
times_tz_pacific = times_tz_central.dt.tz_convert("US/Pacific")
- Use
pd.to_datetime()
to convert the'Date'
column to a collection of datetime objects, and assign back todf.Date
. - Set the index to this updated
'Date'
column, usingdf.set_index()
with the optional keyword argumentinplace=True
, so that you don't have to assign the result back todf
. - Re-plot the DataFrame to see that the axis is now datetime aware. This code has been written for you.
# Plot the raw data before setting the datetime index
df.plot()
plt.show()
# Convert the 'Date' column into a collection of datetime objects: df.Date
df.Date = pd.to_datetime(df['Date'])
# Set the index to be the converted 'Date' column
df.set_index('Date', inplace=True)
# Re-plot the DataFrame to see that the axis is now datetime aware!
df.plot()
plt.show()
- Downsample
df_clean
with daily frequency and aggregate by the mean. Store the result asdaily_mean_2011
. - Extract the
'dry_bulb_faren'
column fromdaily_mean_2011
as a NumPy array using.values
. Store the result asdaily_temp_2011
. Note:.values
is an attribute, not a method, so you don't have to use()
. - Downsample
df_climate
with daily frequency and aggregate by the mean. Store the result asdaily_climate
. - Extract the
'Temperature'
column fromdaily_climate
using the.reset_index()
method. To do this, first reset the index ofdaily_climate
, and then use bracket slicing to access'Temperature'
. Store the result asdaily_temp_climate
.
# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df_clean.resample("D").mean()
# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011.dry_bulb_faren.values
# Downsample df_climate by day and aggregate by mean: daily_climate
daily_climate = df_climate.resample("D").mean()
# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
daily_temp_climate = daily_climate.reset_index().Temperature
# Compute the difference between the two arrays and print the mean difference
difference = daily_temp_2011 - daily_temp_climate
print(difference.mean())
- Use
.loc[]
to select sunny days and assign tosunny
. If'sky_condition'
equals'CLR'
, then the day is sunny. - Use
.loc[]
to select overcast days and assign toovercast
. If'sky_condition'
contains'OVC'
, then the day is overcast. - Resample
sunny
andovercast
and aggregate by the maximum (.max()
) daily ('D'
) temperature. Assign tosunny_daily_max
andovercast_daily_max
. - Print the difference between the mean of
sunny_daily_max
andovercast_daily_max
. This has already been done for you, so click 'Submit Answer' to view the result!
# Select days that are sunny: sunny
sunny = df_clean.loc[df_clean['sky_condition'].str.contains('CLR')]
# Select days that are overcast: overcast
overcast = df_clean.loc[df_clean['sky_condition'].str.contains('OVC')]
# Resample sunny and overcast, aggregating by maximum daily temperature
sunny_daily_max = sunny.resample('D').max()
overcast_daily_max = overcast.resample('D').max()
# Print the difference between the mean of sunny_daily_max and overcast_daily_max
print(sunny_daily_max.mean() - overcast_daily_max.mean())
画图分析
- Import
matplotlib.pyplot
asplt
. - Select the
'visibility'
and'dry_bulb_faren'
columns and resample them by week, aggregating the mean. Assign the result toweekly_mean
. - Print the output of
weekly_mean.corr()
. - Plot the
weekly_mean
dataframe with.plot()
, specifyingsubplots=True
.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df_clean[['visibility','dry_bulb_faren']].resample("W").mean()
# Print the output of weekly_mean.corr()
print(weekly_mean.corr())
# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
plt.show()
- Create a Boolean Series for sunny days. Assign the result to
sunny
. - Resample
sunny
by day and compute the sum. Assign the result tosunny_hours
. - Resample
sunny
by day and compute the count. Assign the result tototal_hours
. - Divide
sunny_hours
bytotal_hours
. Assign tosunny_fraction
. - Make a box plot of
sunny_fraction
.
# Create a Boolean Series for sunny days: sunny
sunny = df_clean.sky_condition.str.contains("CLR")
# Resample the Boolean Series by day and compute the sum: sunny_hours
sunny_hours = sunny.resample('D').sum()
# Resample the Boolean Series by day and compute the count: total_hours
total_hours = sunny.resample("D").count()
# Divide sunny_hours by total_hours: sunny_fraction
sunny_fraction = sunny_hours / total_hours
# Make a box plot of sunny_fraction
sunny_fraction.plot(kind='box')
plt.show()
- From
df_climate
, extract the maximum temperature observed in August 2010. The relevant column here is'Temperature'
. You can select the rows corresponding to August 2010 in multiple ways. For example,df_climate.loc['2011-Feb']
selects all rows corresponding to February 2011, whiledf_climate.loc['2009-09', 'Pressure']
selects the rows corresponding to September 2009 from the'Pressure'
column. - From
df_clean
, select the August 2011 temperature data from the'dry_bulb_faren'
. Resample this data by day and aggregate the maximum value. Store the result inaugust_2011
. - Filter out days in
august_2011
where the value exceededaugust_max
. Store the result inaugust_2011_high
. - Construct a CDF of
august_2011_high
using 25 bins. Remember to specify thekind
,normed
, andcumulative
parameters in addition tobins
.
# Extract the maximum temperature in August 2010 from df_climate: august_max
august_max = df_climate.loc['2010-08','Temperature'].max()
print(august_max)
# Resample the August 2011 temperatures in df_clean by day and aggregate the maximum value: august_2011
august_2011 = df_clean.loc['2011-08','dry_bulb_faren'].resample("D").max()
# Filter out days in august_2011 where the value exceeded august_max: august_2011_high
august_2011_high = august_2011[august_2011.values>august_max]
# Construct a CDF of august_2011_high
august_2011_high.plot(kind='hist', bins=25, normed=True, cumulative=True)
# Display the plot
plt.show()
- class pandas Foundations
- class Manipulating DataFrames with pandas
- class Merging DataFrames with pandas
- Objective Foundations
- Mathematical Foundations
- pandas
- pandas
- Pandas
- pandas
- pandas
- pandas
- pandas
- Pandas
- pandas
- pandas
- pandas
- pandas
- pandas
- 后台管理系统模板随意切换插件
- 15个常用的javaScript正则表达式
- HenCoder文章汇总
- Android 经典笔记之四: 事件冲突解决思路与方案
- 4G网络可用
- class pandas Foundations
- Ghostcloud成为全球首批获CNCF认证的Kubernetes服务提供商
- 简述 ThreadPoolExecutor 处理流程
- 解决Ubuntu 下deb安装Chrome包无桌面图标问题
- Android 欢迎页面 淡出动画
- Intelli IDEA导入jar包
- 分布式缓存Redis使用心得
- 常用git命令整合github/gitlab
- C#多态性