机器学习实验（一）：运用机器学习(Kmeans算法)判定家庭用电主因

来源：互联网发布：word修改属性软件编辑：程序博客网时间：2024/04/30 17:20

声明：版权所有，转载请联系作者并注明出处 http://blog.csdn.net/u013719780?viewmode=contents

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人CSDN博客：http://blog.csdn.net/u013719780?viewmode=contents

运用机器学习(Kmeans算法)确定家庭用电的主要原因

本文将对家庭的用电数据进行一些基本的分析。

本文主要分为两个部分：

Part One: 对数据做一些简单的清洗和分析工作；

Part Two: 运用无监督的机器学习算法-Kmeans算法确定某个特定的时间段家庭用电的主要原因。

首先，想入相应的包并且读取数据集。具体代码如下：

In [1]:

import numpy as npimport pandas as pdimport matplotlib.pyplot as plt

In [2]:

sensor_data = pd.read_csv('merged-sensor-files.csv',                          names=["MTU", "Time", "Power", "Cost", "Voltage"], header = 0)weather_data = pd.read_json('weather.json', typ ='series')

In [3]:

import jsonf=open('weather.json')json_data = json.load(f)    Time = []Temperature = []for time, temperature in json_data.items():    Time.append(int(time))    Temperature.append(float(temperature))    temperature = pd.DataFrame({'Time':Time, 'Temperature': Temperature})temperature

Out[3]:

TemperatureTime084.41431468000183.31431450000270.71431403200372.11431432000484.21431464400580.91431446400668.61431424800781.11431475200880.71431442800969.214314176001076.214314356001168.814314140001272.114313960001368.714314284001480.114314392001583.014314716001669.014314104001775.414313888001871.014313996001969.614314068002067.914314212002185.114314572002287.014314608002373.214313924002484.51431453600

In [4]:

import jsonf=open('weather.json').read()Time = []Temperature = []for line in f.split(','):    time, temperature = line.split(':')    time = time.replace('"','')    time = time.replace('{','')    temperature = temperature.replace('"','')    temperature = temperature.replace('}','')    #print time, temperature    Time.append(int(time))    Temperature.append(float(temperature))

In [5]:

# A quick look at the datasetssensor_data.head(5)

Out[5]:

MTUTimePowerCostVoltage0MTU105/11/2015 19:59:064.1020.62122.41MTU105/11/2015 19:59:054.0890.62122.32MTU105/11/2015 19:59:044.0890.62122.33MTU105/11/2015 19:59:064.0890.62122.34MTU105/11/2015 19:59:044.0970.62122.4

In [6]:

sensor_data.describe()

Out[6]:

MTUTimePowerCostVoltagecount8891488914889148891488914unique27235924958848topMTU1Time0.1360.05123.1freq88891236544194765063

In [7]:

sensor_data.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 88914 entries, 0 to 88913Data columns (total 5 columns):MTU        88914 non-null objectTime       88914 non-null objectPower      88914 non-null objectCost       88914 non-null objectVoltage    88914 non-null objectdtypes: object(5)memory usage: 3.4+ MB

TASK 1: 数据分析

数据清洗

从数据集merged-sensor-files.csv 中我们发现，某些数据有问题。

In [8]:

sensor_data.dtypes

Out[8]:

MTU        objectTime       objectPower      objectCost       objectVoltage    objectdtype: object

下面找出有问题的数据

In [9]:

# Get the inconsistent rows indexesfaulty_row_idx = sensor_data[sensor_data["Power"] == " Power"].index.tolist()faulty_row_idx

Out[9]:

[3784, 7582, 11385, 15004, 18773, 22363, 26049, 29795, 33554, 37193, 40951, 44563, 48227, 51934, 55660, 59431, 63041, 66706, 70468, 74305, 77951, 81617, 85327]

删除有问题的数据

In [10]:

sensor_data.drop(faulty_row_idx, inplace=True)sensor_data[sensor_data["Power"] == " Power"].index.tolist()

Out[10]:

[]

从上述结果可以知道，有问题的数据已经成功被删除

We have cleaned up the sensor_data and now these can be converted to more appropriate data types. 对数据类型进行转换

In [11]:

sensor_data[["Power", "Cost", "Voltage"]] = sensor_data[["Power", "Cost", "Voltage"]].astype(float)sensor_data[["Time"]] = pd.to_datetime(sensor_data["Time"])sensor_data['Hour'] = pd.DatetimeIndex(sensor_data["Time"]).hoursensor_data.dtypes

Out[11]:

MTU                objectTime       datetime64[ns]Power             float64Cost              float64Voltage           float64Hour                int32dtype: object

This is better now. We have got clearly defined datatypes of different columns now. Next step is to convert the weather_data Series to a dataframe so that we can work with it with more ease.

Good!我们现在已经得到了我们所需的数据类型。接下来为了数据操作上的方便，我们将数据集weather_data转换成dataframe格式。

In [12]:

temperature_data = weather_data.to_frame()temperature_data.reset_index(level=0, inplace=True)temperature_data.columns = ["Time", "Temperature"]temperature_data.dtypestemperature_data['Temperature'] = Temperaturetemperature_data["Hour"] = pd.DatetimeIndex(temperature_data["Time"]).hourtemperature_data[["Temperature"]] = temperature_data[["Temperature"]].astype(float)temperature_data

Out[12]:

TimeTemperatureHour02015-05-12 00:00:0075.4012015-05-12 01:00:0073.2122015-05-12 02:00:0072.1232015-05-12 03:00:0071.0342015-05-12 04:00:0070.7452015-05-12 05:00:0069.6562015-05-12 06:00:0069.0672015-05-12 07:00:0068.8782015-05-12 08:00:0069.2892015-05-12 09:00:0067.99102015-05-12 10:00:0068.610112015-05-12 11:00:0068.711122015-05-12 12:00:0072.112132015-05-12 13:00:0076.213142015-05-12 14:00:0080.114152015-05-12 15:00:0080.715162015-05-12 16:00:0080.916172015-05-12 17:00:0083.317182015-05-12 18:00:0084.518192015-05-12 19:00:0085.119202015-05-12 20:00:0087.020212015-05-12 21:00:0084.221222015-05-12 22:00:0084.422232015-05-12 23:00:0083.023242015-05-13 00:00:0081.10

In [14]:

sensor_data.describe()

Out[14]:

PowerCostVoltageHourcount88891.00000088891.00000088891.00000088891.000000mean1.3159800.202427123.12774411.531865std1.6821810.2523570.8387686.921775min0.1130000.020000121.0000000.00000025%0.2550000.040000122.6000006.00000050%0.3670000.060000123.10000012.00000075%1.7650000.270000123.70000018.000000max6.5470000.990000125.60000023.000000

In [15]:

temperature_data.describe()

Out[15]:

TemperatureHourcount25.00000025.00000mean76.27200011.04000std6.6353557.29429min67.9000000.0000025%69.6000005.0000050%75.40000011.0000075%83.00000017.00000max87.00000023.00000

从上面的统计结果可以知道，耗电的平均值、最小值、最大值分别为1.315980kW，0.11kW and 6.54kW。为了对数据进行更好的理解，我们绘制出耗电 and 温度与时间的关系图。在绘图之前，需要对数据关于列'hour'group BY:

In [16]:

grouped_sensor_data = sensor_data.groupby(["Hour"], as_index = False).mean()grouped_sensor_data

Out[16]:

HourPowerCostVoltage000.1737900.029468124.723879110.1795940.033805124.522469220.1857630.037013123.929979330.1845100.036815124.174454440.1811040.036366123.847801550.1842420.036693122.790974660.6724230.106142123.375132770.9777550.150614123.722441880.3823920.060904122.997544990.1684470.027770122.67590610100.3739420.058812122.98620711110.3830650.059837123.50055412120.3784320.059604122.78313313130.3800760.059766122.99157114140.3780200.059666122.81535915150.3765860.059619122.46449916164.3657740.659342121.76684017174.3181180.652923121.85149618184.7799280.721469122.30105919194.2500340.642619122.10370020201.9671200.300640122.77063521211.5798960.242180123.08606022222.5426720.387109123.54262023232.2699410.346457123.415791

In [17]:

grouped_temperature_data = temperature_data.groupby(["Hour"], as_index = False).mean()grouped_temperature_data

Out[17]:

HourTemperature0078.251173.202272.103371.004470.705569.606669.007768.808869.209967.90101068.60111168.70121272.10131376.20141480.10151580.70161680.90171783.30181884.50191985.10202087.00212184.20222284.40232383.00

Basic Visualizations:

In [18]:

%pylab inlineplt.style.use('ggplot')

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['f']`%matplotlib` prevents importing * from pylab and numpy

In [19]:

fig = plt.figure(figsize=(13,7))plt.hist(sensor_data.Power, bins=50)fig.suptitle('Power Histogram', fontsize = 20)plt.xlabel('Power', fontsize = 16)plt.ylabel('Count', fontsize = 16)

Out[19]:

<matplotlib.text.Text at 0x115220310>

从上图可以得出：大部分时间耗电都比较低，但是也有一些时间段用电较多，达到了3.5kW - 5kW之间。接下来绘制用电关于时间的分布图。

In [20]:

fig = plt.figure(figsize=(13,7))plt.bar(grouped_sensor_data.Hour, grouped_sensor_data.Power)fig.suptitle('Power Distribution with Hours', fontsize = 20)plt.xlabel('Hour', fontsize = 16)plt.ylabel('Power', fontsize = 16)plt.xticks(range(0, 24))plt.show()

从上面条形图可以得出如下推论:

用电需求最高的时间段是在晚上，这可能是因为大部分的电器设备，如：AC、暖气、TV、烤炉、洗衣机等的使用。
睡觉时间段(0000 - 0500) and 办公时间段(0900 - 1600)有非常低的需求, 是因为这两个时间段电器设备都已经关闭了.
在时间段0600 - 0900用电有少许增加, 这可能是一些电器设备处在激活状态导致的.

稳定状态:

在时间段 0000 - 0500, 用电需求很少，其范围处在 0.17kW - 0.18kW
另一个稳定状态是时间段 1000 - 1500, 其需求处在 0.373kW - 0.376kW
最高的稳定时间段是 1600 - 1900 其需求处在 4.36kW - 4.25kW

在0700 and 1800期间电力需求突然发生了变化，可能是随机事件或者某些电器设备的使用和异常数据导致的。

在0900时间段电力需求同样有轻微震动，从0.38kw下降到了0.16kw随后又上升到了0.37kw。在2100可以看到同样的变化趋势。

让我们进一步绘制temperature and Power的关系图，看看这里是否有一些相关性。

In [21]:

fig = plt.figure(figsize=(13,7))plt.bar(grouped_temperature_data.Temperature, grouped_sensor_data.Power)fig.suptitle('Power Distribution with Temperature', fontsize = 20)plt.xlabel('Temperature in Fahrenheit', fontsize = 16)plt.ylabel('Power', fontsize = 16)plt.show()

温度和电力需求似乎有一些直接的关系，这很好理解，因为我们当前的数据集是取自于5月，在高峰期（晚上）制冷设备都已经打开了。

Task 2: 机器学习

为了在一个完整的数据集上工作，合并数据集 grouped_sensor_data and grouped_temperature_data。

In [22]:

merged_data = grouped_sensor_data.merge(grouped_temperature_data)merged_data

Out[22]:

HourPowerCostVoltageTemperature000.1737900.029468124.72387978.25110.1795940.033805124.52246973.20220.1857630.037013123.92997972.10330.1845100.036815124.17445471.00440.1811040.036366123.84780170.70550.1842420.036693122.79097469.60660.6724230.106142123.37513269.00770.9777550.150614123.72244168.80880.3823920.060904122.99754469.20990.1684470.027770122.67590667.9010100.3739420.058812122.98620768.6011110.3830650.059837123.50055468.7012120.3784320.059604122.78313372.1013130.3800760.059766122.99157176.2014140.3780200.059666122.81535980.1015150.3765860.059619122.46449980.7016164.3657740.659342121.76684080.9017174.3181180.652923121.85149683.3018184.7799280.721469122.30105984.5019194.2500340.642619122.10370085.1020201.9671200.300640122.77063587.0021211.5798960.242180123.08606084.2022222.5426720.387109123.54262084.4023232.2699410.346457123.41579183.00

在之前的数据可视化中，我们看到了当温度低的时候电力需求比较小。但是这主要是与制冷的电器设备有较大的关系：

Cooling Systems
TV
Geyser
Lights
Oven
Home Security Systems

我们接下来用合并后的完整数据集确定这些设备是否打开。

AC, Refrigerator and Other Coooling Systems:

从"Power Distribution with Temperature"图可以明显看出，随着温度的上升电力需求突然增加，这就意味着家里的制冷设备处于开启状态。

TV:

在evening hours(1600 - 2300), 电视机可能是另外一个导致电力需求增加的因素. 从Power特征看它是相当明显的.

Geyser, Oven:

在during morning hours电力需求轻微增加可能是与一些设备的工作是相关的.

Lights:

灯光对用电需求有比较小的影响（认为house owner使用的是节能灯）。

Home Security Systems:

在工作时间有轻微的增加可能是一些家庭设备与其他的一些自动设备导致的。

现在，我们将使用 K-Means clustering 算法. 使用原始数据集中的特征 Hour, Power and Temperature .首先，我们需要合并数据集sensor_data dataframe 和 grouped_temperature_data.

In [23]:

data =sensor_data.merge(grouped_temperature_data)data.drop(["Time", "MTU", "Cost", "Voltage"], axis = 1, inplace = True)data.head()

Out[23]:

PowerHourTemperature04.1021985.114.0891985.124.0891985.134.0891985.144.0971985.1

In [24]:

from sklearn.cluster import KMeansfrom sklearn.cross_validation import train_test_split

In [25]:

np.random.seed(1234)train_data, test_data = train_test_split(data, test_size = 0.25, random_state = 42)

In [26]:

train_data.shape

Out[26]:

(66668, 3)

In [27]:

test_data.shape

Out[27]:

(22223, 3)

In [28]:

kmeans = KMeans(n_clusters = 4, n_jobs = 4)kmeans_fit = kmeans.fit(train_data)

/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array bydescriptor assignment is deprecated. To maintainthe Fortran contiguity of a multidimensional Fortranarray, use 'a.T.view(...).T' instead  obj_bytes_view = obj.view(self.np.uint8)

In [29]:

predict = kmeans_fit.predict(test_data)

In [30]:

test_data["Cluster"] = predicttest_data.head(20)

/Applications/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy  if __name__ == '__main__':

Out[30]:

PowerHourTemperatureCluster525950.114869.21860444.2551783.3060913.5592087.00601850.4531168.71370540.136470.72592160.3121068.61618480.4531168.712784.1621985.10308290.136371.0287510.9552184.20351340.276470.72314760.278371.02558540.4561068.61549920.370869.21782590.3071580.73627240.3131168.71541320.260869.21442040.114967.9178341.0942184.20252310.125173.22

这看起来是一个很合理的聚类. 我们进一步将标签分到类里面, 作为一个检测模型. 很明显，我们可以将预测的标签设置成如下的类别:

0 - Cooling Systems
1 - Oven, Geyser
2 - Night Lights
3 - Home Security Systems

接下来我们将会把标签和预测结果合并成一个数据框.

In [31]:

label_df = pd.DataFrame({"Cluster": [0, 1, 2, 3],                         "Appliances": ["Cooling System","Oven, Geyser",                                        "Night Lights", "Home Security Systems"]})label_df

Out[31]:

AppliancesCluster0Cooling System01Oven, Geyser12Night Lights23Home Security Systems3

In [32]:

result = test_data.merge(label_df)result.head()

Out[32]:

PowerHourTemperatureClusterAppliances00.114869.21Oven, Geyser10.4531168.71Oven, Geyser20.3121068.61Oven, Geyser30.4531168.71Oven, Geyser40.4561068.61Oven, Geyser

In [33]:

result.tail()

Out[33]:

PowerHourTemperatureClusterAppliances222180.3061580.73Home Security Systems222190.4501376.23Home Security Systems222204.4261680.93Home Security Systems222210.4521580.73Home Security Systems222220.3071580.73Home Security Systems

从result dataframe可以看出，在8, 9, 10时Oven or Geyser有比较高的概率在使用，另一方面，在office hours(1000 - 1600)安全设备使用的可能性很高。

在数据分析的过程中，我们仅仅使用了按照hour的group BY的数据，事实上，如果我们拥有更多的数据，可以使用更多的特征，例如按照day，week，month进行group BY。

我们也应该考虑季节和温度，因为不同的电器设备在不同的季节使用情况是不一样的。因为好的特征能够让我们的模型预测的更加准确。

同时，这个也可以帮助我们进行一个分类任务，因为我们已经知道了某些电器设备需要的耗电情况。

参考文献:

http://www.sciencedirect.com/science/article/pii/S037877881200151X
http://cs.gmu.edu/~jessica/publications/astronomy11.pdf

1 0