CSIC2010网络攻击数据分词处理

来源:互联网 发布:数据化管理pdf下载 编辑:程序博客网 时间:2024/06/05 04:29

CSIC2010数据集(http://www.isi.csic.es/dataset/)包含上万条自动生成的Web请求,主要用于测试网络攻击防护系统,它是由西班牙研究委员会(CSIC)信息安全研究所制作的。
CSIC2010包含针对电子商务Web应用产生的HTTP数据流量。在该Web应用中,用户可以通过购物车购买物品,并通过提供一些个人信息进行注册。由于它是西班牙语的Web应用程序,所以数据集包含一些拉丁字符。
数据集是自动生成的,包含36000个正常请求和超过25000个异常请求。HTTP请求被标记为正常或异常,数据集包括各种攻击,如:SQL注入、缓冲区溢出、信息收集、文件披露,CRLF注入,跨站脚本,服务器端包含、参数篡改等。

CSIC2010数据格式如下:

这里写图片描述

提取请求部分数据并分词

对于原始的请求数据,主要提取GET、POST、PUT请求数据来进行检测。请求数据提取后对数据进行字符串分割,分割依据HTTP请求特点进行,主要涉及URL中解码,及参数项、键值对、特殊符号的分割。

# -*- coding: utf-8 -*-"""@data: 2017-07-21@author: Jiabao Wang@description: Data preparation for deep learning methods on CSIC2010 dataset."""import urllib.parse as psdef string_spliting(str_input, symbol):        str_words = str_input.split(symbol)    return (' ' + symbol + ' ').join(str_words)def string_words_spliting(str_input):    str_ret = str_input    str_ret = string_spliting(str_ret, '?')    str_ret = string_spliting(str_ret, '&')    str_ret = string_spliting(str_ret, '=')    str_ret = string_spliting(str_ret, '(')    str_ret = string_spliting(str_ret, ')')    str_ret = string_spliting(str_ret, '{')    str_ret = string_spliting(str_ret, '}')    str_ret = string_spliting(str_ret, '<')    str_ret = string_spliting(str_ret, '>')    str_ret = string_spliting(str_ret, '/')    str_ret = string_spliting(str_ret, '\\')    str_ret = string_spliting(str_ret, '.')    str_ret = string_spliting(str_ret, '"')    str_ret = string_spliting(str_ret, '\'')    str_ret = string_spliting(str_ret, ';')    str_ret = string_spliting(str_ret, '@')    str_ret = string_spliting(str_ret, '~')    return str_retdef http_request_extraction(fread, fwrite):    # extract the http request string    text = fread.readlines()    lines = len(text)    i = 0    while i < lines:        line = text[i]        n = len(line)        if line.startswith('GET'):                    cmdGET = line[4:n-10]            cmdStr = ps.unquote_plus(cmdGET)+'\n'            cmdStr = string_words_spliting(cmdStr).lstrip()            fwrite.write(cmdStr.encode('utf-8'))#            print(cmdStr)                        i = i+12        if line.startswith('POST'):             cmdPOST = line[5:n-10]            cmdPOST = cmdPOST + '?' + text[i+14][:-1]            cmdStr = ps.unquote_plus(cmdPOST)+'\n'            cmdStr = string_words_spliting(cmdStr).lstrip()            fwrite.write(cmdStr.encode('utf-8'))#            print(cmdStr)        if line.startswith('PUT'):             cmdPUT = line[4:n-10]            cmdPUT = cmdPUT + '?' + text[i+14][:-1]            cmdStr = ps.unquote_plus(cmdPUT)+'\n'            cmdStr = string_words_spliting(cmdStr).lstrip()            fwrite.write(cmdStr.encode('utf-8'))#            print(cmdStr)        i = i+1fwriteNTest = open('data/normalTrafficTestFeature.txt','wb')freadNTest  = open('data/normalTrafficTest/normalTrafficTest.txt',                    encoding='utf-8')http_request_extraction(freadNTest, fwriteNTest)fwriteNTest.close()fwriteNTrain = open('data/normalTrafficTrainingFeature.txt','wb')freadNTrain  = open('data/normalTrafficTraining/normalTrafficTraining.txt',                    encoding='utf-8')http_request_extraction(freadNTrain, fwriteNTrain)fwriteNTrain.close()fwriteATest = open('data/anomalousTrafficTestFeature.txt','wb')freadATest  = open('data/anomalousTrafficTest/anomalousTrafficTest.txt',                    encoding='utf-8')http_request_extraction(freadATest, fwriteATest)fwriteATest.close()fwriteAll = open('data/TrafficFeatureAll.txt','wb')freadNTest  = open('data/normalTrafficTest/normalTrafficTest.txt',                    encoding='utf-8')freadNTrain  = open('data/normalTrafficTraining/normalTrafficTraining.txt',                     encoding='utf-8')freadATest  = open('data/anomalousTrafficTest/anomalousTrafficTest.txt',                   encoding='utf-8')http_request_extraction(freadNTest, fwriteAll)http_request_extraction(freadNTrain, fwriteAll)http_request_extraction(freadATest, fwriteAll)fwriteAll.close()

效果如下:

这里写图片描述

原创粉丝点击