如何标准化特征向量HOW TO NORMALISE FEATURE VECTORS
来源:互联网 发布:js文件在线格式化 编辑:程序博客网 时间:2024/05/18 01:13
HOW TO NORMALISE FEATURE VECTORS
I was trying to create a sample file for training a neural network and ran into a common problem: the feature values are all over the place. In this example I’m working with demographical real-world values for countries. For example, a feature for GDP per person in a country ranges from 551.27 to 88286.0, whereas estimates for corruption range between -1.56 to 2.42. This can be very confusing for machine learning algorithms, as they can end up treating bigger values as more important signals.
To handle this issue, we want to scale all the feature values into roughly the same range. We can do this by taking each feature value, subtracting its mean (thereby shifting the mean to 0), and dividing by the standard deviation (normalising the distribution). This is a piece of code I’ve implemented a number of times for various projects, so it’s time to write a nice reusable script. Hopefully it can be helpful for others as well. I chose to do this in python, as it’s easies to run compared to C++ and Java (doesn’t need to be compiled), but has better support for real-valued numbers compared to bash scripting.
Each line in the input file is assumed to be a feature vector, with values separated by whitespace. The first element is an integer class label that will be left untouched. This is followed by a number of floating point feature values which will be normalised. For example:
1 0.563 13498174.2 -21.3
0 0.114 42234434.3 15.67
We’re assuming dense vectors, meaning that each line has an equal number of features.
To execute it, simply use
python feature-normaliser.py < in.txt > out.txt
The complete script that will normalise feature vectors is here:
import
sys;
import
fileinput;
import
numpy;
data
=
[]
linecount
=
0
for
line
in
fileinput.
input
():
if
line.strip():
index
=
0
for
value
in
line.split():
if
linecount
=
=
0
:
data.append([])
if
index
=
=
0
:
data[index].append(
int
(value))
else
:
data[index].append(
float
(value))
index
+
=
1
linecount
+
=
1
for
row
in
range
(
0
, linecount):
for
col
in
range
(
0
, index):
if
col
=
=
0
:
sys.stdout.write(
str
(data[col][row]))
else
:
val
=
(data[col][row]
-
numpy.mean(data[col]))
/
numpy.std(data[col])
sys.stdout.write(
"\t"
+
str
(val))
sys.stdout.write(
"\n"
)
- 如何标准化特征向量HOW TO NORMALISE FEATURE VECTORS
- 将图像转换为特征向量Transforming Images to Feature Vectors
- how to use C++ Vectors ?
- How to add SMSC feature into Android
- How to test the JIT feature
- Introduction to Vectors answer
- How to setup the 3D Feature in Ubuntu
- How to find the HTML5 feature for your HTML5R
- 【Oracle Database 12c New Feature】How to Learn Oracle (12c New Feature) from Error
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Chapter 1. Introduction to Vectors
- [Ubuntu / Firefox] How to use Advanced Video Upload feature of YouTube
- How to install masterpage,pagelayout,CSS,JS and images through Feature
- SharePoint2013 Study Notes— How to Create a Event Receiver and Add Feature Event Receivers
- 安卓网站收集
- Bshare分享
- SQL Server简洁分页代码
- 集合之LinkedHashSet
- LeetCode - 80. Remove Duplicates from Sorted Array II
- 如何标准化特征向量HOW TO NORMALISE FEATURE VECTORS
- 表达式树
- Android Studio的相关设置信息笔记
- Frament与activity切换
- [REDIS]: Codis作者黄东旭细说分布式Redis架构设计和踩过的那些坑
- Java注解不为人知的作用
- HashSet---Contains Duplicate
- 试用阿里云RDS的MySQL压缩存储引擎TokuDB
- SQL 這個子查詢最多只能傳回一個記錄