Your Prediction Gets As Good As Your Data
来源:互联网 发布:c#索引超出了数组界限 编辑:程序博客网 时间:2024/05/24 05:07
In the past, I have often seen that software engineers and data scientists assume that they can keep increasing their prediction accuracy by improving their machine learning algorithm. Here, I want to approach the classification problem from a different angle where I suggest data scientists analyze the distribution of their data to measure the information level in their data. This approach gives us an upper bound for how far we can improve the accuracy of a predictive algorithm and make sure our optimization efforts are not wasted.
Information and Entropy
In information theory, mathematician have developed useful measures such as entropy to compute the information level in the data. Let's think of a random coin with a head probability of 1%. If one filps this coin, she will collect more information if she sees the head events (i.e. rare event) compared to seeing a tail (i.e. moere likely event). One can formualte the information level in a random process with the negative logarithm of the random event probability.
This captures the described intuition. Mathmatician also formulated another measure called entropy by which they capture the average information in a random process in bits. Below we have shown the entropy formula for a discrete random variable:
For the first example, let's assume we have a coin with P(H)=0% and P(T)=100%. We can compute the entropy of the coin as follows:
For the second example, let's consider a coin where P(H)=1% and P(T)=1-P(H)=99%. Plugging numbers one can find that the entropy of such a coin is:
Finally, if the coin has P(H) = P(T) = 0.5 (i.e. a fair coin), its entropy is calculated as follows:
Entropy and Predictability
So, what these examples tell us? If we have a coin with head probability of zero, the coin's entropy is zero meaning that the average information in the coin is zero. This makes sense because flipping such a coin always comes as tail. Thus, the prediction accuracy is 100%. In other words, when the entropy is zero, we have the maximum predictibility.
In the second example, head probability is not zero but still very close to zero which again makes the coin to be very predictable with a low entropy.
Finally, in the last example we have 50/50 chance of seeing head/tail events which maximizes the entropy and consequently minimizes the predictability. In words, one can show that a fair coin has the meaximum entropy of 1 bit making the prediction as good as a random guess.
Kullback–Leibler Divergence
As last example, we show how we can borrow ideas from information theory to measure the distance between two probability distributions. Let's assume we are modeling two random processes by their pmf's: P(.) and Q(.). One can employ the entropy measure to compute the distance between two pmf's as follows:
Above distance function is known as KL Divergence which measures the distance of Q distribution from P's. The KL Divergence can be very useful in various applications such as NLP problems where we want to measure the distance between the distributions of two documents (e.g. modelled as bag of words).
Wrap-up
In this post, we showed that the entropy from information theory provides a way to measure how much information exists in a given dataset. We also highlighted the inverse relationship between the entropy and the predictability. This shows that one can use the entropy to calculate an upper bound for the accuracy of the prediction problem in hand.
Source: http://www.aioptify.com/informationbound.php
- Your Prediction Gets As Good As Your Data
- Excel as Your Database
- Emacs as Your IDE
- such as your spending habits
- Your Server as a Function
- make your driver as a deb package
- Using github Issue as your assets
- Is Your Website Color As Inviting as Your Home? Does Color Really Matter?
- 93.You are using flat files as the data source for one of your data warehousing applications. To opt
- Make Vim as Your Bash-IDE Using bash-support Plugin
- Using the Android-x86 Port as your Emulator
- Using your Slackware Linux box as wireless AP
- 5 ways to deploy your application to JBoss AS 7
- Your Progress As A Programmer Is All Up To You
- Connection for controluser as defined in your configuration failed.
- Packaging your Shiny App as an Windows desktop app
- Speed up your eclipse as a super fast IDE
- Using Create2(Roomba 620) as your turtlebot driving base.
- EPC编码之设备唯一标识MN
- Maven入门指南④:仓库
- XOR 加密简介
- fix协议封装挑战-DbFail
- JSP学习笔记 -- day01 注解、域对象、指令
- Your Prediction Gets As Good As Your Data
- Dom4j对xml增删改查的实例详解
- android ART hook
- 字符串进制转换“盛大游戏杯”第15届上海大学程序设计联赛夏季赛暨上海高校金马五校赛
- 1. C#
- Maven入门指南⑤:使用Nexus搭建Maven私服
- OC和Swift混合开发需要做的准备
- Source Insight 常用设置和快捷键大全
- 如何将VMIX编辑输出的视频嵌入网页直播