Performing summary statistics and plots —— Python Data Science Cookbook

来源:互联网 发布:centos制作u盘启动 编辑:程序博客网 时间:2024/05/29 03:36

source from :  Python Data Science Cookbook case

The primary purpose of using summary statistics is to get a good understanding of the location and dispersion of the data. By summary statistics, we refer to mean, median, and standard deviation. These quantities are quite easy to calculate. However, one should be careful when using them. If the underlying data is not unimodal, that is, it has multiple peaks, these quantities may not be of much use.

Note

  1. If the given data is unimodal, that is, having only one peak, the mean, which gives the location, and standard deviation, which gives the variance, are valuable metrics. 
  2. Compared to the regular mean, a trimmed mean is less sensitive to outliers.SciPy provides us with a trim mean function. We will demonstrate the trimmed mean
    calculation in step 2.
  3. the mean is very sensitive to outliers; variance also uses the mean, and hence, it’s prone to the same issues as the mean. We can use other measures for variance to avoid this trap. One such measure isabsolute average deviation; instead of taking the square of the difference between the individual values and mean and dividing it by the number of instances, we will take the absolute of the difference between the mean and individual values and divide it by the number of instances. In step 5,  we will define a function for this:
    def mad(x,axis=None):       mean = np.mean(x,axis=axis)       return np.sum(np.abs(x-mean))/(1.0 * len(x))
  4. With the data having many outliers, there is another set of metrics that come in handy. They are themedian and percentiles. Traditionally, median is defined as a value from the dataset such that half of all the points in the dataset are smaller and the other half is larger than the median value. 
    Interpreting the percentiles:
    25% of the points in the dataset are below 13.00 (25th percentile value).
    50% of the points in the dataset are below 18.50 (50th percentile value).
    75% of the points in the dataset are below 25.25 (75th percentile value).
    A point to note is that the 50th percentile is the median. Percentiles give us a good idea of the range of our values.
    The median is the measure of the location of the data distribution. Using percentiles, we can get a metric forthe dispersion of the data, the interquartile range. The interquartile rangeis the distance between the 75th percentile and 25th percentile. 
  5. Similar to the mean absolute deviation as explained previously, we also have themedian absolute deviation.:
    def mdad(x,axis=None):       median = np.median(x,axis=axis)       return np.median(np.abs(x-median))
source code :
#!/usr/bin/env python2# -*- coding: utf-8 -*-"""@author: snaildove"""# Load Librariesfrom sklearn.datasets import load_irisimport numpy as npfrom scipy.stats import trim_mean# Load iris datadata = load_iris()x = data['data']y = data['target']col_names = data['feature_names']# Let’s now demonstrate how to calculate the mean, trimmed mean, and range values:# 1. Calculate and print the mean value of each column in the Iris datasetprint "col name,mean value"for i,col_name in enumerate(col_names):    print "%s,%0.2f"%(col_name,np.mean(x[:,i]))print# 2. Trimmed mean calculation.p = 0.1 # 10% trimmed meanprintprint "col name,trimmed mean value"for i,col_name in enumerate(col_names):    print "%s,%0.2f"%(col_name,trim_mean(x[:,i],p))print# 3. Data dispersion, calculating and display the range values.print "col_names,max,min,range"for i,col_name in enumerate(col_names):    print "%s,%0.2f,%0.2f,%0.2f"%(col_name,max(x[:,i]),min(x[:,i]),max(x[:,i])-min(x[:,i]))print# Finally, we will show the variance, standard deviation, mean absolute deviation, and# median absolute deviation calculations:# 4. Data dispersion, variance and standard deviationprint "col_names,variance,std-dev"for i,col_name in enumerate(col_names):    print "%s,%0.2f,%0.2f"%(col_name,np.var(x[:,i]),np.std(x[:,i]))print# 5. Mean absolute deviation calculationdef mad(x,axis=None):    mean = np.mean(x,axis=axis)    return np.sum(np.abs(x-mean))/(1.0 * len(x))print "col_names,mad"for i,col_name in enumerate(col_names):    print "%s,%0.2f"%(col_name,mad(x[:,i]))print# 6. Median absolute deviation calculationdef mdad(x,axis=None):    median = np.median(x,axis=axis)    return np.median(np.abs(x-median))print "col_names,median,median abs dev,inter quartile range"for i,col_name in enumerate(col_names):    iqr = np.percentile(x[:,i],75) - np.percentile(x[i,:],25)    print "%s,%0.2f,%0.2f,%0.2f"%(col_name,np.median(x[:,i]),mdad(x[:,i]),iqr)print
ouput:

col name,mean valuesepal length (cm),5.84sepal width (cm),3.05petal length (cm),3.76petal width (cm),1.20col name,trimmed mean valuesepal length (cm),5.81sepal width (cm),3.04petal length (cm),3.76petal width (cm),1.18col_names,max,min,rangesepal length (cm),7.90,4.30,3.60sepal width (cm),4.40,2.00,2.40petal length (cm),6.90,1.00,5.90petal width (cm),2.50,0.10,2.40col_names,variance,std-devsepal length (cm),0.68,0.83sepal width (cm),0.19,0.43petal length (cm),3.09,1.76petal width (cm),0.58,0.76col_names,madsepal length (cm),0.69sepal width (cm),0.33petal length (cm),1.56petal width (cm),0.66col_names,median,median abs dev,inter quartile rangesepal length (cm),5.80,0.70,5.30sepal width (cm),3.00,0.25,2.20petal length (cm),4.35,1.25,4.07petal width (cm),1.30,0.70,0.62

0 0