统计学 入门基础概念篇 - Descriptive Statistics: Charts and Graphs(个人笔记)

来源:互联网 发布:邀请函设计软件 编辑:程序博客网 时间:2024/04/29 22:23

Graphically, 


The center of a distribution is located at the median of the distribution.

The spread of a distribution refers to the variability of the data.


Shape

The shape of a distribution is described by the following characteristics.

  • Symmetry. When it is graphed, a symmetric distribution can be divided at the center so that each half is a mirror image of the other.
  • Number of peaks. Distributions can have few or many peaks. Distributions with one clear peak are called unimodal, and distributions with two clear peaks are called bimodal. When a symmetric distribution has a single peak at the center, it is referred to as bell-shaped.
  • Skewness. When they are displayed graphically, some distributions have many more observations on one side of the graph than the other. Distributions with fewer observations on the right (toward higher values) are said to be skewed right; and distributions with fewer observations on the left (toward lower values) are said to be skewed left.
  • Uniform. When the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution. A uniform distribution has no clear peaks.

Here are some examples of distributions and shapes.

0123456789 
0123456789 
0123456789Symmetric, unimodal, 
bell-shaped Skewed right Non-symmetric, bimodal     
0123456789 
0123456789 
0123456789Uniform Skewed left Symmetric, bimodal


Dotplot Overview:
1. each dot represent a specific number of observations from a set of data. 每个点代表了这个被观察到的对象在总数据集合中出现的次数
2. the dost are stacked in a column over a category. , so that the height if the column represents the relative or absolute frequency of observations in the category. 一列中的点代表了该列的频数.
3. the pattern of data in a dotplot can be describe in terms of symmetry and skewness only if the categories are quantitative, if the categories are qualitative a dotplot cannot be describe in those term. 数据模型能被描述为对称型,skewness 型,当且仅当这一列是数量的数据.

From the above image, each dot represent one person who likes certain color. 

Bar chart Overview:

1.the columns are positioned over a label that represents a categorical variable. 每一列的标签是种类变量
2. the height of the column indicates the size of the group defined by the column label. 每一列的高度代表了这列组的大小. 


Histogram Overview:

1. the columns are positioned over a label that represents a quantitative variable.  每一列的标签是定量变量(数字变量)
2. the height of the column indicates the size of the group defined by the column label. 每一列的高度表示了他和一组的大小
3. the column label can be a single value or a range of values. 每组的标签可以是单一值也可以是一个区域值.

The difference between bar chart and histogram:
bar chart的列表示的是种类,但是histogram 的列表示的是数字. 同时也因为这个原因, skewed 属性不能适用于bar chart. 但是可以用于histogram.

茎叶图

Boxplot :
很简单, 解释如下图.


Frequency vs. Cumulative Frequency

In a data set, the cumulative frequency for a value x is the total number of scores that are less than or equal to x. The charts below illustrate the difference between frequency and cumulative frequency. Both charts show scores for a test administered to 300 students.

 Frequency
 
100 
80 
60 
40 
20 
 
 
 
 
 
 
 41-5051-6061-7071-8081-9091-100  Cumulative
frequency
 
300 
240 
180 
120 
60 
 
 
 
 
 
 
 5060708090100

In the chart on the left, column height shows frequency - the number of students in each test score grouping. For example, about 30 students received a test score between 51 and 60.

In the chart on the right, column height shows cumulative frequency - the number of students up to and including each test score. The chart on the right is a cumulative frequency chart. It shows that 30 students received a test score of at most 50; 60 students received a score of at most 60; 120 students received a score of at most 70; and so on.

Absolute vs. Relative Frequency

 Cumulative
percentage
 
100 
80 
60 
40 
20 
 
 
 
 
 
 
 5060708090100

Frequency counts can be measured in terms of absolute numbers or relative numbers (e.g.,proportions or percentages). The chart to the right duplicates the cumulative frequency chart above, except that it expresses the counts in terms of percentages rather than absolute numbers.

Note that the columns in the chart have the same shape, whether the Y axis is labeled with actual frequency counts or with percentages. If we had used proportions instead of percentages, the shape would remain the same.

Discrete vs. Continuous Variables

Cumulative
percentage

Each of the previous cumulative charts have used adiscrete variable on the X axix (i.e., the horizontal axis). The chart to the right duplicates the previous cumulative charts, except that it uses a continuous variable for the test scores on the X axis.

Let's work through an example to understand how to read this cumulative frequency plot. Specifically, let's find the median. Follow the grid line to the right from the Y axis at 50%. This line intersects the curve over the X axis at a test score of about 73. This means that half of the students received a test score of at most 73, and half received a test score of at least 73. Thus, the median is 73.

You can use the same process to find the cumulative percentage associated with any other test score. For example, what percentage of students received a test score of 64 or less? From the graph, you can see that about 25% of students received a score of 64 or less.

Test Your Understanding

Problem 1

Below, the cumulative frequency plot shows height (in inches) of college basketball players.

Cumulative frequency plot of basketball player height

What is the interquartile range?

q1 is the 25 % out of the whole dataset which corresponds to 71.
q3 is the 75 % out of the whole dataset which corresponds to 77.
so the interquartile range is 77 - 71 = 6.


Scatterplot. 散点图
横轴和纵轴都是由点组成的, 每个点表示一组数据来自总体的数据组. 



patterns of Data in Scatterplots: the Scatterplot 被用来分析bivariate data(  when we conduct a study that examine the relationship between two variables, we working with bivariate data.)  他有个三个属性. linearity, slope, strength. 

1. linearity refers to whether a data patter is linear( straight ) or non-linear( curved )
2. slope refers to the direction of change in variable Y when variable x gets bigger. if y get bigger then the slope is positive, otherwise it is negative.
3.strength refers to the degree of scatter in the plot, if the dots are widely spread, the relationship between variables is weak, if the dots ar econcetrated around a line, the relationship is strong. 




1 0