Weather Data Analysis Example:Part 1
来源:互联网 发布:mac yosemite lol 编辑:程序博客网 时间:2024/05/22 00:42
http://theanalyticalminds.blogspot.jp/2015/02/part-1-introduction.html
Part 1: Introduction
The ultimate goal of every data scientist is to extract as much valuable information as possible from a given data set. We want to be able to predict the future based on the past, to discover very deep and hidden patterns in the data, and to expand the current base of knowledge in some specific domain. With this in mind, several machine learning algorithms, from Neural Networks to Support Vector Machines, from Naïve Bayes to Random Forests, have been developed over the years. In many situations, when correctly applied, these can provide greater insight from the data than any human could do, how clever they might be, based on his analytical skills.
Daily experience shows, however, that the process of analysing data seldom goes past the Exploratory Data Analysis (EDA). In fact, more often than not, the classical EDA approach of summarising and plotting is sufficient for the analyst to build a solid intuition of what the data is trying to convey, and sometimes raising new questions that might be addressed afterwards, if a model is to be developed.
Unfortunately, in many cases we cannot perform EDA right away. Raw data is often messy, unstructured, badly coded, inconsistent, or just plain wrong. It is often said that the analyst spends 80% of the time preparing the data, and only 20% actually doing analysis and modelling. This initial data wrangling consists of handling missing values, remove duplicates, transforming variables, format variable types, recode values, detect outliers to ascertain data integrity, and others. Our goal is to have what is calledtidy data by the end of the process.
Data analysis using R
In this series of intermediate-level tutorials, I will guide you through the process of analysing an actual data set. We will start by preparing the data and then use a few EDA techniques to get a grasp of what’s in it. To accomplish this, we will be using the lingua franca of data science, the R programming language. There are many advantages and disadvantages to using R, and I summarise below the ones I personally deem the most relevant.
Pros
- There is a big community of R users – as of 2014, there were more than 5 000 user-contributed packages only in the main repository (CRAN), and around 150 000 R functions (software popularity). If you need a function for some specific purpose, it is highly likely that someone else has already created it before. If you have any doubt about the R syntax, you’ll probably find the answer online easily (stackoverflow);
- The superb graphics capabilities offered by the ggplot2 package - this plotting system implements the grammar of graphics, a new way of thinking about the visual representation of the data. If you have to pick a single topic lo learn in R, ggplot2 is arguably the best option. It is a language by itself that will allow you to produce high quality plots in a short amount of time;
- It’s free! (This is not unique to R, though).
Cons
- R has a steep learning curve – the paradigm is different from the mainstream languages and even from other statistical packages. It is highly interactive, where to complete an analysis you often call one function, take the result and use it to feed the next function and so on, in a cycle that can be quite extensive;
- Some of the syntax is far from intuitive and even cumbersome – for example, using the intuitive sort() function to sort a data frame yields in nasty results; you need to use the order() function instead, not directly, but in a rather convoluted way. It must be said, however, that several packages have been developed to make the analysis much easier, namely plyr/dplyr to manipulate (filter, transform, summarise) data, lubridate and stringr to lessen the burden when dealing with dates and strings, respectively, sqldf to run SQL statements directly over R data frames, among a few other packages;
- R can be a bit slow, but that’s more in the context of developing algorithms than about performing interactive EDA. In fact, the bottleneck of the process is the analyst himself. The time we spend thinking about what information to extract and how we want to visualise it is several orders of magnitude greater than the time it takes R to actually draw the graphics.
Without further ado, let’s have a look at the data set we will be using in this series of tutorials.
The weather data set
The data set consists of daily records of several meteorological parameters, measured in the city ofPorto over the year of 2014. We have, then, 365 observations for each of the following 14 variables:
day.count – number of days passed since the beginning of the year
day – day of the month
month – month of the year
season – season of the year
l.temp, h.temp, ave.temp – lowest, highest and average temperature for the day (in ºC)
l.temp.time, h.temp.time – hour of the day when l.temp and h.temp occurred
rain – amount of precipitation (in mm)
ave.wind – average wind speed for the day (in km/h)
gust.wind – maximum wind speed for the day (in km/h)
gust.wind.time – hour of the day when gust.wind occurred
dir.wind – dominant wind direction for the day
Now, let’s move on to Part 2 of this tutorial, where we will start by inspecting the data and prepare it, so we can then proceed to perform EDA.
0 0
- Weather Data Analysis Example:Part 1
- Weather Data Analysis Example:Part 2
- Weather Data Analysis Example:Part 3a
- Weather Data Analysis Example:Part 3b
- Data analysis example with ggplot and dplyr (analyzing ‘supercar’ data, part 2)
- [Clojure] Data Collection and Data Analysis on the music of www.xiami.com - Part 1
- Video Introduction to Bayesian Data Analysis, Part 1: What is Bayes?
- Business Analysis note part 1
- Mining Twitter Data with Python Part 6: Sentiment Analysis Basics
- OpenWeatherMap Current weather data
- [Exploratory Data Analysis] Week 1
- [Exploratory Data Analysis] Project 1
- Python for Data Analysis (1)
- Solaris Core Analysis, Part 1: mdb
- Cousera:Algorithms: Design and Analysis, Part 1
- data analysis
- Data Analysis
- [Clojure] Data Collection and Data Analysis on the music of www.xiami.com - Part 2
- error C2220: warning treated as error - no 'object' file generated 处理方法
- Android最佳性能实践(一)——合理管理内存
- 当ListView有Header时,onItemClick里的position不正确
- PHP读写大“二进制”文件,不必申请很大内存
- iOS 流媒体 HSL
- Weather Data Analysis Example:Part 1
- 【JavaScript】离开页面前提示
- 灵活使用ARM汇编的WEAK关键字
- 讯椿娜孛柏挡铣冻谎遣磷澜杜美匝markdowncontent=
- 计算机类期刊投稿心得
- 幻氐魄坝八刀峡釉懒餐偕没雌颂蔚markdowncontent=
- PDF转换word破解版免费
- 关于yum相关问题
- How to Install MySQL