joy of data
来源:互联网 发布:linux内核入门书籍 编辑:程序博客网 时间:2024/05/21 08:52
http://www.joyofdata.de/blog/
很好的一个网站,关于数据分析等等
In this tutorial I am going to show you how to set upCUDA 7,cuDNN,caffe andDIGITS on ag2.2xlarge EC2 instance (running Ubuntu 14.04 64 bit) and how to get started with DIGITS. For illustrating DIGITS’ application I use a currentKaggle competition about detecting diabetic retinopathy and its state fromfluorescein angiography.
Convolutional Deep Neural Networks for Image Classification
For classification or regression on images you have two choices:
- Feature engineering and upon that translating an image into a vector
- Relying on a convolutional DNN to figure out the features
Deep Neural Networks are computationally quite demanding. This is the case for two reasons:
- The input data is much larger if you use even a small image resolution of 256 x 256 RGB-pixel implies 196’608 input neurons (256 x 256 x 3). If you engineer your features intelligently then a 1000 neurons would be a lot already.
- Saddling the network with the burden of figuring out the relevant features also requires a more sophisticated network structure and more layers.
Luckily many of the involved floating point matrix operations have been unintentionally addressed by your graphic card’s GPU.
NVIDIA DIGITS and caffe
There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe. NVIDIA DIGITS is a web server providing a convenient web interface for training and testing Deep Neural Networks based on caffe. I intend to cover in a future article how to work with caffe. Here I will show you how to set up CUDA
First of all you need an AWS account and g2.2xlarge instance up and running. That is mostly self-explanatory – for the command line parts (and some tips) you might want to have a look at my previous tutorial “Guide to EC2 from the Command Line“. Make sure to add an inbound rule for port 5000 for your IP – b/c this is where the DIGITS server is made available at.
Installing CUDA 7
Main source for this step is Markus Beissinger’s blog post on setting up Theano.
Installing cuDNN
To further speed up deep learning relevant calculations it is a good idea to set up the cuDNN library. For that purpose you will have to get an NVIDIA developer account and join the CUDA registered developer program. The last step requires NVIDIA to unlock your account and that might take one or two days. But you can get started also without cuDNN library. As soon as you have the okay from them – download cuDNN and upload it to your instance.
Installing caffe
Main source for this and the following step is the readme of the DIGITS project.
Installing DIGITS
Starting and Configuring DIGITS
The first time you start DIGITS it will ask you number of questions for the purpose of its configuration. But those settings are pretty much self-explanatory and you can change them afterwards in ~/.digits/digits.cfg . You might want to consider locating your job-directory ( jobs_dir) on an EBS – the data set of about 140’000 PNGs in the example I feature here consumes about 10 GB of space and the trained models (with all its model snapshots) accounts for about 1 GB.
Troubleshooting DIGITS
When you start DIGITS for the first time you might run into a number of errors and warnings. Here’s my take on them.
Getting Started with DIGITS
First you have to create the data set on which you want to train a model. You have to provide at least one large set of pictures for the training and optionally two smaller sets for validation and testing. You can either separate those sets (and their correct labels) by means of different folders or – what I’d recommend – by providing corresponding CSVs. Those CSVs are supposed to feature two unnamed tab separated columns. The first column keeps the full path of the image (don’t use ~ for home, but the its path equivalent) and the second column keeps a 0-based index referencing the correct class. You will also have to provide a text file holding the different classes – one per line. For example if you have two classes “pos” (1st line) and “neg” (2nd line) – then an image belonging to class “pos” would have to have a class index of 0 associated with it. Loading might take a while. Loading my 140’000 PNGs with 256×256 resolution took about one hour.
Setting up the model you intend to train is even easier provided you stick with the suggested defaults – just choose the data set you want to use, a network and you’re ready to go! Training a GoogLeNet for 30 epochs on the described data set took about one day and 6 hours. This is why you should make sure that …
- … your bidding for a Spot instance is not too low – or you risk it being terminated
- … you start the server in tmux session. Otherwise if you lose connection – maybe b/c your IP changes over night – the server process will be killed
Tackling the Diabetic Retinopathy Kaggle challenge
The provided training set consists of about 35 thousand images of high resolution – zipped and split accross five files. The whole zip archive is about 33 GB large. I downloaded the five components directly onto an EBS using lynx – b/c you can just regularly log on and initiate the download. The download speed on the g2.2xlarge instance btw was incredible – you are granted up to 100 MB per second. I started all five downloads in parallel – each going at 6 MB per second. And yes, its mega byte – not mega bit (the unit DSL providers use).
The visible indicators of diabetic retinopathy are as I understand it mostly leaking (aneurysms) and pathologically growing blood vessels. I figure those features are mirror and rotation invariant. So to increase the available training set I created four versions:
- (A): As is but resized to 256×256 pixels and saved as PNG
- (R): 180 degree rotation of (A)
- Vertical mirroring of (A)
- Vertical mirroring of (R)
Because the task at hand is obviously not a classification but a regression I abstained from attempting to learn a classification into no DR and the four stages of DR. I labelled all DR cases as “positive” and the no-DR cases respectively as “negative”. This would have to be done for all four possible splits ({0} vs {1,…,4}, …, {0,…,3},{4}) and those predictions would finally be regressed against the actual stage.
The bash script for this transformation you may find on bash commands for the processing.
The Result
Well … on one hand I would have liked to see a higher accuracy – on the other hand I can barely (if at all) make out the difference between some healthy cases and some extreme stage four cases. As 73.95% is the share of negative cases – this is also were the accuracy of the network started out at. In the course of 30 epochs it improved about 8 p.p. to 81.8%.
Any Questions?
I highly recommend the DIGITS Google Group for your questions on features and issues. The developers of DIGITS are very helpful and open for suggestions
- joy of data
- 纪录片.BBC.数据之趣.The.Joy.of.Data.2016
- Bundles of Joy
- A Lot of Joy
- Bundles of Joy
- The Joy of Clojure 笔记
- Realms Of Joy - Time Of Light.
- joy
- Joy of SAX:Visual Basic 的示例
- The Joy of Living 生活的乐趣
- The Joy of Living-生活的乐趣
- Git's Little Bundle of Joy
- A Lot of Joy-Gym 100187B
- Gym 100187B - A Lot of Joy
- UVA 7392 Bundles of Joy(乱搞)
- Hello, JIT World: The Joy of Simple JITs
- codeforces Gym 100187 B. A Lot of Joy (期望)
- Bill Joy - not only the author of VI
- javascript中异常捕获抛出处理
- Phone Number
- HDU 5591 ZYB's Game(博弈)
- 【UML学习】用例图 —— Use Case Diagram
- 设置透明的NavBar 获取纯色图片
- joy of data
- [LeetCode]009-Palindrome Number
- 发布nodejs到npm的简单示例
- Shopping
- Greatest Number
- POJ-2656
- jsonArray 构成的String对象可以上传多条数据
- 基于mini2440的裸机led程序及其链接脚本分析
- javascript常用内置对象