Machine Learning Portofolio in Manufacturing Intelligence Practice

来源：互联网发布：太湖超级计算机知乎编辑：程序博客网时间：2024/06/08 16:13

Machine learning processes

Step1: Define your problem
Step2: Prepare your data
Step3: Spot-check algorithms
Step4: Improve results
Step5: Present results

Machine learning Practice Tool

Weka (Open source UI platform)
No coding, no need deep methmetics, and quick start.
Azure Machine Learning studio
Free apply account and practice

Dataset in short-memory for practice

We choose some datasets in IUC and use Weka platform to do practice based on machine learning basic processes.
Dataset:
iris.data (iris flowers classification dataset)in IUC
Note: in Weka it supports .arff file, so we can load .csv to weka then transfer to .arff from .csv.

Modeling steps overall

Problem definition

We will use Iris flowers classification dataset.
Each instance in the iris dataset describs measurements of iris flowers.
The task of predict which species of 3 iris flower the observation belongs.

Prepare your data

Load dataset
In weka, we can use Explorer interface to load dataset with .arff formate to check the data.
Analyze the dataset
Review the distribution of each attribute and the interactions between attriutes, which my shed light on specific data transforms and specific modeling techniques the we could use.
- Summary statistics
- We notice a few things: (dataset basic description from business view)
  1. The dataset is called iris.
  2. There are 150 instances. If we use 10-fold cross-validation later to evaluate the algorithms, then each fold will be comprised of 15 instances, which is quite small. We may want to think about using 5-folds of 30 instances instead.
  3. There are 5 attributes, 4 inputs and 1 output variable.
  4. There are a small number of attributes and we could investigate further using feature selection methods.
  5. Click on each attribute in the Attributes pane and review the summary statistics in the Selected attribute pane.
- We can notice a few facts about our data:(dataset deep description from technique)
  1. There are no missing values for any of the attributes.
  2. All inputs are numeric and have values in the same range between about 0 and about 8.
  3. The last attribute is the output variable called class, it is nominal and has three values.
  4. The classes are balanced, meaning that there is an equal number of instances in each class.
  5. If they were not balanced we may want to think about balancing them.
    We may see some benet from either normalizing or standardizing the data.
- Attribute distributions (Further deep description for attributes)
  We can use visualize All button to review graphical distribution of each attribute.
  1. We can see overlap but diering distributions for each of the class values on each of the attributes. This is a good sign as we can probably separate the classes.
  2. It looks like sepalwidth has a Gaussian-like distribution. If we had a lot more data, perhaps it would be even more Gaussian.
  3. It looks like the other 3 input attributes have nearly-Gaussian distributions with a skew or a large number of observations at the low end of the distribution. Again, it makes me think that the data may be Gaussian if we had an order of magnitude more examples.
  4. We also get a visual indication that the classes are balanced.

阅读全文

0 0