Machine Learning Portofolio in Manufacturing Intelligence Practice

来源:互联网 发布:太湖超级计算机 知乎 编辑:程序博客网 时间:2024/06/08 16:13

Machine learning processes

  • Step1: Define your problem
  • Step2: Prepare your data
  • Step3: Spot-check algorithms
  • Step4: Improve results
  • Step5: Present results

Machine learning Practice Tool

  • Weka (Open source UI platform)
    No coding, no need deep methmetics, and quick start.
  • Azure Machine Learning studio
    Free apply account and practice

Dataset in short-memory for practice

  • We choose some datasets in IUC and use Weka platform to do practice based on machine learning basic processes.

  • Dataset:
    iris.data (iris flowers classification dataset)in IUC
    Note: in Weka it supports .arff file, so we can load .csv to weka then transfer to .arff from .csv.

Modeling steps overall

Problem definition

We will use Iris flowers classification dataset.
Each instance in the iris dataset describs measurements of iris flowers.
The task of predict which species of 3 iris flower the observation belongs.

Prepare your data

  • Load dataset
    In weka, we can use Explorer interface to load dataset with .arff formate to check the data.
    这里写图片描述
  • Analyze the dataset
    Review the distribution of each attribute and the interactions between attriutes, which my shed light on specific data transforms and specific modeling techniques the we could use.
    • Summary statistics
    • We notice a few things: (dataset basic description from business view)
      1. The dataset is called iris.
      2. There are 150 instances. If we use 10-fold cross-validation later to evaluate the algorithms, then each fold will be comprised of 15 instances, which is quite small. We may want to think about using 5-folds of 30 instances instead.
      3. There are 5 attributes, 4 inputs and 1 output variable.
      4. There are a small number of attributes and we could investigate further using feature selection methods.
      5. Click on each attribute in the Attributes pane and review the summary statistics in the Selected attribute pane.
    • We can notice a few facts about our data:(dataset deep description from technique)
      1. There are no missing values for any of the attributes.
      2. All inputs are numeric and have values in the same range between about 0 and about 8.
      3. The last attribute is the output variable called class, it is nominal and has three values.
      4. The classes are balanced, meaning that there is an equal number of instances in each class.
      5. If they were not balanced we may want to think about balancing them.
        We may see some bene t from either normalizing or standardizing the data.
    • Attribute distributions (Further deep description for attributes)
      We can use visualize All button to review graphical distribution of each attribute.
      这里写图片描述
      1. We can see overlap but di ering distributions for each of the class values on each of the attributes. This is a good sign as we can probably separate the classes.
      2. It looks like sepalwidth has a Gaussian-like distribution. If we had a lot more data, perhaps it would be even more Gaussian.
      3. It looks like the other 3 input attributes have nearly-Gaussian distributions with a skew or a large number of observations at the low end of the distribution. Again, it makes me think that the data may be Gaussian if we had an order of magnitude more examples.
      4. We also get a visual indication that the classes are balanced.
原创粉丝点击