A.I, Data and Software Engineering

Machine learning quick note


Machine learning is a terminology to describe the uses statistical techniques to give computer systems the ability to “learn” (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed.

You can think of machine learning as the brains behind AI technologies, and AI technologies do the actions. More technically, machine learning is the process of applying algorithmic analytical models to pre-process data in iterations to facilitate the discovery of hidden patterns or trends that are useful for making predictions. As far as what you can do with machine learning, you can do things like sales forecasting, customer segment analysis, insurance claim fraud detection, and hedge fund classification.

Basic machine learning path

As suggested by several experts working in the AI field, we should not rush for the buzz words that may overwhelm the beginners. Instead, we should understand the foundation since most of the fancy terms are either variations or combinations of several basic methods.

basic machine learning
Fig 1: Basic machine learning path

The picture shows some basic types of machine learning algorithms, including Regression, Bayesian, Dimension reduction, clustering. But there are still many not listed, e.g. decision tree, Support vector machines, genetic algorithm, and so many more.

As a quick note, we’ll leave it to you, curious readers, to discover. We will name out some terms commonly used in machine learning.

Basic data terminologies

  • Data: is a set of values of subjects with respect to qualitative or quantitative variables.
  • Dataset (data set): is a collection of data.
  • Training set: is a sub-set of a data set, normally ~75-90% of the data set. It used for training parameters for some model, e.g. classier.
  • Validation set: a subset of a data set, used to tune the hyper-parameters.
  • Test set: a sub-set of the dataset but independent to the training dataset. It is used to test a trained model.
  • Database: a container for dataset.

Basic data set operations

  • Data visualization: is the graphic representation of data
  • Data cleansing: the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.
  • Outlier detection: identify data points that differ significantly from others.
  • Data reduction: the transformation of digital information derived empirically or experimentally into a corrected, ordered, and simplified form
  • Missing data detection: identify missing values
  • Data editing: the process involving the review and adjustment of collected survey data
  • Data wrangling (munging): the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for some purposes.

Basic machine learning terms

  • Model: A machine learning model can be a mathematical representation of a real-world process. 
  • Attribute: one particular “type of data” in a observation. E.g weight, height, age, gender are an attribute of a person.
  • Feature: depending on the context. In many cases, it also refers to attribute.
  • Dimension: number of attributes (in most cases) or number of features.
  • Label: Labels are the final output. You can also consider the output classes to be the labels.
  • Target: the feature of a dataset which we want to gain a deeper understanding, e.g. for predicting house price based on location, environment as the input, then the target is the price.
  • Training: the process of tweaking an algorithm’s parameters by passing training data to the algorithm such that the input parameters correspond to the target.
  • Parameters are configuration variables that can be estimated from the training data.
  • Hyper-parameters of a model are set and tuned depending on a combination of some heuristics and the experience and domain knowledge of the data scientist.
  • Regression: Regression techniques are used when the output is real-valued based on continuous variables, e.g. predict house values
  • Classification: In classification, you will need to categorize data into predefined classes. For example, an email can either be ‘spam’ or ‘not spam’.

1 comment


A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.