**Machine Learning** is about learning from existing data to make predictions about the future. It’s based on creating models from input data sets for data-driven decision making.

There are different types of machine learning models like:

- Supervised learning
- Unsupervised learning
- Semi-supervised Learning
- Reinforcement learning

**Supervised learning**: This technique is used to predict an outcome by training the program using an existing set of training data (labeled data). Then we use the program to predict the label for a new unlabeled data set.

There are two sub-models under supervised machine learning, Regression and Classification.

**Unsupervised learning:** This is used to find hidden patterns and correlations within the raw data. No training data used in this model, so this technique is based on unlabeled data.

Algorithms like k-means and Principle Component Analysis (PCA) fall into this category.

**Semi-supervised Learning**: This technique uses both supervised and unsupervised learning models for predictive analytics. It uses labeled and unlabeled data sets for training. It typically involves using a small amount of labeled data with a large amount of unlabeled data. It can be used for machine learning methods like classification and regression.

**Reinforcement learning**: The Reinforcement Learning technique is used to learn how to maximize a numerical reward goal by trying different actions and discovering which actions result in the maximum reward.

ML Model |
Examples |

Supervised learning | Fraud detection |

Unsupervised learning | Social network applications, language prediction |

Semi-supervised Learning | Image categorization, Voice recognition |

Reinforcement learning | Artificial Intelligence (AI) applications |

From <https://www.infoq.com/articles/apache-spark-machine-learning>

**Ensemble methods** are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would. This has been the case in a number of machine learning competitions, where the winning solutions used ensemble methods

From <https://www.toptal.com/machine-learning/ensemble-methods-machine-learning>

**Classification**

**Classification** is concerned with building models that separate data into distinct classes. These models are built by inputting a set of training data for which the classes are pre-labelled in order for the algorithm to learn from. The model is then used by inputting a different dataset for which the classes are withheld, allowing the model to predict their class membership based on what it has learned from the training set. Well-known classification schemes include **decision trees** and **support vector machines**. As this type of algorithm requires explicit class labelling, classification is a form of **supervised learning**.

**Regression **

Regression is very closely related to classification. While classification is concerned with the prediction of discrete classes, regression is applied when the “class” to be predicted is made up of continuous numerical values.**Linear regression** is an example of a regression technique

**Loss Function**

In mathematical optimization, statistics, decision theory and machine learning, a **loss function** or cost **function** is a **function** that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.

An optimization problem seeks to minimize a loss function. An **objective function** is either a loss function or its negative (sometimes called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized.

在统计学，统计决策理论和经济学中，损失函数是指一种将一个事件（在一个样本空间中的一个元素）映射到一个表达与其事件相关的经济成本或机会成本的实数上的一种函数

通常而言，损失函数由损失项(loss term)和正则项(regularization term)组成。发现一份不错的介绍资料：

http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf （题名“Loss functions; a unifying view”）。

一、损失项

- 对回归问题，常用的有：平方损失(for linear regression)，绝对值损失；
- 对分类问题，常用的有：hinge loss(for soft margin SVM)，log loss(for logistic regression)。

说明：

- 对hinge loss，又可以细分出hinge loss（或简称L1 loss）和squared hinge loss（或简称L2 loss）。国立台湾大学的Chih-Jen Lin老师发布的Liblinear就实现了这2种hinge loss。L1 loss和L2 loss与下面的regularization是不同的，注意区分开。

二、正则项

- 常用的有L1-regularization和L2-regularization。上面列的那个资料对此还有详细的总结。

From <http://blog.csdn.net/yhdzw/article/details/39291493>

**Overfitting**

In statistics and machine learning, one of the most common tasks is to fit a “model” to a set of training data, so as to be able to make reliable predictions on general untrained data. In **overfitting**, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

The possibility of overfitting exists because the criterion used for training the model is not the same as the criterion used to judge the efficacy of a model. In particular, a model is typically trained by maximizing its performance on some set of training data

Overfitting occurs when a model begins to “memorize” training data rather than “learning” to generalize from trend

**Clustering **

Clustering is used for analyzing data which does not include pre-labeled classes, or even a class attribute at all. Data instances are grouped together using the concept of “maximizing the intraclass similarity and minimizing the interclass similarity,” as concisely described by Han, Kamber & Pei. This translates to the clustering algorithm identifying and grouping instances which are very similar, as opposed to ungrouped instances which are much less-similar to one another. **k-means** clustering is perhaps the most well-known example of a clustering algorithm. As clustering does not require the pre-labeling of instance classes, it is a form of **unsupervised learning**, meaning that it learns by observation as opposed to learning by example.

Association

Association is most easily explained by introducing market basket analysis, a typical task for which it is well-known. Market basket analysis attempts to identify associations between the various items that have been chosen by a particular shopper and placed in their market basket, be it real or virtual, and assigns support and confidence measures for comparison. The value of this lies in cross-marketing and customer behavior analysis. Association is a generalization of market basket analysis, and is similar to classification except that any attribute can be predicted in association. **Apriori** enjoys success as the most well-known example of an association algorithm. Association is another example of **unsupervised learning**.

Decision trees

Decision trees are top-down, recursive, divide-and-conquer classifiers. Decision trees are generally composed of 2 main tasks: tree induction and tree pruning. Tree induction is the task of taking a set of pre-classified instances as input, deciding which attributes are best to split on, splitting the dataset, and recursing on the resulting split datasets until all training instances are categorized. While building our tree, the goal is to split on the attributes which create the purest child nodes possible, which would keep to a minimum the number of splits that would need to be made in order to classify all instances in our dataset. This purity is measured by the concept of information, which relates to how much would need to be known about a previously-unseen instance in order for it to be properly classified.

A completed decision tree model can be overly-complex, contain unnecessary structure, and be difficult to interpret. Tree pruning is the process of removing the unnecessary structure from a decision tree in order to make it more efficient, more easily-readable for humans, and more accurate as well. This increased accuracy is due to pruning’s ability to reduce overfitting.

SVMs are able to classify both linear and nonlinear data. SMVs work by transforming the training dataset into a higher dimension, a higher dimension which is then inspected for the optimal separation boundary, or boundaries, between classes. In SVMs, these boundaries are referred to as hyperplanes, which are identified by locating support vectors, or the instances that most essentially define classes, and their margins, which are the lines parallel to the hyperplane defined by the shortest distance between a hyperplane and its support vectors.

The grand idea with SVMs is that, with a high enough number of dimensions, a hyperplane separating 2 classes can always be found, thereby delineating dataset member classes. When repeated a sufficient number of times, enough hyperplanes can be generated to separate all classes in n-dimension space.

From <http://www.kdnuggets.com/2016/05/machine-learning-key-terms-explained.html/2>

Kernel Function

核函数

本词条缺少名片图，补充相关内容使词条更完整，还能快速升级，赶紧来编辑吧！

支持向量机[1] 通过某非线性变换 φ( x) ，将输入空间映射到高维特征空间。特征空间的维数可能非常高。如果支持向量机的求解只用到内积运算，而在低维输入空间又存在某个函数 K(x, x′) ，它恰好等于在高维空间中这个内积，即K( x, x′) =<φ( x) ⋅φ( x′) > 。那么支持向量机就不用计算复杂的非线性变换，而由这个函数 K(x, x′) 直接得到非线性变换的内积，使大大简化了计算。这样的函数 K(x, x′) 称为核函数。

核函数包括线性核函数、多项式核函数、高斯核函数等，其中高斯核函数最常用，可以将数据映射到无穷维，也叫做径向基函数（Radial Basis Function 简称 RBF），是某种沿径向对称的标量函数。 通常定义为空间中任一点x到某一中心xc之间欧氏距离的单调函数 , 可记作 k（||x-xc||）， 其作用往往是局部的 , 即当x远离xc时函数取值很小。

Discriminative model vs Generative Model

Let’s say you have input data x and you want to classify the data into labels y. A **generative model** learns the **joint probability distribution p(x,y)** and a**discriminative model** learns the **conditional probability distribution p(y|x)** – which you should read as ‘the probability of y given x’.

Here’s a really simple example. Suppose you have the following data in the form (x,y):

(1,0), (1,0), (2,0), (2, 1)

p(x,y) is

y=0 y=1 ———– x=1 | 1/2 0 x=2 | 1/4 1/4

p(y|x) is

y=0 y=1 ———– x=1 | 1 0 x=2 | 1/2 1/2

If you take a few minutes to stare at those two matrices, you will understand the difference between the two probability distributions.

The distribution p(y|x) is the **natural** distribution for classifying a given example x into a class y, which is why algorithms that model this directly are called**discriminative** algorithms. Generative algorithms model p(x,y), which can be tranformed into p(y|x) by applying Bayes rule and then used for classification. However, the distribution p(x,y) can also be used for other purposes. For example you could use p(x,y) to **generate** likely (x,y) pairs.

From the description above you might be thinking that generative models are more generally useful and therefore better, but it’s not as simple as that. This paper is a very popular reference on the subject of discriminative vs. generative classifiers, but it’s pretty heavy going. **The overall gist is that discriminative models generally outperform generative models in classification tasks.**

另一个解释，摘录如下：

- 判别模型Discriminative Model，又可以称为条件模型，或条件概率模型。估计的是条件概率分布(conditional distribution)， p(class|context)。
- 生成模型Generative Model，又叫产生式模型。估计的是联合概率分布（joint probability distribution），p(class, context)=p(class|context)*p(context)。

From <http://blog.163.com/mageng11@126/blog/static/140808374201011411420710/>

Algorithms

ML Model |
Problems |
Algorithms |

Supervised Learning |
Classification, Regression, Anomaly Detection |
Logistic Regression, Back Propagation Neural Network |

Unsupervised Learning |
Clustering, Dimensionality reduction |
k-Means , Apriori algorithm |

Semi-Supervised Learning |
Classification, Regression |
Self training, Semi-supervised Support Vector Machines (S3VMs) |

Steps in a Machine Learning Program

When working machine learning projects, other tasks like data preparation, cleansing and analysis are also very important tasks in addition to the actual learning models and algorithms used to solve the business problems.

Following are the steps performed in a typical machine learning program.

- Featurization
- Training
- Model Evaluation

Figure 1 below shows the process flow of a typical machine learning solution.

From <https://www.infoq.com/articles/apache-spark-machine-learning>

benefits and weaknesses of various binary classification metrics

**Accuracy**

- Definition – Proportion of instances you predict correctly.
- Strengths – Very intuitive and easy to explain.
- Weaknesses – Works poorly when the signal in the data is weak compared to the signal from the class imbalance. Also, you cannot express your uncertainty about a certain prediction.

**Area under the curve****(AUC)**

- Definition (intuitive) – Given a random positive instance and a random negative instance, the probability that you can distinguish between them.
- Definition (direct) – The area under the ROC curve
- Strengths – Works well when you want to be able to test your ability to distinguish the two classes.
- Weaknesses – You may not be able to interpret your predictions as probabilities if you use AUC, since AUC only cares about the rankings of your prediction scores and not their actual value. Thus you may not be able to express your uncertainty about a prediction, or even the probability that an item is successful.

**LogLoss / Deviance**

- Strengths – Your estimates can be interpreted as probabilities.
- Weaknesses – If you have a lot of predictions that are near the boundaries, your error metric may be very sensitive to false positives or false negatives.

**F-score, Mean Average Precision, Cohen’s Kappa**

These are more esoteric and not as often used for general binary classification tasks. You may see them in specific subfields (e.g. F-score in NLP and Precision metrics in information retrieval)

From <https://www.quora.com/What-are-benefits-and-weaknesses-of-various-binary-classification-metrics>

Collaborative filtering

Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. spark.mllib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.mllib uses thealternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.mllib has the following parameters:

- numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
- rank is the number of latent factors in the model.
- iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less.
- lambda specifies the regularization parameter in ALS.
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

From <http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering>

引言

机器学习领域中所谓的降维就是指采用某种映射方法，将原高维空间中的数据点映射到低维度的空间中。降维的本质是学习一个映射函数 f : x->y，其中x是原始数据点的表达，目前最多使用向量表达形式。 y是数据点映射后的低维向量表达，通常y的维度小于x的维度（当然提高维度也是可以的）。f可能是显式的或隐式的、线性的或非线性的。

目前大部分降维算法处理向量表达的数据，也有一些降维算法处理高阶张量表达的数据。之所以使用降维后的数据表示是因为在原始的高维空间中，包含有冗余信息以及噪音信息，在实际应用例如图像识别中造成了误差，降低了准确率；而通过降维,我们希望减少冗余信息所造成的误差,提高识别（或其他应用）的精度。又或者希望通过降维算法来寻找数据内部的本质结构特征。

在很多算法中，降维算法成为了数据预处理的一部分，如PCA。事实上，有一些算法如果没有降维预处理，其实是很难得到很好的效果的。

主成分分析算法（PCA）

Principal Component Analysis(PCA)是最常用的线性降维方法，它的目标是通过某种线性投影，将高维的数据映射到低维的空间中表示，并期望在所投影的维度上数据的方差最大，以此使用较少的数据维度，同时保留住较多的原数据点的特性。

通俗的理解，如果把所有的点都映射到一起，那么几乎所有的信息（如点和点之间的距离关系）都丢失了，而如果映射后方差尽可能的大，那么数据点则会分散开来，以此来保留更多的信息。可以证明，PCA是丢失原始数据信息最少的一种线性降维方式。（实际上就是最接近原始数据，但是PCA并不试图去探索数据内在结构）

LDA

Linear Discriminant Analysis (也有叫做Fisher Linear Discriminant)是一种有监督的（supervised）线性降维算法。与PCA保持数据信息不同，LDA是为了使得降维后的数据点尽可能地容易被区分！

假设原始数据表示为X，（m*n矩阵，m是维度，n是sample的数量）

既然是线性的，那么就是希望找到映射向量a， 使得 a‘X后的数据点能够保持以下两种性质：

1、同类的数据点尽可能的接近（within class）

2、不同类的数据点尽可能的分开（between class）

局部线性嵌入（LLE）

Locally linear embedding（LLE）是一种非线性降维算法，它能够使降维后的数据较好地保持原有流形结构。LLE可以说是流形学习方法最经典的工作之一。很多后续的流形学习、降维方法都与LLE有密切联系。

Laplacian Eigenmaps 拉普拉斯特征映射

继续写一点经典的降维算法，前面介绍了PCA,LDA，LLE，这里讲一讲Laplacian Eigenmaps。其实不是说每一个算法都比前面的好，而是每一个算法都是从不同角度去看问题，因此解决问题的思路是不一样的。这些降维算法的思想都很简单，却在有些方面很有效。这些方法事实上是后面一些新的算法的思路来源。

Laplacian Eigenmaps[1] 看问题的角度和LLE有些相似，也是用局部的角度去构建数据之间的关系。

它的直观思想是希望相互间有关系的点（在图中相连的点）在降维后的空间中尽可能的靠近。Laplacian Eigenmaps可以反映出数据内在的流形结构。

From <http://www.36dsj.com/archives/26723>

Random Forests(随机森林)和Gradient-Boosted Trees（GBTs梯度提升树）

http://www.csdn.net/article/2015-03-11/2824178

在Spark 1.2中，MLlib引入了Random Forests和Gradient-Boosted Trees（GBTs）。在分类和回归处理上，这两个算法久经验证，同时也是部署最广泛的两个方法。Random Forests和GBTs属于ensemble learning algorithms（集成学习算法），通过组合多个决策树来建立更为强大的模型。

两个算法的主要区别在于各个部件树（component tree）的训练顺序。

在Random Forests中，各个部件树会使用数据的随机样本进行独立地训练。对比只使用单棵决策树，这种随机性可以帮助训练出一个更健壮的模型，同时也能避免造成在训练数据上的过拟合。

GBTs一次训练一棵树，每次加入的新树用于纠正已训练的模型误差。因此，随着越来越多树被添加，模型变得越来越有表现力。

总而言之，两种方法都是多个决策树的加权集合。集成模型基于多个树给出的结果进行结合来做出预测。下图是建立在3个树之上的一个非常简单的例子

对比两种方法，Random Forests训练的速度无疑更快，但是如果想达到同样的误差，它们往往需要深度更大的树。GBTs每次迭代都可以进一步减少误差，但是如果迭代次数太多，它很可能造成过拟合。

From <http://www.csdn.net/article/2015-03-11/2824178>

**Ensemble methods** **集成学习算法**

Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would. This has been the case in a number of machine learning competitions, where the winning solutions used ensemble methods

From <https://www.toptal.com/machine-learning/ensemble-methods-machine-learning>

Voting and Averaging Based Ensemble Methods

Voting and averaging are two of the easiest ensemble methods. They are both easy to understand and implement. Voting is used for classification and averaging is used for regression.

In both methods, the first step is to create multiple classification/regression models using some training dataset. Each base model can be created using different splits of the same training dataset and same algorithm, or using the same dataset with different algorithms, or any other method. The following Python-esque pseudocode shows the use of same training dataset with different algorithms.

train = load_csv(“train.csv”)

target = train[“target”]

train = train.drop(“target”)

test = load_csv(“test.csv”)

algorithms = [logistic_regression, decision_tree_classification, …] #for classification

algorithms = [linear_regression, decision_tree_regressor, …] #for regression

predictions = matrix(row_length=len(target), column_length=len(algorithms))

for i,algorithm in enumerate(algorithms):

predictions[,i] = algorithm.fit(train, target).predict(test)

According to the above pseudocode, we created predictions for each model and saved them in a matrix called predictions where each column contains predictions from one model.

**Majority Voting**

Every model makes a prediction (votes) for each test instance and the final output prediction is the one that receives more than half of the votes. If none of the predictions get more than half of the votes, we may say that the ensemble method could not make a stable prediction for this instance. Although this is a widely used technique, you may try the most voted prediction (even if that is less than half of the votes) as the final prediction. In some articles, you may see this method being called “plurality voting”.

**Weighted Voting**

Unlike majority voting, where each model has the same rights, we can increase the importance of one or more models. In weighted voting you count the prediction of the better models multiple times. Finding a reasonable set of weights is up to you.

**Simple Averaging**

In simple averaging method, for every instance of test dataset, the average predictions are calculated. This method often reduces overfit and creates a smoother regression model. The following pseudocode code shows this simple averaging method:

final_predictions = []

for row_number in len(predictions):

final_predictions.append(

mean(prediction[row_number, ])

)

**Weighted Averaging**

Weighted averaging is a slightly modified version of simple averaging, where the prediction of each model is multiplied by the weight and then their average is calculated. The following pseudocode code shows the weighted averaging:

weights = […, …, …] #length is equal to len(algorithms)

final_predictions = []

for row_number in len(predictions):

final_predictions.append(

mean(prediction[row_number, ]*weights)

)

SVM

一、数学原理

支持向量机(SVM)是由Vladimir N. Vapnik和 Alexey Ya. Chervonenkis在1963年提出的。SVM的提出解决了当时在机器学习领域的“维数灾难”，“过学习”等问题。它在机器学习领域可以用于分类和回归（更多信息可以参考文献1）。

SVM在回归可以解决股票价格回归等问题，但是在回归上SVM还是很局限，SVM大部分会和分类放在一起。所以本节

主要讲的是SVM的分类问题。

From <http://blog.csdn.net/legotime/article/details/51836019>

Recommendation Engines

Recommendation Engines

Recommendation engines use the attributes of an item or a user or the behavior of a user or their peers, to make the predictions. There are different factors that drive an effective recommendation engine model. Some of these factors are list below:

- Peer based
- Customer behavior
- Corporate deals or offers
- Item clustering
- Market/Store factors

Recommendation engine solutions are implemented by leveraging two algorithms, content-based filtering and collaborative filtering.

Content-based filtering: This is based on how similar a particular item is to other items based on usage and ratings. The model uses the content attributes of items (such as categories, tags, descriptions and other data) to generate a matrix of each item to other items and calculates similarity based on the ratings provided. Then the most similar items are listed together with a similarity score. Items with the highest score are most similar.

Movie recommendation is a good example of this model. It recommends that “Users who liked a particular movie liked these other movies as well”.

These models don’t take into account the overall behavior of other users, so they don’t provide personalized recommendations compared to other models like collaborative filtering.

Collaborative Filtering: On the other hand, collaborative filtering model is based on making predictions to find a specific item or user based on similarity with other items or users. The filter applies weights based on the “peer user” preferences. The assumption is users who display similar profile or behavior have similar preferences for items.

An example of this model is the recommendations on ecommerce websites like Amazon. When you search for an item on the website you would see something like “Customers Who Bought This Item Also Bought.”

Items with the highest recommendation score are the most relevant to the user in context.

Collaborative filtering based solutions perform better compared to other models. Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares (ALS). There are two variations of the entries in collaborative filtering, called explicit and implicit feedback. Explicit feedback is based on the direct preferences given by the user to the item (like a movie). Explicit feedback is nice, but many times it’s skewed because users who strongly like or dislike a product provide reviews on it. We don’t get the opinion of many people towards the center of the bell shaped curve of data points.

Implicit feedback examples are user’s views, clicks, likes etc. Implicit feedback data is used a lot in the industry for predictive analytics because of the ease to gather this type of data.

There are also model based methods for recommendation engines. These often incorporate methods from collaborative and content-based filtering. Model-based approach gets the best of both worlds, the power and performance of collaborative filtering and the flexibility and adaptability of content-based filtering. Deep learning techniques are good examples of this model.

You can also integrate other algorithms like K-Means into the recommendation engine solution to get more refined predictions. K-Means algorithm works by partitioning “n” observations into “k” clusters in which each observation belongs to the cluster with the nearest mean. Using K-Means technique, we can find similar items or users based on their attributes.

Figure 2 below shows different components of a recommendation engine, user factors, other factors like market data and different algorithms.

From <https://www.infoq.com/articles/apache-spark-machine-learning>

Customer personas (用户画像)

Think customer personas – those detailed representations of the different segments of your target audience.

Fueled by data driven research that map out the who behind the buying decisions of your products or services, customer personas can help inform everything from more effective copy to product development.

In essence, personas are fictional representations of segments of buyers based on real data reflecting their behaviors. Their purpose is to put the people behind company decision making in the shoes of the customer.

From <http://conversionxl.com/creating-customer-personas-using-data-driven-research/?hvid=2AD8ar>