Week 9 11
Application-level and advanced high-level topics in Machine Learning
Anomaly Detection System
Gaussian Distribution
Anomaly detection uses gaussian distribution - probability density function formula
The hypothesis model uses density estimation, the product of all density functions , to detect frauds
The concept here is deeply incorporated into another concept likelihood
- the anonymous data points usually have low likelihood
Likewise, anonymous data points usually have low value in density estimation function
Feature engineering
Instead of , one can do log transform or change the degree of feature to form gaussian-like shape to make model happy e.g.
, or
Sometimes is not that comparable (say, both large) for normal and anonymous data.
- To solve this problem, we can define new features e.g. which can help capture unusually large or small values (outliers)
Multivariate Gaussian Distribution
Motivation: what if normal data points cluster don't follow standard gaussian distribution shape - normal data are not within perfect circle but oval instead even when features are normalized (when feature engineering can't help a lot).
e.g. normal (red) vs. anonymous (green) - we can't draw circle (pink) bound to separate two classifications but need to draw oval (blue) bound below
Model hypothesis:
Similar to normal gaussian distribution step, to detect anonymous data using multivariate gaussian distribution, we plug data into model and use a threshold
Multivariate vs. single variate
So multivariate gaussian model basically is a more general form with flexible covariance matrix , where original normal gaussian model requires to be diagonal
Large Scale Machine Learning
Different names for different kind of gradient descents
batch/mini-batch/stochastic gradient descent
Online Learning
In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.