Deep Leaning Review

PS:If you want to see the Latex Formula, you can access my webpage:
http://henryshe.cn/DL

From "Pexels"

Author: Henry SHE

Date: 5/4/2018

This semester I have learned the course “Topic in Computer Science - Deep Learning”. Actually it is an introduction course, and it didn’t dig into detail. But it gives a general ideas of modern techniques in this area.

Basic of Machine Learning

Types of learning:

  • Supervised Learning
    • Give data, and training the algorithm, by comparing the predicted results and the actual results, then update the weight accordingly.
  • Unsupervised Learning
    • Giving a dataset, letting the algorithm learn the data
  • Reinformance Learning
    • Agent choose actions, trying to find the max reward

Machine Learning Algorithm ( MLA)

A MLA have:

  • Hypothesis set $$H$$
  • Target Function $$f$$
  • Dataset (u need to split the data into training data and testing data in 8:2)
  • Distribution $$D$$ (the true value , by comparing with the predicted value, and calculate the loss function(how many error/ distances))

Linear Regression:s

Important Concepts:

  1. $$Error$$ (Mean-square Error and Half Mean-square Error)

Data Sets:
$$
x=\begin{bmatrix}
x_{1} \
x_{2} \
x_{3} \
\vdots \
x_{M}
\end{bmatrix}
$$

$$
\theta =\begin{bmatrix}
\theta _{0} & \theta _{1} & \theta _{2} & \theta _{3} & \ldots & \theta _{N}
\end{bmatrix}
$$

$$
X=\begin{bmatrix}
1 & x_{1}, \
1 & x_{2} \
1 & x_{3} \
\vdots & \vdots \
1 & x_{M}
\end{bmatrix}
$$

Gradient Descent:

Important Parameters:

  • Learning Rate $\alpha$
  • Cost Function $J(\theta)$
    • Y(That used to calculate the Cost) $y$

Feature Scaling / Mean Normalization:
$$
x_{j}\leftarrow \dfrac {x_{j}-Mean_{j}} {Maxj-Minj}
$$

Logistic Regression:

Update Rules:
$$
\dfrac {\partial J\left( \theta \right) }{\partial \theta {j}}=-\dfrac {1}{M}\sum ^{M}{i=1}\left( y_{i}-h\left( x_{i}\right) \right) x^{(j)}_{i}
$$

Update theta:
$$
\theta_{j} \leftarrow \theta_{j}-\alpha\dfrac {\partial J\left( \theta \right) }{\partial \theta _{j}}
$$

SGD - Stochastic Gradient Descent

Neural Networks - Perceptron

Simulate human’s perceptron

Basic Process: (important)

  1. Input $x$ (use the training data as input (neurons))

  2. Weight $w$ (Intialized with ramdom number)

  3. Bias $b$

  4. Sum then up

  5. Activation Function (below is the common activation functions)

    1. $Sigmoid$ : $\sigma (x) = \frac{1}{1+e^{-x}}$
    2. $TanH$: $TanH = 2\sigma (x)-1$
    3. $ReLU$: $ReLU = max(0,x)$
    4. $Softmax$ (usually used in CNN)
  6. Cost Function(loss function) - you will need this in backpropagation, and then update the weight accordingly.

    1. Update Rules:
      1. Vanilla Update ($x \leftarrow x - \alpha \cdot dx$)
      2. Momentum Update (have velocity $v$)
      3. Adam (have $\beta1$ and $\beta2$)

    Pay attention that, NN can represent any function (它能表示任意一个Function)

Solving over fitting problem:

Neural Networks - Forward Pass and Backpropagation

Backpropagation:

Basic Concept in “Linear Algebra”

Chain Rule 1:

$$ \frac{\partial z}{\partial x} = \frac{\partial x}{\partial y}\cdot \frac{\partial y}{\partial x}$$

Chain Rule 2:

$$\frac{\partial z}{\partial s} = (\frac{\partial z}{\partial x}\cdot \frac{\partial x}{\partial s})+(\frac{\partial z}{\partial y}\cdot \frac{\partial y}{\partial s})$$

We need to use “chain rule” to calculate the divertive of some function, then do the backpropagation and update those weights

Training the Neural Networks

$$
\dfrac {\partial x}{\partial y}=\dfrac {\partial y}{\partial x}\sum ^{\infty }{i}\left( x{i}+y_{i}\right)
$$

Output performance:

  • Learning Rate Problem $\alpha$

    • Too Large: Overshooting, can’t not have convergence
    • Too Small: Trapped in local minimum, need more iterations
  • Under fitting (High bias $b$ ), Solutions:

    • Add more features
    • Decrease $\lambda$ (Fix “High bias“ problem)
  • Over fitting (Higher Variance,Large Weight $\theta$), Solutions:

    • Get more training data (# of features > training data size)

    • Fewer features

    • Increase $\lambda$

    • Drop out (Randomly drop 5% points)

    • Early Stop

      • Cross Validation (use your testing data, to evaluate the neural network, and stop before it getting worse.)
    • Weight Regularization

      • L1 Regularization

      $$
      \dfrac {1}{M}\Sigma {i}J\left( h{\theta }\left( x_{i}\right) ,y_{i}\right) +\lambda \Sigma _{j}\left( \theta _{i}\right)
      $$

- L2 Regularization

$$
\dfrac {1}{M}\Sigma _{i}J\left( h_{\theta }\left( x_{i}\right) ,y_{i}\right) +\lambda \Sigma _{j}\left( \theta _{i}\right)^{2}
$$
  • Improvement of your Neural Network

    • Get more training data
    • Invent more data (data argumentation)
    • Rescale your data(Activation function)
    • Transform the data
      • Apply “log” or sth that can normalize the data
    • Feature Selection
  • Gradient Vanishing (Often occur in RNN)

    • If we have Sigmoid function as Activation function, then the largest value of (sigmoid)’ will be 0.25 at most, then when we do the back propagation, especially with many hidden layer, some layer can’t not be updated efficiently.

    • Solutions:

  • Gradient Exploding

    • Re-Design the model of your network
    • Use ReLU as activation function
    • Use LSTM/GRU network
    • Gradient Pruning (set a threshold)

RNN - Recurrent Neural Network

$$
\dfrac {\partial J}{\partial W}=\sum\dfrac {\partial Jt}{\partial W}
$$

Machine Translation

CNN - Convolutional Neural Network

Other Topic - RL (Reinforcement Learning)