Machine Learning(2) Linear Regression

Linear Regression

Linear regresssion with one variable.

Hypothesis: $hθ(x)=θ_0+θ_1x$

Choose and so that is close to y for out training example(x,y)

Cost Function

$J(θ_0,θ_1)=\frac1{2m}\sum_{i=1}^n(hθ(x^i)−y^i)$

m = number of training example.

h = my hypothesis

y = actual

Goal: minimize $J(θ_0,θ_1)$

cost function look like this:

Gradient Descent

start with $\theta_0,\theta_1$ . Changing $\theta_0,\theta_1$ to reduce the $J(\theta_0,\theta_1)$ , until we hopefully end up at a minimum.

$\begin{align*} repeat\ until\ convergence \{ \\ \theta_j := \theta_j - \alpha\frac{d}{d\theta_j}J(θ_0,θ_1) \\ (for j=0 and j=1) \\ \}\ Simultaneously\ update\ \theta_0\ and\ \theta_1 \end{align*}$

α means learning rate. if α is too small, gradient descent can be slow, if α is too large, gradient descent can overshoot the minimum. It mat fail to converge.

Feature Scaling

Make sure features are on a similar scale, or else gradient descent will take a long time.
Processing every feature into approximately a [-1, 1] range. ( [-3, 3] might also works. )

Mean Normalization

Replace $X_i$ with $\frac{X_i - \mu}{S}$
μ is average value of X
S is range ( max - min ), or standard deviation.

Learning Rate α

if α is too small: slow canvergence
if α is too large: $J(\theta)$ may not decrease on every iteration, may not converge (miss the local optimum).
Hence, try α
0.001 -> 0.003 -> 0.01 -> 0.03 -> 0.1 -> 0.3 ….
Normal Equation

Normal equation is a method to solve for θ analytically.

$\theta = (X^TX)^{-1}X^Ty$

octave code: pinv(X’ * X) * X’ * y

If you use normal equation, you don’t need to do feature scaling actually.

The matrix transpose is very expensive $(X^TX)^{−1}$ needs $O(n^3)$ . Therefore, if the n is large, might be greater than 10000, you should consider gradient descent.

Reference

Andrew Ng. https://www.coursera.org/learn/machine-learning

Siling

Siling

Machine Learning(2) Linear Regression

Linear regresssion with one variable.

Cost Function

Gradient Descent

Feature Scaling

Mean Normalization

Learning Rate α

Normal Equation

Reference

Siling

Linear regresssion with one variable.

Cost Function

Gradient Descent

Feature Scaling

Mean Normalization

Learning Rate α

Normal Equation

Reference

Algorithm

Python

Leetcode

Trie

Data Structure

Backtracking

Binary Search

Design

Dynamic Programming

Greedy

KMP

Linked List

Two Pointers

Sorting

Heap

Searching

Stack

Tree

BFS

DFS

Hash Table

Sliding windows

Union Find

Data Science

Data Mining

Bash

Shell

Machine Learning

DFA

Computer Science

NFA

RegExp

docker

container

devops

Big Data

Database

NoSQL

Relational Database

career

wepay

jpmc

journal