Feature Extraction

Recall Supervised Classification

Supervised Classification diagram
  • the chosen vector space representation aligns the chosen model and how we make a classification
  • earlier, we took pixel by pixel and write it in the vector space
Vector space representation

Problem: If we took a slightly other camera angle, there is still a coffee mug, but it’s a whole different point in the space

→ Classification could be wrong

Different solutions:

  • if we have a picture of every possible camera angle, we could evaluate more correctly → what if the size of the coffee mug differs?
  • feature engineering “preparation”, so our model can evaluate correctly

Feature Engineering

Solution: Shallow Learning

Unstructured data —> Feature Engineering (mathematical operator) —> “mathematical finger print” (feature space)

Properties:

  • preserves information semantic classes have in common, throws away the rest
  • the mathematical finger print doesn’t change much when e.g. the camera angle varies (called invariant or robust)
  • Reduces the dimensionality of the problem

Drawbacks:

  • good feature extraction is hard to find (high manual workload)
  • feature extraction is very data specific

Invariant Theory

Invariant Theory
  • transformations can be anything, even don’t have to be mathematically

→ we want to define invariance:

Invariance definition
  • given are two data points that are equivalent under transformation
  • feature extraction T
  • we need completeness (transformation backwards), otherwise the trivial solution T(x) = 0 would be valid (we don’t need to know how “much” they were transformed, but which class they are assign)

→ in praxis, these is nearly impossible to reach

→ we “weaken” the invariance

separability: we don’t demand invariance for all, but for our training samples

Separability
Separability example

Discussion

Feature extraction vs learning algorithms:

I. if we had perfect features, learning would be trivial
II. if we had perfect classifiers, we would not need features

Features are usually used to introduce prior knowledge about the structure of the data and variances to the learning algorithm.

There isn’t the perfect feature, most often you need clustering of different approaches.

In practice:

  • Good features are hard to find
  • Often based on complex mathematical functions
  • Depend on the application (domain knowledge needed!)

Generic Approaches:

Invariance by differentiation: set properties into relation / normalization
Invariance by integration: compute average properties

Motivation

  • Very high dimensional representations require a lot of data to fill this huge space (curse of dimensionality)
  • Danger of overfitting is higher if space is only sparsely sampled

We would like to “compress” our data (with steerable loss) to a lower dimensional representation

Feature Reduction: PCA Recall:

  • Co-VarianceMatrix (Week3)
  • EigenValueDecomposition (Week2)

Combining both concepts for dimension reduction via Principal Component Analysis (PCA)

Feature Reduction: PCA

PCA Algorithm in a nutshell

  1. Compute Co-Variance Matrix of the data
Covariance matrix formula

where E[X] is the expected value (mean)

  1. Compute Eigen Vectors and Values of K[xx]
  2. Sort Eigen Values
  3. Select cut-off value
  4. New basis: project to Eigen vectors New dimension: #Eigen values selected

Deep Learning

Intro

  • On an implementational level, deep Learning are very large neuronal networks
  • also called 3rd generation of neuronal networks
  • we still are in machine Learning, there still isn’t a strong KI
History of neural networks
  • Modern architectures have evolved very far from the original perceptron → in a pure Deep Learning lectures, we would go from left to right, but we focus on the modern models

Deep Learning as a blackbox

  • is a subset of ML algorithms
  • our blackbox model is still valid
Blackbox model

What is different?

Before:

  • most ML algorithms are limited to tight input/output domains e.g. image to vector, always scalar output
  • Pre- and post-processing needed to solve more complex problems!

Now:

  • capacity of learned mapping is better → natively allows structured (tensor) in- and output
  • “End to End” learning: Feature extraction is difficult, can we make this step learnable? (Shallow Learning) → but still 2 steps, even if these steps are connected → “End to End” learning, learn decision space and function in one optimization problem
Content creation example
End to End learning

Opening the blackbox model

(brief intro)

Opening the blackbox
Neural network layers
  • we don’t have one big function, rather we partition the problem in other functions
  • Only constraint: partially differentiable (for training)
  • Depending on application and data some have learnable parameters, others are static
  • In/Output are tensors
  • later we can’t make this easy assignment, because steps are mixed

Simplest Example of DNN: Multi Layer Perceptron (MLP)

Multi Layer Perceptron

More complex deep DNN example:

Complex DNN example

Training of Neural Networks

Training of DNNs is an optimization problem: (example for supervised learning)

1. Run data samples through graph → “forward feed”

2. Get result y’

3. Define differentiable “loss function” to measure “difference” between prediction and true label

4. Optimization objective: minimize loss

Optimization objective

Optimization Problem:

Optimization problem
  • J is our Loss function

Problem: Non CONVEX optimization problem in a very high dimensional space

→ we don’t know if the solution is the best one (could be local minima)

→ NP-HARD PROBLEM

→ numeric and stochastic solution

How to get the nested gradients?

→ With Back Propagation algorithm

  1. feed forward and compute activation
  2. compute error gradient between true Y and predicted Y’
Error gradient computation
  1. compute derivative by layer

→ chain rule (we calculate backwards each gradient with the chain rule)

  1. update all weights with small step in gradient direction

BUT: Using all of the training data to compute the gradient is computational infeasible

Instead: Monte Carlo approach: select random batch (small set of training data) per iteration to compute the gradient → SGD

Stochastic Gradient Descent (SGD)

we use random to escape from possible local minima and to calculate gradients without running over all data

→ non-smooth convergence

→ many iterations needed (also called epochs)

→ need additional parameters

  • Step size (learning rate)
  • Batch size
SGD visualization

Pitfalls of DNN training

  1. Optimization Algorithms have many hyper parameters (that have great effect on the outcome)
  2. Optimization is not deterministic (local minima, random initialization)
  3. Optimization is very compute intensive
  4. DNN training is likely to overfit

→ regularization

Regularization
  1. Needs a lot of training data to learn the underlying distributions
  • not only many data points, also good sampling
  • data annotation (for supervised DL)

→ especially in 3 to 5, progress in hardware and data are the core drivers

Basic Types of Deep Neuronal Network Architectures

Basic Architecture of Multi Layer Perceptron (2nd Generation Neuronal Networks):

Multi Layer Perceptron

Where are the Neurons?

Neurons in MLP

→ How do we implement the feature extraction?

→ we have to look at the operators

Convolutional Neuronal Networks (CNN)

  • Most prominent type of DNN (LSTMs are catching up). Responsible for many practical breakthroughs
  • Able to capture locality at multiple scales. Works well for data with locally embedded structures, e.g. images, videos, audio, graphs … even text
  • Learns Filters which are applied via convolution (End-to-end optimization)

Typical Design: (AlexNet)

AlexNet design
  • New Conv operators

Exkurs: Convolution

  • is a mathematical operation between two functions

(here over time series [t])

Convolution formula

(wikipedia example animation here)

→ we have discrete functions, often multi dimensional, e.g. image in 2D

→ we have to adapt

Convolutional Filters

Convolutional filters
  • f is the image
  • g is the filter
  • image is much larger than the filter

→ We “sum of an element-wise multiplication over a sliding window”

  • = filter is moved over the image

Example:

Convolution example
  • we don’t select a filter (Convolution Kernel), we learn which filter fits best for our problem
  • we have a “filter bank”, we choose which fits our problem the best

Convolution stride
  • Stride: step size in which we move

Capturing image statistics of natural scenes

Natural image statistics
  • at the lowest level, pictures are edges

Recurrent Neural Networks

  • Networks for sequence learning
  • typical applications: text analysis, translation (sequence to sequence), audio, sensor data → time series, well logs, …
Recurrent Neural Networks
  • s1, s2, … sn are Input sequence
  • h1, h2, … hn are “hidden” vectors, coding the current state in the sequence → hn can be used, e.g. as input for classification
  • Recurrent DNN, usually LSTM or GRU, but more complex DNN possible
  • RNN predicts next sequence element

Generative Adversarial Neuronal Networks (GAN)

Generative Adversarial Networks

In a Nutshell, Generative Adversarial Nets are:

  • No classification or Regression, instead we reproduce data (probabilities)
  • Groups of DNNs (at least two)
  • Working against each other → Generator gets a negative gradient if the Discriminator knows if the image was real or fake
  • Min parts
    • Discriminator Network
    • Generator Network

Code Exercises

(Links to Github)

Intro_to_keras_for_engineers_week7.ipynb

Week7_CNNs_solution.ipynb