Recent News

# Introduction to Deep Learning  In this article, I will give you a very simple introduction to the basics of deep learning, regardless of the language, library, or framework you may choose thereafter.

## Introduction

Trying to explain deep learning with a good level of understanding may take quite a while, so that’s not the purpose of this article.

The purpose is to help beginners understand the basic concepts of this field. Nevertheless, even experts may find something useful in the following content.

At the risk of being extremely simple (you experts please forgive me), I will try to give you some basic information. If nothing else, this may just trigger a willingness to study the subject more deeply for some of you.

You may also like: Deep Learning and Machine Learning Guide (Part 1)

## Some History

Deep learning is essentially a new and trendy name for a subject that has been around for quite some time under the name of Neural Networks.

When I started studying (and loving) this field in the early 90s, the subject was already well-known. In fact, the first steps were made in the 1940s (McCulloch and Pitts), but the progress in this area has been quite up and down since then, until now. The field has had a huge success, with deep learning running on smartphones, cars, and many other devices.

So, what is a neural network and what can you do with it?

Ok, let’s focus for a moment on the classic approach to computer science: the programmer designs an algorithm that, for a given input, generates an output.

He or she accurately designs all the logic of the function f(x) so that:

y = f(x)

where x and y are the input and the output respectively.

However, sometimes designing f(x) may not be so easy. Imagine, for example, that x is an image of a face and y is the name of the correspondent person. This task is so incredibly easy for a natural brain, while so difficult to be performed by a computer algorithm!

That’s where deep learning and neural networks come into play. The basic principle is: stop trying to design the f() algorithm and try to mimic the brain.

Ok, so how does the brain behave? It trains itself with several virtually infinite pairs of (x, y) samples (the training set), and throughout a step-by-step process, the f(x) function shapes itself automatically. It’s not designed by anyone but just emerges from an endless trial-and-error refining mechanism.

Think of a child watching familiar people around him or her daily: billions of snapshots, taken from different positions, perspectives, light conditions, and every time making an association, every time correcting and sharpening the natural neural network underneath.

Artificial neural networks are a model of the natural neural networks made of neurons and synapses in the brain.

## Typical Neural Network Architecture

To keep things simple (and survive with the mathematics and computational power of today’s machines), a neural network may be designed as a set of layers, each one containing nodes (the artificial counterpart of a brain neuron), where each node in a layer is connected to every node in the next layer. Each node has a state represented by a floating number between two limits, generally 0 and 1. When this state is near to its minimum value, the node is considered inactive (off), while when it’s near the maximum, the node is considered active (on). You can think of it as a light bulb; not strictly tied to a binary state, but also capable of being in some intermediate value between the two limits.

Each connection has a weight, so an active node in the previous layer may contribute more or less to the activity of the node in the next layer (excitatory connection), while an inactive node will not propagate any contribution.

The weight of a connection may also be negative, meaning that the node in the previous layer is contributing (more or less) to the inactivity of the node in the next layer (inhibitory connection).

For the sake of simplicity, let’s describe a subset of a network where three nodes in the previous layer are connected with a node in the next layer. Again, to put it simply, let’s say the first two nodes in the previous layer are at their maximum value of activation (1), while the third is at its minimum value (0). In the figure above, the first two nodes in the previous layer are active (on) and therefore, they give some contribution to the state of the node in the next layer, while the third in inactive (off), so it will not contribute in any way (independently from its connection weight).

The first node has a strong (thick) positive (green) connection weight, which means that its contribution to activation is high. The second has a weak (thin) negative (red) connection weight; therefore, it is contributing to inhibit the connected node.

In the end, we have a weighted sum of all the contributions from the incoming connected nodes from the previous layer. where a i is the activation state of node i and w ij is the connection weight that connects node i with node j.

So, given this weighted sum number, how can we tell if the node in the next layer will or will not be activated? Is the rule as simple as “if the sum is positive it will be activated, while if negative it will not”?
Well, it may be this way, but in general, it depends on which Activation Function (along with which threshold value) you choose for a node.

Think about it; this final number can be anything in the real numbers range, while we need to use it to set the state of a node with a more limited range (let’s say from 0 to 1). We then need to map the first range into the second, so to squish an arbitrary (negative or positive) number to a 0..1 range.

A very common activation function that performs this task is the sigmoid function In this graph, the threshold (the x value for which the y value hits the middle of the range, i.e. 0.5) is zero, but generally, it may be any value (negative or positive, causing the sigmoid to be shifted to the left or to the right).

A low threshold allows a node to be activated with a lower weighted sum, while a hight threshold will determine the activation only with a high value of this sum.

This threshold value can be implemented by considering an additional dummy node in the previous layer, with a constant activation value of 1. In this case, in fact, the connection weight of this dummy node can act as the threshold value, and the sum formula above can be considered inclusive of the threshold itself.

Ultimately, the state of a network is represented by the set of values of all its weights (in its broad sense, inclusive of thresholds).

A given state, or set of weight values, may give bad results or a big error, while another state may instead give good results, or in other words, small errors.

So, moving in the N-dimensional state space leads to small or big errors. This function, which maps the weights domain to the error value, is the Loss Function. Our mind cannot easily imagine such a function in an N+1 space. However, we can get a general idea for the special case where N = 2: read this article and you’ll see.

Training a neural network consists of finding a good minimum of the loss function. Why a good minimum instead of the global minimum? Well, because this function is generally not differentiable, so you can only wander around the weights domain with the help of some Gradient Descent technique and hope not to:

• make too big of steps that may make you climb over a good minimum without being aware of it
• make too small of steps that may make you lock in a not-so-good local minimum

Not an easy task, huh? That’s why this is the overall main problem with deep learning and why the training phase may take hours, days, or weeks. It’s why your hardware is crucial for this task and why you often have to stop the training and think about different approaches and configuration parameter values and start it all over again!

But let’s get back to the general structure of the network, which is a stack of layers. The first layer is the input (x), while the last layer is the output (y).

The layers in the middle can be zero, one, or many. They are called hidden layers, and the term “deep” in deep learning refers exactly to the fact that the network can have many hidden layers and therefore potentially be able to find more features correlating input and output during the training.

A note: in the 1990s, you would have heard of a multi-layer network instead of deep networks, but that’s the same thing. It’s just that now, it has become more clear that the more a layer is far from the input (deep) the more it may capture abstract features.

## The Learning Process

At the beginning of the learning process, the weights are set randomly, so a given input set in the first layer will propagate and generate a random (calculated) output. This output is then compared to the desired output for the input presented; the difference is a measure of the error of the network (loss function).

This error is then used to apply an adjustment in the connection weights that generated it, and this process starts from the output layer and goes step-by-step backward to the first layer.

The amount of the applied adjustment can be small or big and is generally defined in a factor called learning rate.

This algorithm is called backpropagation and became popular in 1986 after the research of Rumelhart, Hinton, and Williams.

Keep in mind the name in the middle: Geoffrey Hinton. He is often referred to by some as the “Godfather of Deep Learning” and is a tireless illuminated scientist. For example, he is now working on a new paradigm called Capsule Neural Networks, which sounds like another great revolution in the field!

The goal of backpropagation is to gradually reduce the overall error of the network by doing appropriate corrections to the weights at each iteration through the training set. Again, consider that this process of reducing the error is the hard part since there is not any guarantee that the weight adjustments always goes in the right direction for a good minimum.

The problem sums up into finding a minimum in an n-dimensional surface while stepping around with a blindfold: you can find a local minimum and never know if you can perform better.

If the learning rate is too small, the process may result too slowly, and the network may stagnate at a local minimum. On the other hand, a big learning rate may result in skipping the global minimum and making the algorithm diverge.

In fact, quite often, the problem during the training phase is that the process of reducing the error does not converge, and the error grows instead of shrinks!

## Today

Why is this field having such great success now?

Mainly because of two reasons:

1. The availability of a huge amount of data (from smartphones, devices, IoT sensors, and the internet in general) needed for the training
2. The computational power of modern computers allows reducing the training phase drastically (notice that training phases of days or weeks are not so uncommon!)

Want to go deeper into the field? Here are a couple of good books: