### How does it work:

Computer programs that use deep learning go through much the same process. Each algorithm in the hierarchy applies a nonlinear transformation on its input and uses what it learns to create a statistical model as output. Repetitions continue until the output has reached an adequate level of accuracy. The number of processing layers through which data must pass is what motivated the label *deep*.

**The main concept in deep learning algorithms is automating the extraction of representations (abstractions) from the data**. Deep learning algorithms use a huge amount of unsupervised data to automatically obtain complex representation. These algorithms are largely motivated by the field of artificial intelligence, which has the general goal of emulating the human brain’s ability to observe, analyze, learn, and make decisions, especially for extremely complex problems. Work concerning to these complex challenges has been a key inspiration behind Deep Learning algorithms which strive to emulate the hierarchical learning approach of the human brain. Models based on shallow learning architectures such as decision trees, support vector machines, and case-based reasoning may fall short when attempting to extract useful information from complex structures and relationships in the input corpus. In contrast, Deep Learning architectures have the capability to generalize in non-local and global ways, generating learning patterns and relationships beyond immediate neighbors in the data. Deep learning is, in fact, an important step toward artificial intelligence. It not only provides complex representations of data which are suitable for AI tasks but also makes the machines independent of human knowledge which is the ultimate goal of AI. It extracts representations directly from unsupervised data without human interference.

**A key concept underlying Deep Learning methods is distributed representations of the data**, **in which a large number of possible contours of the abstract features of the input data are feasible, allowing for a compact representation of each sample and leading to a richer generalization.** The number of possible configurations is exponentially related to the number of extracted abstract features. Noting that the observed data was generated through interactions of several known/unknown factors, and thus when a data pattern is obtained through some configurations of learned factors, additional (unseen) data patterns can likely be described through new configurations of the learned factors and patterns. Correlated to learning based on local generalizations, the number of patterns that can be obtained using a distributed representation scales quickly with the number of learned factors.

**Deep learning algorithms are actually Deep architectures of consecutive layers.** Each layer applies a nonlinear transformation on its input and provides a representation in its output. The objective is to learn a complicated and abstract representation of the data in a hierarchical manner by passing the data through multiple transformation layers. The sensory data (for example pixels in an image) is fed to the first layer. Consequently, the output of each layer is provided as input to its next layer.

**Stacking up the nonlinear transformation layers is the basic idea in deep learning algorithms**. The more layers the data goes through in the deep architecture, the more complicated the nonlinear transformations which are constructed. These transformations represent the data, so Deep Learning can be considered as a special case of representation learning algorithms which learn representations of the data in a Deep Architecture with multiple levels of representations. The finished final representation is a highly non-linear function of the input data.

Learning the parameters in a deep architecture is a difficult optimization task, such as learning the parameters in neural networks with many hidden layers. In 2006 Hinton proposed learning deep architectures in an unsupervised greedy layer-wise learning manner. In the beginning, the sensory data is fed as learning data to the first layer. The first layer is then trained based on this data, and the output of the first layer (the first level of learned representations) is provided as learning data to the second layer. Such iteration is done until the wanted number of layers is obtained. At this point, the deep network is trained. The representations learned in the last layer can be used for different tasks. If the task is a classification task usually another supervised layer is put on top of the last layer and its parameters are learned (either randomly or by using supervised data and keeping the rest of the network fixed). In the end, the whole network is fine-tuned by providing managed data to it.