The journey from Vanilla Neural network to BERT architecture (NLU)

Aaina Bajaj
5 min readOct 19, 2020



Natural language understanding (NLU) is the ability of machines to understand human language.

NLU refers to how unstructured data is rearranged so that machines may “understand” and analyze it.

On the other hand, we have BERTOne of the most path-breaking developments in the field of NLU; a revolutionary NLP model that is superlative when compared with traditional NLP models.

Below, I have tried to briefly describe the journey from Vanilla Neural network to BERT architecture to achieve real-time NLU.

The above figure shows lots of activities. As we will explore more about it we will see, Every succeeding architecture has tried to overcome the problems of previous architecture and has tried to understand the language more accurately.

Let’s explore them(for examples, we have taken a basic QA system as a scenario):

Vanilla Neural Network

An artificial neural network consists of a collection of simulated neurons. Each neuron is a node that is connected to other nodes via links that correspond to biological axon-synapse-dendrite connections. Each link has a weight, which determines the strength of one node’s influence on another.

But, A vanilla neural network takes in a fixed size vector as input which limits its usage in situations that involve a ‘series’ type input with no predetermined size. (e.g.: In case of QA System, Question can only be of 30 words)

Recurrent Neural Network

Recurrent Neural Networks (RNNs) add an interesting twist to basic neural networks. RNNs are designed to take a series of inputs with no predetermined limit on size. (i.e.: Question can be of any length).

But Recurrent Neural Networks also suffers from short-term memory. If a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. So if you are trying to process a paragraph of text to do predictions, RNN’s may leave out important information from the beginning.


LSTM ’s and GRU’s were created as the solution to short-term memory. They have internal mechanisms called gates that can regulate the flow of information.

Encoder-Decoder Sequence to Sequence LSTM based RNNs

Encoder-Decoder or Sequence to Sequence RNN is used a lot in translation services. The basic idea is that there are two RNNs, one an encoder that keeps updating its hidden state and produces a final single “Context” output. This is then fed to the decoder, which translates this context to a sequence of outputs. Another key difference in this arrangement is that the length of the input sequence and the length of the output sequence need not necessarily be the same.

Click below for visual presentation:


The main issue with this encoder-decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector/context vector. This turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences.

Click below for visual presentation:


How to overcome the above limitation:


A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder

Click below for visual presentation:


“Transformer”: Multi-Head Self Attention

In the paper “Attention Is All You Need”, Google introduced the Transformer, a neural network architecture based on a self-attention mechanism that believed to be particularly well suited for language understanding.

The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality

The attention mechanism in the Transformer is interpreted as a way of computing the relevance of a set of values(information)based on some keys and queries


Unlike the unidirectional language model, BERT uses a bidirectional Transformer. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question-answering

BERT Structure for QA System:


As the complexity of the model is increasing, we need more data to train it. In the case of BERT, Google has already trained it on millions of data points and has released all the checkpoints publicly. We can use those checkpoints at the initial level to understand the language. Then we can fine-tune that model as per our requirement with our data to make it more accurate.


BERT Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Google Blog Post: Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

Paper: Attention is all you need

Jay Alammar’s Posts:

BERT — — The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Transformer — — The Illustrated Transformer

Attention — — Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)