LSTM(long short term memory)

LSTM networks are an extension of recurrent neural networks (RNNs). They can remember the information for a long period of time, unlike RNNs. In RNNs, there is no finer control over which part of the context needs to be carried forward or ‘forgotten’. The other issues with RNNs occur during the training process of a network through backtracking and are known as are exploding and vanishing gradients.

The basic difference between the architectures of RNNs and LSTMs is that the hidden layer of LSTM is a gated unit. It consists of four layers that interact with one another. LSTMs comprises three logistic sigmoid gates and one tanh layer. Whereas RNNs have got the only single neural net layer of tanh.

Gates have been introduced in order to limit the information that is passed through the cell. They determine which part of the information will be needed by the next cell and which part is to be discarded.

The output is usually in the range of 0-1 where ‘0’ means ‘reject all’ and ‘1’ means ‘include all’. 

The first valve is called the forget valve. If you shut it, no old memory will be kept. If you fully open this valve, all old memory will pass through. The second valve is the new memory valve. Exactly how much new memory should come in is controlled by the second valve. Then we have the output gate. The output gate decides what the next hidden state should be. The hidden state is also used for predictions.

First, we pass the previous hidden state and the current input into a sigmoid function. Then we pass the newly modified cell state to the tanh function. We multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and the new hidden is then carried over to the next time step.

There are various applications of LSTMs. Some of them are:

  1. Speech Recognition (Input is audio and output is text) – as done by Google Assitant, Microsoft Cortana, Apple Siri
  2. Machine Translation (Input is text and output is also text) – as done by Google Translate
  3. Image Captioning (Input is image and output is text)
  4. Sentiment Analysis (Input is text and output is rating)
  5. Music Generation/Synthesis ( input music notes and output is music)
  6. Video Activity Recognition (input is video and output is type of activity)

Comments are closed.