LSTMs are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber in 1997, and were refined and popularized in subsequent work. They work tremendously well on a large variety of problems and are now used widely in time series forecasting and in some kinds of sequential data prediction problems. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is their default behavior.
An LSTM looks at some input and outputs a value . A loop allows information to be passed from one step of the network to the next. In theory, traditional RNNs are capable of handling these kinds of long-term dependencies. A human could carefully pick parameters for them to solve toy problems of this form. RNNs unfortunately aren’t able to learn them.
LSTMs are similar to many other kinds of feed-forward neural networks but they have several crucial differences. The LSTM model introduces an intermediate type of storage with a structure that’s called a memory cell. A memory cell is a composite kind of unit built from simpler nodes, the most important of which is a multiplicative node.
The multiplicative nodes in an LSTM memory cell function like gates that control the flow of information through the network. They enable the network to decide how much of each signal from the input data should pass through. There are three main places where these multiplicative interactions occur. An LSTM cell contains three gates and one cell state.
The forget gate multiplies the previous cell state by another value between 0 and 1, deciding how much information from the previous time step to retain or discard.
The input gate multiplies the candidate cell state by a value produced by a sigmoid function as the activation function, between 0 and 1, determining how much new information from the current input to add to the memory.
The input gate then generates a new candidate cell state:
The cell state update is updated by the long-term memory:
Finally, the output gate multiplies the updated cell state (after passing through a tanh) by a gating value to determine what part of the internal state becomes visible as output and a new hidden state :
The multiplicative operations at each gate enable the LSTM to regulate information dynamically. That protects long-term dependencies from being overwritten or vanishing.