Long Short-Term Memory(LSTM): An Overview

Urvashi
8 min readDec 8, 2021

LSTM is a special kind of recurrent neural network capable of handling long-term dependencies. LSTM allows the neural networks to memorize the things that it needs to keep hold of context and do not remember the things that are no longer applicable.

Let’s say while watching a video you remember the previous scene or while reading a book you know what happened in the earlier chapter. Similarly RNNs work, they remember the previous information and use it for processing the current input. The shortcoming of RNN is, they can not remember Long term dependencies due to vanishing gradient.LSTM allows the neural networks to memorize the things that it needs to keep hold of context and do not remember the things that are no longer applicable. For example, if we look at the sequence ‘How are yo_’ and need to predict the next letter in the sequence. Well, just by looking at the letters individually, it’s not obvious what the next letter would be. So, how do we predict the sequence? If we contemplate through the time series to look at all the letters in the sequence, we can establish a context and clearly say that the next letter in the sequence would be ‘u’. LSTM is a type of Recurrent Neural Network(RNN). RNNs have a node that receives some input, which is then processed and computed, and that results in an output. These nodes are recurrent, i.e. it loops around, so the output of a given step is provided alongside the input of the next step. RNN allows remembering the previous step in a sequence.

A Recurrent Network with a loop

RNN suffers from a long-term dependency problem, if over time more and more information piles up, then RNN becomes less effective at memorizing new things. So, the LSTM provides a solution to this long-term dependency problem by adding an internal state to the RNN node. LSTM state consists of three gates, namely, forget gate, input gate, and output gate. A forget gate consists of the information from the internal LSTM state that is no longer contextually relevant. The input gate decides what new information should be added or updated into the working storage state information. And, the output gate decided that of all the information stored in the input gate, which part should be the output in a particular instance. These gates can be assigned numbers between 0 and 1, where 0 means that the gate is effectively closed and nothing gets through, and 1 means that the gate is open.

LSTM can be used for Machine translations, Q&A chatbots, and examples where we have a time sequence of things and some long-term dependencies. A slight variation in LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state and makes some other changes. The resulting model is simpler than standard LSTM models and has been growing increasingly popular.

Following is an example of Time Series Forecasting With RNN(LSTM). The following code is implemented using a CSV file the dataset of monthly milk production from the source: Basic Animal Husbandry Statistics, DAHD&F, GoI.

Importing Libraries

Let’s start the implementation by importing the required libraries inside a Google Colaboratory as shown below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Analyzing Dataset

The next step is to import the dataset in Google colab. We can read the dataset by downloading it into the system before reading it using python codes.

df =pd.read_csv(‘monthly_production.csv’,index_col=’Date’,parse_dates=True)

Parse date = True is used to make sure that the pandas recognize that the data is in the time-series format.

df.index.freq=’MS’

Index frequency ‘MS’ indicates that we are dealing with monthly data, although, the machine can automatically infer that detail.

df.head()

The data isa monthly data ranging from 1962.

df.plot(figsize=(12,6))

From the above plot, it can be inferred that there is some sort of seasonality in the data. A repeating pattern and a general trend that is increasing with time can be seen in the graph.

To view the details about components like seasonality nature, trend pattern, we use the stats model and will import a function called seasonal decompose which will decompose different parts with time series.

from statsmodels.tsa.seasonal import seasonal_decompose

results = seasonal_decompose(df[‘Production’])
results.plot();

In the above plot, we can see that the trend is isolated and the seasonal pattern is removed. Thus, we can see that there is a general increase in the trend with time. The seasonal part of the graph shows only the seasonality by subtracting and removing the trend from the original graph and the seasonal pattern can be observed clearly. The residual part in the graph shows things that cannot be explained by the trend or the seasonal pattern., i.e. it is the noise part of the dataset. RNNs can learn complex neural networks in data whether the data is stationary or not.

len(df)

168

The length of my dataset is 168. For splitting the data into the training set, I used all the data except the last 12 months, i.e. 168–12 = 156, and the remaining 12 months data for the testing set.

train = df.iloc[:156]
test = df.iloc[156:]

Data Preprocessing

The dataset is preprocessed using a MinMaxScaler to convert the data in the scale of 0 to 1, because we don’t want the model to get confused because of different range of magnitude of values.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(train)
scaled_train = scaler.transform(train)
scaled_test = scaler.transform(test)

scaled_train[:10]

We can see in the above output that the values have been scaled between 0 to 1.

Now, we will format the data to give it to the neural network model by giving a sequence of the data. Here, I have given the model the values of 3 months as the input. In this way, different batches will be created to train the neural network model. The TimeseriesGenerator will create such batches of 3 inputs and use the output of the previous for the next input.

from keras.preprocessing.sequence import TimeseriesGenerator

# define generator
n_input = 3
n_features = 1
generator = TimeseriesGenerator(scaled_train, scaled_train, length=n_input, batch_size=1)

X,y = generator[0]
print(f’Given the Array: \n{X.flatten()}’)
print(f’Predict this y: \n {y}’)

X.shape

(1,3,1)

There is one row, 3 columns, and the third dimension represents the number of features which is also 1.

Note: We can create batches for any number of inputs.

# We do the same thing, but now for 12 months instead of 3 months
n_input = 12
generator = TimeseriesGenerator(scaled_train, scaled_train, length=n_input, batch_size=1)

Modeling

Sequential model is used to ensure that the layers are added one after the other.

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

# define model
model = Sequential()
model.add(LSTM(100, activation=’relu’, input_shape=(n_input, n_features)))
model.add(Dense(1))
model.compile(optimizer=’adam’, loss=’mse’)

I have added an LSTM layer with 100 neurons and the activation function as ‘relu’. I have complied my model using the ‘adam’ optimizer, and the ‘mean squared error’ as the loss function. The model architecture can be observed by printing the model summary as shown below.

model.summary()

The next step is to fit the model with a generator that already has the inputs and the outputs, and the number of epochs I have used here is 50.

# fit model
model.fit(generator,epochs=50)

After training the model, I have plotted the loss per epoch which the model stores after every epoch.

loss_per_epoch = model.history.history[‘loss’]
plt.plot(range(len(loss_per_epoch)),loss_per_epoch)

Loss per epoch plot

From the plot, we can see that the loss is decreasing with every epoch and there is no significant change after 35.

Now, I have made predictions on the last 12 months' values in the training set to predict the first value in the test set, and then create a new input to make a future prediction. While making the prediction, our model will predict the first value of the testing set.

last_train_batch = scaled_train[-12:]

The data needs to be reshaped in the format (1,3,1) here, because that is how we have trained our model.

last_train_batch = last_train_batch.reshape((1, n_input, n_features))

model.predict(last_train_batch)

scaled_test[0]

From the above 2 codes, we can see that my original value was 0.67 and the model has predicted the value as 0.60 which is close to the original value. Now we will make the predictions on the testing set .

test_predictions = []

first_eval_batch = scaled_train[-n_input:]
current_batch = first_eval_batch.reshape((1, n_input, n_features))

for i in range(len(test)):

# get the prediction value for the first batch
current_pred = model.predict(current_batch)[0]

# append the prediction into the array
test_predictions.append(current_pred)

# use the prediction to update the batch and remove the first value
current_batch = np.append(current_batch[:,1:,:], [[current_pred]],axis=1)

Now we can print the test predictions.

test_predictions

The scaled values now are needed to be transferred back to their original magnitudes by performing an inverse transform and appending these values into the original test dataset.

true_predictions = scaler.inverse_transform(test_predictions)

test[‘Predictions’] = true_predictions

test.plot(figsize=(14,5))

From the above plot, it can be seen that the test values compared to the predictions are mostly similar which means that our model is efficient. Also, we can calculate the root mean squared error using the sklearn and math library, by giving the input original testing values and the predictions.

from sklearn.metrics import mean_squared_error
from math import sqrt
rmse=sqrt(mean_squared_error(test[‘Production’],test[‘Predictions’]))
print(rmse)

My root mean squared value is 24 which is considered a good score for a moderately or well-working algorithm. Thus, it can be concluded that the model can relatively predict the data accurately.

--

--