System Identification with Machine Learning Models - Part 1
Simple prediction models that can be used for data augmentation
- Bike Share Daily Dataset
- Loading the data
- Any missing data?
- Visualization of time series
- Cross Correlation Matrix
- Split the data
- Normalize the data
- Feature selection
- Denormalizing function
- Models
- Gradient Boosting
- Random Forest
- DNN
- Comparing the models
Photo by Pascal Müller on Unsplash
Recently, I have been looking at ways to create synthetic datasets, specificaly from timeseries given a real measured dataset using Machine Learning models. So, the basic idea is to use ML models for system identifiction. That means to characterize the dynamics of a system using measurements of the inputs and outputs of the system.
One of the most popular non-linear system identification approach is based on the application of the Artificial Neural Networks (ANNs) (Gupta et al., 2003; Mueller and Lemke, 2000; Nelles, 2001)
Typically, in a dynamic system the values of the output variables depend on both the instantaneous values of the input variables and also on the past behavious of the system.
In this notebook (part 1 of upcoming notebook posts), I'll start with simple ML models that only take into account the instantaneous values of the input variables. For this purpose, I searched through Kaggle and picked a time series dataset that has seasonal and environmental input variables. The dataset is Bike Share Daily Data from Hadi Fanaee-T, Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto.
Bike Share Daily Dataset
Here is the description of the dataset according to their Kaggle page:
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com. And here is the description of the fields in the dataset:
- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
In this notebook, we will train a model that characterize the dynamics of bike rental behaviour given the environmental and seasonal settings. We can look at this as a prediction problem where we try to predict the bike user counts (casual and registered) given the input data.
After a short EDA and data preparation, we'll be using Gradient Boosting, Random Forest, and a simple DNN for this prediction and modeling problem.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_absolute_error
import tensorflow as tf
According to the description of the dataset, the enviromental features (4 columns of temp, atemp, hum and windspeed) are normalized. Since, we will standardize all continuous valued features later, we can reverse the normalization and turn according to the description below from the dataset owner:
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
df = pd.read_csv('bike_sharing_daily.csv', index_col ='dteday')
df.drop('instant', axis=1, inplace=True)
df['temp'] = df['temp']*41
df['atemp'] = df['atemp']*50
df['hum'] = df['hum']*100
df['windspeed'] = df['windspeed']*100
df
df.describe().T
df.isnull().values.any()
df[['temp', 'hum', 'windspeed']].plot(figsize=(10, 5))
df[['casual', 'registered']].plot(figsize=(10, 5))
plt.subplots(figsize=(8, 6))
sns.heatmap(df.corr())
I think the temp and atemp (feels like temperature) columns are too correlated, so perhaps we can discard atemp here.
n = len(df)
train_df = df[:int(0.8*n)].copy()
test_df = df[int(0.8*n):].copy()
non_cat = ['temp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
train_mean = train_df[non_cat].mean()
train_std = train_df[non_cat].std()
train_df.loc[:, non_cat] = (train_df.loc[:, non_cat] - train_mean) / train_std
test_df.loc[:, non_cat] = (test_df.loc[:,non_cat] - train_mean) / train_std
Let's plot the probability density of the data using violin data to make sure our normalization looks good.
df_std = (df[non_cat] - train_mean) / train_std
df_melt = df_std.melt(var_name='Column', value_name='Normalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalized', data=df_melt)
_ = ax.set_xticklabels(df[non_cat].keys(), rotation=90)
Feature selection
There are two types of input data:
- Seasonal data
- Weather data
and the output of the model is the bike user count which can be found in three columns of:
- casual
- registered
- cnt (sum of the causual and registered)
We can drop 'cnt' as it's just the total count of users. I experimented with seasonal data only or weather data only as the input to the model. But, the performance was better when all input features are used.
# separate input and output in each dataset
input_columns = ['season', 'yr', 'mnth', 'weekday', 'workingday','temp', 'hum', 'windspeed']
output_columns = ['casual', 'registered']
train_in = train_df[input_columns].to_numpy()
test_in = test_df[input_columns].to_numpy()
train_out = train_df[output_columns].to_numpy()
test_out = test_df[output_columns].to_numpy()
print(f'train_in shape: {train_in.shape}')
print(f'test_in shape: {test_in.shape}')
print(f'train_out shape: {train_out.shape}')
print(f'test_out shape: {test_out.shape}')
Before evaluating the models, we invert the normalization and compare the predictions and true values in their actual scale. We use the following function to invert the normalization.
# mean and std to be used for inverting the normalizatioin
_mean = train_mean[output_columns].values
_std = train_std[output_columns].values
# Denormalize the data
def denorm(data, mean, std):
data_denorm = data * std + mean
return data_denorm
Let's define a function to evaluate the models.
def evaluate_model(model, train_in, train_out, test_in, test_out):
print('\n***** Training performance: *****')
train_pred = model.predict(train_in)
train_pred_denorm = denorm(train_pred, _mean, _std)
train_out_denorm = denorm(train_out, _mean, _std)
train_mae = round(mean_absolute_error(train_pred_denorm, train_out_denorm), 2)
print('MAE = ', train_mae)
print('\n***** Testing performance: *****')
test_pred = model.predict(test_in)
test_pred_denorm = denorm(test_pred, _mean, _std)
test_out_denorm = denorm(test_out, _mean, _std)
test_mae = round(mean_absolute_error(test_pred_denorm, test_out_denorm), 2)
print('MAE = ', test_mae)
for i, col in enumerate(output_columns):
plt.figure(figsize=(10,5))
plt.plot(test_pred_denorm[:, i])
plt.plot(test_out_denorm[:, i])
plt.legend(['predict', 'true'])
plt.title(f'Predict vs. ground truth for {col}')
return train_mae, test_mae
# Creating a MultiOutputRegressor object with GradientBoostingRegressor estimtor
# This assumes that the outputs are independent of each other, which might not be a correct assumption.
gbr = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
print('\n====== GradientBoostingRegressor =====')
gbr.fit(train_in, train_out)
gbr_train_mae, gbr_test_mae = evaluate_model(gbr, train_in, train_out, test_in, test_out)
rfr = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
print('\n====== RandomForestRegressor =====')
rfr.fit(train_in, train_out)
rfr_train_mae, rfr_test_mae = evaluate_model(rfr, train_in, train_out, test_in, test_out)
nn = tf.keras.Sequential([
tf.keras.layers.Dense(32, input_shape=(len(input_columns),)),
tf.keras.layers.Dense(len(output_columns))
])
print(nn.summary())
MAX_EPOCHS = 100
def compile_and_fit(model, train_in, train_out, test_in, test_out, patience=5):
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
patience=patience,
mode='min')
model.compile(loss=tf.losses.MeanSquaredError(),
optimizer=tf.optimizers.Adam(),
metrics=[tf.metrics.MeanAbsoluteError()])
history = model.fit(train_in, train_out, epochs=MAX_EPOCHS,
validation_data=(test_in, test_out),
callbacks=[early_stopping],
batch_size = 32, verbose=0)
return history
tf.keras.backend.clear_session()
history = compile_and_fit(nn, train_in, train_out, test_in, test_out)
plt.figure()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.figure()
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.legend(['mean_absolute_error', 'val_mean_absolute_error'])
nn_train_mae, nn_test_mae = evaluate_model(nn, train_in, train_out, test_in, test_out)
print('Model: Train MAE - Test MAE')
print(f'Gradient Boosting: {gbr_train_mae} - {gbr_test_mae }')
print(f'Random Forest: {rfr_train_mae} - {rfr_test_mae }')
print(f'DNN: {nn_train_mae} - {nn_test_mae }')
It looks like the best results based on test MAE is driven from Gradient Boosting method. Perhaps the other models specially DNN can be tuned to achieve at least the same level of performance. Please let me know if you have any suggestion to improve the models. Thanks for your feedback.