Photo by Pascal Müller on Unsplash

Recently, I have been looking at ways to create synthetic datasets, specificaly from timeseries given a real measured dataset using Machine Learning models. So, the basic idea is to use ML models for system identifiction. That means to characterize the dynamics of a system using measurements of the inputs and outputs of the system.

One of the most popular non-linear system identification approach is based on the application of the Artificial Neural Networks (ANNs) (Gupta et al., 2003; Mueller and Lemke, 2000; Nelles, 2001)

Typically, in a dynamic system the values of the output variables depend on both the instantaneous values of the input variables and also on the past behavious of the system.

In this notebook (part 1 of upcoming notebook posts), I'll start with simple ML models that only take into account the instantaneous values of the input variables. For this purpose, I searched through Kaggle and picked a time series dataset that has seasonal and environmental input variables. The dataset is Bike Share Daily Data from Hadi Fanaee-T, Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto.

Here is the description of the dataset according to their Kaggle page:

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com. And here is the description of the fields in the dataset:

instant: record index
dteday : date
season : season (1:springer, 2:summer, 3:fall, 4:winter)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday : day of the week
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

In this notebook, we will train a model that characterize the dynamics of bike rental behaviour given the environmental and seasonal settings. We can look at this as a prediction problem where we try to predict the bike user counts (casual and registered) given the input data.

After a short EDA and data preparation, we'll be using Gradient Boosting, Random Forest, and a simple DNN for this prediction and modeling problem.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_absolute_error
import tensorflow as tf

Loading the data

According to the description of the dataset, the enviromental features (4 columns of temp, atemp, hum and windspeed) are normalized. Since, we will standardize all continuous valued features later, we can reverse the normalization and turn according to the description below from the dataset owner:

temp : Normalized temperature in Celsius. The values are divided to 41 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

df = pd.read_csv('bike_sharing_daily.csv', index_col ='dteday')
df.drop('instant', axis=1, inplace=True)
df['temp'] = df['temp']*41
df['atemp'] = df['atemp']*50
df['hum'] = df['hum']*100
df['windspeed'] = df['windspeed']*100
df

df.describe().T

Any missing data?

Good, there is no missing data.

df.isnull().values.any()

False

Visualization of time series

df[['temp', 'hum', 'windspeed']].plot(figsize=(10, 5))
df[['casual', 'registered']].plot(figsize=(10, 5))

<matplotlib.axes._subplots.AxesSubplot at 0x7fb23c27e690>

Cross Correlation Matrix

plt.subplots(figsize=(8, 6))
sns.heatmap(df.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x7fb23c42a090>

I think the temp and atemp (feels like temperature) columns are too correlated, so perhaps we can discard atemp here.

Split the data

Let's make a simple manual split for now.

n = len(df)
train_df = df[:int(0.8*n)].copy()
test_df = df[int(0.8*n):].copy()

Normalize the data

We just need to normalize the non-categorical data. So, first we define them and then normalize those columns only.

non_cat = ['temp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
train_mean = train_df[non_cat].mean()
train_std = train_df[non_cat].std()

train_df.loc[:, non_cat] = (train_df.loc[:, non_cat] - train_mean) / train_std
test_df.loc[:, non_cat] = (test_df.loc[:,non_cat] - train_mean) / train_std

Let's plot the probability density of the data using violin data to make sure our normalization looks good.

df_std = (df[non_cat] - train_mean) / train_std
df_melt = df_std.melt(var_name='Column', value_name='Normalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalized', data=df_melt)
_ = ax.set_xticklabels(df[non_cat].keys(), rotation=90)

Feature selection

There are two types of input data:

Seasonal data
Weather data

and the output of the model is the bike user count which can be found in three columns of:

casual
registered
cnt (sum of the causual and registered)

We can drop 'cnt' as it's just the total count of users. I experimented with seasonal data only or weather data only as the input to the model. But, the performance was better when all input features are used.

# separate input and output in each dataset
input_columns = ['season', 'yr', 'mnth', 'weekday', 'workingday','temp', 'hum', 'windspeed']
output_columns = ['casual', 'registered']



train_in = train_df[input_columns].to_numpy()
test_in = test_df[input_columns].to_numpy()


train_out = train_df[output_columns].to_numpy()
test_out = test_df[output_columns].to_numpy()

print(f'train_in shape: {train_in.shape}')
print(f'test_in shape: {test_in.shape}')
print(f'train_out shape: {train_out.shape}')
print(f'test_out shape: {test_out.shape}')

train_in shape: (584, 8)
test_in shape: (147, 8)
train_out shape: (584, 2)
test_out shape: (147, 2)

Denormalizing function

Before evaluating the models, we invert the normalization and compare the predictions and true values in their actual scale. We use the following function to invert the normalization.

# mean and std to be used for inverting the normalizatioin
_mean = train_mean[output_columns].values
_std = train_std[output_columns].values

# Denormalize the data
def denorm(data, mean, std):
    data_denorm = data * std + mean
    return data_denorm

Models

Let's define a function to evaluate the models.

def evaluate_model(model, train_in, train_out, test_in, test_out):
    print('\n***** Training performance: *****') 
    train_pred = model.predict(train_in)
    train_pred_denorm = denorm(train_pred, _mean, _std) 
    train_out_denorm = denorm(train_out, _mean, _std) 
    train_mae = round(mean_absolute_error(train_pred_denorm, train_out_denorm), 2)
    print('MAE = ', train_mae)


    print('\n***** Testing performance: *****') 
    test_pred = model.predict(test_in)
    test_pred_denorm = denorm(test_pred, _mean, _std) 
    test_out_denorm = denorm(test_out, _mean, _std) 
    test_mae = round(mean_absolute_error(test_pred_denorm, test_out_denorm), 2)
    print('MAE = ', test_mae)

    for i, col in enumerate(output_columns):   
        plt.figure(figsize=(10,5))
        plt.plot(test_pred_denorm[:, i])
        plt.plot(test_out_denorm[:, i])
        plt.legend(['predict', 'true'])
        plt.title(f'Predict vs. ground truth for {col}')
    return train_mae, test_mae

Gradient Boosting

# Creating a MultiOutputRegressor object with GradientBoostingRegressor estimtor 
# This assumes that the outputs are independent of each other, which might not be a correct assumption. 
gbr = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))

print('\n====== GradientBoostingRegressor =====')
gbr.fit(train_in, train_out) 
gbr_train_mae, gbr_test_mae = evaluate_model(gbr, train_in, train_out, test_in, test_out)

====== GradientBoostingRegressor =====

***** Training performance: *****
MAE =  165.58

***** Testing performance: *****
MAE =  515.7

Random Forest

rfr = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
print('\n====== RandomForestRegressor =====')
rfr.fit(train_in, train_out) 
rfr_train_mae, rfr_test_mae = evaluate_model(rfr, train_in, train_out, test_in, test_out)

====== RandomForestRegressor =====

***** Training performance: *****
MAE =  97.21

***** Testing performance: *****
MAE =  548.42

DNN

nn = tf.keras.Sequential([
    tf.keras.layers.Dense(32, input_shape=(len(input_columns),)),
    tf.keras.layers.Dense(len(output_columns))
])

print(nn.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 32)                288       
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 66        
=================================================================
Total params: 354
Trainable params: 354
Non-trainable params: 0
_________________________________________________________________
None

MAX_EPOCHS = 100

def compile_and_fit(model, train_in, train_out, test_in, test_out, patience=5):
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                    patience=patience,
                                                    mode='min')

    model.compile(loss=tf.losses.MeanSquaredError(),
                optimizer=tf.optimizers.Adam(),
                metrics=[tf.metrics.MeanAbsoluteError()])

    history = model.fit(train_in, train_out, epochs=MAX_EPOCHS,
                      validation_data=(test_in, test_out),
                      callbacks=[early_stopping],
                      batch_size = 32, verbose=0)
    return history

tf.keras.backend.clear_session()
history = compile_and_fit(nn, train_in, train_out, test_in, test_out)
plt.figure()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.figure()
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.legend(['mean_absolute_error', 'val_mean_absolute_error'])

<matplotlib.legend.Legend at 0x7fb23c76a590>

nn_train_mae, nn_test_mae = evaluate_model(nn, train_in, train_out, test_in, test_out)

***** Training performance: *****
MAE =  384.11

***** Testing performance: *****
MAE =  559.46

Comparing the models

print('Model: Train MAE - Test MAE')
print(f'Gradient Boosting: {gbr_train_mae} - {gbr_test_mae }')
print(f'Random Forest: {rfr_train_mae} - {rfr_test_mae }')
print(f'DNN: {nn_train_mae} - {nn_test_mae }')

Model: Train MAE - Test MAE
Gradient Boosting: 165.58 - 515.7
Random Forest: 97.21 - 548.42
DNN: 384.11 - 559.46

It looks like the best results based on test MAE is driven from Gradient Boosting method. Perhaps the other models specially DNN can be tuned to achieve at least the same level of performance. Please let me know if you have any suggestion to improve the models. Thanks for your feedback.

	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
dteday
2011-01-01	1	0	1	0	6	0	2	14.110847	18.18125	80.5833	16.0446	331	654	985
2011-01-02	1	0	1	0	0	0	2	14.902598	17.68695	69.6087	24.8539	131	670	801
2011-01-03	1	0	1	0	1	1	1	8.050924	9.47025	43.7273	24.8309	120	1229	1349
2011-01-04	1	0	1	0	2	1	1	8.200000	10.60610	59.0435	16.0296	108	1454	1562
2011-01-05	1	0	1	0	3	1	1	9.305237	11.46350	43.6957	18.6900	82	1518	1600
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2012-12-27	1	1	12	0	4	1	2	10.420847	11.33210	65.2917	35.0133	247	1867	2114
2012-12-28	1	1	12	0	5	1	2	10.386653	12.75230	59.0000	15.5471	644	2451	3095
2012-12-29	1	1	12	0	6	0	2	10.386653	12.12000	75.2917	12.4383	159	1182	1341
2012-12-30	1	1	12	0	0	0	1	10.489153	11.58500	48.3333	35.0754	364	1432	1796
2012-12-31	1	1	12	0	1	1	2	8.849153	11.17435	57.7500	15.4846	439	2290	2729

	count	mean	std	min	25%	50%	75%	max
season	731.0	2.496580	1.110807	1.000000	2.000000	3.000000	3.000000	4.000000
yr	731.0	0.500684	0.500342	0.000000	0.000000	1.000000	1.000000	1.000000
mnth	731.0	6.519836	3.451913	1.000000	4.000000	7.000000	10.000000	12.000000
holiday	731.0	0.028728	0.167155	0.000000	0.000000	0.000000	0.000000	1.000000
weekday	731.0	2.997264	2.004787	0.000000	1.000000	3.000000	5.000000	6.000000
workingday	731.0	0.683995	0.465233	0.000000	0.000000	1.000000	1.000000	1.000000
weathersit	731.0	1.395349	0.544894	1.000000	1.000000	1.000000	2.000000	3.000000
temp	731.0	20.310776	7.505091	2.424346	13.820424	20.431653	26.872076	35.328347
atemp	731.0	23.717699	8.148059	3.953480	16.892125	24.336650	30.430100	42.044800
hum	731.0	62.789406	14.242910	0.000000	52.000000	62.666700	73.020850	97.250000
windspeed	731.0	19.048621	7.749787	2.239170	13.495000	18.097500	23.321450	50.746300
casual	731.0	848.176471	686.622488	2.000000	315.500000	713.000000	1096.000000	3410.000000
registered	731.0	3656.172367	1560.256377	20.000000	2497.000000	3662.000000	4776.500000	6946.000000
cnt	731.0	4504.348837	1937.211452	22.000000	3152.000000	4548.000000	5956.000000	8714.000000

Bike Share Daily Dataset