Build a Machine Learning Model for a Weather dataset

Sachin Mamoru
14 min readMay 28, 2021
Photo by Glenn Carstens-Peters on Unsplash

Problem in brief

Use case: Is there a relationship between humidity and temperature? What about between humidity and apparent temperature? Can you predict the apparent temperature given the humidity?

Here our problem definition is to figure out the relationship between the humidity and the apparent temperature.

For the analysis purpose, we are using a weather history dataset from Kaggle.

Apparent temperature is the temperature equivalent perceived by humans, caused by the combined effects of air temperature, relative humidity and wind speed. The measure is most commonly applied to the perceived outdoor temperature.

According to this context, we cannot directly argue that humidity and apparent temperature are strongly connected or not. Because apparent temperature can be affected by other factors as well.

Given dataset having 12 columns, therefore we cannot blindly say that humidity and Apparent temperature are related or not. For that, we need to do a proper analysis.

Note: All the coding was done in a Google Colab notebook (Resources Section)

Preprocess the dataset as specified in the data mining process

First, we need to visualize and understand the dataset.

weather_df = pd.read_csv('../data/weatherHistory.csv')
weather_df.head(5)

Using the shape and dtypes methods, you can find the number of columns and the data types of each column.

In the real world data are usually incomplete: lacking attribute values, lacking several attributes of interest, or holding only aggregate data. Noisy: containing errors or outliers. Inconsistent: holding discrepancies in codes or names.

Therefore applying to preprocess is a must for real-world data.

At first, we need to check the values of each column and need to have a rough idea of the data set. For that, we can use describe method.

weather_df.describe(include="all")

So in here, you can see the row count is 96453. And the “Formatted Date” column is having the same number of unique values. If the columns have so many unique, it’s not useful when recognizing patterns. Therefore we remove that column.

Other than that you can find that the “Loud Cover” column only has the value 0. Therefore we can drop that column as well.

And then regarding the “Summary” and “Daily Summary” columns, they are quite identical. Therefore I removed the “Daily Summary” column.

Handle Missing Values

In a real-world dataset, there can be several missing values. So we need to handle them before going into further analysis. We have two options concerning missing values.

  1. Deletion
  2. Imputation

Imputation means filling the missing value by the mean or median value of the column dataset. But we need to be more careful with that because it might bias the model.

And if the number of missing cases is very less compared to the dataset, then it’s better to remove them.

print(weather_df.isnull().sum() * 100 / len(weather_df))

Since the “Precip Type” column is having a very low missing value ratio (0.536), it’s better to remove them than doing an imputation.

new_df = weather_df.dropna(axis = 0, how ='any')
new_df = new_df.reset_index(drop=True)

When we remove rows from a dataset, what happens is it will remove that particular row, but the indexes are not reset. For an instance, if we remove a row with index 5, then the new dataframe has the indexes as ..3,4,6,7, etc. So to avoid this situation we can use the reset_index method. It will reset the entire dataframe again.

Remove Duplicates

Let’s find out how many duplicates are there in the dataset.

print("Number of duplicates : " + str(new_df.duplicated().sum()))

Let’s remove them.

new_df = new_df.drop_duplicates()
new_df = new_df.reset_index(drop=True)

Handle Outliers

An outlier is any data point that is clearly deviating from the rest of the dataset.

Types of Outliers

  • Global Outliers: A data point is considered a global outlier if its values are far outside the entirety of the dataset.
  • Contextual (Conditional) Outliers: if an individual data instance is anomalous in a specific context or condition, then it is termed as a contextual outlier.
  • Collective Outliers: when a collection of data points is anomalous concerning the entire data set, the values themselves are not anomalous\

There are two basics methods to handle outliers.

  • Percentile
  • Box Plot

Here we are using the box plot method.

As you can see on the diagram, there are several outliers in the dataset. We need to properly identify them and remove them.

There are three basics ways to handle outliers.

  • Remove all the outliers
  • Replace Outlier Values with a suitable value — Replace them with min or max quantile value
  • Using IQR — We can simply remove the data above and below the limits or we can replace them with the limit value.

For demonstration purposes, I will remove outliers from the “Pressure (millibars)” column. Regarding other outliers, you can find the full code in the notebook. In here pressure is having an outlier of value 0. In the real-world scenario, it cannot happen. Therefore we need to remove them.

warnings.filterwarnings("ignore")fig, axes = plt.subplots(1,2)plt.tight_layout(0.2)## DataFrameprint("Before Shape:",new_df.shape)## Removing 0 from column valuesdf_pressure = new_df.loc[~(new_df['Pressure (millibars)'] == 0)]## Visulizationprint("After Shape:",df_pressure.shape)sns.boxplot(new_df["Pressure (millibars)"],orient='v',ax=axes[0])axes[0].title.set_text("Before")sns.boxplot(df_pressure["Pressure (millibars)"],orient='v',ax=axes[1])axes[1].title.set_text("After")plt.show()

Train-Test Split

When we split the train and test data just before training the model, there is a huge problem we have to face. This obstacle is referred to as data leakage, where information of the hold-out test set leaks into the dataset utilized to train the model. This will result in an inaccurate evaluation of model performance while making predictions on fresh data.

… leakage means that information is revealed to the model that gives it an unrealistic advantage to make better predictions. This could happen when test data is leaked into the training set, or when data from the future is leaked to the past. Any time that a model is given information that it shouldn’t have access to when it is making predictions in real time in production, there is leakage.

— Page 93, Feature Engineering for Machine Learning, 2018.

The solution is, we should split our dataset into train and test sets now and perform the transformation steps.

Initially, we need to separate our feature set and the target column (Apparent Temperature ©).

df_features= df2.drop('Apparent Temperature (C)', axis=1)
df_target = pd.DataFrame(df2['Apparent Temperature (C)'], columns=["Apparent Temperature (C)"])

Let’s split the dataset into training and testing set according to the 0.8— 0.2 ratio.

x_train, x_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.2, random_state = 101)

Data Transformation — Handle Skewness

Simply the skewness is the measure of how much the probability distribution of a random variable diverges from the normal distribution. The probability distribution with its tail on the right side is a right-skewed distribution and the one with its tail on the left side is a left-skewed distribution.

source — https://www.statisticshowto.com/probability-and-statistics/skewed-distribution/

So to detect the skewness we are illustrating the Q-Q plots and histograms.

stats.probplot(df2["Apparent Temperature (C)"], dist="norm", plot=plt)
plt.show()
df2["Apparent Temperature (C)"].hist()
Q-Q plots and histogram of “Apparent Temperature ©”

Q-Q plot — Most of the data points lie on the line.

Histogram — Nearly symmetric distribution.

Transformation — No need.

Now let’s analyze the Q-Q plots and histogram of the “Wind Speed (km/h)”.

Q-Q plots and histogram of the “Wind Speed (km/h)”

According to the histogram that you can see there is a right-skewed skewness in the dataset of the “Wind Speed (km/h)” column. To make it symmetric distribution we need to apply a logarithmic transformation to it. But there is a problem doing so, that is in this column we can find 0 values, and if we apply the logarithmic transformation, they will become minus infinity. Therefore we applied square root transformation.

#right skewed data
# create columns variables to hold the columns that need transformation
columns = ['Wind Speed (km/h)']

# create the function transformer object with logarithm transformation
square_root_transformation = FunctionTransformer(np.sqrt, validate=True)
# apply the transformation to your train data
data_new = square_root_transformation.transform(x_train[columns])
df_right_skewed = pd.DataFrame(data_new, columns=columns)
df_right_skewed = df_right_skewed.reset_index(drop=True)
x_train["Wind Speed (km/h)"] = df_right_skewed["Wind Speed (km/h)"]
“Wind Speed (km/h)” column after transformation

Let’s take the “Humidity” column now.

Q-Q plots and histogram of the “Humidity”

Since the “Humidity” column is having left-skewed skewness, we can apply the exponential or power transformation.

#left skewed data
# create columns variables to hold the columns that need transformation
columns = ['Humidity']

# create the function transformer object with logarithm transformation
exponential_transformer = FunctionTransformer(lambda x: x ** 3, validate=True)

# apply the transformation to your train data
data_new = exponential_transformer.transform(x_train[columns])
df_left_skewed = pd.DataFrame(data_new, columns=columns)
df_left_skewed = df_left_skewed.reset_index(drop=True)
x_train["Humidity"] = df_left_skewed["Humidity"]
“Humidity” column after transformation

So likewise we can do the necessary transformation to make it symmetrically distribute.

Feature Coding Techniques

When we take a real-world dataset, we cannot always have numeric columns. Some columns are non-numeric (categorical). But the issue is most of the machine learning algorithms require numerical input and output variables. Therefore we need to convert these columns into numeric columns.

There two methods to do that.

  • One-hot Encoding — Representation of categorical variables as binary vectors.
  • Integer (Label) Encoding — Converting the labels into the numeric form to transform them into a machine-readable form.
source — https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179

But the problem in label encoding is that it will lead us to find unnecessary trends in the dataset. The reason is when we encode the categorical values as 1, 2, 3 and so on, the model thinks that the value 2 is more powerful than value 1. In this example, it thinks “Chicken” is more powerful than the “Apple”. Therefore it’s not a good practice for linear models. Therefore we apply the One Hot Encoding method.

But as you can see in the example, one hot encoding method will create a new column or a dimension for each category. So it can lead to a huge number of columns and ultimately lead to the problem of the curse of dimensionality. Even though that happens, we can later apply dimension reduction techniques such as PCA or SVD to remove unnecessary dimensions.

I have applied encoding techniques for the two categorical columns in the dataset. Which are “Precip Type” and “Summary”.

Now I’ll apply One Hot Encoding for the “Summary” column.

# creating instance of one-hot-encoder
onehot_encoder = OneHotEncoder(handle_unknown='ignore')

# We fitting data
onehot_encoder.fit(x_train[['Summary']])
column_names = onehot_encoder.get_feature_names(['Summary'])

# transforming testing data
onehot_encoder_train_df = pd.DataFrame(onehot_encoder.transform(x_train[['Summary']]).toarray(),columns=column_names)
onehot_encoder_test_df = pd.DataFrame(onehot_encoder.transform(x_test[['Summary']]).toarray(),columns=column_names)

x_train = x_train.join(onehot_encoder_train_df)

Then I’ll apply Label Encoding for the Precip Type” column since it has only two classes (unique values).

# label encoding for Precip Type - Training Data
x_train['Precip Type']=x_train['Precip Type'].astype('category')
x_train['Precip Type']=x_train['Precip Type'].cat.codes

Let’s now see the sample of the dataset.

x_train.sample(n = 5)
Dataset after encoding

Standardized the features

Feature standardization provides the values of each feature in the dataset have zero-mean and unit variance.

Standardization

We need to apply the standardization for our dataset now.

And the important part is we need to set aside the earlier encoded columns because we don’t standardize them.

feature_columns = ['Temperature (C)', 'Wind Bearing (degrees)', 'Humidity','Wind Speed (km/h)','Visibility (km)','Pressure (millibars)']
# for feature data
x_train_standardize = x_train[feature_columns].copy()
Before Standardization

Now let’s perform standardization.

# Create the scaler object
scaler = StandardScaler()
# Fit the data to scaler
scaler.fit(x_train_standardize)
x_train_scaled = scaler.transform(x_train_standardize)
df_standardized_x_train = pd.DataFrame(x_train_scaled, columns = x_train_standardize.columns)
After Standardization

Here you can see the x-axis is now scaled with the value 0 in the center. We can perform the standardization for the test and target dataset also.

Feature Discretization

A discretization transform will map numerical variables onto discrete values. Binning or discretization is the method of transforming a quantitative variable into a set of two or more qualitative bins.

Discretization Approach

  • Supervised Approach — Discretization with decision trees
  • Unsupervised Approach — Equal-width discretization, K-means discretization
  • Custom Discretization

Normally we perform discretization when a variable is having a very wide range with very little frequency of values. But in this dataset, normally all the columns are having relatively high frequencies. So I think it’s not much needed to perform a discretization. Therefore I didn’t perform the discretization for this dataset.

But if you need to perform a discretization, here I have provided you the code sample.

import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

discretize_columns = ['column_name']
train_discretize_data = pd.DataFrame(x_train, columns=discretize_columns)

# fit the scaler to the data
discretizer = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='uniform')
discretizer.fit(train_discretize_data)

train_discretized = discretizer.transform(train_discretize_data)
df_train_discretized = pd.DataFrame(train_discretized,columns=discretize_columns)
x_train[discretize_columns] = df_train_discretized

df_train_discretized.hist()

Perform Feature Engineering

  • Dimension reduction

Here we are going to find the most significant features we needed to train the model. We need to identify the relationship of each feature. If we train the model with irrelevant features, it will negatively impact our model.

When we take a practical scenario like the banking industry and if we try to do banking customer churn prediction, there we have so many features to be considered. It might sometimes be 100+ features. This is the problem of the curse of dimensionality. Therefore to identify the most relevant features we need to appropriately use PCA (Principal Component Analysis) or SVD (Singular Value Decomposition) for feature reduction.

When we do dimensionality reduction, there two main things that we need to focus on.

  1. Original data should be able to approximately reconstructed.
  2. In between Data points, distances should be preserved.
  • Significant Features

To identify the significant features we can use the Correlation Matrix of the transformed dataset.

# correlation matrix without target
correlation_mat = x_train.iloc[:,:7].corr()
plt.figure(figsize=(10,8))
sns.heatmap(correlation_mat, annot = True)
plt.title("Correlation matrix for features")
plt.xlabel("weather features")
plt.ylabel("weather features")
plt.show()

According to the matrix, there are several comparatively high correlations.

  1. Temperature and Humidity
  2. Temperature and Precip Type
  3. Temperature and Visibility
  4. Humidity and Visibility

Any two highly correlated features can be a redundant feature, therefore we can drop one of them. But here we don’t know the actual relationship between these features. And removing a feature is highly subjective and we need to have the proper domain knowledge to do so.

Here we can see that the “Apparent Temperature” and “Humidity” having a high correlation between them. So we can identify them as a significant feature in the dataset.

And between these two diagrams, we can conclude that some of the features are related to other features as well as they are highly related to the target column as well. So we cannot remove them blindly saying they have a high correlation.

  • PCA — Principal Component Analysis

PCA is a dimension reduction technique that projects the data into dimensions in eigenvector space. Here it will identify the most significant feature set and remove the irrelevant ones. But it will take the important information from the removed dimensions and project it into the remained ones. So finally we have a small number of new dimensions.

First, we need to apply PCA for our dataset and need to identify the variance ratios of the PCA object. By doing so we can find how many dimensions should be preserved for the model to be trained.

pca = PCA()
pca.fit(x_train)
# check how many components should be remaind
pca.explained_variance_ratio_
Variance Ratio

According to the sorted vector array, we will keep the high variance ratio features and remove the other features. Here we take the features up to or more than 95% of the variance explained by the PCA.

Accordingly, I’ll keep 8 components (dimensions) and remove the rest.

pca = PCA(n_components=8)
pca.fit(x_train)
x_train_pca = pca.transform(x_train)

So we were able to reduce from 33 dimensions to 8 dimensions by preserving most of the information of the dataset.

Train model

Here we are using the Multiple Linear Regression model to train our dataset because we need to predict one response variable provided multiple explanatory variables or features. This model doing is fitting a linear equation to observed data. It should give the slightest variation to the original data points.

source — https://www.scribbr.com/statistics/multiple-linear-regression/

Let’s apply the Multiple Linear Regression model to train our dataset.

lm = linear_model.LinearRegression()
model = lm.fit(x_train_pca,y_train)

Prediction

Let’s check the predicted values for our test dataset.

predictions = lm.predict(x_test_pca)
y_hat = pd.DataFrame(predictions, columns=["Predicted Apparent Temperature (C)"])
y_test values and y_hat values

Let’s visualize this in a plot.

import matplotlib.pyplot as plt
plt.plot(y_hat[:150], label = "Pred")
plt.plot(y_test[:150], label = "Actual")
plt.xlabel('x - axis')
# Set the y axis label of the current axis.
plt.ylabel('y - axis')
# Set a title of the current axes.
plt.title('Predictions vs Actual')
# show a legend on the plot
plt.legend()
# Display a figure.
plt.show()
Predictions vs Actual

Model Evaluation

When we predict values using our model there can be an error between the actual and the predicted value. Even earlier we have discussed that the regression line is fit as giving minimum deviations to the data points. So here we are taking that deviation as an error value and use it to measure the model accuracy.

There are several measures available.

Let’s check the MSE and RMSE for our model.

#Mean Squared Error
from sklearn.metrics import mean_squared_error
mse_for_model=mean_squared_error(y_test, y_hat)
print("Mean squared error : "+str(mse_for_model))

#Root Mean Squared Error
from math import sqrt
rmsq_for_model = sqrt(mean_squared_error(y_test, y_hat))
print("Root mean squared error : "+str(rmsq_for_model))
Error calculations for the model

Let’s check the coefficient of determination R² of the prediction.

score_pca=lm.score(x_test_pca,y_test)
print("Coefficient of determination R² of the prediction : "+str(score_pca * 100)+"%")

Now let’s analyze the weight factors.

#Evaluating the Weight factors of the model
print(lm.coef_)

As you can see, the weight factors are relatively small. If the weight factors are very large, then we say the model is overfitted. If that happens we need to check again our preprocessing techniques for any mistakes.

Let’s perform K-fold cross-validation to evaluate the overall accuracy score.

# Necessary imports:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
# Perform 6-fold cross validation
x = pd.DataFrame(x_train_pca).append(pd.DataFrame(x_test_pca)).reset_index(drop=True)
y = y_train.append(y_test).reset_index(drop=True)
scores = cross_val_score(model, x, y, cv=6)
print("Cross-validated scores:", scores)
predictions = cross_val_predict(model, x, y, cv=6)
accuracy = metrics.r2_score(y, predictions)
print("Cross-Predicted Accuracy:", accuracy)

Our model has achieved over 98.8% of accuracy.

Resources

  • Complete Co-lab notebook. (With preloaded data set)
  • Weather History Dataset — Kaggle

Conclusion

We have done preprocessing, feature engineering, PCA up-to model creation, prediction, and accuracy testing with core concepts. After that, we have achieved 98.8% of accuracy.

I would like to express my special appreciation and thanks to Dr. Subha Fernando (Senior Lecturer at the University of Moratuwa) for inspiring me to write this article.

Thank you very much.

--

--