Modelling Metabolic equivalents (METS) using motion, heart rate, and demographic features.¶
First, let us outline the necessary domain knowledge:
One metabolic equivalent (MET) is defined as the amount of oxygen consumed while sitting at rest and is equal to 3.5ml of oxygen per kg of body weight per minute. (Jetté et al., 1990)
Second, let us discuss the data we are going to be using:
The dataset is composed of 285 features, the vast majority of which are entirely opaque. That introduces the potential for a significant trade-off between explainability and accuracy. However, it also introduces the potential for dimensional analysis, with no loss to explainability, thanks to their original opaqueness.
Importing and cleaning the data¶
Upon importing the data it is clear that a significant ammount of datapoints are missing, represented by the value -99999
. A script has been written to remove features with that are more sparse than is deemed acceptable. In this instance the threshold was 80% however this is somewhat arbitrary. For columns that are missing data but surpass the threshold, the script imputes the missing values. Numeric data is mean-imputed and non-numeric data is mode imputed.
import pandas as pd
import scripts
df = pd.read_csv('training_data.csv')
cleaned_df = scripts.cleaning(df, missing= -99999, drop_threshold= 0.2)
Your original data was missing 12085 values across 34 columns. These columns are: Index(['accel__orientMag__acf__localMaxsNeg__mean', 'accel__prjStream__argLocalMinsNeg_12f1_mv__range_0_100', 'accel__prjStream__localMinsNeg_12f1_mv__areaUnderRangeABS_mid_100', 'accel__prjStream__localMaxsNeg_12f1_mv__last', 'accel__prjStream__acf__argLocalMaxsNeg__diffEdgePad__mean', 'accel__prjStream__localMinsNeg_11f1_mv__max', 'accel__prjStream__argLocalMinsNeg_13f1_mv__range_0_100', 'accel__prjStream__localMinsNeg_15f1_mv__mean', 'accel__prjStream__localMaxsNeg_11f1_mv__mean', 'accel__prjStream__acf__localMaxsNeg__max', 'accel__prjStream__localMinsNeg_12f1_mv__last', 'accel__prjStream__localMinsNeg_15f1_mv__last', 'accel__orientMag__acf__argLocalMaxsNeg__diffEdgePad__mean', 'accel__prjStream__localMaxsNeg_11f1_mv__first', 'accel__orientMag__acf__localMaxsNeg__max', 'accel__prjStream__acf__localMaxsNeg__mean', 'accel__prjStream__argLocalMinsNeg_14f1_mv__range_0_100', 'accel__prjStream__localMinsNeg_12f1_mv__length', 'accel__prjStream__localMaxsNeg__peakinessNeg_12f1__scaledRange_95_100', 'accel__prjStream__localMaxsNeg_14f1_mv__areaUnderRangeABS_mid_100', 'accel__prjStream__argLocalMaxsNeg_15f1_mv__range_0_mid', 'accel__prjStream__localMinsNeg_11f1_mv__first', 'accel__prjStream__localMinsNeg_11f1_mv__mean', 'accel__orientMag__acf__argLocalMaxsNeg_1_0__max', 'accel__prjStream__localMinsNeg_11f1_mv__mid', 'accel__prjStream__acf__argLocalMaxsNeg_1_0__max', 'accel__prjStream__acf__argLocalMaxsNeg_1_0__min', 'accel__prjStream__acf__localMaxsNeg__varSmall', 'accel__prjStream__localMinsNeg_15f1_mv__mid', 'accel__prjStream__localMinsNeg_13f1_mv__mean', 'accel__prjStream__localMaxsNeg__peakinessNeg_12f1__max', 'accel__prjStream__localMaxsNeg_13f1_mv__first', 'accel__prjStream__argLocalMaxsNeg_11f1_mv__range_0_mid', 'accel__prjStream__localMaxsNeg_12f1_mv__mean'], dtype='object') As a result of cleaning your data's shape has changed from: (943, 285) to (943, 258). Columns: ['accel__prjStream__localMinsNeg_11f1_mv__max', 'accel__prjStream__localMaxsNeg_11f1_mv__mean', 'accel__prjStream__localMaxsNeg_11f1_mv__first', 'accel__prjStream__localMinsNeg_11f1_mv__first', 'accel__prjStream__localMinsNeg_11f1_mv__mean', 'accel__prjStream__localMinsNeg_11f1_mv__mid', 'accel__prjStream__argLocalMaxsNeg_11f1_mv__range_0_mid'] have undergone imputation.
Splitting the data on uid
¶
Another script splits the data into train and test sets, before any analysis is done, to allow rigorous model comparison and prevent overfitting. The script splits by the column uid
, to allow the model to be tested, not just on unseen data, but on unseen indivduals. While information on how this model might be used is not given, it seems a safe assumtpion that it would be used on unseen individuals, rather than on the indivduals who it was trained on. The train set is roughly 20% of the initial data set, give or take a bit due to grouping. However, it contains 10 unique individuals, just shy of half the individuals in the data. This hopefully provides the most challenging test set possible, which again will adress issues of overfitting, due to the limited quantity of data available.
X_train, X_test, y_train, y_test = scripts.splitting(cleaned_df,'mets',0.2,'uid')
Columns uid contained 22 unique values. After splitting, 12 are in the training data and 10 are in the testing data. The training data has a length of 730 and the testing data has a length of 213.
Visualising and fitting PCA¶
Now that the data is split, it can be safely inspected without any danger of introducing bias or overfitting. However, with 258 features still present, it would be impossible to learn anything with plots. The most obvious targets for dimensional reducation are the accel
features, closely followed by the hr
features. The accel
features are entirely opaque and make up the bulk of the total features. The hr
features are not opaque, however, they are all designed to communicate one thing: the indivdual's heart rate during the epoch. As a result, they are likely to be easily and signifcantly reduced.
from sklearn.decomposition import PCA
hr_pca = PCA().fit(X_train.iloc[:,9:24])
accel_pca = PCA().fit(X_train.iloc[:,24:])
import numpy as np
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1,2)
fig.set_size_inches(10,5)
axs[0].bar(np.arange(0,10),hr_pca.explained_variance_ratio_[:10])
axs[1].bar(np.arange(0,10),accel_pca.explained_variance_ratio_[:10])
axs[0].set_title('HR')
axs[1].set_title('Accel')
plt.show()
The Scree plots above show that the vast majority of the information within the hr
columns can be conveyed in one component. For the acell
columns, the story is a little different. We have gone with three components, however this is somewhat arbitrary and a case could certainly be made for both: two components, and four components. In our case, >90% of the information is retained.
from sklearn.decomposition import PCA
hr_pca = PCA(1,random_state=0)
accel_pca = PCA(3,random_state=0)
def pca_transform(df):
global hr_pca
global accel_pca
if len(df) > 700:
hr = hr_pca.fit_transform(df.iloc[:,9:24])
accel = accel_pca.fit_transform(df.iloc[:,24:])
else:
hr = hr_pca.transform(df.iloc[:,9:24])
accel = accel_pca.transform(df.iloc[:,24:])
df = df.iloc[:,:9].reset_index()
frames = [df,pd.DataFrame(hr),pd.DataFrame(accel)]
df = pd.concat(frames,axis=1).drop(columns='index')
df.columns = ['uid','steps','sex','age','height','weight','stride_length','mph','speed_mets','hr_pca','accel_pca_1','accel_pca_2','accel_pca_3']
return df
X_train = pca_transform(X_train)
Engineering a work
feature and dummy-coding sex
¶
Before the features are plot against the target, a new feature: work
is engineered and the feature: sex
is dummy-coded to make it compatible with most modelling libraries. The work
feature is engineered in line with the physics texbook formula:
$$
\text{Work} = \text{Weight} \times \text{Distance}
$$
def work_and_sex(df):
df['work'] = df['weight'] * df['steps'] * df['stride_length']
df['sex'] = (df['sex'] == 'Female').astype('int')
return df
X_train = work_and_sex(X_train)
Bivariate plots¶
The features are scatter plotted against the target: mets
. Some appear uncorrelated, some appear linearly correlated and others appear to have a more complicated relationship to mets
.
fig, axs = plt.subplots(4,4)
for ax in range(0,14):
axs[int(ax/4),ax%4].scatter(X_train.iloc[:, ax],y_train,alpha=0.2)
axs[int(ax/4),ax%4].title.set_text(X_train.columns[ax])
fig.set_size_inches(14, 14)
fig.tight_layout()
plt.show();
Adding polynomial features¶
Two features: steps
and speed_mets
appear to require polynomial regression in order to capture their non-linear relationship to mets
. In the case of steps
this is an attempt to capture, what appears to be, a logarithmic relationship to the target. In the case of speed_mets
this is an attempt to capture the plateau as speed_mets
reaches around 6.
def poly(df):
df['steps^2'] = df['steps'] ** 2
df['speed_mets^2'] = df['speed_mets'] ** 2
return df
X_train = poly(X_train)
Wrapping it all in a sklearn.Pipeline
¶
Having written all the functions to transform the raw data into model-ready training data, it is wrapped up in a sklearn.Pipeline
so that we can easly apply the same cleaning, pre-processing and engineering to the test-set. A unit test is then designed to ensure that the pipeline does exactly the same thing as the manual processing steps previously taken.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
work_transformer = FunctionTransformer(work_and_sex)
polynomial_features_transformer = FunctionTransformer(poly)
pca_transformer = FunctionTransformer(pca_transform)
pipeline = Pipeline(steps=[('pca', pca_transformer),
('add_work', work_transformer),
('add_polynomial_features', polynomial_features_transformer)])
X_train_, X_test_, y_train_, y_test_ = scripts.splitting(cleaned_df,'mets',0.2,'uid')
X_train_ = pipeline.fit_transform(X_train_)
assert np.all(X_train == X_train_)
Columns uid contained 22 unique values. After splitting, 12 are in the training data and 10 are in the testing data. The training data has a length of 730 and the testing data has a length of 213.
First pass prediction¶
A first pass at modelling is made using sklearn
's LinearRegression
to provide a benchmark, which future more models will be compared with. There is no hyper parameter tuning, as the model is too basic to have any parameters worth tuning. The becnhmark model achievs a mean absolute error of: 0.71 and an r2 of : 0.45
X_train, X_test, y_train, y_test = scripts.splitting(cleaned_df,'mets',0.2,'uid')
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
Columns uid contained 22 unique values. After splitting, 12 are in the training data and 10 are in the testing data. The training data has a length of 730 and the testing data has a length of 213.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
reg = LinearRegression()
reg.fit(X_train.drop('uid',axis=1), y_train)
y_pred = reg.predict(X_test.drop('uid',axis=1))
errors = y_test - y_pred
mae = np.mean(abs(errors))
mae, r2_score(y_test, y_pred)
(0.7059383245515215, 0.45172973983367504)
XGBoost with hyper parameter tuning¶
Finally, XGBRegressor
is selected as the model of choice for it's speed and performance. GroupKFold
cross-validation is used in conjunction with RandomizedSearchCV
for hyperparamter tuning, again splitting by uid
. It should be noted that the eval_metric
for this process was mean absolute error, the performance metric noted from the benchmark. This results in the following hyperparameters:
- 'subsample': 0.75
- 'reg_lambda': 0
- 'reg_alpha': 0.5
- 'n_estimators': 500
- 'min_child_weight': 1
- 'max_depth': 3
- 'learning_rate': 0.01
- 'gamma': 0.2
- 'colsample_bytree': 0.5
The reg_alpha
means that our model is doing L1 regularisation, a form of feature selection. However, in the end it was a low enough alpha that it only entirely eliminated two features.
The result of this tuning was a more accurate model, which achieved an mean absolute error of: 0.69 and an r2 of : 0.54. While these results are better they are not as good as one might have hoped, going into this process.
import xgboost as xgb
model = xgb.XGBRegressor(eval_metric='mae')
from sklearn.model_selection import RandomizedSearchCV, GroupKFold
param_grid = {
'n_estimators': [100, 500, 1000],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 6, 9],
'subsample': [0.5, 0.75, 1],
'colsample_bytree': [0.5, 0.75, 1],
'min_child_weight': [1, 3, 5],
'gamma': [0, 0.1, 0.2],
'reg_alpha': [0, 0.1, 0.5],
'reg_lambda': [1, 0.5, 0]
}
group_kfold = GroupKFold(n_splits=len(X_train['uid'].unique()))
search = RandomizedSearchCV(model, param_grid, cv=group_kfold, random_state=42)
search.fit(X_train.drop('uid',axis=1), y_train, groups=X_train['uid'])
print("The best hyperparameters are ",search.best_params_)
The best hyperparameters are {'subsample': 0.75, 'reg_lambda': 0, 'reg_alpha': 0.5, 'n_estimators': 500, 'min_child_weight': 1, 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 0.2, 'colsample_bytree': 0.5}
model = xgb.XGBRegressor(learning_rate = search.best_params_["learning_rate"],
n_estimators = search.best_params_["n_estimators"],
max_depth = search.best_params_["max_depth"],
eval_metric='mae')
model.fit(X_train.drop('uid',axis=1), y_train)
y_pred = model.predict(X_test.drop('uid',axis=1))
errors = y_test - y_pred
mae = np.mean(abs(errors))
mae, r2_score(y_test, y_pred)
(0.6897050388393958, 0.5404554486038424)
Final Model¶
import matplotlib.ticker as mtick
plt.barh(X_train.drop('uid',axis=1).columns,model.feature_importances_,color='green')
plt.title('Model Feature Importance')
plt.gca().xaxis.set_major_formatter(mtick.PercentFormatter(xmax=1))
plt.show()
plt.plot(np.arange(213),y_pred, label='XGB Predicted METs', color='cyan')
plt.plot(np.arange(213),y_test, label='True METs', color='black')
plt.title('Predicted METs vs True METs')
plt.legend()
plt.show()
On top is the plotted feature importance of the model showing that hr_pca
, accel_pca_1
, and work
were all useful features. That said, by far the most important two features were speed_mets
and mph
.
Just above is a graph of predicted y vs true y. It is not the most interpretable graph, however no clear bias is visible i.e. the model does not consitently under-predict METs or over-predict METs. It perhaps leans slightly towards over-predicting METs but mostly the graph just shows that the XGB model has failed to capture the variance of the data.
The model's overall lacklustre performance could be due to a number of things:
For one, the dataset was quite small, containing data from only 22 individuals. Ideally one would train a model on a much larger dataset, which would allow for much more effective cross-validation.
Addittionally, greater compute would have allowed a more comprehensive GridSearchCV
, however only marginal improvements to accuracy should be expected from this.
Finally, more domain knowldege might have allowed a greater level of thoughtful feature engineering.