Predicting Fuel Efficiency with Machine Learning

UpSkill Tools
4 days ago
5 min read

Updated: 3 days ago

From Linear Regression to a Tuned XGBoost Pipeline

Fuel efficiency is one of the most visible performance metrics in the automotive world. It directly affects operating costs, environmental impact, and the design trade-offs manufacturers make long before a car reaches the market. Yet, predicting miles per gallon (mpg) accurately from early design and market specifications is not a trivial task. The relationships among engine configuration, vehicle size, power, drivetrain, and efficiency are highly nonlinear and highly interactive.

This project explores how far we can push predictive accuracy using a structured modeling approach. Starting from a simple linear baseline and progressing through eleven increasingly sophisticated models, the final result is a tuned XGBoost pipeline that predicts MPG with a mean absolute percentage error of about 5.4%, roughly a 33% improvement over the baseline linear regression.

Problem Framing and Dataset

The task is to predict fuel economy (miles per gallon) for 300 unseen passenger vehicles using a labeled dataset of 1,500 cars. The data is provided as an Excel workbook with four sheets: a legend, a training set, a test set, and a submission sheet for final predictions.

Each observation represents a passenger car with 24 predictors and one response variable, MPG. The predictors cover both engineering and market characteristics:

Technical variables such as weight, engine displacement, horsepower, length, width, wheelbase, cylinder count, drivetrain, transmission type, and accessories.
Market variables include price, markup, and sales volume.
Categorical attributes like vehicle origin, domestic or European market flags, and drivetrain indicators.

In the test set, MPG is hidden and marked with a question mark. The goal is to build a model that generalizes well enough to accurately forecast these missing values.

Evaluation Metric and Baseline

Model performance is evaluated using mean absolute percentage error (MAPE). This metric measures, on average, the percentage deviation of predictions from true MPG values. It is easy to interpret and particularly suitable when absolute scale matters less than relative error.

As a baseline, I fitted a multiple linear regression after removing obvious non-informative identifiers. This model achieved a validation MAPE of approximately 7.9%, which already captures some of the main trends in the data but leaves significant room for improvement.

Exploratory Data Analysis

Before modeling, it is essential to understand the structure of the target variable and its relationship with key predictors.

Distribution of MPG

MPG in the training data ranges from roughly 10 to 47, with a mean around 22 mpg. Most vehicles fall between 18 and 25 mpg, which indicates that the dataset is centered on typical passenger cars rather than extreme economy vehicles.

MPG and Vehicle Size

A scatterplot of MPG versus vehicle weight shows a strong negative relationship. Heavier cars consistently achieve lower fuel efficiency. Similar patterns appear when comparing MPG with displacement, horsepower, and body dimensions.

Grouping by cylinder count reveals another clear pattern: three- and four-cylinder cars are the most efficient, while six- and eight-cylinder vehicles are substantially less efficient on average. This aligns with basic engineering intuition and sets the stage for feature engineering later.

Early Models: Linear and Regularized Approaches (Models 1–4)

The first set of models focused on linear structure and regularization.

Model 1: Multiple linear regression baseline, MAPE ≈ 7.9%.
Model 2: Linear regression with log-transformed weight and displacement, performance worsened to about 9.0%.
Model 3: Ridge regression with many predictors, MAPE increased further to around 11.4%.
Model 4: Ridge regression with engineered features, improved slightly but still around 8.5%.

The takeaway from these experiments is straightforward: linear and regularized linear models hit a performance ceiling. Even with feature engineering, they fail to capture the nonlinear interactions that dominate fuel efficiency.

Feature Engineering as Applied Vehicle Physics

Feature engineering was a turning point in this project. Instead of relying only on raw variables, I translated engineering intuition into numeric features:

Power and size ratios such as horsepower per weight, displacement per cylinder, and weight per cylinder.
Log-transformed variables for weight, displacement, horsepower, and price.
Size and density proxies combining length, width, and wheelbase.
Interaction terms like cylinders multiplied by displacement.
Temporal features capturing vehicle age and binned model years.

The goal was to encode physical relationships explicitly so nonlinear models could exploit them more effectively.

Tree-Based Models and Gradient Boosting (Models 5–10)

Moving to tree-based gradient boosting dramatically improved performance.

Model 5: Basic XGBoost with engineered features reduced MAPE to about 5.6%.
Model 6: XGBoost with a richer engineered feature set achieved around 5.4%.
Model 7: Ensemble of XGBoost models offered no real gain.
Model 8: CatBoost with full features reached about 5.9%.
Models 9–10: Deeper XGBoost with early stopping and extra features remained in the 5.6–5.8% range.

These results showed that a well-tuned but relatively simple XGBoost pipeline outperformed more complex ensembles and alternative boosting algorithms.

Final Model: Tuned XGBoost Pipeline (Model 11)

The final model formalizes the best ideas into a clean, reproducible pipeline.

44 engineered features including ratios, log transforms, interaction terms, and temporal bins.
Preprocessing handled via a ColumnTransformer with OrdinalEncoder for categorical variables.
Hyperparameters tuned using RandomizedSearchCV with repeated 10-fold cross-validation.

The best configuration used depth-6 trees, a learning rate of 0.03, strong subsampling, and moderate L1 regularization.

The final performance:

10-fold CV MAPE: 5.41% ± 0.52%
Approximately 33% lower error than the linear regression baseline.

Feature Importance and Interpretation

Feature importance analysis confirms that the model relies on physically meaningful signals.

The most important feature is the interaction between cylinder count and displacement, followed by cylinders and displacement individually. Log-transformed weight and displacement, year bins, and origin also contribute meaningfully.

This reinforces a consistent story: engine size, cylinder configuration, and overall vehicle scale are the dominant drivers of fuel efficiency.

Diagnostics, Predictions, and Practical Implications

A diagnostic plot of predicted versus actual MPG for the training data shows points lying close to the 45-degree line, with no strong residual patterns. This suggests the model captures the main structure of the data without obvious systematic bias.

Using the final pipeline, MPG predictions were generated for all 300 test vehicles and submitted for scoring.

From a practical standpoint, the results highlight several insights:

Reducing weight and optimizing engine displacement relative to cylinder count yields large efficiency gains.
Increasing horsepower without corresponding weight reduction is costly in terms of MPG.
For forecasting tasks, gradient boosting with domain-informed feature engineering offers a strong balance between accuracy and interpretability.

Closing Perspective

This project demonstrates how a structured modeling process can transform a noisy spreadsheet of car specifications into a reliable prediction engine. Starting from a simple linear baseline and iteratively improving through feature engineering and nonlinear models, the final tuned XGBoost pipeline achieves robust performance without unnecessary complexity.

For future work, richer aerodynamic variables, loss functions that directly optimize MAPE, and validation on newer model years could push accuracy further. Within the constraints of the current dataset, however, the final model represents a strong and practical solution for predicting automobile fuel efficiency.

Recognition

This project was recognized as the top-performing submission in ISDS 7103 and earned first place in the course Hall of Fame. The final model achieved one of the lowest prediction errors in the class, reflecting both technical rigor and a disciplined modeling approach. This recognition capped off a semester focused on turning theory into practical, decision-ready machine learning systems.

https://chun.z21.web.core.windows.net/2_teaching/hall_of_fame/isds_fame.html

Andrey Fateev