image article energie

I tried to use ChatGPT’s architecture to predict electricity consumption — here’s what happened

Like many beginner data scientists, I was convinced that recent technologies inevitably outperformed older ones. Transformers, the neural networks powering ChatGPT that have been revolutionizing artificial intelligence since 2017, should logically crush LSTMs, an architecture invented in 1997. Twenty years apart, billions of dollars in investment, thousands of scientific publications — the match seemed decided before it started. In my case, things didn’t go as expected.

The challenge: anticipating consumption to optimize energy purchases

For my very first deep learning project, I worked on a concrete use case: predicting a household’s electricity consumption 24 hours ahead. Behind this technical problem lies a major economic challenge. Smart grid operators buy electricity daily on the European spot market (EPEX) for next-day delivery. A forecast that’s too high means waste. Too low means penalties. In my project’s fictional scenario, these errors represented 62 million euros in annual losses.

I used a real dataset from UCI: four years of minute-by-minute measurements from a household in Sceaux, a Paris suburb. After hourly aggregation and feature engineering — weather variables, cyclical encoding of hours, consumption lags — I had 34,000 hours of data and 35 predictive variables.

The showdown: LSTM versus Transformer

I built two models. First, the LSTM: two layers, 38,000 parameters, a classic but proven architecture. Then the Transformer: attention mechanism, positional encoding, 70,000 parameters — the heavy artillery.

Raw results seemed to favor the Transformer. Its mean absolute error (MAE) reached 0.4086 kW versus 0.4145 kW for the LSTM. A 1.4% advantage. Victory? Not so fast.

The metric that changes everything: overfitting

Digging deeper into the results, I discovered a warning sign. The gap between training and test performance — what we call percentage of overfitting ((validation_loss − train_loss)/train_loss))— reached 6.62% for the Transformer versus only 1.63% for the LSTM. In other words, the Transformer tended to memorize the training data rather than learn the true underlying patterns.

In production, facing unseen data, this behavior can be catastrophic. A model that doesn’t generalize well is a dangerous model.

Moreover, the LSTM trained in under 2 minutes versus over 7 for the Transformer. Simpler, faster, more robust: the choice was clear.

What I take away from this

In this specific context — a single household, four years of data, relatively regular consumption patterns — the LSTM proved more suitable. The Transformer probably needs larger datasets or more complex sequences to reach its full potential. This is actually what the scientific literature suggests: attention shines on long dependencies and large data volumes, less so on short, structured time series.

My final LSTM model saves approximately 28 million euros per year by reducing forecast errors by 45%. I’m proud of this result for a first deep learning project.

A call to experts

That said, I’m still a beginner in this field. If you’re an experienced professional and see areas for improvement — on the Transformer architecture, hyperparameters, or training strategy — I’d love to discuss. The complete notebook and detailed methodology are available here. Feel free to reach out: email (giuliagovernatori@hotmail.com), post Linkedin (link).