Machine learning

Feature Engineering Is Still King: Why Raw Data Almost Never Works

khaled June 8, 2023 4 mins read
Feature Engineering Is Still King: Why Raw Data Almost Never Works

Feature Engineering Is Still King: Why Raw Data Almost Never Works

The narrative that deep learning eliminates the need for feature engineering is seductive and mostly wrong. Outside of image classification, speech recognition, and a handful of other modalities where neural networks genuinely learn from raw pixels or waveforms, the majority of production ML models — tabular data, time-series forecasting, churn prediction, fraud detection, recommendation systems — perform dramatically better with engineered features than with raw inputs. Understanding what feature engineering actually is, and why it matters, is one of the highest-leverage skills in applied machine learning.

What Feature Engineering Actually Is

Feature engineering is the process of transforming raw data into representations that better expose the underlying patterns the model needs to learn. It is not just adding columns to a dataframe — it is encoding domain knowledge into features that make the model's job easier.

Raw data from a user activity log might include timestamps and event types. Engineered features might include: time since last login, number of logins in the past 7 days, ratio of purchase events to browse events, time of day distribution, weekend vs weekday activity. These features encode behavioural patterns that a model would struggle to discover from raw timestamps and event strings.

Why Raw Data Fails

Scale and sparsity: raw high-cardinality categoricals (product IDs, user IDs, location names) produce sparse one-hot encodings that are difficult to learn from. Embedding them with count encoding, target encoding, or learned embeddings provides dense, meaningful representations.

Non-linearity: linear models (and gradient boosted trees to a degree) learn linear boundaries. Raw features that interact non-linearly (age × income as a credit risk signal) need to be explicitly constructed. Deep networks can learn this in principle but need abundant data; constructed interaction features solve it with any dataset size.

Domain knowledge encoding: a churn model that knows "number of days since last support ticket" captures a domain relationship that a model would need millions of examples to discover from raw support ticket timestamps. Encoding known domain relationships as features shortcut the learning process.

Distribution shift at inference: raw feature distributions shift between training and serving. Engineered features that are normalised, clipped, and bounded are more robust to distribution shift than raw values.

High-Impact Feature Engineering Techniques

Temporal Features

From datetime columns, extract: hour of day, day of week, day of month, week of year, is_weekend, is_holiday, time_since_last_event. Cyclical encoding (sin/cos transformation of hour and day of week) handles the wrap-around structure of time.

Rolling Aggregates

For time-series and event data, compute rolling statistics over windows: rolling mean, std, min, max over 7, 30, and 90 day windows. Interaction between short and long windows (ratio of 7-day mean to 90-day mean) captures trend direction.

Target Encoding

Replace high-cardinality categoricals with the mean of the target variable for each category value, computed on the training set. This is powerful but requires cross-fitting to prevent target leakage.

Interaction Features

Explicitly construct cross-products of features that have domain-justified interactions. For a fraud model: transaction_amount / avg_user_transaction_amount and transaction_hour_distance_from_user_typical_hour.

Text Features Without Deep Learning

For tabular datasets with text fields, TF-IDF on key text columns often outperforms raw inclusion of the text. For short text (product names, job titles), count of specific keyword patterns may be more informative than unigram frequencies.

Feature Selection: Removing What Does Not Help

More features are not always better. Irrelevant features add noise and slow training. Techniques:

  • Mutual information: measures the statistical dependence between each feature and the target; efficient for filtering low-information features
  • Permutation importance: shuffle each feature and measure performance drop; direct measure of a trained model's reliance on each feature
  • Correlation filtering: remove features that are near-duplicates of other features (Pearson |r| > 0.95)

Conclusion

Feature engineering is the art of converting domain knowledge and data intuition into model inputs that make learning tractable. It is irreplaceable for tabular ML, transformative for time-series and event data, and still relevant for domains where deep learning operates on structured inputs. The practitioners who invest in feature engineering consistently outperform those who rely on raw data and model complexity alone.

Keywords: feature engineering, machine learning, tabular ML, target encoding, rolling features, feature selection, data preprocessing, ML pipeline, data science