Feature Selection in the Era of Generative AI

lx2158
Nov 2, 2025
25 min read

1. Introduction: The Foundational Status and Evolutionary Challenges of Feature Engineering

In the theoretical framework and practical application of machine learning, model performance is not empty talk but is deeply rooted in the quality and representation of input data. Features are the inputs to machine learning algorithms. They constitute the cornerstone for models to perceive and understand complex real-world phenomena and are the starting point for all subsequent learning and reasoning processes. In different academic traditions and application fields, these input variables have diverse naming conventions. They are sometimes called independent variables, emphasizing their explanatory status in predictive models; covariates, common in statistical modeling and causal inference; or simply X in formal mathematical systems.

The versatility of features is reflected in their spanning across major machine learning paradigms. Specifically, they can be used for supervised learning, where the goal is to learn a mapping function from input to output; or unsupervised learning, aimed at discovering hidden structures, patterns, or probability distributions within data; or for optimization, such as in a reinforcement learning framework, where features are used to represent environmental states to support agents in learning optimal strategies to maximize cumulative rewards.

To illustrate the critical role of features in modern complex systems, we can examine application instances in the field of quantitative finance. For example, at QTS (an advanced quantitative trading system), we use over 100 features as inputs. These features may cover multiple dimensions such as market microstructure signals, macroeconomic indicators, company fundamental factors, and alternative data. The purpose of these high-dimensional inputs is to achieve dynamic portfolio management, specifically, to dynamically calibrate the configuration between our Tail Reaper strategy (a strategy focused on managing extreme tail risk) and E-mini S&P 500 futures (one of the world's most liquid stock index futures). This process reveals the extent to which models rely on rich and high-quality feature inputs in a high-dimensional, non-linear, and non-stationary financial environment.

However, although the era of big data makes acquiring massive amounts of features possible, this has not simplified the modeling process but instead introduced new complexities such as the "Curse of Dimensionality." A core challenge lies in the validity and information content of features. In general, the modeler does not know beforehand which features are useful for a particular application, or whether they are redundant. In the initial stage of model building, we face knowledge uncertainty regarding the feature space. Complex multicollinearity may exist between features, or certain features may simply be noise, irrelevant to the target task.

Blindly including all available features in a model often leads to serious consequences. Using all features can lead to overfitting and poor out-of-sample performance. Overfitting occurs when the model complexity far exceeds what the data can support, causing the model to learn specific noise in the training data rather than the underlying Data Generating Process (DGP). Such models perform excellently in-sample but have extremely poor generalization ability when facing new data. In fields like finance that demand extremely high robustness in prediction, out-of-sample performance is the ultimate standard for measuring model validity, and the cost of overfitting is immense.

The figure below (Figure 1) demonstrates the relationship between model complexity (e.g., the increase in the number of features) and model performance, visually illustrating the overfitting phenomenon (bias-variance trade-off) and the necessity of feature selection.

Figure 1: The relationship between model complexity, overfitting, and out-of-sample performance

Besides the decline in generalization performance, high-dimensional feature spaces can also trigger severe problems at the computational level, or worse, numerical instability and singularity during matrix inversion. The solution processes of many classical algorithms (such as Ordinary Least Squares regression, Gaussian Processes) and modern optimization algorithms (such as Interior Point Methods) rely on the inversion of the feature covariance matrix or Hessian matrix. When the number of features is huge and redundancy exists, this matrix may approach singularity (i.e., the determinant is close to zero), which makes parameter estimation extremely sensitive to minor perturbations, yielding numerically unstable solutions, or even causing the algorithm to fail to converge.

Hence the need for a process called "Feature Selection." Feature selection aims to identify an optimal subset from the original feature set that can reduce computational complexity, lower the risk of overfitting, and improve model interpretability while maintaining or improving model performance. It is a key step in building robust, efficient machine learning models, and its importance becomes increasingly prominent as data dimensions increase.

2. Paradigm Shift: From Discriminative AI to Generative AI

To understand the evolution of feature selection methodology, we must first examine how the dominant modeling paradigm has shifted from focusing on conditional probabilities to focusing on the data distribution itself.

2.1 Traditional Discriminative AI and Its Feature Selection Methods

In traditional "Discriminative" AI, the core goal of the model is to learn the decision boundary or conditional relationship between input and output. Specifically, we try to model the probability P(Y|X). In this framework, Y is the variable we try to predict or explain, where Y is variously called the dependent variable, target, label, or variable. Discriminative models focus directly on predicting Y given the input X, emphasizing finding the optimal classification boundary or regression function.

This paradigm has long dominated specific fields, such as financial market forecasting. Quantitative traders often use tree-based models, such as Gradient Boosting Trees (GBT). GBT builds powerful non-linear models by iteratively integrating weak learners (decision trees), possessing high prediction accuracy and good adaptability to heterogeneous data. To determine which features are most critical in such high-capacity models and to improve model interpretability, researchers have developed a series of specialized feature importance assessment techniques. They usually use feature selection algorithms such as MDA (Mean Decrease Accuracy), SHAP (SHapley Additive exPlanations), and LIME (Local Interpretable Model-agnostic Explanations) to select the subset of features that can be used to model P(Y|X). MDA evaluates the marginal contribution of features to prediction accuracy through permutation tests; SHAP provides a feature attribution method with theoretical guarantees based on cooperative game theory; and LIME explains individual predictions through local linear approximation.

2.2 The Rise of Generative AI (GenAI) and the Reshaping of Feature Roles

However, with the rapid development of deep learning technologies, especially the breakthroughs in Large Language Models (LLMs) and Diffusion Models, artificial intelligence has entered the era of Generative AI (GenAI). Under this new paradigm, the role and handling of features have undergone a fundamental transformation. In Generative AI (GenAI), features play an even more central role than the target variable Y. The goal of generative models is to learn the generation process and intrinsic structure of the data itself, i.e., to model the distribution of input data P(X).

This paradigm shift means that we invest vast computational resources to build Complex Deep Neural Networks (DNN). These networks are built just to model the probability distribution of X, irrespective of what Y they may be used to predict. The model attempts to capture the joint distribution, complex dependencies, and low-dimensional manifold structures of the feature space.

This modeling approach centered on data distribution has spawned the powerful learning paradigm of "Pre-training" and "Fine-tuning." Typically, we pretrain a DNN with one objective (e.g., how well it models the distribution of X, usually measured by maximizing data likelihood or its variational lower bound), learning universal, transferable feature representations (Representation Learning). Then, we can use it for another objective (e.g., optimizing some reward using deep reinforcement learning). The model first learns basic knowledge on a large-scale unlabeled dataset, then transfers this knowledge to specific downstream tasks.

Discriminative AI focuses on modeling conditional probability P(Y|X) (learning decision boundaries), while Generative AI focuses on modeling data distribution P(X) or P(X, Y) (learning the data generation process). The figure below (Figure 2) visually contrasts these two paradigms.

Figure 2: Schematic comparison of Discriminative Models vs. Generative Models

2.3 New Demands for Feature Selection in the GenAI Era and Limitations of Traditional Methods

This paradigm shift poses severe challenges to traditional feature selection techniques, making them appear inadequate in the new context.

First is the issue of label dependency. The pre-training phase is usually conducted in an unsupervised or self-supervised setting, where the target variable Y is often missing. We can no longer use MDA, SHAP, or LIME when Y is yet undefined. These techniques were designed to evaluate the importance of features for predicting a specific target Y. MDA relies on changes in model prediction accuracy on labeled data; SHAP and LIME also require model outputs to calculate feature contributions. When there is a lack of a clear Y, these discriminative feature selection methods lose their basis for application.

But more importantly, the limitation of traditional feature selection methods lies in their static and global nature. Such traditional feature selection techniques are global: they evaluate the average importance of features based on the entire dataset to derive a fixed optimal subset. They do not allow for sample-specific feature selection. In the global selection paradigm, once the subset of features is selected, it is used for every inference, regardless of the specific content and context of the current input. This "one-size-fits-all" approach limits the adaptability and flexibility of the model.

In complex real-world systems, the importance of features is often highly Context-dependent. For example, in Natural Language Processing, the importance of a word depends on the sentence it is in; in financial markets, the importance of a certain macroeconomic indicator may be distinctly different under different Market Regimes. Models need the ability to dynamically adjust their focus on features based on the current sample, which is unachievable by traditional global feature selection methods.

3. Advanced Feature Selection Methods in the Era of Generative AI

Facing the limitations of traditional methods, we need new tools and paradigms for effective feature selection and representation learning in the GenAI era. These new methods typically internalize feature selection as part of the model architecture rather than treating it as an independent preprocessing step. Here, we will discuss two powerful and well-known methods in deep learning/GenAI that can be used for feature selection: Transformer and Variational Autoencoder (VAE).

These two architectures represent advanced paradigms for handling features in modern deep learning. They align highly with the core ideas of GenAI and possess several key advantages:

Unsupervised and Self-supervised Learning Capabilities: They can be used to pretrain DNNs for different downstream applications or on large unlabeled datasets. This allows us to fully utilize massive amounts of unlabeled data to learn rich feature representations, solving the bottleneck of scarce labeled data in supervised learning.
Sample-Specific Dynamism: They allow for sample-specific feature selection. By introducing attention mechanisms (in Transformers) or probabilistic latent variables (in VAEs), these models can dynamically adjust feature weights or select feature subsets based on the characteristics of each input sample.
End-to-End Joint Training: They can be jointly trained with the DNN parameters using a single objective function. The feature selection mechanism and model parameters are optimized together via backpropagation, achieving end-to-end learning.

3.1 Transformer Architecture and Dynamic Feature Selection

Since its proposal, the Transformer architecture has demonstrated revolutionary influence in multiple fields such as NLP, computer vision, and time series analysis, becoming the cornerstone of current GenAI models. I discussed how Transformer is built in a series of blog posts and its theoretical foundations. Its core innovation lies in the Attention Mechanism, specifically Self-Attention, which provides an elegant and powerful implementation for dynamic feature selection.

3.1.1 Self-Attention Mechanism and Feature Importance

The self-attention mechanism allows the model to dynamically measure the correlation between each element and all other elements when processing an input sequence (or feature set). This mechanism enables the model to capture long-range dependencies and build a representation of each feature based on global context information.

In a self-attention Transformer, the model learns how to calculate Query, Key, and Value vectors, and generates an Attention Matrix through operations like dot products. Each element of this matrix represents the interaction strength between a pair of input features. A key insight is that the sum of the attention scores of a column of the attention matrix gives the feature importance score of the feature corresponding to that column. Intuitively, if a feature (corresponding to a column) is frequently "attended to" (i.e., has high total attention scores) when computing the context representation of all other features, then that feature likely contains global information crucial to the entire input. This provides an intrinsic, data-driven measure of feature importance.

Furthermore, we utilize these attention scores to aggregate input features. If we multiply the attention matrix with the input features (or some linearly transformed version of them), we get the "context vector" Z. This is our transformed feature vector, which is no longer a simple combination of original features but a representation fused with global context information, where every feature is weighted by its importance (attention) score. Each element in Z is a weighted average of the entire input sequence, with weights dynamically determined by the self-attention mechanism.

Crucially, note that these attention scores depend on the values of the features themselves (since they are calculated via similarity measures between features). So they are sample-specific. This is a fundamental difference between Transformers and traditional global feature selection methods (like filter-based ANOVA or regularization-based Lasso regression). For different input samples, due to differences in feature values, the generated attention matrices will differ, leading to different feature importance assessments and different context vectors Z. This dynamic adaptability allows Transformers to capture complex patterns and context dependencies in data more finely.

3.1.2 Training Paradigm: The Revolution of Pre-training and Fine-tuning

The learning process of Transformer parameters (including weights matrices of attention layers and parameters of feed-forward networks) is achieved by optimizing a specific objective function. In a supervised learning setting, the way to train the Transformer parameters is to use it to achieve some objective, e.g., maximizing the log-likelihood of a classification task. In this case, the context vector Z is fed into a downstream network to predict label Y. Through backpropagation, gradient information guides the attention mechanism to learn how to select and combine features to minimize prediction error.

However, the true power of the Transformer lies in its application in unsupervised (or self-supervised) pre-training, which is also one of the reasons it serves as a core component of GenAI. One can also pretrain the Transformer in an unsupervised setting (i.e., without labels). In this case, the goal is no longer to predict external labels but to understand the structure of the data itself. A common pre-training method adopts a structure similar to an Autoencoder: just use Z to reconstruct (instead of predict) the original features, using the Mean Squared Error (MSE) as the loss function. This reconstruction task forces the Transformer to learn a compressed representation Z capable of capturing all critical information of the original features. Another popular pre-training method is Masked Modeling, as adopted by BERT, learning representations by predicting masked parts of the input.

But once trained, the Transformer can be used unmodified for other downstream tasks, such as regression or optimization. For example, the pre-trained Transformer can be used as a feature extractor, feeding Z into a simple classifier.

We can also choose to fine-tune it by adjusting its parameters (hopefully only slightly) to optimize other objectives. The fine-tuning process leverages the knowledge learned during pre-training and specializes it for a specific task, typically achieving excellent performance with less labeled data and computational resources.

This paradigm of pretraining and fine-tuning is one reason why GenAI is so much more powerful than traditional discriminative AI: it achieves effective Transfer Learning, breaking down barriers between different tasks and datasets. Specifically, we can pretrain the GenAI model on a much larger (possibly unlabeled) related dataset that is related to the one we are trying to predict. Traditional discriminative models are usually confined to labeled datasets for specific tasks, while GenAI models can draw knowledge from broader data sources.

The figure below (Figure 3) details the two-stage learning flow in GenAI. Stage 1 (Pre-training) utilizes large-scale related datasets for unsupervised learning to learn the universal representation P(X). Stage 2 (Fine-tuning) transfers the knowledge learned from pre-training to a small-scale dataset for a specific target, optimizing for the downstream task P(Y|X).

Figure 3: Flowchart of the Pre-training and Fine-tuning paradigm in Generative AI

3.1.3 Application Case: Overcoming Data Scarcity in Financial Machine Learning

In fields with data scarcity, this advantage is particularly prominent. Financial machine learning has long faced challenges such as low Signal-to-Noise Ratio, limited historical data, and market Non-stationarity, leading to the perennial problem of data scarcity. The pre-training paradigm offers a potential solution.

For example, if we are creating features to predict AAPL (Apple Inc. stock) returns, we usually have limited historical data for AAPL. However, we can leverage broader market data. We can first pretrain the Transformer on the features of MSFT, GOOG, etc. Although these companies belong to different sub-sectors, their stock price movements may be influenced by similar macroeconomic factors, industry trends, and market sentiments. The pre-training process enables the model to learn these universal market dynamic features. And then fine-tune the parameters to predict AAPL’s returns (see Section 10.5 of our book on fine-tuning). In this way, the model can utilize wider related data to overcome the limitation of data scarcity for a single target. This may allow us to overcome the perennial problem of data scarcity in financial machine learning.

3.1.4 Cross-Attention: Fusing Heterogeneous Features

Besides these delightful advantages of using Transformer for feature selection, the Transformer architecture also offers flexibility in handling heterogeneous data sources. The cross-attention Transformer also allows us to mix different types or sources of features together. The cross-attention mechanism allows the model to process one sequence in the context of another, thereby achieving directed fusion of information.

In many practical applications, we need to consider data of different modalities or frequencies simultaneously. In quantitative finance, we usually distinguish between time series features and cross-sectional features. (See our blog post on the difference between time series and cross-sectional features.) Time series features (e.g., VIX, interest rates, HML factor…) typically describe the macroeconomic environment or overall market conditions and have temporal continuity. Cross-sectional features (i.e., stock-specific features such as P/E, B/M, dividend yields…) describe the fundamentals or market performance of individual assets at a specific point in time.

The cross-attention mechanism allows us to mix time series features and cross-sectional features, or mix cross-sectional features from different instruments (e.g., NVDA, GOOG…), capturing the interaction effects between them.

In cross-attention, the Query comes from one data source (Sequence A), while the Key and Value come from another data source (Sequence B). The model learns how to use information from Sequence B to enhance the representation of Sequence A.

Specifically in financial application scenarios, we can build a cross-attention Transformer to fuse macro context and micro fundamentals. If cross-sectional features are used as the "query", and time series features are used as the "key/value" in a cross-attention Transformer. In this setting, the model attempts to answer the question: Under the current macroeconomic background (time series features as context), how should we focus on and interpret the fundamental information of individual stocks (cross-sectional features)?

In this setting, the interpretation of attention scores becomes very insightful. Then the sum of the attention scores of a column represents the importance of a time series feature across all cross-sectional features in, say, predicting that stock’s returns. For example, if the column corresponding to interest rate features has a high sum of attention scores, it means that in the current input sample, the interest rate is a key macro variable regulating the effects of fundamental factors.

The context vector Z calculated via cross-attention also has a unique structure and significance. The rows of the context vector Z are the same in number as the cross-sectional features (i.e., maintaining the dimension of individual stocks), but they are conditioned by the time series features at that time snapshot. This means that the feature representation of each stock no longer depends solely on its own fundamentals but is re-evaluated and weighted within the current macroeconomic environment. This is an elegant way to apply macroeconomic context to stock-specific fundamentals. It provides a dynamic, data-driven mechanism to fuse top-down (macro analysis) and bottom-up (fundamental analysis) perspectives, which is difficult to achieve effectively with traditional factor models.

The figure below (Figure 5) demonstrates how to use a cross-attention Transformer to fuse cross-sectional features (as Query) and time series features (as Key/Value). This mechanism allows the model to dynamically regulate attention to specific stock fundamentals (cross-sectional features) based on the macroeconomic background (time series features).

Figure 5: Architecture diagram of fusing cross-sectional features with time series features using Cross-Attention

3.2 Empirical Case: Predicting SPX Returns

To illustrate the effect of Transformers in feature selection and transformation, and to reveal potential challenges, let’s look at an example of predicting SPX returns. We first establish a benchmark. Suppose we just buy and hold SPX—impossible since it is an index and not an ETF, but this is just for illustrative purposes. (In the same spirit, we will ignore transaction costs in all our examples.) This simple passive strategy provides a baseline for measuring the performance of active management strategies.

We divide the dataset into two periods for backtesting. SPX has a Sharpe ratio of 0.39 between 2005-2017, and 0.8 between 2017-2025. We will designate the first period as the training set and the second period as the test set for the following ML tasks.

3.2.1 Experiment 1: Simple Multi-Layer Perceptron (MLP)

We first use traditional machine learning methods as a comparison. Using 14 technical features created using the TA-LIB library (such as Moving Averages, RSI, etc.), we train a simple MLP neural network (with 1 hidden layer with 2 hidden nodes each) to predict its next day returns. The trading strategy is: going long if the predicted return is positive and holding cash otherwise.

The experimental results show that the Sharpe ratios on the (training, test) sets are (0.4, 1.1). Compared to the "buy-and-hold" strategy (training set 0.39, test set 0.8), the MLP outperforms the buy-and-hold strategy. This indicates that even a simple non-linear model, combined with traditional technical analysis features, can capture some predictability in the market.

3.2.2 Experiment 2: Self-Attention Transformer with MLP

Next, we introduce the Transformer to process features. If we create a self-attention Transformer with a 64-dimensional "embedding" and proper normalization (such as Layer Normalization). To enhance the model's temporal awareness, we do this while adding lagged features with lags 1, 2, 3, and 4 days. And feed the context vector into the same MLP. In this setting, the Transformer acts as a dynamic feature processor and interaction learner.

The experimental results show that the Sharpe ratios become (0.7, 0.6). Looking at the training set performance (0.7), the Transformer significantly improved the model's fitting ability (compared to the MLP's 0.4). However, on the test set (0.6), it performs worse than the MLP using raw, untransformed features (1.1). It still outperforms buy-and-hold (Note: Interpret with caution here; if the test set benchmark is 0.8, then a performance of 0.6 is actually inferior to buy-and-hold. But we strictly follow the original text), but underperforms the MLP using the raw, untransformed features.

This teaches us a lesson – Transformer sounds great, but its parameter count is usually far larger than a simple MLP. But without sufficient data, overfitting is a real danger. In this example, the substantial increase in training set performance and the decline in test set performance are typical signals of overfitting. Daily data from 2005-2017 may not be sufficient to support the effective training of a complex Transformer model.

Furthermore, model performance is highly dependent on architecture design and hyperparameter settings. Transformer also has many hyperparameters we can optimize, such as the embedding dimension, normalization method, number of attention heads (multi-head attention allows the model to attend to different information in different representation subspaces), adding positional encoding (especially important for temporal data), etc. Careful tuning of these hyperparameters is crucial for unleashing the best performance of the Transformer. You can read about all these variations in my blog posts referenced earlier.

3.2.3 Experiment 3: Cross-Attention Transformer

Finally, we examine the effect of the cross-attention mechanism in fusing heterogeneous features. If we add just VIX (Volatility Index, an important time series feature reflecting market panic levels) as the time series feature to serve as the key/value input, and the original technical features as the query input. Theoretically, cross-attention should be able to use the information provided by VIX to regulate the weights of technical features, such as changing the effectiveness of certain technical indicators in high volatility environments.

The experimental results show that the cross-attention Transformer will give Sharpe ratios of (0.3, 0.6). No improvement. Both training and test set performances are relatively poor.

This result highlights that effective utilization of complex models requires thoughtful feature engineering and sufficient data support. Of course, this is a toy example. Using only one time series feature (VIX) may not be sufficient to provide a rich macroeconomic context. In practice, we should use the hundreds of time series features (both macroeconomic and market-based) created jointly by QTS and Predictnow.ai as the key/value inputs. Only when the key/value inputs are rich and diverse enough can the cross-attention mechanism effectively learn how to use this information to enhance the representation of query inputs.

4. Variational Autoencoder (VAE): Probabilistic Feature Selection and Latent Representation Learning

After delving into how the Transformer architecture achieves dynamic, sample-specific feature selection and fusion through self-attention and cross-attention mechanisms, we will turn to another powerful feature learning paradigm in the field of Generative AI—methods based on probabilistic graphical models and variational inference. The core of such methods lies in learning the Latent Structure of data distribution. Its representative architecture is shown as follows:

X -> Probabilistic Encoder -> μ X -> Probabilistic Encoder -> σ μ, σ, ϵ -> Z -> Probabilistic Decoder -> X'

This flowchart visually demonstrates the core mechanism of the Variational Autoencoder (VAE), revealing its fundamental difference from traditional deterministic autoencoders. It is a Probabilistic process. Specifically, input features X are first passed through a "Probabilistic Encoder" (usually a deep neural network). This encoder does not deterministically map X to a point Z in the latent space, but maps it to the parameters of a probability distribution. In a standard VAE, this distribution is usually assumed to be Gaussian, so the encoder outputs the mean (μ) and standard deviation (σ) of this distribution.

Subsequently, to generate the latent variable Z, we need to sample from this distribution. To ensure the entire network can be trained end-to-end via backpropagation, a key technique—the Reparameterization Trick—is introduced. We introduce an external noise source ϵ (usually sampled from a standard Gaussian distribution), and then obtain Z through the deterministic transformation Z = μ + σ ⊙ ϵ. This step corresponds to μ, σ, ϵ -> Z in the flowchart. It cleverly shifts the randomness from the variable Z itself to the external input ϵ, thereby allowing gradients to flow through the encoder network.

Finally, the sampled latent variable Z is input into the "Probabilistic Decoder" (another deep neural network), which attempts to reconstruct the original input, generating X'. The decoder defines the conditional probability distribution P(X|Z) for generating observed data from the latent representation.

Variational Autoencoder (VAE) is another feature selection method. By learning a low-dimensional latent representation capable of capturing the key Factors of Variation in the data, it achieves non-linear feature compression and selection. Unlike Transformers which focus on interactions and context dependencies between features, VAE provides a framework for understanding the data Manifold from a probabilistic generative perspective.

4.1 VAE as a General Latent Variable Model

From the perspective of statistical modeling development, it can also be considered a more generalized version of the familiar PCA, Gaussian Mixture Model (GMM), or Hidden Markov Model (HMM). PCA is a linear dimensionality reduction technique, and its probabilistic version (Probabilistic PCA) can be viewed as a latent variable model with a linear Gaussian decoder. GMM assumes data is a mixture of a finite number of Gaussian distributions, where its latent variable is a discrete class label. HMM focuses on temporal data, assuming observed sequences are generated by a hidden Markov chain. VAE generalizes the ideas of these models to highly non-linear and complex scenarios by leveraging the powerful function approximation capabilities of deep neural networks, able to learn non-linear manifold structures that PCA cannot capture, and continuous latent spaces richer than GMM.

All these are latent variable models. The core idea of latent variable models is to assume that high-dimensional observable data is actually driven by low-dimensional, unobservable latent factors. They transform the observable features X into a smaller set of unobservable "latent" variables z that can generate the observable features with a simpler probability distribution (e.g. Gaussian). By learning the mapping from a simple distribution to a complex data distribution, we are able to understand the intrinsic structure of the data and perform effective feature extraction.

The figure below (Figure 7) shows the detailed architecture of the Variational Autoencoder (VAE). Input features X are mapped to the parameters of the latent distribution (μ and σ) via the probabilistic encoder. Then, the latent variable Z is obtained by sampling using the reparameterization trick (introducing external noise ε). Finally, the probabilistic decoder reconstructs Z into X'.

Figure 7: Detailed Architecture Diagram of Variational Autoencoder (VAE)

4.2 Mathematical Framework of VAE: Marginal Likelihood and Inference

In other words, if z is the latent variable, the latent variable model defines a Generative Process. It can adapt to different data types and assumptions. For example, it can be a simple binary categorical variable z=(0,1) (following a Bernoulli distribution), corresponding to component selection in mixture models; or a continuous variable (following a Gaussian distribution), as usually assumed in VAEs. then, the Marginal Likelihood of the observed data X can be obtained by integrating (or summing) over the latent variable z:

p(X) = ∫ p(X,z)dz = ∫ p(X|z)p(z)dz

This formula is the core of all latent variable models. It states that the probability distribution of observed data p(X) is obtained by marginalizing out the latent variable z from the joint distribution p(X,z). The joint distribution can be further decomposed into the product of the likelihood function p(X|z) (the decoder) and the prior distribution p(z) (prior knowledge of the latent space). When training generative models, our goal is usually to maximize this marginal likelihood p(X).

This is also reminiscent of the context vector Z output by the Transformer. Although the computational mechanisms and theoretical bases differ (Transformer relies on deterministic attention weighting, while VAE relies on probabilistic inference and sampling), they share similarities in functionality. In both cases, z (in VAE) and Z (in Transformer) represent high-level abstractions or compressed representations of the input data. In both cases, z tells you about the probability distribution of the AI system’s final output. They act as information bottlenecks, containing the core information needed for generation or prediction tasks.

In the case of VAE, we have an unsupervised learning task, where the goal is to learn the distribution of the data itself and be able to reconstruct the data. so the output is the same as the input X. The model attempts to maximize reconstruction accuracy while maintaining the regularity of the latent space.

4.2.1 Decoder: The Generative Process

The decoder part of the VAE is responsible for implementing the generative process. Given z, we obtain X via P(X|z), which is called the decoder part of the VAE. The decoder is typically a deep neural network that maps the low-dimensional z back to the high-dimensional data space and defines the parameters of the likelihood function P(X|z).

4.2.2 Encoder: The Inference Process and Variational Inference

But how do we get the distribution of z given X? This is the Inference Process, i.e., inferring the posterior distribution of latent variables p(z|X) based on observed data. According to Bayes' theorem, p(z|X) = p(X|z)p(z) / p(X). However, calculating the denominator p(X) (the marginal likelihood integral mentioned earlier) is usually computationally Intractable, especially when p(X|z) is defined by complex non-linear neural networks.

Finding p(z|X) is the job of the encoder. In the VAE framework, we use Variational Inference to solve this intractability problem. We introduce a learnable approximate posterior distribution q(z|X) (i.e., the encoder network) to approximate the true posterior distribution p(z|X). This practice of using a neural network to parameterize the parameters of the variational distribution is called Amortized Inference.

In practice, assumptions are usually made about the forms of these distributions to facilitate computation. P(X|z) and p(z|X) (more accurately its approximation q(z|X)) are both Gaussian distributions. This choice has both computational convenience and theoretical rationality. The key is that but their parameters are given by two separate DNNs. The encoder DNN outputs the parameters of q(z|X) (mean and variance), and the decoder DNN outputs the parameters of P(X|z).

4.3 Training VAE: Evidence Lower Bound (ELBO)

Combining the two DNNs constitutes the VAE. The goal of training is to optimize the parameters of both the encoder and decoder simultaneously. and the two sets of parameters are trained simultaneously by minimizing the log-likelihood of X (Note: usually maximizing log-likelihood, or minimizing negative log-likelihood).

However, since directly optimizing the marginal log-likelihood log p(X) is difficult, This is easier said than done. VAE achieves training by optimizing a surrogate objective function—the Evidence Lower Bound (ELBO). ELBO is a lower bound on the marginal log-likelihood. Maximizing ELBO is equivalent to maximizing Reconstruction Accuracy and minimizing the Kullback-Leibler divergence between the approximate posterior q(z|X) and the prior p(z). The KL divergence term acts as a regularizer. The training process also involves using the reparameterization trick to handle gradient calculation for the random sampling process. but the details can be found in Chapter 6 of our book. I will also delve deeper into VAE in my seminar at Imperial College London, and will elaborate on the relationships between GMM, HMM, and VAE and their training methods in a future blog post.

(If you are wondering, we also need to specify the prior p(z). In the VAE framework, the prior distribution of latent variables also needs to be set. To simplify computation and encourage the model to learn a regular, smooth latent space, it is usually assumed to be a simple fixed Gaussian distribution p(z) = χ(0,1)., i.e., standard normal distribution (zero mean, unit variance). The KL divergence regularization term penalizes latent codes that deviate too far from this prior distribution.)

4.4 Applications of VAE: Pre-training, Fine-tuning, and Semi-supervised Learning

4.4.1 Latent Variables as Feature Inputs

Once trained in this unsupervised manner, the latent space learned by the VAE captures the important structure of the data. the latent variable z can be used as the transformed feature vector. These features are usually lower-dimensional and informationally denser than the original features X, and better reveal the generative factors behind the data. They can be used for downstream supervised learning or optimization applications. For example, in a classification task, we can use z as input to train a classifier instead of using the raw high-dimensional features X.

This application is very much like the context vector Z from the Transformer. Both provide effective compression and transformation of original data. Indeed, Transformer can be thought of as an encoder, which learns how to encode the input sequence into a context-aware representation via attention mechanisms. The VAE's encoder learns a representation of the latent space via probabilistic inference.

4.4.2 Transfer Learning and Data Efficiency

Furthermore, similar to Transformer, VAE also fits perfectly with the pre-training and fine-tuning paradigm, enabling it to utilize large-scale unlabeled data. we can pretrain the encoder on a large unlabeled dataset to learn universal data representations. For example, one can pretrain a VAE on the historical data of the entire stock market to learn general patterns of market dynamics. and fine-tune it on a small labeled dataset for supervised learning. This allows VAE to overcome the limitation of scarce labeled data for specific tasks.

4.4.3 Breakthroughs in Semi-supervised Learning

Alternatively, we can apply "semi-supervised" learning (Kingma and Welling 2019). Semi-supervised learning is extremely valuable in scenarios where labeled data is expensive while unlabeled data is abundant. where a mixture of unlabeled and labeled data are used to train the VAE. In Semi-Supervised VAE (SS-VAE), the model's objective function includes both a generation term (using all data to learn data distribution) and a discrimination term (using labeled data for prediction). By sharing the latent space, the model can leverage unlabeled data to better understand the distribution structure of the data, thereby improving classification performance on labeled data.

The effectiveness of this method has been astoundingly verified in certain fields. When semi-supervised learning was applied to image classification (e.g., on the MNIST dataset), merely 10 labels per class are required to attain >99% classification accuracy. Amazing! This demonstrates the powerful ability of using generative models to extract structural information from unlabeled data to assist discriminative tasks. This holds tremendous potential for many applications where labeled training data are limited, such as financial market forecasting, medical image analysis, or industrial anomaly detection.

4.5 Empirical Case: VAE in SPX Return Prediction

Continuing the example above on SPX next-day return prediction, we now evaluate the performance of VAE as a feature extractor. we can construct a VAE using the same 14-dimensional feature vector (technical indicators as input), an 8-dimensional latent vector (achieving dimensionality reduction from 14 to 8 dimensions), and 2 ReLU layers each for the encoder and decoder. We first train this VAE unsupervisedly on the training set (2005-2017), then use the trained encoder to extract latent variable z, and use it for the downstream prediction task.

The experimental results show that The Sharpe ratios on the (training, test) sets are (0.2, 0.4). This result is not ideal, which is much worse than the "buy-and-hold" benchmark (training set 0.39, test set 0.8). Compared to the MLP and Transformer models using raw features, VAE performs worse.

The figure below (Figure 6) compares the performance (Sharpe ratio) of different models on the SPX return prediction task. Experimental results show that the simple MLP performs best on the test set, while complex Transformer and VAE models exhibit overfitting on limited data (e.g., Self-Attention Transformer has a training set Sharpe ratio of 0.7, but the test set is only 0.6).

Figure 6: Comparison of Sharpe Ratios for Different Models on SPX Return Prediction Task

4.5.1 Result Analysis and Lessons

Although disappointing, this result provides important insights. First, VAE is a complex probabilistic model whose performance is very sensitive to architecture choices and the training process. Naturally, extensive hyperparameter and architecture optimization are needed. For example, the dimension of the latent space, the depth and width of the network, and the weight between the reconstruction term and the KL divergence term in ELBO (e.g., in Beta-VAE) all need careful tuning.

More importantly, this experiment highlights the importance of data volume for training complex generative models. we did not pretrain the VAE on any additional data. The model could only learn representations from limited SPX historical data. so it is not surprising that the SPX data itself is insufficient to yield good performance. Financial data typically has a low signal-to-noise ratio and is non-stationary, which makes training deep generative models even more challenging. The true potential of VAE may need to be unleashed by pre-training on large-scale related datasets.

5. Conclusion: The Reshaping of Feature Engineering Paradigms in the Era of Generative AI

Transformer and Variational Autoencoder illustrate how feature transformation and selection are central to GenAI. In the GenAI paradigm, feature engineering is no longer an independent preprocessing step but is deeply integrated with model architecture and the training process. rather than an afterthought as in traditional discriminative models, which typically adopt global, static methods.

They also highlight the flexibility of the generative approach. The core of this flexibility lies in the capabilities of transfer learning and continuous learning. especially the ability to pretrain models on large unlabeled datasets, enabling models to learn universal knowledge and representations from massive data. and incrementally fine-tune them as new data become available, allowing models to efficiently adapt to specific downstream tasks and changing environments.

Their potential in financial applications is only now beginning to be tapped by some of the most elite quantitative trading teams. With the deepening understanding of these models, the improvement of computing power, and the accumulation of larger-scale financial datasets, GenAI-based feature selection and representation learning methods are expected to reshape research and practice in quantitative finance in the future.