Insights

Combining Alt-Data Signals Without Overfitting

Alphanume Team · June 3, 2026

Orthogonality, sample size, and honest validation — the disciplines that separate a durable multi-signal book from an overfit backtest.

The appeal of combining alpha signals is obvious: diversification across uncorrelated edges should improve a portfolio's risk-adjusted return without requiring any individual signal to be stronger. The danger is equally obvious once you have run enough backtests. Each signal added to the stack is another degree of freedom. Add enough of them, select the combination that looks best historically, and you have constructed an elaborate curve-fit that will decay immediately in live trading. The discipline of combining alternative data with price, volume, and fundamental factors requires confronting this problem directly — not treating it as a detail to handle after the strategy is built.

The sections below work through the core principles in the order they should be applied: why fewer economically-motivated signals beat many fitted ones, how to measure whether a new signal genuinely adds information, which combination methods hold up out-of-sample, how to account for the multiple-testing problem embedded in any signal-selection process, and what ongoing monitoring looks like once a strategy is live.

The case for fewer, motivated signals

A common pattern in quantitative research is to begin with a handful of signals, observe that the backtest Sharpe is acceptable but not exceptional, and then iterate — adding signals, applying filters, introducing interaction terms — until the historical performance looks compelling. Each iteration feels like improvement. It is not. What is actually happening is that the researcher is finding the parameters that explain the past sample, not the parameters that will generalize to the future.

The antidote is to begin with a strong prior. Before touching the data, articulate why a signal should predict returns. What is the information advantage? Who is on the other side of the trade, and why are they willing to be there? A signal that answers those questions confidently — because it captures a behavioral bias, an institutional constraint, or a genuine information asymmetry — is far more likely to survive out-of-sample than one that emerged from scanning hundreds of combinations for the best historical fit.

This applies especially to alternative data signals. The Wikipedia Views dataset, for example, captures retail attention in a way that is economically interpretable: elevated search behavior for a company or commodity ahead of a known catalyst reflects the same attention-driven trading dynamics documented in the academic literature on investor attention. That motivation exists before the backtest is run. It constrains the researcher from arbitrarily combining Wikipedia traffic with dozens of other datasets until something looks good.

A practical ceiling of three to five signals per strategy is not a rule, but it is a useful forcing function. It requires prioritization and justification rather than accumulation.

Measuring orthogonality before combining alpha signals

The promise of combining alpha signals is diversification — signals that capture different sources of return add more than signals that largely replicate each other. The empirical test of this promise is orthogonality: does a new signal carry information that is not already present in price and volume, existing factors, and the signals already in the book?

Two measurements are necessary. The first is pairwise correlation of signal values in the cross-section. High correlation between a new signal and an existing one means they are largely measuring the same thing; combining them adds complexity without adding edge. A correlation above roughly 0.6 in the cross-section should prompt skepticism about whether the new signal earns its place.

The second and more important test is incremental information coefficient. Compute the IC of the existing signal combination against forward returns. Then add the new signal and compute the IC again. If the incremental IC is close to zero, the new signal is not contributing. If it is meaningfully positive — and if that increment is stable across subperiods, not driven by a single favorable window — the signal earns consideration. Neither test should be run on the same data used to develop the signal. A clean holdout period is required.

The correlation test should also extend to the factor exposures the signal loads on implicitly. A signal that appears to predict returns but is highly correlated with the size factor or the value factor may simply be a noisy replication of those factors. Residualizing against known factors before computing IC is good practice.

Combination methods: simplicity as a feature

Once a set of signals passes the orthogonality and motivation screens, they need to be combined into a composite score. The methods available range from simple — equal-weighted z-score averaging — to complex — ridge or lasso regression, neural networks, gradient boosting. The consistent finding in quantitative practice is that simple methods outperform complex ones out-of-sample, often by a wide margin.

The reason is straightforward. More complex methods have more parameters. More parameters require more data to estimate reliably. In most equity signal research, there is not enough independent data — not enough non-overlapping holding periods — to estimate complex weighting schemes without overfitting. A neural network fitted on ten years of monthly signal data may have more parameters than there are independent observations to constrain them.

Z-score averaging — standardize each signal to zero mean and unit variance in the cross-section, then average — imposes equal weighting by construction. It has no free parameters beyond the signals themselves. When tested out-of-sample against regression-weighted combinations estimated on the same training data, the equal-weighted approach frequently matches or beats the fitted weights because it does not absorb noise from the estimation step.

Regression or factor-model weighting is appropriate when the sample is genuinely large — thousands of instruments across many years — and when the weighting scheme is simple enough to be estimated stably. In those cases, a penalized regression that shrinks weights toward equal weighting captures the best of both approaches. The key constraint is that the weighting model must be fitted on training data and evaluated on an unseen period, not on the full sample.

The multiple-testing problem

Every backtest is a hypothesis test. A strategy that appears to have a Sharpe ratio of 1.2 over a historical window may have that Sharpe because it is genuinely a good strategy — or because the researcher tested twenty combinations and selected the best-looking one. The latter produces a Sharpe ratio that is inflated by selection bias, a form of data snooping that is endemic to quantitative research.

The deflated Sharpe ratio, introduced in the academic literature by Bailey and López de Prado, adjusts the observed Sharpe by the number of trials conducted, the length of the track record, and the non-normality of returns. The adjustment can be substantial: a strategy that looks like it has a Sharpe of 1.0 after testing fifty combinations on five years of data may have a deflated Sharpe below 0.3 — meaning the evidence for genuine edge is much weaker than the raw number suggests.

Bonferroni correction and false-discovery-rate methods offer related adjustments. If you test one hundred signal combinations and require a 5% significance threshold, you expect five false positives by chance alone. Applying a Bonferroni correction raises the required significance level for each individual test. False-discovery-rate control — via the Benjamini-Hochberg procedure — is less conservative and more appropriate when many tests are run. Both approaches enforce the discipline of accounting for how many things were tried before selecting the winner.

The most honest constraint is to maintain a research log. Record every signal and every combination tested, not just the ones that looked good. The log makes the number of trials explicit and prevents the selective amnesia that makes backtests look better than they are.

Walk-forward and holdout discipline

No amount of in-sample statistical adjustment replaces genuine out-of-sample testing. The gold standard is a true holdout: data that was set aside before any signal development began and that is evaluated only once, after all signal-selection and weighting decisions have been finalized. Using the holdout for any iterative refinement converts it into in-sample data.

Walk-forward testing is a practical alternative when the dataset is too short to afford a large static holdout. Fit the signal combination on a rolling training window — say, three years — and generate predictions on the next month, then roll forward. The out-of-sample predictions accumulate across the walk-forward periods and can be evaluated as a time series of performance. The critical requirement is that nothing from the test period — no price data, no return data — enters the signal estimation at each step.

Transaction costs and market impact must be incorporated before any out-of-sample performance is interpreted. A signal combination that turns over 50% of the portfolio monthly and generates a gross Sharpe of 1.5 may have a net Sharpe near zero after realistic cost assumptions. Capacity is a related constraint: a signal derived from a small-cap dataset cannot necessarily support the same dollar volume as a large-cap one, and the analysis should include an estimate of the portfolio size at which market impact degrades performance meaningfully.

Point-in-time data integrity applies across every input. Each signal must be computed using only the data that would have been available on the date the signal is generated — no restated financials, no corrected datasets backfilled by vendors. This is particularly important for fundamental data, where restatements are common, and for any alternative data source where historical delivery may differ from what was actually available in real time. The Wikipedia Views dataset carries timestamped daily counts that reflect what was observable at each date — this kind of point-in-time structure is what a rigorous combination framework requires from every input.

Signal decay and live monitoring

A signal combination that passes all of the above tests still requires ongoing monitoring in production. Signals are not static — the information advantage embedded in a factor erodes as more participants discover and trade it. Signal decay is the empirical process of this erosion: the IC of a signal against forward returns declines over time as it becomes more widely known and as the market structure it exploits changes.

Monitoring live IC — computed monthly or quarterly on expanding out-of-sample data since the strategy went live — is the primary tool for detecting decay. A persistent decline in rolling IC, not explained by a single bad period, is a signal to re-examine whether the edge still exists. Acting on a single bad quarter is an overreaction; ignoring a two-year declining trend is negligence.

Correlation between signals should also be monitored live. Signals that were largely orthogonal during development can become more correlated as they attract crowded positioning. When pairwise correlations between strategy signals rise materially, the diversification benefit shrinks and the composite's risk properties change.

A checklist for adding a new alt-data signal

Before incorporating any new alternative data source into an existing book, the following questions should have explicit answers:

What is the economic mechanism? The signal should predict returns for a stated reason that does not depend on the backtest.
Is the data point-in-time? Every historical value must reflect what was available on that date, with no backfill.
What is the incremental IC beyond existing signals? Compute this on a holdout period that was not used in signal development.
What is the pairwise correlation with existing signals? High cross-sectional correlation is a reason to reject or replace, not to add on top.
How many trials were run to find this signal? Apply a multiple-testing adjustment to the reported Sharpe before treating it as meaningful.
What are the realistic transaction costs? Net performance, not gross, is the relevant metric.
What is the expected holding period, and does the data frequency support it? Using daily data to support a monthly strategy wastes nothing; using monthly data to support a daily strategy introduces stale-signal risk.
What is the plan for monitoring decay? Define in advance what rolling IC trend would trigger a review.

The checklist is not a guarantee against overfitting — no checklist is. It is a forcing function that makes the assumptions explicit and creates an audit trail. When a strategy underperforms, explicit assumptions can be examined and updated. Implicit ones cannot.

The most durable multi-signal books are not the ones with the most signals. They are the ones where every signal present has earned its place through economic motivation and rigorous out-of-sample evidence — and where the researchers who built them were honest about how many things they tried before finding what worked.