Insights
What Is Alternative Data in Trading?
Alphanume Team · June 10, 2026
The categories, sources, and pitfalls of non-traditional datasets — and why most of the work is plumbing, not alpha.
Alternative data trading has become one of the most overloaded phrases in quantitative finance. At its core, the term refers to any dataset used to generate an investment edge beyond standard price and volume data and the traditional fundamental disclosures — earnings, revenue, balance-sheet filings — that every participant already has access to. The category is broad by design: satellite imagery of parking lots, anonymized credit-card transaction flows, job postings, shipping manifests, web-scraped pricing data, and Wikipedia page-view counts all qualify. What unites them is that they are non-traditional, often unstructured, and require significant engineering work before they yield anything resembling a signal. The Wikipedia Views dataset is a representative example — public attention data derived from page-view counts, useful as a measure of retail investor or public interest in a company before that interest appears in price.
The popularity of alternative data has grown in proportion to the difficulty of generating alpha from traditional sources. As more capital pursues the same fundamental factors, the marginal value of a clean balance-sheet screen declines. The hunt moves to datasets that are harder to acquire, harder to clean, and harder to interpret — which means the edges, where they exist at all, tend to be smaller, shorter-lived, and more expensive to maintain than the marketing around them suggests.
A taxonomy of alternative data sources
The alternative data universe is heterogeneous enough that a taxonomy helps. Most datasets fall into one of six broad categories, each with distinct data-quality considerations and typical use cases.
Web and attention data captures search interest, social media activity, and platform engagement. Google Trends indices, Reddit comment volumes, and Wikipedia page-view counts sit here. These signals tend to be noisy and require careful normalization — absolute counts are less useful than changes relative to historical baselines, and seasonality is a persistent confounder.
Transaction data includes anonymized credit-card and debit-card spending aggregated by merchant, industry, or geography, along with e-receipt data from consumer apps. This category is among the most commercially valuable because it provides a near-real-time read on consumer spending before companies report revenue. It is also among the most expensive and most legally scrutinized, given privacy-regulation exposure.
Geolocation and foot-traffic data is derived from mobile device pings, aggregated and anonymized to show how many devices visited a given location in a given period. Retail store visits, restaurant foot traffic, and logistics-hub activity are common applications. The signal degrades when GPS accuracy varies across devices or when anonymization methods change across vendor data pulls.
Satellite and aerial imagery covers a range of applications: parking-lot occupancy at retail locations, crude-oil tank fill levels from shadow analysis, agricultural yield estimation from spectral analysis. Imagery data requires computer-vision processing before it becomes numeric, which adds model risk on top of data risk.
Web-scraped data includes product pricing from e-commerce sites, job postings as a proxy for hiring intent and business momentum, and app-store rankings. These datasets are operationally fragile — website structures change, scraping access is frequently blocked or throttled, and the underlying population of sources shifts over time in ways that break time-series comparability.
Regulatory and filing-derived data extracts signals from public filings that are technically available to everyone but rarely processed systematically. SEC filing intensity — the frequency, timing, and volume of a company's regulatory submissions — is one example. Insider transaction patterns, ownership concentration changes from 13F filings, and textual sentiment scores from 10-K risk-factor language are others. The advantage of this category is that the data is public, timestamped, and not subject to privacy-law exposure.
The alt-data lifecycle: raw to signal
Acquiring a dataset is the smallest part of the problem. The lifecycle from raw vendor feed to deployable signal involves four stages, and the middle two consume most of the effort.
Raw ingestion involves receiving data in whatever format the vendor provides — often messy, inconsistently encoded, and poorly documented. Cleaning involves detecting and handling duplicates, missing observations, encoding errors, and vendor-specific artifacts. This stage alone routinely requires more engineering time than building the signal model that follows it.
Point-in-time alignment is where most practitioners underestimate the difficulty. A dataset is point-in-time correct if, for any historical date in a backtest, the only values used are those that were actually available as of that date — not values that were later revised, restated, or backfilled by the vendor. Transaction data vendors frequently revise their panel coverage as new merchants join the network; if the backtest uses the current panel to simulate 2018 signals, the model has seen data that did not exist in 2018. This is a form of look-ahead bias that is structurally baked into many commercial datasets and is difficult to detect without access to vendor snapshot histories.
Signal construction is the final stage — aggregating the cleaned, point-in-time-aligned observations into a numeric factor, normalizing it against an appropriate cross-sectional or time-series benchmark, and estimating its predictive relationship with future returns. This is where the modeling work lives, but it cannot be trusted until the prior stages are clean.
The universal pitfalls
Alternative datasets carry a set of failure modes that recur across categories and vendors. Understanding them before committing to a dataset is more valuable than any backtested Sharpe ratio.
Survivorship bias affects datasets that only capture currently active entities. A web-scraped job-posting history that only includes companies still in operation today will look more predictive than it was in real time, because the companies that went bankrupt — and whose declining job postings might have been a useful signal — have been retrospectively excluded.
Backfill bias is specific to vendor data construction. When a vendor builds an initial historical database after the fact, they fill it with data from sources that were available retrospectively — which is not the same as the data that would have been available in real time. Transaction data panels, analyst coverage expansions, and satellite image archives are all vulnerable to this.
Short history is simply the constraint that most alternative datasets have existed for a decade or less, often far less. A dataset with five years of history contains perhaps two or three distinct market regimes — not enough to separate a genuine predictive relationship from an in-sample coincidence. This makes overfitting nearly unavoidable for researchers who are not disciplined about out-of-sample testing.
Publication and reporting lags create look-ahead contamination even when data is technically available. Credit-card data may be delivered with a two-week lag; satellite imagery may take days to process. Backtests that assume same-day availability overstate signal quality in proportion to how much prices move in the lag window.
Crowding is a market-structure effect. When many systematic funds buy the same dataset from the same vendor and construct similar signals, the resulting trades can be correlated. A catalyst that causes one fund to unwind — a drawdown, a redemption — can cascade through other funds running correlated positions. The edge erodes fastest in exactly the conditions when capital is most at risk.
Privacy and MNPI risk is a legal consideration that is easy to underweight when a dataset looks attractive on a backtest. Transaction data derived from financial intermediaries may carry restrictions on how it can be used. Geolocation data faces regulatory exposure in multiple jurisdictions. And datasets that provide information about specific companies' operations before public disclosure may cross into material non-public information territory under securities law. Legal review before deployment is not optional.
How to evaluate a dataset
A structured framework for evaluating a new alternative dataset before investing in data-engineering infrastructure reduces the probability of discovering fatal flaws after months of work.
Coverage and representativeness. What fraction of your investment universe is covered? A dataset that covers 60% of large-cap US equities but only 15% of mid-caps creates a selection bias in any backtest that ignores the uncovered names. Ask the vendor how coverage has changed over the history of the dataset — expansion of coverage looks like alpha in a backtest.
History length and regime diversity. A minimum of eight to ten years of history, spanning at least one recession and one sharp dislocation, is a reasonable threshold for beginning to trust out-of-sample results. Datasets with fewer than five years of history should be treated as hypotheses to confirm prospectively, not alpha to deploy immediately.
Point-in-time integrity. Request documentation of the vendor's snapshotting methodology. Can they provide vintage files — the exact data as delivered on specific historical dates? If not, backfills and restatements cannot be distinguished from the original signal, and the backtest is not trustworthy.
Orthogonality to known factors. A dataset that is highly correlated with existing well-documented factors — momentum, value, low-volatility — provides little incremental information once those factors are controlled for. The marginal information content relative to cheap factor exposures is what justifies the cost and operational complexity of the alternative dataset.
Signal decay and the cost of maintenance
Even a dataset that passes all of the above criteria will generate a signal that decays. The mechanism is straightforward: as more market participants identify and trade on the same pattern, prices adjust faster to incorporate the information, and the predictive window shrinks. What was a five-day predictive horizon in 2015 may be a one-day horizon in 2025 as the same data has been commoditized and processed more quickly by more participants.
Decay is not a reason to avoid alternative data, but it requires a realistic model of the maintenance burden. A signal that worked on a monthly rebalance three years ago may now require weekly or daily rebalancing to capture what remains of the edge — with proportionally higher transaction costs. The expected lifetime of a signal, and the monitoring required to detect decay before it renders the strategy unprofitable, should be part of the investment thesis before the data engineering begins.
The right response to decay is not to find a new dataset and repeat the cycle, but to build a portfolio of signals that are imperfectly correlated with each other — so that as one decays, others compensate. The methodology for doing this is covered in the discussion of combining alt-data signals, which addresses the weighting and correlation considerations that determine whether a multi-signal portfolio is actually more robust than its components or simply more complex.
The realistic framing
Alternative data is not a category of datasets that unlocks persistent outperformance. It is a category of datasets that, after significant engineering investment, may yield small, decaying, expensive-to-maintain edges — some of which are real, some of which are artifacts of survivorship, backfill, or in-sample overfitting. The fraction of alternative datasets that generate genuine out-of-sample alpha, net of data costs and transaction costs, is substantially smaller than the vendor landscape implies.
The practitioners who extract durable value from alternative data share a common discipline: they treat data quality and point-in-time integrity as prior constraints, not as details to address after the backtest looks good. They evaluate signals out-of-sample before deployment. They monitor decay and are willing to retire strategies that no longer clear the cost hurdle. And they are specific about what problem a given dataset solves — attention data, transaction data, satellite imagery, and filing-derived signals each have distinct information content and distinct failure modes, and conflating them under the label "alternative data" obscures as much as it reveals.
The category is worth engaging with seriously. It rewards rigor and punishes shortcuts — which is, ultimately, what makes it interesting.