Kernel Regularized Least Squares — an Explainer

How KRLS works, when to use it, and how to run it in R and Stata.

This is a self-contained tutorial on Kernel Regularized Least Squares (KRLS) for users coming from either R or Stata. The R commands use the KRLS package; the Stata commands use the krls command. Both implement the same algorithm from Hainmueller and Hazlett (2014) and Ferwerda, Hainmueller, and Hazlett (2017).

Back to the Kernel ML Methods project page for the package landing pages and the most recent release notes. Source: j-hai/KRLS (R) · j-hai/krls-stata (Stata).

The problem KRLS solves

You want to estimate the conditional expectation \(E[Y \mid X]\) without committing to a specific functional form. The standard options each have their own failure modes:

  • OLS assumes additivity and linearity. If either is wrong, the coefficients are biased and the marginal effects (the things you actually care about) are wrong everywhere.
  • GLM with interactions and polynomials can pick up non-linearities, but you have to specify them in advance. With \(p\) predictors, the cost of “just include the right interactions” grows fast.
  • Random forests / gradient boosting are flexible but not interpretable: you get predictions, not coefficients, and pointwise marginal effects are not first-class output.
  • GAMs assume additivity, which can be too restrictive for social-science data with rich interactions.

KRLS is a kernel-based method that:

  1. Makes no assumption about the functional form (no linearity, no additivity, no specific interactions).
  2. Returns pointwise marginal effects — one per observation — so you can see how the partial derivative \(\partial \hat E[Y \mid X] / \partial X_j\) varies across the covariate space.
  3. Reports closed-form analytical standard errors for the average marginal effects, together with the full distribution of pointwise derivatives so heterogeneity is visible alongside the summary statistic.

In short: a flexible regression that still tells you a coefficient.

The setup

KRLS represents the unknown regression function as a weighted sum of Gaussian kernels — one centered on each observation:

\[\hat f(x) \;=\; \sum_{i=1}^n c_i \,K(x, x_i) \qquad K(x, x_i) \;=\; \exp\!\left(-\frac{\|x - x_i\|^2}{\sigma}\right)\]

The sigma argument to krls() is the kernel denominator directly (not its square), and the default is \(\sigma = p\) where \(p\) is the number of predictors.

The coefficients \(c\) minimize a Tikhonov-regularized least-squares objective,

\[\min_c \;\bigl\| Y - K c \bigr\|^2 + \lambda \, c^\top K c,\]

where \(K\) is the \(n \times n\) kernel matrix and \(\lambda\) is a regularization parameter chosen by leave-one-out cross-validation. The closed-form solution is \(\hat c = (K + \lambda I)^{-1} Y\), and the implied \(\hat f\) lies in the reproducing kernel Hilbert space (RKHS) spanned by the Gaussian kernels.

The pointwise derivative \(\partial \hat f / \partial X_j\) at any data point comes out of \(\hat c\) in closed form. This is the headline output: not a single coefficient per predictor, but a distribution of marginal effects across observations.

Worked example

A simple simulated example showing what KRLS gives you that OLS doesn’t.

In R

library(KRLS)

set.seed(1)
n <- 200
# x1 on an evenly spaced grid over a full sine period so the
# *empirical* average of cos(x1) is essentially zero by construction
# -- the average derivative inherits the symmetry, and the
# heterogeneity in pointwise derivatives is the story.
x1 <- seq(-pi, pi, length.out = n)
x2 <- runif(n, -pi, pi)
# Smooth, additive true surface
y  <- sin(x1) + 0.5 * x2 + rnorm(n, sd = 0.3)
X  <- cbind(x1, x2)

# Fit
fit <- krls(X = X, y = y)

# Pointwise marginal effects: one row per observation, one column per X
head(fit$derivatives)

# Average marginal effects (with analytical SEs)
summary(fit)

# Plot the distribution of marginal effects across observations
plot(fit)

For each predictor, summary(fit) reports:

  • The average marginal effect (mean of pointwise derivatives).
  • A standard error computed from the analytical vcov of \(\hat c\).
  • Quartiles of the pointwise derivatives so you can see whether the effect is constant (boring) or varies across the covariate space (interesting).

In this example the AME of x1 is near zero — the derivative \(\sin'(x_1) = \cos(x_1)\) integrates to zero over a full sine period, since \(\int_{-\pi}^{\pi}\cos(t)\,dt = \sin(\pi) - \sin(-\pi) = 0\). But the pointwise effects span the full range of \(\cos\), from \(-1\) to \(+1\) across the observed values of \(x_1\). The AME of x2 recovers the true slope of \(0.5\). The distribution of \(x_1\)’s marginal effects is exactly the heterogeneity an OLS coefficient on \(x_1\) would compress into a single near-zero slope, while OLS on \(x_2\) would recover the same constant.

For samples where the \(n \times n\) kernel matrix is uncomfortable — roughly past \(n \approx 5{,}000\) on a typical laptop — swap in the Nyström approximation:

fit_big <- krls(X = X, y = y, approx = "nystrom")

The interface, fit object, and outputs are the same. Default landmark count is min(500, ceiling(sqrt(n))); pass nystrom_m, landmarks, or landmark_method to override. Predictions, average marginal effects, and their standard errors are all computed from the same fit. For applications that hinge on the historical exact KRLS inference, see the note in Common pitfalls below.

Pointwise marginal effects. KRLS returns one derivative per observation. The average marginal effect summarizes the distribution, while the OLS coefficient compresses all heterogeneity into a single slope.

In Stata

use mydata.dta, clear

* Fit with pointwise derivatives saved as d_x1, d_x2 in the
* current dataset (use `sderiv(name)` to save to a separate file).
krls y x1 x2, deriv

* Histograms of the pointwise derivatives.
krls y x1 x2, graph

* Fitted values, residuals, and SEs go through kpredict.
* (Fitted values are the default; pass `se` or `residuals` for
* the other variants.)
kpredict yhat
kpredict yhat_se, se

The Stata command shares the same algorithm and output structure as the R package; flags and defaults are documented in help krls.

Reading the output

Three things to look at:

1. Average marginal effects (AMEs). The analogue of an OLS coefficient. If the AME is large and the pointwise derivatives are all close to it, the relationship is roughly linear and KRLS gives you the same conclusion as OLS — but with a check that linearity held.

2. Heterogeneity in pointwise derivatives. This is the unique KRLS contribution. If pointwise derivatives are far from the AME for many observations, the marginal effect varies meaningfully across the covariate space. That’s substantive: the average effect masks heterogeneity. plot(fit) shows the distribution.

3. Pre/post-fit diagnostics.

fit$R2             # in-sample R²
fit$Looe           # leave-one-out CV loss at the chosen lambda
fit$lambda         # regularization parameter (chosen via LOOCV by default)
fit$lambda_method  # "loo" or "gcv" -- which objective was minimized
fit$sigma          # kernel bandwidth (default p; set sigma= for sensitivity)

If \(R^2\) is very high but the LOO loss is also high, you may be overfitting — try a larger \(\lambda\) via lambdasearch() or constrain the kernel bandwidth.

Minimum reporting checklist

When reporting a KRLS fit, include at least:

  • average marginal effects with analytical standard errors;
  • quartiles or a plot of the pointwise marginal effects;
  • in-sample \(R^2\) and leave-one-out loss;
  • the selected regularization parameter and kernel bandwidth;
  • sensitivity checks for bandwidth or predictor scaling when results hinge on heterogeneity.

Choosing among alternatives

KRLS is in the family of nonparametric regression methods. Pick by what you need:

  • Want pointwise marginal effects with closed-form SEs? KRLS is designed for this.
  • Want fast prediction on big data? Random forests or gradient boosting (ranger, xgboost) scale to millions of rows where KRLS does not. They give you SHAP values for interpretation, which approximate KRLS’s pointwise derivatives but with bootstrap-based uncertainty.
  • Have only a handful of nonlinear features in mind? GLM with splines (mgcv) is a lighter tool that’s faster and usually sufficient.
  • Care most about the average effect, not heterogeneity? OLS with regression adjustment (and a check that linearity is reasonable) gets you there with a tiny fraction of the compute.

Common pitfalls

  • n³ memory and runtime for the exact path. By default KRLS solves an \(n \times n\) linear system. For \(n > \sim 5{,}000\), the cost becomes prohibitive on a single machine. Since version 1.4-0 the package offers an explicit low-rank alternative — krls(..., approx = "nystrom") — that anchors the model on \(m \ll n\) landmark points and solves the ridge problem in \(m\)-dimensional space, with time \(\mathcal O(n m^2 + m^3)\) and memory \(\mathcal O(n m)\). The Nyström path is the right default for most applied work at moderate-to-large \(n\): predictions, average marginal effects, and analytical standard errors are all available and the calibration matches a parametric bootstrap. Strictly speaking these are conditional SEs — they condition on the selected landmarks, the chosen \(\lambda\), and the feature map — so use the exact path when a paper explicitly hinges on the historical KRLS inference. See vignette("krls-nystrom-scaling") for the timing comparison, the landmark-reuse pattern, and the choice between LOO and GCV for \(\lambda\).
  • Bandwidth choice. The default is \(\sigma = p\) where \(p\) is the number of predictors (the kernel denominator, not its square). This works well in many cases but is worth checking. Smaller \(\sigma\) makes the kernel sharper (higher variance, lower bias); larger \(\sigma\) smooths more. Refit with sigma = ... and compare \(R^2\) + LOO loss.
  • Standardization matters. Gaussian kernels are scale-sensitive. Standardize predictors before fitting, or pass them as a matrix with comparable scales. The package does this internally by default.
  • Marginal effects average over the sample, not the population. The AME of a covariate is the mean pointwise derivative across your data points. If your sample is selected, the AME reflects that selection.
  • Categorical predictors. KRLS expects numeric predictors. For categories, expand to dummies before passing to krls(). The package will return one marginal effect per dummy.

When not to use KRLS

  • You need exact KRLS inference at very large \(n\). The Nyström path (approx = "nystrom") handles large samples well for nearly every applied use case. The exception is when the analysis explicitly relies on the exact analytical inference from Hainmueller and Hazlett (2014); for that, the \(n \approx 5{,}000\) ceiling on the exact path still applies.
  • You only care about prediction. If you don’t need pointwise marginal effects, gradient boosting will be faster and often more accurate.
  • You need extrapolation outside the data. KRLS is non-parametric in the data range; outside it, predictions revert toward zero (in centered coordinates). Don’t ask it to extrapolate.

References

See also

  • Kernel ML Methods project page — package landing page with the latest release notes.
  • The companion Stata package: krls (see https://github.com/j-hai/krls-stata).
  • vignette("krls-nystrom-scaling", package = "KRLS") — how the built-in Nyström approximation scales beyond the \(\sim 5{,}000\) ceiling, including the landmark-reuse pattern via get_landmarks() and the LOO-vs-GCV trade-off for the regularization parameter.