Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

5.3 Local Regression

Kernel Smoothing

We’ve encountered data smoothing in the form of a moving average. Figure 1 and Figure 2 present a synthetic white noise process and the same process after smoothing with a moving average of length 10.

Simulated white noise process.

Figure 1:Simulated white noise process.

Smoothed version of white noise after applying the filter [\frac{1}{10}, \frac{1}{10},...,\frac{1}{10}].

Figure 2:Smoothed version of white noise after applying the filter [110,110,...,110][\frac{1}{10}, \frac{1}{10},...,\frac{1}{10}].

We can think of the smoothing process depicted in Figure 2 as consisting of sliding a kernel (also known as a filter) across the data. In this case, the kernel is defined as

[110,110,110,110,110,110,110,110,110,110]\Bigg[\frac{1}{10}, \frac{1}{10},\frac{1}{10}, \frac{1}{10},\frac{1}{10},\frac{1}{10}, \frac{1}{10},\frac{1}{10}, \frac{1}{10},\frac{1}{10}\Bigg]

which we can envision as being slid across the data. At each point, we take the inner (or dot) product of the data with the kernel. For the kernel used in Eq. (1) this gives the unweighted average of 10 points. In later chapters we will learn that this can be thought of as a convolution in the frequency domain.

The box (or boxcar) kernel used in Figure 2 is only one of many possible choices. We will learn in future chapters that a box filter has substantial shortcomings in the frequency domain that are mitigated (though not eliminated) by use of tapered kernels, such as kernels based on Gaussians, the cosine function, or the tricube function introduced in the next section.

Tricube Function

For this chapter, the most important kernel method is generated from the tricube function WW:

W(x0,xi)={(1u3)3u10u>1W(x_0, x_i) \stackrel{\triangle}= \begin{cases} (1-|u|^3)^3 & |u| \leq 1 \\ 0 & |u| > 1 \\ \end{cases}

where uu is the distance between the the focal point x0x_0 and the point xix_i calculated as u=x0ximaxj(x0xj)u=\frac{x_0-x_i}{\max_j{(|x_0-x_j|)}}[1]. The maximum jj is determined by an adjustable parameter giving the length of the kernel.

For example, for maxj(x0xj)=5\max_j{(|x_0-x_j|)}=5, the tricube function WW generates a sharply peaked kernel with 2j1=92j-1=9 non-zero values out of 2j+1=112j+1=11 total values

[0,0.116,0.482,0.820,0.976,1,0.976,0.820,0.482,0.116,0]\big[0, 0.116, 0.482, 0.820, 0.976, 1, 0.976, 0.820, 0.482, 0.116,0 \big]

with all points outside the window weighted as 0 (note that for the first and last elements u=±55u=\frac{\pm5}{5}, resulting in zero for WW). In actual implementation we would simply use a kernel of length 2j12j-1 (9, in our example) instead of using a kernel of length 2j+12j+1 (11, in our example) with zeros in the first and last positions. We explicitly included the zeros in Eq. (3) to emphasize that the kernel falls to zero at u=1|u|=1.

Weighting Points Near the Edges

In order to avoid shortening the time series (as occurs with moving averages), any point less than maxj(x0xj)\max_j{(|x_0-x_j|)} from either end has an asymmetric selection of points in order to maintain the same kernel size at the expense of a larger value for maxj(x0xj)\max_j{(|x_0-x_j|)}. For example, continuing with maxj(x0xj)=5\max_j{(|x_0-x_j|)}=5 from before, the asymmetric window used for the first point in the series will now have maxj(x0xj)=9\max_j{(|x_0-x_j|)}=9 with values given as:

[1,0.996,0.967,0.893,0.759,0.569,0.348,0.148,0.0264,0]\big[1, 0.996, 0.967, 0.893, 0.759, 0.569, 0.348, 0.148, 0.0264, 0 \big]

LOWESS Algorithm

Local regression is perhaps best thought of as a hybrid between kernel smoothing and linear regression (or more broadly polynomial regression). The most common family of methods is LOWESS (locally weighted scatterplot smoothing), a type of LOESS (locally estimated scatterplot smoothing)[2]. The LOWESS algorithm is generally credited as having been formally introduced in Cleveland (1979) and Cleveland (1981) (though much of the concept had been proposed independently by Savitzky & Golay (1964) about fifteen years beforehand).

LOWESS Demo

Before diving into the underlying algorithm, let’s get a feeling for how LOWESS behaves. The following interactive figure plots the function

y=sin(x)+12sin(3x)+wy = \sin{(x)} + \frac{1}{2}\sin{(3x)} + w

for some noise ww in gray dots. For reference, the original (presumably unknown) clean signal without noise that we are attempting to reconstruct is denoted by a gray dashed line. The toggle button LOWESS CURVE turns on or off the LOWESS curve fitted through the points. The button LOCAL FITS toggles on and off a selection of the individual local regression lines (more on those soon). Finally, the slider SPAN (f) tells the algorithm what fraction of the overall data to consider for each local fit. Note that at 1.00, the line is almost a flat linear regression line with large local regression lines. As you shrink the fraction down, the LOWESS fit becomes more jagged with smaller local regression lines (denoting that the local line is calculated with a smaller portion of the data).

If the above fails to render correctly in your browser you can also open the demo as a new browser window using the Open Demo in a New Tab ↗ button at the top of the frame. Note that you may need to enable popups for this to work.

Local Regression

The LOWESS algorithm begins by dividing the data into overlapping windows. The size of each window is governed by the parameter ff, the fraction of the total observations nn. In statsmodels.nonparametric.smoothers_lowess.lowess, the fraction is set with the argument frac (default frac=2.0/3.0). In the demo above, ff can be smoothly varied from 0.05 to 1. Once ff is chosen, we create nn windows of length f×n\lceil f\times n \rceil centered at each data point where f×n\lceil f\times n \rceil is the ceiling function indicating the smallest integer greater than or equal to f×nf\times n. For each window about a given central point (denoted as xcenterx_{center}), we calculate a weighted linear regression defined as

minβ0,β1i=1f×nW(xcenter,xi)(yiβ0(xcenter)β1(xcenter)xi)2\min_{\beta_0, \beta_1} \sum_{i=1}^{\lceil f\times n \rceil}W(x_{center}, x_i) (y_i-\beta_0(x_{center})-\beta_1(x_{center})x_i)^2

where W(xcenter,xi)W(x_{center}, x_i) is a weight function, most commonly the tricube function, that weights the contribution of each xix_i based on its distance from xcenterx_{center} and where we have expressed the coefficients as β0(xcenter)\beta_0(x_{center}) and β1(xcenter)\beta_1(x_{center}) to emphasize that distinct β\beta’s are calculated for each data point. For points closer than f×n2\frac{f\times n}{2} to either end, the window no longer places the point in the center of the window, and uses the modification to WW discussed above.

LOWESS can be extended from linear regression to polynomial regression using expressions such as

minβ0,β1,β2i=1f×nW(xcenter,xi)(yiβ0(xcenter)β1(xcenter)xiβ2(xcenter)(xi)2)2,\min_{\beta_0, \beta_1,\beta_2} \sum_{i=1}^{\lceil f\times n \rceil}W(x_{center}, x_i) (y_i-\beta_0(x_{center})-\beta_1(x_{center})x_i-\beta_2(x_{center})(x_i)^2)^2,

though we will not need to go beyond Eq. (6) for our purposes.

Having calculated nn distinct linear regressions, we use each linear regression about a given xcenterx_{center} to calculate the y^center\hat{y}_{center} at that point, leaving us with nn y^\hat{y} values (one per data point). The y^\hat{y} points are then connected into a single curve—though note in the demo above that the fitted curve becomes notably more jagged as ff is decreased.

Robust Smoothing

In the demo above, we calculated the LOWESS curve solely by implementing Eq. (6). This method works well for data that has relatively few outliers. Cleveland (1979) describes an optional second stage of smoothing using the bisquare function B(x)B(x) defined as

B(x)={(1x2)2x10x>1B(x) \stackrel{\triangle}= \begin{cases} (1-x^2)^2 & |x| \leq 1 \\ 0 & |x| > 1 \\ \end{cases}

where as in Eq. (2) B(x)B(x) falls to zero at x=±1x=\pm1. After the initial LOWESS curve calculation using Eq. (6), the algorithm then goes through a predetermined number of iterations of the following procedure:

  1. Calculate the residual between the predicted and observed points

    ei=yiy^i.e_i = y_i -\hat{y}_i.
  2. Determine new weightings δ\delta for each point by first calculating the median ss of the ei|e_i|'s and then applying the formula

    δk=B(ek6s).\delta_k \stackrel{\triangle}{=} B\Big(\frac{e_k}{6s}\Big).
  3. Recalculate the local regressions using Eq. (6) using B×WB\times W instead of WW.

Cleveland refers to this algorithm as “robust locally weighted regression.” In statsmodels, the number of iterations of the above procedure is controlled by the argument it (default it=3). Setting it=0 will result in the algorithm only using Eq. (6) in its original form.

In general, robust smoothing is primarily required for datasets with large outliers, such as those produced by Cauchy distributed noise. Figure 3 demonstrates LOWESS smoothing with different numbers of iterations on a synthetic dataset with Cauchy distributed noise. Note in particular that for 0 iterations (non-robust LOWESS) the fitted LOWESS line “chases” the outliers out of the plot. Once we reach 3 iterations, the LOWESS line almost perfectly captures the original signal.

LOWESS regression using 0, 1, 3, and 6 iterations of robust smoothing on synthetic dataset defined as y=\sin{(x)}+w with w Cauchy distributed.

Figure 3:LOWESS regression using 0,1,3,0, 1, 3, and 6 iterations of robust smoothing on synthetic dataset defined as y=sin(x)+wy=\sin{(x)}+w with ww Cauchy distributed.

Footnotes
  1. The quantity maxj(x0xj)\max_j{(|x_0-x_j|)} is often referred to as the bandwidth. We will reserve the term bandwidth for use in the context of the frequency domain.

  2. Some sources such as NIST treat the terms LOESS and LOWESS as being interchangeable. Cleveland himself in Cleveland & Devlin (1988) appears to have used the term LOESS as a multivariate generalization of the previous univariate LOWESS.

References
  1. Cleveland, W. S. (1979). Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association, 74(368), 829–836. 10.1080/01621459.1979.10481038
  2. Cleveland, W. S. (1981). LOWESS: A Program for Smoothing Scatterplots by Robust Locally Weighted Regression. The American Statistician, 35(1), 54. 10.2307/2683591
  3. Savitzky, Abraham., & Golay, M. J. E. (1964). Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry, 36(8), 1627–1639. 10.1021/ac60214a047
  4. Cleveland, W. S., & Devlin, S. J. (1988). Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting. Journal of the American Statistical Association, 83(403), 596–610. 10.1080/01621459.1988.10478639