Edit this page

M4 line simplification

implemented in DuckDB

An implementation of M4, a line simplification algorithm for time series, using DuckDB’s UNNEST operator. Data: Max-Planck-Institut für Biogeochemie, Jena

( / points.)

WITH pts AS (
  SELECT '{x}' AS u
       , '{y}' AS v
       , TIME_BUCKET(INTERVAL '{k}', '{x}') AS k
    FROM climate
)
FROM (
  SELECT k,
        UNNEST([
          {u: MIN(u), v: ARG_MIN(v, u)}
        , {u: ARG_MIN(u, v), v: MIN(v)}
        , {u: ARG_MAX(u, v), v: MAX(v)}
        , {u: MAX(u), v: ARG_MAX(v, u)}
       ], max_depth := 2)
    FROM pts
  GROUP BY k
) ORDER BY u;

Compared to the original implementation, this guarantees exactly 4 points per time bucket, and keeps the definition of the bucket in one place.

It’s similar to AM4 (implemented below), using ARG_MIN and ARG_MAX to locate the extrema, as in Mosaic for reference. Even though it has 4 SELECT statements, AM4 is probably faster than the UNNEST version.

WITH pts AS (
  SELECT '{x}' AS u, '{y}' AS v, TIME_BUCKET(INTERVAL '{k}', '{x}') AS k
    FROM climate
)

  SELECT k, MIN(u) AS u, ARG_MIN(v, u) AS v
    FROM pts
   GROUP BY k
UNION
  SELECT k, ARG_MIN(u, v) AS u, MIN(v) AS v
    FROM pts
   GROUP BY k
UNION
  SELECT k, ARG_MAX(u, v) AS u, MAX(v) AS v
    FROM pts
   GROUP BY k
UNION
  SELECT k, MAX(u) AS u, ARG_MAX(v, u) AS v
    FROM pts
   GROUP BY k

ORDER BY u;