M4 line simplification
implemented in DuckDB
An implementation of M4, a line simplification algorithm for time series, using DuckDB’s UNNEST operator. Data: Max-Planck-Institut für Biogeochemie, Jena
(
WITH pts AS (
SELECT '{x}' AS u
, '{y}' AS v
, TIME_BUCKET(INTERVAL '{k}', '{x}') AS k
FROM climate
)
FROM (
SELECT k,
UNNEST([
{u: MIN(u), v: ARG_MIN(v, u)}
, {u: ARG_MIN(u, v), v: MIN(v)}
, {u: ARG_MAX(u, v), v: MAX(v)}
, {u: MAX(u), v: ARG_MAX(v, u)}
], max_depth := 2)
FROM pts
GROUP BY k
) ORDER BY u;
Compared to the original implementation, this guarantees exactly 4 points per time bucket, and keeps the definition of the bucket in one place.
It’s similar to AM4
(implemented below), using ARG_MIN
and ARG_MAX
to locate the extrema, as in Mosaic for reference. Even though it has 4 SELECT
statements, AM4
is probably faster than the UNNEST
version.
WITH pts AS (
SELECT '{x}' AS u, '{y}' AS v, TIME_BUCKET(INTERVAL '{k}', '{x}') AS k
FROM climate
)
SELECT k, MIN(u) AS u, ARG_MIN(v, u) AS v
FROM pts
GROUP BY k
UNION
SELECT k, ARG_MIN(u, v) AS u, MIN(v) AS v
FROM pts
GROUP BY k
UNION
SELECT k, ARG_MAX(u, v) AS u, MAX(v) AS v
FROM pts
GROUP BY k
UNION
SELECT k, MAX(u) AS u, ARG_MAX(v, u) AS v
FROM pts
GROUP BY k
ORDER BY u;