Kleiner Hartigan Trees

TL;DR

A K-H tree (Kleiner–Hartigan tree) hierarchically partitions data by repeatedly splitting subsets at the median along chosen dimensions.
It supports any number of dimensions and handles outliers by placing them in their own subsets.
Construction requires computing medians for splits, which can be computationally intensive, but search and sorting become efficient once built.

Definition

Kleiner-Hartigan trees, also known as K-H trees, are a type of data structure used to represent data sets in a hierarchical manner. They divide a data set into progressively smaller subsets by splitting on the median value of the data along a selected dimension, producing a binary hierarchical structure that continues until subsets contain single points.

Explanation

K-H trees partition a dataset by selecting a dimension to split on, computing the median value of that dimension across the current subset, and dividing the subset into two child subsets: one containing points with coordinate values less than or equal to the median and the other containing points with coordinate values greater than the median. Each resulting subset is then processed recursively, choosing dimensions and medians for further splits until leaf subsets contain single points.

This structure enables efficient search and sorting by traversing the tree from the root and using median comparisons at each node to select which child to visit next. Compared with structures like quadtrees, K-H trees do not require a predetermined number of fixed partitions per split and can flexibly partition across as many dimensions as needed. K-H trees also isolate outliers by placing them in their own subsets rather than dedicating a larger fixed region to a single outlier.

Examples

2-dimensional points example

Consider the dataset of 2-dimensional points: (1,5), (3,4), (5,2), and (4,3).

Choose a dimension to split on (for example, the x-dimension).
Compute the median of the x-coordinates; in this case the median is 4.
Divide the dataset into two subsets:
- Subset A: points with x-coordinates less than or equal to 4.
- Subset B: points with x-coordinates greater than 4.
Treat each subset as a separate dataset and repeat the process. For example, Subset A might be further split by the median of the y-coordinates of its points.
Continue dividing until each subset contains only a single point.

Use cases

Efficient searching and sorting of spatial or multidimensional data once the tree is constructed.
Representing hierarchical partitions of datasets across any number of dimensions.
Handling outliers by isolating them into individual subsets.

Notes or pitfalls

Constructing a K-H tree can be computationally intensive because the median must be computed for each split and each dimension as the tree is built.
While construction time can be high, query operations (searching/sorting) are typically faster after the tree is constructed.

Quadtrees
Binary trees
K-H trees (abbreviation for Kleiner-Hartigan trees)