本文理解翻译自：http://en.wikipedia.org/wiki/K-d_tree

k-d树(k-d tree)

来自维基百科，自由的百科全书

简介

k-d树是二叉树的一种，树中每个节点都是一个多维(k-dimension)的数据点。每个非叶子节点都可以看做是隐含的分割超平面，该平面将空间分成两部分（也叫半空间）。超平面左边的点由k-d树的左子树表示，右边的点由右子树表示。选择超平面的方式为：树中的每个节点对应K个维度中的一维，超平面会垂直这个维度的坐标轴。例如，如果选择X轴做分割，那么所有X值小当前树节点的点都在当前树节点的左子树中，所有X值大于当前树节点的点都在当前树节点的右子树中。在这种情况下，超平面是通过点的x值来设置，它的法向量(normal)就是单位X轴。^[1]

k-d树的操作

建树

有很多方法可以用来选择坐标轴分割平面，所以有许多不同的构建k-d树方法。比较权威的k-d建树方法有下面几个约束：^[2]

在建树过程中，循环使用每个坐标来选择分割平面。（例如，在一个3D树中，根节点使用x对应的平面，根节点的儿子选择y对应的平面，根节点的孙子选择z对应的平面，曾孙使用x对应的平面，玄孙使用y对应的平面，如此往复。）
选择了分割面之后，将所有点对应维度的中值（中位数）所在点作为当前的树节点。

这种方法可以创建一个平衡k-d树，平衡的意思是说每个叶节点到根节点的距离大致相同。但是，平衡树并不是对所有应用都是最优的。
另外要注意的是选择中值也不是必须的。这种情况下，不保证树的平衡。一个简单的启发方法用来避免编写复杂的median-finding(O(N))算法^[3]^[4] ，或者使用 Heapsort or Mergesort排序(O(nlogn)，具体做法是随机挑选指定数量（小于n）的点取中值用于分割平面。实践中，这个技术通常能够产生很平衡的k-d树。
给定一个长度为n的点链表，下面的算法使用中值选择排序来创建平衡k-d树。

function kdtree (list of points pointList, int depth)
{
    // Select axis based on depth so that axis cycles through all valid values
    var int axis := depth mod k;
        
    // Sort point list and choose median as pivot element
    select median by axis from pointList;
        
    // Create node and construct subtrees
    var tree_node node;
    node.location := median;
    node.leftChild := kdtree(points in pointList before median, depth+1);
    node.rightChild := kdtree(points in pointList after median, depth+1);
    return node;
}

通常在中值“之后”的点应该只包括严格大于中值的点。对于中值对应的点，it is possible to define a “superkey” function that compares the points in all dimensions。在某些情况下，等于中值的点放在某一边也是可以的，例如，将点分成“小于“子集和“大于等于”子集。
上面的算法用Python实现的例程如下：

from collections import namedtuple
from operator import itemgetter
from pprint import pformat

class Node(namedtuple('Node', 'location left_child right_child')):
    def __repr__(self):
        return pformat(tuple(self))

def kdtree(point_list, depth=0):
    try:
        k = len(point_list[0]) # assumes all points have the same dimension
    except IndexError as e: # if not point_list:
        return None
    # Select axis based on depth so that axis cycles through all valid values
    axis = depth % k
 
    # Sort point list and choose median as pivot element
    point_list.sort(key=itemgetter(axis))
    median = len(point_list) // 2 # choose median
 
    # Create node and construct subtrees
    return Node(
        location=point_list[median],
        left_child=kdtree(point_list[:median], depth + 1),
        right_child=kdtree(point_list[median + 1:], depth + 1)
    )

def main():
    """Example usage"""
    point_list = [(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)]
    tree = kdtree(point_list)
    print(tree)

if __name__ == '__main__':
    main()

输出如下：

((7, 2),
 ((5, 4), ((2, 3), None, None), ((4, 7), None, None)),
 ((9, 6), ((8, 1), None, None), None))

生成的树如下图所示：

上面的k-d树分解了点集:(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)。

上图是k-d树结果。

上面的算法为每个节点做了一个不变的条件规定：所有左子树的节点在分割面的一边，所有右子树的节点在分割面的另一边。分类面上的点可以在任意一边。当前节点存储分类面上的值（代码中是node.location)。
创建平衡k-d树的另外一种算法是：建树之前对数据预排序。然后在建树的过程中维护这个顺序，从而消除了每层分支时查找中值的开销。在三维计算机器图形中，有两个这样的算法通过构造平衡k-d树来排序三角形，从而提升光纤追踪的性能。这些算法在建树之前预排序n个三角形，然后以O(n logn)的最好时间复杂度建树；但是，这些算法的最坏时间复杂度很难预测，因为它依赖于计算机图形中三角形的特定排列。^[5]^[6] 与这些算法相比，有一个算法能够通过排序点集以O(kn logn)的最坏时间复杂度排序建树。这个算法首先使用Heapsort 或 Mergesort 以O(n logn)的时间复杂度在每个维度(共k个）上预排序n个点，然后在建树过程中维护这k维的顺序，因此能够获得最坏时间复杂度O(kn logn)。

添加元素

This section requires expansion.(November 2008)

向k-d树添加节点跟其他搜索树添加节点的方式一样。首先，遍历树，从根节点开始，将待插入点跟当前节点比较确定是在那个分割面，从而选择继续遍历左儿子节点还是右儿子节点。一旦找到了可以添加到下面的节点，将新的待插入点作为左儿子或者右儿子节点添加到树中，“左”还是“右”取决于该节点跟分类面的关系。
按这种方式添加节点可能会导致树失去平衡，从而降低树的性能。树性能的降低比例取决于树之前的空间分布，以及添加的节点数和树原大小的关系。如果输变得很不平衡，就需要做均衡了，从而恢复依赖于树平衡的查询性能，例如最近邻居查询。

删除节点

This section requires expansion.(February 2011)

从已有k-d树中删除节点，且不破坏限制条件，最简单的方法是将待删除节点及其子树做成集合，并重新建立子树。
另外一个方法是为待删除点找一个替代点。^[8] 首先，找到包含待删除点的节点R；如果R是叶子节点，不需要替换；如果是其他情况，从以R为根的子树中找到一个替代点，设为p；交换R和p；然后，递归删除p。
找到一个可替换点的方法：假设节点R通过x轴来区分，并且R有一个右儿子，找到这个右儿子及其子树中x值最小的点，即为可替换点。反之，找到右儿子及其子树中x值最大的点，即为可替换点。

平衡

k-d树的平衡需要非常小心，因为k-d树通过多个维度来排序，所以tree rotation这样的技术不能用来做平衡，原因是这个技术可能破坏k-d树的限制条件。
Several variants of balanced k-d trees exist. They include divided k-d tree, pseudo k-d tree, k-d B-tree, hB-tree and Bkd-tree. Many of these variants are adaptive k-d trees.
k-d树有几种变体，包括：divided k-d tree、pseudo k-d tree、 k-d B-tree、Bkd-tree。这写变体里面有许多是自适应k-d树。

近邻搜索

上图是二维k-d树中的NN搜索的动画

近邻搜索(NN)算法旨在从树中找到离给定的点最近的点。这个检索可以通过k-d树的特性快速缩减大部分搜索空间而高效实现。
从k-d树中查找最近邻点按下列步骤进行：

从跟节点开始，递归向下遍历，这个添加节点是一样的（例如：向左还是向右取决于待查点在分割维度上比当前点小还是大）。
一旦找到叶子节点，就将该节点保存为“当前最佳”。
回溯，对每个节点执行下列步骤：
1. 如果当前节点比“当前最佳”更接近待查节点，更新该节点为“当前最佳”
2. 检查在分割面的另一边是否有比“当前最佳”离待查点更近的节点。从概念上来说，以待查点为中心、以当前最近距离为半径画一个超球面，看这个超球面是否穿过了分割平面。因为平面都是坐标轴对应的，所以只需要简单比较待查点和当前点的在分裂面上的那个维度的差值是否比当前最佳距离小。
  1. 如果超球面穿越的分割面，那么分割面的另外一侧可能有最近点，所以需要递归遍历树的另外的分支，从而寻找更近的点。
  2. 如果超球面没有穿过分割面，继续遍历其他节点，但是分割面另外一边的整个分支会被剪掉。
当算法最后回溯到根节点的时候，检索完成。

通常，算法使用平方距离来做比较，而不是计算（更耗时的）平方根。另外，可以通过维持当前最好的平方距离来节省计算量。
在随机分布的数据点上，查找最近点是一个O(log n)操作，这个分析比较麻烦。但是有算法声明可以保证O(log n)的时间复杂度。^[9] 在高维度空间，维数灾难会导致算法需要访问远多于低维空间的分支。在实践中，如果点数比维数大不了多少，算法只能略好于线性遍历所有的点。
这个算法也可以通过简单地修改做多种扩展。比如，可用于计算k个最近邻点，这个时候需要保存k个当前最佳而不是一个。分支能够剪掉的条件是：k个点都找到，并且分支中没有比这k个最佳更近的点。
还可以做近似是算法更快。例如：近似最近点查找可以通过指定检查点的上限来实现，也可以基于实时时钟（硬件实现更合适）终止检索过程。【如果是查找已经在树中的最近邻点，只需要看节点的距离是否为0就可以了，这有个缺点就是可能会丢弃重复、但是和待查点一致的节点。Nearest neighbour for points that are in the tree already can be achieved by not updating the refinement for nodes that give zero distance as the result, this has the downside of discarding points that are not unique, but are co-located with the original search point.（这一句理解不够，把原文放这了）】
近似的近邻查找在实时程序中比较有价值，例如机器人的显著性能提升就是通过非穷举搜索来获得的。一种实现是：best-bin-first search。

范围搜索

范围搜索指的是使用范围参数来做检索。例如，如果一个k-d树存储的是收入和年龄的数值，那么一个范围搜索可能是：查找树中年龄在20到50，收入在50000到80000的节点。应为k-d树在树的每一层对域的范围做了分割，所以可以高效执行范围查询。
Analyses of binary search trees has found that the worst case time for range search in a k-dimensional KD tree containing N nodes is given by the following equation.^[10] 二叉搜索树的分析表明：在包含N个节点的k-d树中做范围查找，最坏时间复杂度如下：

高维数据

k-d trees are not suitable for efficiently finding the nearest neighbour in high-dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, N, should be N >> 2^k. Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search,^[11] and approximate nearest-neighbour methods should be used instead.
在高维空间，k-d树是不适合做高效的近邻查询。通常原则是，如果维度是k, 数据点数是N，需要满足N >> 2^k。否则，当k-d树用在高维度数据上，查找时绝大多数节点需要做评估，所以性能不一定比穷举搜索好^[11]，应该替换为一个近似的近邻查询。

复杂度

从n个节点创建一个静态的k-d树有下列最坏时间复杂度：
- O(n log² n) 在建树过程中，每层使用像Heapsort 或 Mergesort 这样的O(n logn)算法查找中值。
- O(n log n) 使用线性查找均值算法median of medians ^[3]^[4] 。
- O(kn log n) 对n个节点每一维做了与排序，使用Heapsort or Mergesort这样的时间复杂度为O(nlogn)的排序方法。 ^[7]

插入一个新节点到平衡k-d树, 时间复杂度为O(log n)
从平衡k-d树中删除一个节点, 时间复杂度为O(log n)
在平衡k-d树中，做坐标轴平行的范围查询的时间复杂度是O(n^1-1/k +m)，其中m是要返回的节点数，k是k-d树的维度。
在用随机分布的数据构造的平衡二叉树上，查找一个最近邻的平均时间复杂度是：O(log n) 。

翻译待续…..

Variations

Volumetric objects

Instead of points, a k-d tree can also contain rectangles or hyperrectangles.^[12]^[13] Thus range search becomes the problem of returning all rectangles intersecting the search rectangle. The tree is constructed the usual way with all the rectangles at the leaves. In an orthogonal range search, the opposite coordinate is used when comparing against the median. For example, if the current level is split along x_high, we check the x_low coordinate of the search rectangle. If the median is less than the x_low coordinate of the search rectangle, then no rectangle in the left branch can ever intersect with the search rectangle and so can be pruned. Otherwise both branches should be traversed. See also interval tree, which is a 1-dimensional special case.

Points only in leaves

It is also possible to define a k-d tree with points stored solely in leaves.^[2] This form of k-d tree allows a variety of split mechanics other than the standard median split. The midpoint splitting rule^[14] selects on the middle of the longest axis of the space being searched, regardless of the distribution of points. This guarantees that the aspect ratio will be at most 2:1, but the depth is dependent on the distribution of points. A variation, called sliding-midpoint, only splits on the middle if there are points on both sides of the split. Otherwise, it splits on point nearest to the middle. Maneewongvatana and Mount show that this offers “good enough” performance on common data sets. Using sliding-midpoint, an approximate nearest neighbour query can be answered in $O \left ( \frac{ 1 }{ { \epsilon\ }^d } \log n \right )$ . Approximate range counting can be answered in $O \left ( \log n + { \left ( \frac{1}{ \epsilon\ } \right ) }^d \right )$ with this method.

References

Bentley, J. L. (1975). “Multidimensional binary search trees used for associative searching”. Communications of the ACM 18 (9): 509. doi:10.1145/361002.361007. edit
“Orthogonal Range Searching”. Computational Geometry. 2008. p. 95. doi:10.1007/978-3-540-77974-2_5. ISBN 978-3-540-77973-5. edit
Blum, M.; Floyd, R. W.; Pratt, V. R.; Rivest, R. L.; Tarjan, R. E. (August 1973). “Time bounds for selection” (PDF). Journal of Computer and System Sciences 7 (4): 448–461. doi:10.1016/S0022-0000(73)80033-9. edit
Cormen, Thomas H.; Leiserson, Charles E., Rivest, Ronald L.. Introduction to Algorithms. MIT Press and McGraw-Hill. Chapter 10.
Wald I, Havran V (September 2006). “On building fast kd-trees for ray tracing, and on doing that in O(N log N)” (PDF). In: Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing: 61–69. doi:10.1109/RT.2006.280216.
Havran V, Bittner J (2002). “On improving k-d trees for ray shooting” (PDF). In: Proceedings of the WSCG: 209–216.
Brown RA (2015). “Building a balanced k-d tree in $O(kn log n)$ time”. Journal of Computer Graphics Techniques 4 (1): 50–68.
Chandran, Sharat. Introduction to kd-trees. University of Maryland Department of Computer Science.
Freidman, J. H.; Bentley, J. L.; Finkel, R. A. (1977). “An Algorithm for Finding Best Matches in Logarithmic Expected Time”. ACM Transactions on Mathematical Software 3 (3): 209. doi:10.1145/355744.355745. edit
Lee, D. T.; Wong, C. K. (1977). “Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees”. Acta Informatica 9. doi:10.1007/BF00263763. edit
Jacob E. Goodman, Joseph O’Rourke and Piotr Indyk (Ed.) (2004). “Chapter 39 : Nearest neighbours in high-dimensional spaces”. Handbook of Discrete and Computational Geometry (2nd ed.). CRC Press.
Rosenberg, J. B. (1985). “Geographical Data Structures Compared: A Study of Data Structures Supporting Region Queries”. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 4: 53. doi:10.1109/TCAD.1985.1270098. edit
Houthuys, P. (1987). “Box Sort, a multidimensional binary sorting method for rectangular boxes, used for quick range searching”. The Visual Computer 3 (4): 236. doi:10.1007/BF01952830. edit
S. Maneewongvatana and D. M. Mount. It’s okay to be skinny, if your friends are fat. 4th Annual CGC Workshop on Computational Geometry, 1999.

External links

libkdtree++, an open-source STL-like implementation of k-d trees in C++.
A tutorial on KD Trees
FLANN and its fork nanoflann, efficient C++ implementations of k-d tree algorithms.
Spatial C++ Library, a generic implementation of k-d tree as multi-dimensional containers, algorithms, in C++.
kdtree A simple C library for working with KD-Trees
K-D Tree Demo, Java applet
libANN Approximate Nearest Neighbour Library includes a k-d tree implementation
Caltech Large Scale Image Search Toolbox: a Matlab toolbox implementing randomized k-d tree for fast approximate nearest neighbour search, in addition to LSH, Hierarchical K-Means, and Inverted File search algorithms.
Heuristic Ray Shooting Algorithms, pp. 11 and after
Into contains open source implementations of exact and approximate (k)NN search methods using k-d trees in C++.
Math::Vector::Real::kdTree Perl implementation of k-d trees.