CRUSH算法基本原理和相关的数据结构

17-11-30 来源：[db:作者]

收藏我要投稿

CRUSH 算法解决了PG副本如何分布在集群OSD上的问题，本文先介绍CRUSH算法基本原理和相关的数据结构，主要是CRUSH map 中的内容如 bucket 、placement rule等，以及他们的源码初探，在下一节中将介绍CRUSH的算法实现。

数据分布算法

存储系统的数据分布算法要解决数据如何分布到各个节点的问题，随着大规模分布式存储系统(PB级的数据和成百上千台存储设备)的出现，数据分布算法面临了几个挑战：

数据分布和负载的均衡。数据分布的均衡是指算法要能够将数据均匀的存储到各个节点和磁盘上。负载均衡指的是数据的访问的等操作的负载在各个节点之间负载均衡。扩展性和高可用性。系统方便的增加或者删除存储设备。当增删设备之后，数据恢复和数据迁移，将会使得数据重新均衡，算法需要是这样的移动、迁移的数据量尽可能的少。支持大规模集群。为了支持大规模的存储集群，要求数据分布算法维护的元数据要相对较小，且计算量不能太大。

传统的存储使用数据中心式的元数据/索引表来保存有关客户端的数据如何存放的问题。而Ceph使用CRUSH按需计算出元数据，因此它消除了对中心式的服务器/网关的需求（去中心化）。

Ceph设计了CRUSH(一个可扩展的伪随机数据分布算法)，用在分布式对象存储系统上，可以有效映射数据对象到存储设备上(不需要中心设备)。因为大型系统的结构式动态变化的，CRUSH能够处理存储设备的添加和移除，并最小化由于存储设备的的添加和移动而导致的数据迁移。

CRUSH有两个关键优点：

任何组件都可以独立计算出每个object所在的位置(去中心化)。只需要很少的元数据(cluster map)，只要当删除添加设备时，这些元数据才需要改变。

CRUSH 算法的基本原理

CRUSH算法全称为可控的、可扩展的、分布式的副本数据放置算法。

CRUSH算法解决的是PG如何映射到OSD列表中的问题，也就是给定一个输入x，CRUSH 算法将输出一个确定的有序的储存目标向量 ?R 。其中x为要计算的PG的pg _id，存储向量列表就是OSD列表，即选择的OSD，这些OSD通常都会在不同的故障域。

CRUSH（x）-> (OSD1,OSD2,...,OSDn)

当输入x，CRUSH利用强大的多重整数hash函数根据Hierachical Cluster Map(集群的拓扑结构)、Placement Rules（定位规则、选择策略）、以及x计算出独立的完全确定可靠的映射关系。下面就介绍下层级的Cluster Map 和 Placement Rules。

Hierachical Cluster Map

层级化的Cluster Map 定义了OSD集群剧透层级的静态拓扑结结构。OSD的层级让CRUSH算法在选择OSD时实现了机架感知的能力，通过规则定义，使得副本分布在不同机架、不同机房中，提高的安全性。CRUSH Map文件里面到底有什么呢？主要包含四个部分。

struct crush_map {
        /*! An array of crush_bucket pointers of size __max_buckets__.
         * An element of the array may be NULL if the bucket was removed with
         * crush_remove_bucket(). The buckets must be added with crush_add_bucket().
         * The bucket found at __buckets[i]__ must have a crush_bucket.id == -1-i.
         */
    struct crush_bucket **buckets;
        /*! An array of crush_rule pointers of size __max_rules__.
         * An element of the array may be NULL if the rule was removed (there is
         * no API to do so but there may be one in the future). The rules must be added
         * with crush_add_rule().
         */
    struct crush_rule **rules;
        __s32 max_buckets; /*!< the size of __buckets__ */
    __u32 max_rules; /*!< the size of __rules__ */
        /*! The value of the highest item stored in the crush_map + 1
         */
    __s32 max_devices;

    /*! Backward compatibility tunable. It implements a bad solution
         * and must always be set to 0 except for backward compatibility
         * purposes
         */
    __u32 choose_local_tries;
    /*! Backward compatibility tunable. It implements a bad solution
         * and must always be set to 0 except for backward compatibility
         * purposes
         */
    __u32 choose_local_fallback_tries;
    /*! Tunable. The default value when the CHOOSE_TRIES or
         * CHOOSELEAF_TRIES steps are omitted in a rule. See the
         * documentation for crush_rule_set_step() for more
         * information
         */
    __u32 choose_total_tries;
    /*! Backward compatibility tunable. It should always be set
         *  to 1 except for backward compatibility. Implemented in 2012
         *  it was generalized late 2013 and is mostly unused except
         *  in one border case, reason why it must be set to 1.
         *
         *  Attempt chooseleaf inner descent once for firstn mode; on
         *  reject retry outer descent.  Note that this does *not*
         *  apply to a collision: in that case we will retry as we
         *  used to.
         */
    __u32 chooseleaf_descend_once;
    /*! Backward compatibility tunable. It is a fix for bad
         *  mappings implemented in 2014 at
         *  https://github.com/ceph/ceph/pull/1185. It should always
         *  be set to 1 except for backward compatibility.
         *
         *  If non-zero, feed r into chooseleaf, bit-shifted right by
     *  (r-1) bits.  a value of 1 is best for new clusters.  for
     *  legacy clusters that want to limit reshuffling, a value of
     *  3 or 4 will make the mappings line up a bit better with
     *  previous mappings.
         */
    __u8 chooseleaf_vary_r;

    /*! Backward compatibility tunable. It is an improvement that
         *  avoids unnecessary mapping changes, implemented at
         *  https://github.com/ceph/ceph/pull/6572 and explained in
         *  this post: "chooseleaf may cause some unnecessary pg
         *  migrations" in October 2015
         *  https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg26075.html
         *  It should always be set to 1 except for backward compatibility.
         */
    __u8 chooseleaf_stable;

        /*! @cond INTERNAL */
    /* This value is calculated after decode or construction by
       the builder. It is exposed here (rather than having a
       'build CRUSH working space' function) so that callers can
       reserve a static buffer, allocate space on the stack, or
       otherwise avoid calling into the heap allocator if they
       want to. The size of the working space depends on the map,
       while the size of the scratch vector passed to the mapper
       depends on the size of the desired result set.

       Nothing stops the caller from allocating both in one swell
       foop and passing in two points, though. */
    size_t working_size;

#ifndef __KERNEL__
    /*! @endcond */
    /*! Backward compatibility tunable. It is a fix for the straw
         *  scaler values for the straw algorithm which is deprecated
         *  (straw2 replaces it) implemented at
         *  https://github.com/ceph/ceph/pull/3057. It should always
         *  be set to 1 except for backward compatibility.
         *
     */
    __u8 straw_calc_version;

        /*! @cond INTERNAL */
    /*
     * allowed bucket algs is a bitmask, here the bit positions
     * are CRUSH_BUCKET_*.  note that these are *bits* and
     * CRUSH_BUCKET_* values are not, so we need to or together (1
     * << CRUSH_BUCKET_WHATEVER).  The 0th bit is not used to
     * minimize confusion (bucket type values start at 1).
     */
    __u32 allowed_bucket_algs;

    __u32 *choose_tries;
#endif
    /*! @endcond */
};

Devices（设备）

CRUSH Map文件的devices部分保存了Ceph集群中的所有OSD设备。OSD是和ceph-osd守护进程一一对应的物理磁盘。要将PG映射到OSD，CRUSH需要OSD设备列表，并且它位于CRUSH的开头部分。

Bucket types （bucket 类型）

CRUSH map 文件的types部分定义了在CRUSH层次结构中用到的bucket类型。Bucket由物理位置（例如，行、机架、机箱、主机等）的分层聚合以及它们被重新分配的权重（weights）组成。它们使用节点和叶子（leaves）两种层次，节点bucket表示物理位置，能够和层次结构中的其他节点和叶子bucket相聚合。叶子bucket表示cephosd守护进程和它们的底层物理设备。默认的bucket类型有一下几个，并且CRUSH支持自定义bucket类型。

数值	BUCKet	描述
0	OSD	一个OSD守护进程（比如osd.1）
1	Host	一个包含若干OSD的节点
2	Rack	一个吧包含若干节点的机架
3	Row	跨一组机架的一行
4	Room	一个包含若干机架和节点的房间
5	Data center	一个包含若干房间的数据中心
6	Root	bucket分层结构的根

源代码的结构定义

struct crush_bucket {
    __s32 id;        /*!< bucket identifier, < 0 and unique within a crush_map */
    __u16 type;      /*!< > 0 bucket type, defined by the caller */
    __u8 alg;        /*!< the item selection ::crush_algorithm */
        /*! @cond INTERNAL */
    __u8 hash;       /* which hash function to use, CRUSH_HASH_* */
    /*! @endcond */
    __u32 weight;    /*!< 16.16 fixed point cumulated children weight */
    __u32 size;      /*!< size of the __items__ array */
        __s32 *items;    /*!< array of children: < 0 are buckets, >= 0 items */
};

Bucket instances(bucket实例)

CRUSH map 文件的bucket部分定义了在CRUSH层次结构中的bucket实例。定义bucket之后，需要给主机（host）声明实例。声明实例时，必须指定bucket类型、一个唯一的名称、一个负整数表示的唯一的ID、和主机设备容量相关的权重（weight）、bucket算法（默认straw），以及哈希算法（默认是0，使用rjenkins1) 。一个bucket下有一个或者多个items，items则包含bucket（可以使用递归定义来理解）或者OSD（达到最底层的节点）。每个items都要有权重，表示该items的相对权重，用于选择bucket或者OSD。

 [bucket-type] [bucket-name]
 id [a unique negative numeric ID]
 weight [the relative capacity the item]
 alg [the buck type:uniform|list|tree|straw|straw2]
 hash [the hash type:0 by defalut]
 item [item] weight [weight]

 列子：
 host test{
   id -1
   #weight 2.000
   alg straw
   hash 0

   item osd.1 weight 1.00
   item osd.2 weight 2.00
   item osd.3 weight 1.00
 }

这里说一说alg参数及其相关内容，其它参数都很好理解不过多介绍。 alg 可以选择多个bucket随机选择算法，每个算法都是在是性能和重组效率之间的一种妥协。下面介绍几种bucket类型的选择items的算法。该算法主要解决的是如何从Bucket中选择出需要的子items。

首先声明下一下面描述中的Hash 函数

hash(x,r,i) // x 为要计算的PG ID 、r 为选择的副本序号、i 为bucket的id号

Uniform

Uniform：当每个items具有相同的权重，且items 很少添加或者删除，可以使用该算法。使用的伪随机排列算法。
参看源代码，位于 ceph/src/crush/mapper.c 中的140行

/*
 * bucket choose methods
 *
 * For each bucket algorithm, we have a "choose" method that, given a
 * crush input @x and replica position (usually, position in output set) @r,
 * will produce an item in the bucket.
 */

/*
 * Choose based on a random permutation of the bucket.
 *
 * We used to use some prime number arithmetic to do this, but it
 * wasn't very random, and had some other bad behaviors.  Instead, we
 * calculate an actual random permutation of the bucket members.
 * Since this is expensive, we optimize for the r=0 case, which
 * captures the vast majority of calls.
 */
static int bucket_perm_choose(const struct crush_bucket *bucket,
                  struct crush_work_bucket *work,
                  int x, int r)
{
    ...
    ...
    ...
}

/* uniform */
static int bucket_uniform_choose(const struct crush_bucket_uniform *bucket,
                 struct crush_work_bucket *work, int x, int r)
{
    return bucket_perm_choose(&bucket->h, work, x, r);
}

List

List ：List型的bucket将她们的内容聚合成链表，可以包含任意权重的设备。集群扩展时，新设备加到表头，数据迁移很少。但是移除设备时，会产生很多数据移动。具体查找算法如下：
1）从表头 item开始，wh为当前item的权重，Ws为剩余权重。
2）根据Hash（x，r，i）函数得到一个[0~1]的值，若该值在[0~Wh/Ws]之间，则选择当前item ，并返回该item的ID)
3）否则item = item->next。即1)2)3) 是递归的。
时间复杂度为O(n)

参看源代码，位于 ceph/src/crush/mapper.c  中的140行

/* list */
static int bucket_list_choose(const struct crush_bucket_list *bucket,
                  int x, int r)
{
    int i;

    for (i = bucket->h.size-1; i >= 0; i--) {
        __u64 w = crush_hash32_4(bucket->h.hash, x, bucket->h.items[i],
                     r, bucket->h.id);
        w &= 0xffff;
        dprintk("list_choose i=%d x=%d r=%d item %d weight %x "
            "sw %x rand %llx",
            i, x, r, bucket->h.items[i], bucket->item_weights[i],
            bucket->sum_weights[i], w);
        w *= bucket->sum_weights[i];
        w = w >> 16;
        /*dprintk(" scaled %llx\n", w);*/
        if (w < bucket->item_weights[i]) {
            return bucket->h.items[i];
        }
    }

    dprintk("bad list sums for bucket %d\n", bucket->h.id);
    return bucket->h.items[0];
}

`Tree Bucket`

树型bucket将项目保存在一颗加权的二叉树中，项目位于叶子节点上。每个根节点都知道它的左子树和右子树的总权重。当bucket中包含大量的item时，效率会比List型的高，时间复杂度为 O(log n)。具体查找算法如下： 1）从Tree bucket的root item （interior node）开始遍历。 2）得到左子书的权重Wl，以及当前节点的权重Wn，计算Hash（x，r，i）得到一个值v。 - 若值v在[0~Wl/Wn]之间则，则左子树走，递归寻找item - 否则往右子树走，递归选择item - 继续遍历子树，直到叶子节点，该叶子节点为选择出的结果。

/* (binary) tree */
static int height(int n)
{
    int h = 0;
    while ((n & 1) == 0) {
        h++;
        n = n >> 1;
    }
    return h;
}

static int left(int x)
{
    int h = height(x);
    return x - (1 << (h-1));
}

static int right(int x)
{
    int h = height(x);
    return x + (1 << (h-1));
}

static int terminal(int x)
{
    return x & 1;
}

static int bucket_tree_choose(const struct crush_bucket_tree *bucket,
                  int x, int r)
{
    int n;
    __u32 w;
    __u64 t;

    /* start at root */
    n = bucket->num_nodes >> 1;

    while (!terminal(n)) {
        int l;
        /* pick point in [0, w) */
        w = bucket->node_weights[n];
        t = (__u64)crush_hash32_4(bucket->h.hash, x, n, r,
                      bucket->h.id) * (__u64)w;
        t = t >> 32;

        /* descend to the left or right? */
        l = left(n);
        if (t < bucket->node_weights[l])
            n = l;
        else
            n = right(n);
    }

    return bucket->h.items[n >> 1];
}

`Straw Bucket`

List buckets和Tree buckets的结构决定了只有有限的哈希值需要计算并与权重进行比较以确定bucket中的项。这样做的话，他们采用了分而治之的方式，要么给特定项以优先权（比如那些在列表开头的项），要么消除完全考虑整个子树的必要。尽管这样提高了副本定位过程的效率，但当向buckets中增加项、删除项或重新计算某一项的权重以改变其内容时，其重组的过程是次最优的。

Straw类型bucket允许所有项通过类似抽签的方式来与其他项公平“竞争”。定位副本时，bucket中的每一项都对应一个随机长度的straw，且拥有最长长度的straw会获得胜利（被选中）。每一个straw的长度都是由固定区间内基于CRUSH输入 x,，副本数目r,，以及bucket项 i，的哈希值计算得到的一个值。每一个straw长度都乘以根据该项权重的立方获得的一个系数 f(wi)，这样拥有最大权重的项更容易被选中。尽管straw类型bucket定位过程要比List buckets和Tree buckets慢，但是straw类型的bucket在修改时最近邻项之间数据的移动（重组过程）是最优的。

/* straw */

static int bucket_straw_choose(const struct crush_bucket_straw *bucket,
                   int x, int r)
{
    __u32 i;
    int high = 0;
    __u64 high_draw = 0;
    __u64 draw;

    for (i = 0; i < bucket->h.size; i++) {
        draw = crush_hash32_3(bucket->h.hash, x, bucket->h.items[i], r);
        draw &= 0xffff;
        draw *= bucket->straws[i];
        if (i == 0 || draw > high_draw) {
            high = i;
            high_draw = draw;
        }
    }
    return bucket->h.items[high];
}

`Straw2 Bucket`

Straw bucket 的改进，可以减少数据的迁移量。例如，增加一个设备给项目C从而改变它的权重后，或者删除项目C以后，数据只会移动到它上面或者从它上面移动到其他地方，而不会在bucket内的其它项目之间出现数据移动。

`Bucket选择算法比较`

`是`	`时间复杂度`	`item添加难度`	`item删除难度`
`uniform`	`O(1)`	`poor`	`poor`
`list`	`O(n)`	`optimal`	`poor`
`tree`	`O(log n)`	`good`	`good`
`straw`	`O(n)`	`better`	`better`
`tree`	`O(n)`	`optimal`	`optimal`

`Placement Rules`

Cluster Map 包含了若干CRUSH rules 来决定存储池内的数据存放的方式。他们指定了replication 和 placement策略，该策略CRUSH将数据保存在Ceph集群中。默认的CRUSH map 包含了适用于默认存储池rbd的一条rule。语法如下：

rule {
    ruleset 
        type [replicated|erasure]
        min_size 
        max_size 
        step take 
        step [choose|chooseleaf] [firstn] 
            
        step emit
}

ruleset : 一个整数值，它指定了这条规则所属的规则集。 type：一个字符串值，它制定了存储池类型是 复制型（replicated）还是纠删码类型（erasure coded） min_size：存储池副本数小于该值，CRUSH将不会为存储池使用这条规则 max_size：存储池副本数大于该值，CRUSH将不会为存储池使用这条规则 step take：获取一个bucket名称，并开始遍历其树。 step choose firstn {num} type {bucket-type}：选择某类型的若干（N）bucket，这里数字N通常是存储池的副本数

若 num == 0 ，选择N个bucket 若num > 0 ,并且小于 N ，选择num个bucket。 若num < 0，选择N+num个bucket（N - abs(num)）

step chooseleaf firstn {num} type {bucket-type}：首先选择指定bucket 类型的一组bucket，然后从每个bucket的子树中选择叶子节点。改组中bucket的数目N通常是该存储池的副本份数。

若 num == 0 ，选择N个bucket 若num > 0 ,并且小于 N ，选择num个bucket。 若num < 0，选择N+num个bucket（N - abs(num)）

step emit：它首先弹出当前值，并清空栈。他会被典型的应用于rule结尾，也可以用于组织同一条rule的不同的树。

`Placement Rules 例子`

如图所示的CRUSH map：顶层是一个root bucket，每个root 下有四个row类型bucket。每个row下面有4个cabinet，每个cabinet下有若干个OSD设备（图中有4个host，每个host有若干个OSD设备，但是在本crushmap 中并没有设置host这一级的bucket，而是直接把4个host的所有OSD都设定为cabinet）：

现在定义规则为：

rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take root
    step choose fisrtn 1 type row
    step choose firstn 3 type cabinet
    step choose firstn 1 type osd
    step emit
}

根据规则算法执行过程如下： 1）选中root 类型的bucket 作为一下步骤的输入。 2）从root类型的bucket中选择一个 row类的子bucket，其选择算法在root的定义中设置，一般默认为straw算法。 3）从上一步输出的row中，选择三个cabinet，其选择的算法在row中定义。 4）从上一步的三个cabinet中，分别选择一个OSD，并输出。

根据rule sets ，选择出的三个OSd都在一个row上的三个cabinet中。

点击复制链接与好友分享!回本站首页