IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    代数方法求PageRank

    SparkandShine发表于 2017-02-14 12:33:28
    love 0

    本文结合实例介绍如何用代数方法求PageRank。

    1. PageRank

    博文《网页排序算法PageRank》介绍了PageRank,计算PageRank可以用迭代的方法也可以用代数的方法,其背后的数学基本运算是一样的,即:

    $$
    PR(p_i) = \frac{1-d}{N} + d \sum_{p_j \in B(p_i)} \frac{PR(p_j)}{L(p_j)}
    $$

    下文结合图1介绍如何用代数方法求PageRank。

    Wikipedia-PageRanks-Example
    Fig. 1: PageRanks for a simple network (image from Wikipedia).

    为了便于讨论,将图1下方的节点分别标上G, H, I, J, K,如下图所示:

    wikipedia_pagerank_example
    Fig. 2: Draw Fig. 1 in NetworkX.

    2. 代数方法

    根据1中的等式,把所有节点都放在一块,可以得到:

    $$
    \begin{bmatrix}
    PR(p_1) \
    PR(p_2) \
    \vdots \
    PR(p_N)
    \end{bmatrix} =
    \begin{bmatrix}
    {(1-d)/ N} \
    {(1-d) / N} \
    \vdots \
    {(1-d) / N}
    \end{bmatrix}
    + d
    \begin{bmatrix}
    \ell(p_1,p_1) & \ell(p_1,p_2) & \cdots & \ell(p_1,p_N) \
    \ell(p_2,p_1) & \ddots & & \vdots \
    \vdots & & \ell(p_i,p_j) & \
    \ell(p_N,p_1) & \cdots & & \ell(p_N,p_N)
    \end{bmatrix}
    \cdot
    \begin{bmatrix}
    PR(p_1) \
    PR(p_2) \
    \vdots \
    PR(p_N)
    \end{bmatrix}
    $$

    上述等式可以缩写为:

    $$
    \mathbf{R} = \frac{1-d}{N} \mathbf{1} + d \mathcal{M}\mathbf{R}
    $$

    其中,\(\mathbf{1}\)为\(N\)维的列向量,所有元素皆为1。以图1为例,该列向量为,

    N = len(G.nodes())      # N = 11
    column_vector = np.ones((N, 1), dtype=np.int)
    
    [[1]
     [1]
     [1]
     [1]
     [1]
     [1]
     [1]
     [1]
     [1]
     [1]
     [1]]
    

    2.1 Adjacency function

    邻接函数(adjacency function)\(\ell(p_1,p_2)\)组成了矩阵\(\mathcal{M}\),

    $$
    \mathcal{M}_{ij} =
    \ell(p_i,p_j) =
    \begin{cases}
    1 /L(p_j) , & \mbox{if }j\mbox{ links to }i. L(p_j)\mbox{是指从}p_j\mbox{链出去的网页数目} \
    0, & \mbox{otherwise}
    \end{cases}
    $$

    这样矩阵每一行乘以\(\mathbf{R}\),就得到了新的PR值,比如第二行(图1的节点B),

    $$
    \begin{split}
    M_{2j} & = \ell(p_2,p_2) \cdot PR(p_2) + \ell(p_2,p_2) \cdot PR(p_2) + \cdots + \ell(p_2,p_N) \cdot PR(p_2) \
    &= 0 \mbox{ (A')} + 0 \mbox{ (B’)} + 1 \mbox{ (C')} + \frac{1}{2} \mbox{ (D’)} + \frac{1}{3} \mbox{ (E')} + \frac{1}{2} \mbox{ (F’)} \
    & + \frac{1}{2} \mbox{ (G')} + \frac{1}{2} \mbox{ (H’)} + \frac{1}{2} \mbox{ (I')} + 0 \mbox{ (J’)} + 0 \mbox{ (`K’)}
    \end{split}
    $$

    以节点G为例,G给B和E投票,所以B得到1/2。

    2.2 矩阵\(\mathcal{M}\)

    事实上,\(\mathcal{M}\)可以被看成normalized的图邻接矩阵,即:

    $$
    \mathcal{M} = (K^{-1} A)^T
    $$

    其中,\(\mathcal{A}\)为图的邻接矩阵,以图1为例,

    # Get adjacency matrix
    nodelist = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']  # sorted(G.nodes())
    A = nx.to_numpy_matrix(G, nodelist)
    
      'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'
    [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  0.]
     [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
     [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
     [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
     [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]]
    

    \(\mathcal{A}\)是对角矩阵,对角线上的元素是对应节点的出度。

    nodelist = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']  # sorted(G.nodes())
    list_outdegree = map(operator.itemgetter(1), sorted(G.out_degree().items()))
    K = np.diag(list_outdegree)
    
    [[0 0 0 0 0 0 0 0 0 0 0]
     [0 1 0 0 0 0 0 0 0 0 0]
     [0 0 1 0 0 0 0 0 0 0 0]
     [0 0 0 2 0 0 0 0 0 0 0]
     [0 0 0 0 3 0 0 0 0 0 0]
     [0 0 0 0 0 2 0 0 0 0 0]
     [0 0 0 0 0 0 2 0 0 0 0]
     [0 0 0 0 0 0 0 2 0 0 0]
     [0 0 0 0 0 0 0 0 2 0 0]
     [0 0 0 0 0 0 0 0 0 1 0]
     [0 0 0 0 0 0 0 0 0 0 1]]
    

    \(\mathbf{K}\)的逆矩阵\(\mathbf{K^{-1}}\)为,

    K_inv = np.linalg.pinv(K)
    
    [[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    1.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    1.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.5   0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.33  0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.5   0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.5   0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.5   0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.5   0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    1.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    1.  ]]
    

    那么,根据公式\(\mathcal{M} = (K^{-1} A)^T\)就可以求得\(\mathcal{M}\),如下,

    M = (K_inv * A).transpose()
    
    [[ 0.    0.    0.    0.5   0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    1.    0.5   0.33  0.5   0.5   0.5   0.5   0.    0.  ]
     [ 0.    1.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.33  0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.5   0.5   0.5   0.5   1.    1.  ]
     [ 0.    0.    0.    0.    0.33  0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
     [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]
    

    2.3 求解

    \(\mathbf{R}\)是2.1等式的特征向量(eigenvector),求解等式得:

    $$
    \mathbf{R} = (\mathbf{I}-d \mathcal{M})^{-1} \frac{1-d}{N} \mathbf{1}
    $$

    其中\(\mathbf{I}\)是单位矩阵。

    d = 0.85
    I = np.identity(N)
    R = np.linalg.pinv(I - d*M) * (1-d)/N * column_vector 
    
    [[ 0.028]
     [ 0.324]
     [ 0.289]
     [ 0.033]
     [ 0.068]
     [ 0.033]
     [ 0.014]
     [ 0.014]
     [ 0.014]
     [ 0.014]
     [ 0.014]]
    

    咦,结果怎么跟图1不一样,在StackOverflow提了一个问,Incorrect PageRank calculation result。

    本文涉及到的源代码已经分享到我的GitHub,pagerank目录下的pagerank_algebraic.py.

    3. Python源代码

    NetworkX实现了PageRank的代数计算方法nx.pagerank_numpy,源代码在这里。

    def pagerank_numpy(G, alpha=0.85, personalization=None, weight='weight', dangling=None):
        """Return the PageRank of the nodes in the graph.
        """
    
        if len(G) == 0:
            return {}
    
        M = google_matrix(G, alpha, personalization=personalization,
                          weight=weight, dangling=dangling)
    
        # use numpy LAPACK solver
        eigenvalues, eigenvectors = np.linalg.eig(M.T)
        ind = eigenvalues.argsort()
    
        # eigenvector of largest eigenvalue at ind[-1], normalized
        largest = np.array(eigenvectors[:, ind[-1]]).flatten().real
        norm = float(largest.sum())
    
        return dict(zip(G, map(float, largest / norm)))
    


沪ICP备19023445号-2号
友情链接