IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Simple regression analysis using Python

    SparkandShine发表于 2016-07-26 17:19:42
    love 0

    This article introduces simple regression analysis illustrated with a simple example and shows how to calculate it using Python.

    1. Simple regression analysis

    1.1 Introduction

    In layman’s terms, given a bunch of data points \({(x_i, y_i), i = 1, …, n}\), simple regression analysis is to find a straight line that fits them the best. (two dimensions)

    $$
    y=\alpha +\beta \cdot x
    $$

    where \(\alpha\) and \(\beta\) are so-called intercept and slop respectively. Such a equation can be used for predicting the unknown value.

    Before diving into details, let’s take an example. There are 15 data points, the height and mass (weight), from a sample of American women of age 30–39.

    # Height (m)
    x = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]
    
    # Mass (kg)
    y = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]
    

    We make a scatter plot for the above data points.

    simple_regression_analysis
    Fig. 1: An example illustrates simple regression analysis

    Clearly, no matter how we plot the straight line, some points have a distance to that line. This is known as the error term, denoted by \(\varepsilon\). Thus, each data point \((x_i, y_j)\) can be precisely described by,

    $$
    y_i = \alpha + \beta \cdot x_i + \varepsilon_i
    $$

    So very naturally the question arises: how do we find the ‘best’ line? One possible solution is to find a line that minimizes the sum of squared errors. Such a approach is so-called ordinary least squares (OLS). Formally, find \(\alpha\) and \(\beta\) such that,

    $$
    \min_{\alpha ,\,\beta}Q(\alpha,\beta) =\sum_{i=1}^{n}\varepsilon_{i}^{\,2}=\sum_{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}
    $$

    We can derive \(\alpha\) and \(\beta\) by using either calculus, the geometry of inner product spaces, or simply expanding to a quadratic expression.

    1.2 Calculate using Python

    scipy.stats.linregress is used to calculate a linear least-squares regression for two sets of measurements.

    import scipy.stats
    
    # Calculate a linear least-squares regression for two sets of measurements.
    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y=None)
    
    # Returns
        slope       # float, slope of the regression line
        intercept   # float, intercept of the regression line
        r_value     # float, correlation coefficient
        p_value     # float, two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero.
        std_err     # float, standard error of the estimate
    

    Apply scipy.stats.linregress to the above example, we have,

    slop            61.2721865421
    intercept       -39.0619559188
    R-squared       0.989196922446
    p-value         3.60351533955e-14
    standard error  1.77592275222
    

    As described earlier, slope is denoted by \(\beta\) and intercept is denoted by \(\alpha\). We are able to derive an equation. It can be used to predict the height with the known mass and vice versa.

    $$
    y = -39.06 + 61.27 \cdot x
    $$

    R value

    Pearson product-moment correlation coefficient, also known as r, R, or Pearson’s r, a measure of the strength and direction of the linear relationship between two variables that is defined as the (sample) covariance of the variables divided by the product of their (sample) standard deviations.

    R squared

    In statistics, the coefficient of determination, denoted \(R^2\) or \(r^2\) and pronounced “R squared”, is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable.

    P value

    In frequentist statistics, the p-value is a function of the observed sample results (a test statistic) relative to a statistical model, which measures how extreme the observation is.

    Standard error

    In regression analysis, the term “standard error” is also used in the phrase standard error of the regression to mean the ordinary least squares estimate of the standard deviation of the underlying errors.

    2. The source code

    The source code of the above example is hosted on my GitHub, here.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-#
    from __future__ import division
    
    import scipy.stats
    import matplotlib.pyplot as plt
    import numpy as np
    
    def main():
    
        '''
        This example concerns the data set from the ordinary least squares article. 
        This data set gives average masses for women as a function of their height 
        in a sample of American women of age 30–39.
        '''
    
        x = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]
        y = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]
    
    
        # Step 1: regression analysis
        slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)
    
        print "slop\t", slope
        print "intercept\t", intercept
        print "R-squared\t", r_value**2
        print "p-value\t", p_value  
        print "standard error\t", std_err
    
        # Step 2: plot the graph
        fig, ax = plt.subplots()
    
        # plot scatter
        ax.plot(x, y, 'bx', markersize=10)
    
        # plot the straight line
        new_x = np.linspace(min(x), max(x), 1000)
        new_y = slope * new_x + intercept
    
        ax.plot(new_x, new_y, 'r', linewidth=2)
        ax.set_xlabel('Height (m)')
        ax.set_ylabel('Mass (kg)')
        plt.grid()
    
        out_file = 'simple_regression_analysis.png'
        plt.savefig(out_file, bbox_inches='tight')
    
        plt.show()
    
    if __name__ == '__main__':
        main()
    

    References:
    [1] Wikipedia: Simple linear regression
    [2] Wikipedia: Coefficient of determination



沪ICP备19023445号-2号
友情链接