IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Learning inferential statistics using R

    Jyosna Philip发表于 2023-12-16 19:55:31
    love 0
    [This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Imagine you need to find the average height of 20-year olds. One way is to go around and measure each person individually. But that seems quite a bit of work, doesn’t it? Luckily, there’s a better way. Inferential statistics allows us to use samples to draw conclusions about the population. In other words, we can get a small group of people and use their characteristics to estimate the characteristics of the entire group.
     To see how this works in practice, let’s take a look at a dataset from Kaggle. This platform provides a wealth of data sets from various fields, each offering unique challenges for R users. Here, we’ll be using a dataset on Cardiovascular diseases compiled by Jocelyn Dumlao.
    This dataset originates from a renowned multispecialty hospital situated in India, encompassing a comprehensive array of health-related information. Comprising an extensive structure of 1000 columns and 14 rows, this dataset plays a pivotal role in the early detection of diseases.
    Let us see how to import this into RStudio. The dataset is imported into RStudio using the library ‘readr’ (this is only if the dataset is in .csv format). Replace “File path” with the path of your downloaded dataset.
    library(readr)
    cardio <- read.csv("File path")
    
    Just type in the name of the variable you used to import the dataset so that you can view the entire dataset in RStudio.
    cardio


    The first 6 rows of the dataset can be viewed using the ‘head’ function.
    top_6=head(cardio)
    top_6
    

    Similarly, the last 6 rows of the dataset can be viewed using the ‘tail’ function.
    bottom_6=tail(cardio)
    bottom_6
    

    The dimension of the dataset (number of rows and columns) can be found out using the ‘dim’ function.
    dimension=dim(cardio)
    dimension
    

    The entire dataset can be termed as population and all the population parameters can be easily found. The mean of a target variable in the population is calculated by the ‘mean’ function. Below, we choose serumcholestrol as the target variable.
    mean_chol=mean(cardio$serumcholestrol)
    mean_chol
    

    So, we can infer that the average serumcholestrol levels in the patient population taken from the hospital is 311.447.
    There also exists a function to calculate the standard deviation of a dataset.

    std_chol=sd(cardio$serumcholestrol)
    std_chol


    From this value, it can be understood that the values of serumcholestrol lies 132.4438 below or above the mean level.
    We take a random sample of size 100 where our target variable is serumcholestrol. If you want to take a random sample with replacement, give the third argument as TRUE. Here, we’re taking a sample without replacement.

    sample_1=sample(cardio$serumcholestrol,100,FALSE)
    sample_1
    
    mean_sample_chol=mean(sample_1)
    mean_sample_chol

    The mean of the sample that we selected is 317.51. This mean can be used to calculate the test statistic which further can be used to make decisions about the null hypothesis(whether to accept or reject).
    Calculating the standard error of the sample Getting the standard deviation of a dataset gives us many insights. Standard deviation provides the spread of the data around the mean. The standard deviation of sampling distribution is called standard error.
    std_error=sd(sample_1)
    std_error
    The mean and the standard error of the sample is close to the population mean and standard deviation. Plotting the sample distribution in histogram with x-axis as frequency and y-axis as Cholesterol levels.
    To get a sampling distribution, we repeatedly take samples 1000 times. This is done using the replicate function, which repeatedly evaluates an expression a given number of times.
    samp_dist_1=replicate(1000,mean(sample(cardio$serumcholestrol,100,replace=TRUE)))
    samp_dist_1
    The obtained graph is similar to normal distribution graph. That is, values near the mean is occurring more frequently than values far from mean. Now let's calculate the variance of the sampling distribution using the var function.
    variance_sample_1=var(samp_dist_1)
    variance_sample_1

    Now let us see how increasing the sample size affects the variance of the sample.
    Increasing the sample size by 200
    sample_2=sample(cardio$serumcholestrol,200,FALSE)
    sample_2
    Calculating the mean of the sample 2
    mean_sample_chol=mean(sample_2)
    mean_sample_chol
    The mean of the sample 2 with sample size 200 is 308.875 . Calculating the standard error of the sample2
    std_error=sd(sample_2)
    std_error

    The standard error of sample2 is 135.9615 .
    We repeat the previous steps to obtain a sampling distribution.
    samp_dist_2=replicate(1000,mean(sample(cardio$serumcholestrol,200,replace=TRUE)))
    samp_dist_2
    Now we plot it like before.
    hist(samp_dist_2,main="Sampling distribution of serum_cholestrol",xlab = "Frequency",ylab = "Cholestrol Levels", col = "skyblue")
    variance_sample_2=var(samp_dist_2)
    variance_sample_2
    The variance of the sample 2 with sample size 200 is 84.513. That is, the variance of sample 1 with size 100 is greater than the latter sample. Hence we can conclude that as sample size increase, variance as well as standard error reduces. On the other hand, precision increases with an increase in sample size.

    Authors: Aadith Joseph Mathew, Amrutha Paalathara, Devika S Vinod, Jyosna Philip

    Learning inferential statistics using R was first posted on December 16, 2023 at 7:55 pm.
    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Learning inferential statistics using R


沪ICP备19023445号-2号
友情链接