Introduction to Statistical Analysis with R
Statistical Analysis with R is one of the best practices which the statistician, data analysts, and data scientists do while analyzing statistical data. R language is a popular open-source programming language that extensively supports built-in packages and external packages for statistical analysis. R language natively supports basic statistical calculations for exploratory data and advanced statistics for predictive data analysis Statistical analysis with R is an important part of identifying data patterns based upon the statistical rules and business constraints. Due to the simplicity of R syntax and flexibility of using advanced packages. R language is preferred for Statistical Analysis.
How to Perform Statistical Analysis with R language?
Let us now discuss how to perform Statistical Analysis with R language.
- To start with statistical data analysis with R, the business requirement needs to be clear to find the data patterns from the available data.
- The R language needs to be installed on the system
- R can be installed in Windows, Linux, and MAC OS X.
- The installable file for R can be downloaded from https://cran.r-project.org/
- Next, the IDE such as R Studio needs to be installed on the system.
- R Studio provides GUI support along with some enterprise-ready features like Syntax hiliting, debugging, packages, and workspace management.
- R Studio can be downloaded and installed from https://www.rstudio.com/
- Once the R studio is installed, it can be directly used to develop R script which will work on the installed version of the R language.
- Once the Environment is ready, the next step is to import the data set to R workspace.
- For Example, we will import a .csv file to R studio for Statistical analysis.
- We will be downloading an open-source data set from https://www.kaggle.com/ for this demonstration.
- The data file we will use is ‘cbb.csv’ which is college basketball dataset,
The practical approach of statistical analysis with R
- This section will do hands-on using R studio for college basketball dataset.
- The first step is to set the working directory which will be used as the preferred location to read and write datasets.
- setwd() is used in R to set the working directory
- getwd() to check the present working directory
- Following is a screenshot of R Studio with setwd() and getwd() functions.
- Next will import the data set using read.csv() command and assign to a data frame called SampleData as the following the syntax
- Sample data = read.csv(“cbb.csv”)
- To check the dataset imported correctly and review the few top lines of data use head() command in R
sampleData = read.csv("cbb.csv")
- Next, we will use a summary() command to do basic statistical analysis which will show the Min, Max, Mean, median, and the inter quartile range information about the data set for each quantitative variable.
- The summary of basketball data set shows the Variable G has min value 24.00, Max values 40.00, the median value is 31.00 and the mean value is 31.52
- Next, we will discuss univariate data analysis.
- R data frames are an efficient data store reference,
- A particular variable can be assessed from the data frame using $ symbol
- For example, to view the statistical summary of W variable, we will use
- The data can be plotted as a histogram using hist. default() command to view overall data distribution
- We can use Table function to create a frequency table which shows the number of frequency of the data in the variable using table(sampleData$W)
- The frequency table shows the value 20 has a maximum frequency in the data. This function is very useful while doing statistical categorical variables.
- Also, we n plat this frequency table using plot function in R using >
- Next, we will discuss bivariate statistical analysis with R
- This statistical analysis is a comparison between two variables present in that data set.
- It helps to identify the correlation and patterns between the two variables.
- Symbol ‘~’ is used for bivariate analysis in R
- In this example, we are creating a scatter diagram or scatter plot for G and W variable using
- This scatter plot represents the graph for bivariate analysis
- Apart from the Scatter plot, there are several other functions and plots like histograms, line plots, and boxplots are being used for Bivariate data analysis.
- Next, we will discuss the t-test which is the statistical hypothesis testing process using R.
- t,test() function used in R to process the t-test
- We will use G variable data of data frame sample data for t-test
- test(sampleDat$G) is the syntax we will apply on the R Studio console.
- T-test shows the statistical inferences and the confidence interval .as outcomes.
- The p-value is the probability value significant to the null hypothesis. And the percentage value is the confidence interval.
- In this T-test, the P-value is <2.2e-16 and the confidence interval is 95%. It also shows the mean value of 31.52205.
- This T-test shows the Alternate hypothesis is true in the hypothesis testing process.
Importance of Statistical Analysis with R language
- R is a reliable programming language for Statistical Analysis.
- It has a wide range of statistical library support like T-test, linear regression, logistic regression, time-series data analysis.
- R comes with very good data visualization features supporting potting and graphs using graphical packages like ggplot2.
- It is a scripting language, which helps statisticians and data scientists to develop code and test individual statistical models for efficient data analysis.
- The code written in R for statistical analysis is easy for interpretation and sharable to other stack holders of the organization and coworkers.
- Being a popular and well-structured Language, R has several code reusable components and libraries available to get started with statistical analysis of an input dataset.
- R language includes various build-in datasets for learning and creating a proof of concept before using actual business data for statistical analysis.
It is an integrated phase of data science projects. Due to its native support of statistical computation, wide community support, it makes it unique from its competitors like python language, SAS, IBM SPSS Statistics, MATLAB, Minitab, and Microsoft Excel. Statistical analysis using R is evolving with version upgrades.
This is a guide to Statistical Analysis with R. Here we discuss the introduction, How to Perform Statistical Analysis with R language? and Importance of Statistical Analysis with R language respectively. You may also have a look at the following articles to learn more –