We recommend installing a GUI for R, such as RStudio or Tinn-R. This will provide a nice interface for working with R.
Create lists of the data (vectors) that you want to make an ANOVA for. In this case, the response variable (y) is the selling price (in thousands of dollars) and x represents the sales person who each sold 4 robots at y selling price. Note in this example that it is necessary to use strings for x instead of numbers. R reads these strings like humans read words. In other words, the number assigned to each salesperson is arbitrary and could be named anything. For instance, as humans we recognize that the problem would be comparable if we named the sales people "Rebecca", "Rachel", and "Raymond" instead of "1", "2", and "3" but R needs to be specifically told to read them as a string and not a number.
Next make a data frame for the variables which enables R to read them as one set of data instead of two independent columns.
> values = data.frame(SellingPrices, Salesperson)
> summary(values)
## | SellingPrices | Salesperson |
## | Min. :10.00 | 1:4 |
## | 1st Qu.:11.75 | 2:4 |
## | Median :13.00 | 3:4 |
## | Mean :13.00 | |
## | 3rd Qu.:14.25 | |
## | Max. :16.00 |
Use the aov(y ~ x, data = data frame) to run a one-way ANOVA. Then use the summary() function to look up key outputs from the ANOVA.
> SellingPrice.aov <- aov(SellingPrices ~ Salesperson, data = values)
> summary(SellingPrice.aov)
## | Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
## | Salesperson | 2 | 6.5 | 3.25 | 0.929 | 0.43 |
## | Residuals | 9 | 31.5 | 3.50 |
Read more about making your own ANOVAs here.
Into the console type dbinom(successes, number of trials, probability of success). The result is shown in row [1].
Into the console type dbinom(successes, number of trials, probability of success). The result is shown in row [1].
Note: Here we have entered 0:4 for successes to calculate the entire distribution at once. The first probability (0.4822530864) in the output corresponds to 0 successes, the second (0.3858024691) corresponds to 1 success, and so on. Individual probabilities can be found by instead entering a single number here, such as "dbinom(0, 4, 1/6)".
Into the console type pbinom(successes, number of trials, probability of success). The result is shown in row [1].
Into the console type "qchisq(1-a, degrees of freedom)". The chi-square critical value corresponding to probability a in the right tail is returned. The result is shown in row [1].
To find the corresponding p-value for a left tailed probability (cdf) X^{2} test statistic, use pchisq(x, degrees of freedom, lower.tail = TRUE).
Read more about chi-square distribution probability distributions here.
To find the corresponding p-value for a right tailed probability (cdf) X^{2} test statistic, use pchisq(x, degrees of freedom, lower.tail = FALSE).
Read more about chi-square distribution probability distributions here.
To make a proportion confidence interval you will use the binom.test(x,n) function. You will need to enter the following parameters into the function: x being the number of cases, n being the total sample size. Example: You take a sample of 10 people. 5 of them are female. Confidence level defaults to 95%
You can change the confidence level using the parameter conf.level.
To make a t-interval you will need your data saved in an array.
You can perform the t-interval calculation using the function t.test(). The function will automatically calculate the necessary sample statistics. Confidence level defaults to 95%.
You can change the confidence level using the parameter conf.level.
To make a z-interval you will use R to make the calculation by hand
To make a z-interval you will need your data saved in an array.
Save the following parameters in a variable.
Sample mean
Sample standard deviation
Sample size
Now you need to determine the Critical Value to use using the qnorm() function. Decide on your Confidence Level. The typical options are 90%, 95%, or 99%. Subtract that percentage from 100%, cut in half, and convert to a decimal to use in the qnorm() function. For Example: For a 95% confidence interval, take half of 5% or 2.5% (.025).
Calculate the Margin of Error by multiplying the critical value times standard deviation, then dividing by square root of sample size.
Calculate the confidence interval by adding and subtracting the margin of error from the sample mean.
To make a Two Sample t-interval you will need each sample's data saved in a separate array.
You can perform the t-interval calculation using the function t.test(). The function will automatically calculate the necessary sample statistics. Confidence level defaults to 95%.
You can change the confidence level using the parameter conf.level.
To make a two sample z-interval you will use R to make the calculation by hand.
To make a two sample z-interval you will need each sample's data saved in a separate array.
Sample mean 1 and 2
Sample variance 1 and 2
Sample size 1 and 2
Now you need to determine the Critical Value to use using the qnorm() function. Decide on your Confidence Level. The typical options are 90%, 95%, or 99%. Subtract that percentage from 100%, cut in half, and convert to a decimal to use in the qnorm() function. For Example: For a 95% confidence interval, take half of 5% or 2.5% (.025).
Calculate the Margin of Error by multiplying the critical value times the square root of the sum of each variance divided by its sample size.
Calculate the sample difference by subtracting the two sample means.
Calculate the confidence interval by adding and subtracting the margin of error from the sample difference.
The number of combinations can found by using combn. Input the number of objects first (36) followed by the number of objects taken at a time (5). The ncol command counts the number of combinations in this case.
Read more about programming your combinations here.
Use the factorial() function to find the factorial.
Read more about how to use the factorial function here.
Since there is no simple command for a permutation like there is for combinations, it is easiest to calculate a permutation by using what we know about combinations. Permutations are simply a combination multiplied by k! or the factorial of the number selected at a time. Without using the number of columns function (ncol), we would receive a list of all permutations.
To sort data by ascending or descending order, use the sort() function. In this example, we will sort the ages of 25 employees at a clothing department store (Example 3.4.2).
Create a list of all the ages to be included. This list (or vector) is called "Ages" here. Alternatively, you could extract a column or row of data from an Excel to sort it.
Use sort(x, decreasing=FALSE) to sort in ascending order (smallest to largest). Set decreasing=TRUE for descending order (largest to smallest).
Read more about sorting your data here.
There are a variety of functions and packages that can be used to calculate descriptive statistics. In addition to the base functions in R, packages such as "mosaic" can be used to calculate descriptive statistics.
One of the easiest ways to see the mean, median, maximum, and minimum of a data set is to use the summary() function. Note that there is no simple function to find the mode of a data set.
> summary(list)
## | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
## | 4.00 | 6.25 | 8.50 | 9.00 | 11.25 | 15.00 |
The standard deviation, stadard error, variation, range, and sum can also be calculated easily.
Alternatively, you can install the "mosaic" package in R to use the favstats function. This will show you data on the minimum, maximum, mean, median, standard deviation, and count amoung other basic descriptors.
> install.packages("mosaic")
> library("mosaic")
> favstats(list)
## | min. | Q1 | median | Q3 | max | mean | sd | n | missing |
## | 4 | 6.25 | 8.5 | 11.25 | 15 | 9 | 4.690416 | 4 | 0 |
Once you find the test statistic F and the degrees of freedom, then you can plug your values into the function to find the P-value.
F-statistic: 0.9286
Degrees of freedom: 2,9
P(F>0.9286): lower.tail=FALSE
Read more about F-Probability (cdf) here.
Create vectors (lines of your data) for x and y. Here, x is the Lack of Parental Involvement and y is the Percentage of Frequency Distribution for the responses.
Plot the bar chart using barplot().Percents acts as the height of the bar chart and the names for the widths of the bars are represented by the responses for Lack of Parental Involvement. xlab and ylab are used to label the name of each axis and ylim is used to show the distribution of response percentage from 0-60%.
Read more about bar charts here.
Write your data as a list of numbers (vectors). For this example, the highest and lowest 5 wins per seasons were included from the Braves, Cubs, Dodgers, and Yankees baseball teams from the years 1967-2010.
For a boxplot with multiple columns, it is necessary to create a data frame which puts each data list as a column with 10 rows.
> BaseballTeams <- data.frame(Braves,Cubs,Dodgers,Yankees)
> BaseballTeams
## | Braves | Cubs | Dodgers | Yankees | |
## | 1 | 50 | 38 | 58 | 59 |
## | 2 | 54 | 49 | 63 | 67 |
## | 3 | 61 | 61 | 63 | 70 |
## | 4 | 63 | 64 | 71 | 71 |
## | 5 | 65 | 65 | 73 | 72 |
## | 6 | 106 | 103 | 102 | 114 |
## | 7 | 104 | 97 | 98 | 103 |
## | 8 | 103 | 97 | 95 | 103 |
## | 9 | 101 | 96 | 95 | 103 |
## | 10 | 101 | 93 | 95 | 101 |
Plot the boxplot using the data frame and appropriate labels.
Read more about making your own boxplots here.
Open RStudio.
Create a new project by clicking "File" at the top of the window, and then selecting "New Project".
In the pop-up window, click "New Directory" and then "Empty Project". Name it "Choropleth Map", and save it wherever you prefer.
Upon creating your project, a section of your screen will show up with some text on it. Below the text there will be a ">" symbol and the cursor should show up to the right of the symbol. This is called the R console, and this is where we will write our statements.
First, we will install all the necessary packages needed to create a choropleth map. Enter the following statements in the R console, pressing Enter after each statement.
Note: the install.packages() statements only need to be run once and the specified package will be installed permanently. However, installing a package is different from loading a package into R. We need to load packages into R using the library() function every time we open a new R session.
Now that we have all of the necessary tools installed and loaded, it is time to load in our data. For county level data, the choropleth map package in R requires data to be in comma separated value format (.csv) and organized in the following way:
A | B | |
---|---|---|
1 | region | value |
2 | 1001 | 8437 |
3 | 1003 | 39710 |
4 | 1005 | 2354 |
5 | 1007 | 1664 |
6 | 1009 | 5080 |
7 | 1011 | 1031 |
8 | 1013 | 2032 |
9 | 1015 | 13818 |
10 | 1017 | 2759 |
The first column must have a header titled "region" containing the geographic indicator (FIPS/county codes), and the second column must have a header titled "value" that contains the value of the variable of interest associated with the geographic indicator. Make sure that there is no extra formatting in either column (no commas, symbols, text, etc.) Save the correctly formatted .csv file in the same location that you created your new R project directory. You should see the file name appear in the "Files" pane.
For this example, we will use the US County Data found on the web resource.
Once the .csv is saved in the correct location, we can load it into R via the R console. Navigate to the console, type in the following statement, and press Enter.
Note: The "<-" symbol denotes assignment. We are assigning the data from the .csv file to a variable named "mapData" so that we can easily access it for future use.
Once loaded, you should see an item in the Data panel with the name "mapData". By clicking on the item in the data panel, we can view the data that was loaded into R. The data should only have 2 variables (region, value). Now, we will create the choropleth map using a single R statement. Type the following into the console and press Enter.
Depending on which FIPS codes are/aren't included in your dataset, you may get a warning message, but if there are no actual errors, the graph should still be generated like the following.
After a couple seconds, R will generate a map of the United States with the county regions shaded depending on the associated value. You can specify a title for the maps, and a title for the legend by typing the statement with some additional input or parameters.
The previous command will generate the following plot.
The plot can be saved as an image by clicking the "Export" button above the graph and selecting "Save as Image…"
Create a list of values (a vector) for your histogram. In this case, we are using the heart rate of 50 students.
Create a histogram with breaks using the number at the beginning of the interval (56.5 is an example here). This can be accomplished by using hist(). To create a histogram where the frequency of values is on the y-axis, make sure that the interval breaks are equally spaced.
Read more about making your own histograms here.
Define your data variable by loading a datafile or entering a set of single variable data as a vector (in this case the data set was small so we defined it by hand).
Then, use qqnorm and qqline to create the plot and draw the trendline.
Note: If you left off ", datax=TRUE" the plot would be drawn with the sample quantities on the y-axis instead. Our materials typically show the data on the x-axis so we have adjusted this argument, but regardless of which axis has the data, you are looking for the points to follow a line.
For this example, we are using the High School Completion and Crime Rate data from Hawkes Stat. Delete the title (High School Completion and Crime Rate 2014) of the dataset from the top while saving it to your computer.
Upload the data set to R. To do this, type in the name you want to save the data set as (in this case school_and_crime). To read the data into R, use read.csv(file="",header=TRUE,sep=","). Inside the "" you should put the path name to the file. You should then be able to view your data set in the Global Environment. More information about doing this can be found here.
Next you can plot your data. The $ are used to call a particular column of data from your file. In this case, the Crime Rate Data column is read as Crime.Rate..per.100.000 by R and the High School Completion column is read by R as High.School.Completion. Finally, label your axes and title.
plot(school_and_crime$Crime.Rate..per.100.000,school_and_crime$High.School.Completion, xlab="Crime Rate (per 100,000)", ylab="Completion Rate",main="High School Completion Rate and Crime Rate",ylim=c(65,95))
Read more about programming scatterplots here.
Unless you choose to install a package in R, you will have to create your own z-test. There are a number of ways to accomplish this, but one way is to make a function that calculates the z-score and a separate command to calculate the P-value. In this case we are using the parameters for $\stackrel{\_}{x}$ (x.bar), μ (mu), 𝜎 (sd), and number (n) for our z-test. To calculate the z-score, we use the equation:
$t={\displaystyle \frac{\stackrel{\_}{x}-{\mu}_{0}}{{\displaystyle \frac{\mathit{\sigma}}{\sqrt{n}}}}}$.
Now plug in the values for the z-score function. Saving it as a new output will be useful for calculating the P-value and other test statistics.
Calculate the P-value for P( z ≥ 2.53) = P( z ≤ -2.53). As always, be careful to correctly evaluate the P-value depending on if you want the upper, lower, or two-tailed probability. Then evaluate if the p-value convinces you to reject or fail to reject the null hypothesis.
>p=pnorm(-z_output)
>p
## | [1] | 0.005706018 |
alpha = 0.01
if (alpha > p) {
print("Reject null hypothesis")
} else {
print ("Fail to reject the null hypothesis")
}
[1] "Reject null hypothesis"
Read more about z-tests here.
There is a t.test option in R, but without a vector or list of data, it is necessary to create your own function. There are several ways to accomplish this, but one way is to make a function that calculates the t-score and a separate command to calculate the P-value. In this case we are using the parameters for $\stackrel{\_}{x}$ (x.bar), μ_{0} (mu), s (s), and number (n) for our t-test. To calculate the t-score, we use the equation:
$t={\displaystyle \frac{\stackrel{\_}{x}-{\mu}_{0}}{{\displaystyle \frac{s}{\sqrt{n}}}}}$.
Now plug in the values for the t-score function. Saving it as a new output will be useful for calculating the p-value and other test statistics.
Calculate the P-value. As always, be careful to correctly evaluate the P-value depending on if you want the upper, lower, or two-tailed probability. Then evaluate if the P-value convinces you to reject or fail to reject the null hypothesis.
>alpha = 0.01
>p=2*pt(t_output, df=19)
>p
## | [1] | 0.003332838 |
if (alpha > p) {
print("Reject null hypothesis")
} else {
print ("Fail to reject the null hypothesis")
}
[1] "Reject null hypothesis"
Read more about t-tests here.
Use the function pnorm(z/x, mean = mu, sd = standard deviation, lower.tail = TRUE)
Z/x: provide the z score or x value
mu: if left off assumed to be 0
standard deviation: if left off assumed to be 1
lower.tail: TRUE if left off. Include lower.tail = FALSE if you need the probability of observing a value above the x or z you provided.
Enter ppois(1, lambda=mean). The probability is shown in output row [1].
> ppois(1,lambda=0.5)
[1] 0.909796
Enter dpois(x, lambda=mean). The probability is shown in output row [1].
> dpois(0, lambda = .5)
[1] 0.6065307
To find the confidence interval for the slope and y-intercept of a linear regression, run your regression using the lm() function, then use the confint() function. Inside of this you give the model, and the confidence level desired.
> Y = c(12, 11, 12, 12, 13, 16, 13, 18, 11, 14)
> X = c(50, 51, 62, 45, 63, 76, 53, 68, 51, 74)
> model = lm(Y~X)
> confint(model,level=0.95)
2.5 % | 97.5 % | |
(Intercept) | −2.74485180 | 10.9761507 |
X | 0.03920706 | 0.2671791 |
Read more about confidence intervals here.
Revisiting the scatterplot we made previously (see Graphs > Scatterplot for information on the data set) , we can calculate its correlation using cor( x, y). The ouput indicates a moderately negative correlation which makes sense given the scatterplot.
> cor(School_and_crime$Crime.Rate..per.100.000,school_and_crime$High.School.Completion)
[1] | -0.4262846 |
To find the confidence interval for the mean value of y given x, run your regression, then use the predict() function. Inside of this you give the model, the data to predict on as shown below, the type of interval, and the confidence level desired. For a simple linear regression, still use the newdata=list() notation.
> daughter <- c(65, 65, 61, 69, 67, 59, 69, 70, 68, 70, 70, 65, 70)
> mother <- c(64, 66, 62, 70, 70, 58, 66, 66, 64, 67, 65, 66, 68)
> father <- c(73, 70, 72, 72, 72, 63, 75, 75, 72, 69, 77, 70, 74)
> m1 <- lm(daughter~mother+father)
> predict(m1,newdata=list(mother=64, father=74),interval="confidence",level=0.95)
fit | lwr | upr | |
1 | 66.82968 | 64.7847 | 68.87465 |
To find the predicted value of y given x, change the interval type to prediction.
> daughter <- c(65, 65, 61, 69, 67, 59, 69, 70, 68, 70, 70, 65, 70)
fit | lwr | upr | |
1 | 66.82968 | 61.5329 | 72.12646 |
Read more about predictions here.
Create a list of values (vectors) for the x and y variables. Age will go on the x-axis and AskingPrice on the y-axis for this example.
> Age <- c(1,1,2,2,2,3,3,4,4,5,5,6,6,6)
> AskingPrice <- c(17850,18000,15195,16995,15625,14935,14879,14460,13586,13050,13495,9150,9950,10995)
Create a regression line using the command lm(y ~ x) for linear model.
> RegressionLine <- lm(AskingPrice ~ Age)
Plot the points with the fitted linear model. The lines () function can be used and the points should be sorted by the x-variable before being fit to the regression line.
> plot(Age, AskingPrice,xlab="Age (Years)",ylab="AskingPrice",main="Asking Price versus Age (Years)", lines(sort(Age),fitted(RegressionLine)))
To get a summary of the linear regression line use summary () function.
> summary(RegressionLine)
lm(formula = AskingPrice ~ Age)
Residuals:
Min | 1Q | Median | 3Q | Max |
-1574.94 | -582.30 | 50.25 | 533.37 | 1357.83 |
Coefficients:
Estimate Std. | Error | t value | Pr(>|t|) | |
(Intercept) | 19198.3 | 524.9 | 36.58 | 1.12e-13 |
Age | -1412.2 | 131.8 | -10.71 | 1.69e-07 |
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 868.7 on 12 degrees of freedom
Multiple R-squared: 0.9053, Adjusted R-squared: 0.8975
F-statistic: 114.8 on 1 and 12 DF, p-value: 1.692e-07
Read more about simple linear regression and plotting linear regression points here: Simple Linear Regression | The Default Scatterplot Function.
See examples below. Enter the values you would like to sample from in an array named as you like. sample(x) will sample from the given array without replacement and generate a sample with as many values as are in the array. By default, the function samples without replacement but you may specify replace=TRUE.
Example 1
> x<-c(1,2,3,5,6,7)
> sample(x)
[1] 1 7 2 3 5 6
> sample(x, replace=TRUE)
[1] 3 5 5 3 2 1
> sample(x,2)
[1] 1 5
Example 2
> x<-1:5
> sample(x)
[1] 1 3 2 4 5
> sample(x,4,replace=TRUE)
[1] 4 4 2 2
Enter qt(probability, "df=" degrees of freedom). The t-value is shown in output row [1].
> qt(0.975, df=18)
[1] 2.100922
To find the probability of successes P(X=0), P(X=1), and P(X=2), use the dhyper(number of successes in the sample of size n, number of possible successes, number of possible failures, number of draws) function. This can equivalently be written as dhyper(x, k, N-k, n).
> dhyper(0,2,28,16)
## [1] 0.2091954
> dhyper(1,2,28,16)
## [1] 0.5149425
> dhyper(2,2,28,16)
## [1] 0.2758621
Read more about hypergeometric distributions here.