Exploratory Data Analyis Tutorial: E. Devenish-Nelson, University of Edinburgh

Exploring Data and Descriptive Statistics

This tutorial assumes you have some experience using R and with basic statstical analysis.

Here, we use R’s in-built database ‘iris’ data(iris). Always check your data after reading it into R.

Summary Statistics

The function summary provides useful summary statistics regarding the data, known as the ‘five number summary’ (minimum value, first quartile, median, third quartile, and maximum value)

As with all functions, you can use help, ?summary, to find our more about this function. You can type this in the box and run code to open a help page in your browser. There is also a lot of help online, such as on ‘stackoverflow’, so do make use of google!

Write the code to produce a summary output of the iris dataset:

The mean, the average of all values of a variable, is a key parameter of interest in EDA. The mean function allows us to calculate this parameter.

Write the R code to calculate the mean of the varable called ‘sepal length’:

mean(iris$Sepal.Length)

On its own the mean doesn’t tell us anything about the variation in the values of the variable. For that, we need to calculate the standard deviation. Do this for the variable called ‘sepal length’, using the sd function:

sd(iris$Sepal.Length)

Sometimes we want to apply functions to compare variables across the dataset. A simple way to do this is to produce table using tapply, telling it which variable to report, which variable to group by and what mathematical function to use. The first argument of the function takes the variable of interest, the second argument is the grouping variable, and the last argument is the function of interest.

Tip: use the $ operator to extract columns by name (see ?extract)

Write the code using tapply to report the mean sepal length by species:

tapply(iris$Sepal.Length,iris$Species,mean)

Box and Whisker Plots

Visualisation is an important part of exploratory data analysis. R is a very powerful graphics tool so it’s worth exploring this in more depth when you come to analyse and present your own data.

Box and whisker plots are a very useful visualisation for EDA, using the function boxplot. They are a standardized way of visualising the distribution of data based on the ‘five number summary’ and can show us whether there are outliers (values that lie above/below the extremes).

Make a boxplot of the iris dataset:

boxplot(iris)

Since the data also vary by species it’s useful to plot them by species. It’s possible to do this directly within the boxplot function, by specifying which data to plot using a formula.

Write code to make a boxplot that shows sepal length by species. You may need to look up the help for boxplot.

boxplot(iris$Sepal.Length ~iris$Species, xlab="Species", ylab="Sepal length")

Quiz

Data distributions

Histograms

Statistical tests make assumptions about the distribution of the underlying data. Thus, understanding the distribution of your data is essential for choosing appropriate statistical tests. Histograms are very useful for visualising the frequency distribution of your data, where you can check for features including skewwness, symmetry and multi-modality.

We expect a bell-shaped pattern for a normal distribution. We can check this by making a histogram using the hist function.

Make a histogram of sepal width, with labeled axes.

hist(iris$Sepal.Width, xlab="Sepal Width", ylab="Frequency",main= "")

It can be useful to manually set the number of bars of your histogram to aid interpretation.

Now make a histogram of petal length, using the breaks argument to create 10 bars.

hist(iris$Petal.Length, xlab="Petal Length", ylab="Frequency",main= "",breaks =5)

QQplots

We can use a quantile-quantile plot to visualise these data slightly differently. QQplots plot the quantiles from a theoretical reference distribution against the quantiles from your data. Focusing on the normal distribution again, if our data are normally distributed then the points on the plot will lie roughly along the central reference line.

Use the qqnorm function to produce a QQplot for sepal width:

qqnorm(iris$Sepal.Width)

Copy the code for the qqnorm plot and on the next line, use qqline to add a straight line for ease of interpretation:

qqline(iris$Sepal.Width)

Testing for normality

Several statistical tests exist to test for deviation from the normal distribution. One such method is the Shapiro-Wilk test. If the test reports a p-value smaller than 0.05 we must reject this hypothesis and assume that the data are not normally distributed.

Run this test for sepal width, using the function shapiro.test:

shapiro.test(iris$Sepal.Width)

Note that you have tested all species data together and for analysis you might need to test by species.

Quiz

Data Patterns

Pairs plot

We often want to explore patterns between variables in our data, such as when we need to identify collinearity. We can do this visually using multiple scatterplots.

The pairs function allows us to do this easily. Investigate patterns in the iris database using this function:

pairs(iris)

Scatterplots

You may want to compare just two of these variables, using a scatterplot.

Plot petal length on the x axis against sepal length on the y axis, using the plot function:

plot(iris$Petal.Length,iris$Sepal.Length,
     xlab='Petal length', ylab='Sepal length')

Testing for correlations

Now, we can also test for a correlation between these two variables. A simple way to do this is to use a Pearson correlation test. The rule of thumb is that if the correlation coefficient returned by this test is > (+-)0.7, then the two variables are correlated.

Using the cor function, test for a correlation between sepal length and sepal width: Note, the default test for this function is Pearson, but other methods can be chosen by changing the method argument.

 cor(iris$Sepal.Length,iris$Sepal.Width)

Quiz

Saving your output

You can save your graphics directly from R directly to your working directory, to multiple file types (pdf, png, bmpg, jpg, tiff).

If you don’t know where you have set your working directory, you can check it using getwd():

Modify this code to save one of your plots, by adding the code for the plot in between the two lines. Note, you won’t see your plot in the console, as it will save directly to your working directory.

tif(file='Plot.tif')    


dev.off()