BASIC CONCEPTS OF STATISTICS AND DATA ANALYSIS

Instructor: Richard L. Smith

This page was last updated May 7, 2009.

** Final Exam with Solutions **

** Summary Scores and Final Grades **

The score on the final exam is given in the "FIN" column (max score 120).

Final grades were based on the aggregated total of all scores ("FINTOT" column) in which the scores have been combined in the proportion announced at the start of semester, i.e. 25 points for HW (best 8 of 11 HWs), 20 points each for the two midterms, and 35 points for the final exam.

Grading scheme:

92 to 100: A

89 to 91.9: A-

86 to 88.9: B+

82 to 85.5: B

79 to 81.9: B-

76 to 78.9: C+

72 to 75.9: C

65 to 71.9: C-

50 to 64.9: D

** REVIEW SESSIONS **

First review session: Tuesday, April 28, 5:30-6:30 pm in Hanes 107 or
Hanes 120 (I have a graduate class in Hanes 107 up to 5:20; if that should
overrun, please wait for us to finish).

Second review session: Friday May 1, 3:30-4:50 pm, Gardner 08.

** Class of 04/21/09 (and 04/23/09)**

** Arrangements for final exam**

The final exam is at noon on Saturday, May 2, in the same room as
the regular class (Hanes 120).

The rules are the same as in the two midterms: it is an open-book exam,
and you are allowed to bring the course text, all class materials,
and your own personal notes. You should also bring a calculator and
one or two blue books.

It is a three-hour exam. The material covered will be everything that
has been covered in class during the whole course, though you may
expect a greater emphasis on topics that we have covered since
midterm 2.

As with all exams, it is subject to the university's Honor Code.

** Final exam from Fall 2006 **

HW11 (last HW of course), due 04/23/09: Chapter 9, questions 9.8, 9.16, 9.62, 9.70.

** Class of 04/16/09**

** Class of 04/14/09**

HW10, due 04/16/09:
Chapter 8, questions 8.32, 8.36, 8.54, 8.80

* Note: * This comment applies specifically to 8.36, but you
can take it as a general instruction that applies to all the questions.

Where the question refers to an example worked out in the text,
you can quote anything from the text as given. So in 8.36, where
you asked to calculate the effect on the hypothesis test of changing a
suspected outlier, you can treat all results from the original dataset
that are worked out in the text as given, and quote them freely (though
if you do that, make clear that you are quoting from the text).

HW9, due 04/09/09: Chapter 7, questions 7.46, 7.48. Chapter 8, questions 8.14, 8.20.

** Class of 04/09/09** (I actually got up to page 11 in class,
but have included pages 12-16 because they are relevant for one question
on HW10. I will go through these pages in detail on Tuesday.)

** Class of 04/07/09**

** Class of 04/02/09**

** Class of 03/31/09**

** Midterm 2 with solutions **

** Class of 03/26/09**

HW8, due 04/02/09: Chapter 7, questions 7.20, 7.28, 7.66, 7.76.

Comments on 7.66: The question asks you to look up some data on the Web, but you should not have any trouble finding it if you follow the web link given and search for the variable FEHELP in the year 1998. However, interpreting it may be a little more difficult --- I believe they want you to combine together the categories labelled "Strongly agree" and "Agree" and omit the cases for which no data are available. Even then, I think there is one tiny error in the question, though one which has no material impact on the final answer.

** Summary Scores up to HW7 (updated March 25)**

This contains provisional totals computed as follows. For HW1-7,
I deleted the worst two scores and rescaled the rest to a maximum
of 15 points. For MT1 and MT2, each total was divided by 5 so that
it becomes a total of 20 for each exam. Thus the scores in the "SUBTOT"
column are out of 55 total.

Rough breakdown into grades:

A: 49.5 or better

B: 43 to 49.4

C: 35 to 42.9

D: 30 to 34.9

F: 0 to 29.9

Mean 72.5; SD 16.1; 5-number summary (32, 60.75, 74.5, 86, 97)

** Class of 03/05/09** (Note: we got up to slide 21 in class.
The rest will be covered on Tuesday March 17.)

Midterm 2 has now been fixed for Thursday, March 19.

This will be held under the same conditions as Midterm 1:
an in-class exam, open book, bring your own blue book and calculator.

The material covered is Chapters 5 and 6.

There will be a review session the night of Monday, March 16
(Hanes 120, 5:15-6:30 pm).

** Midterm 2 from Fall 2006**

** Homework 7, due 03/19/09: **
6.32, 6.68, 6.76, 6.82

Hint: In 6.82, the statement "20 gallons per week is the third quartile"
is equivalent to saying "75% of the population uses 20 gallons or less".

** Homework 6, due 03/05/09: **
5.22, 5.32, 5.52, 5.80

1. 5.52 refers to question 5.51 so you will have to at least read that question, but I'm not asking you to answer 5.51 - only 5.52.

2. In 5.80(c), calculate the probability that someone has a heart attack,
given they tested positive for CK. (The question doesn't ask you in so many
words to do that, but I think that's what they meant.)

** Class of 03/03/09** (includes normal tables)

** Class of 02/26/09**

** Midterm 1 with solutions **

** Class of 02/24/09**

** Class of 02/19/09**

Mean=83.75, SD=11.07.

Min=37, Q1=78, Median=87.5, Q3=92, Max=99

Homework 5: Due 2/26/09: 4.78 (double question - worth 10 points), 5.14, 5.18.

Notes about 4.78 (asking you to read a medical journal and summarize the experimental design or sampling methodology): Your answer should be about a page long, and should include the reference (title of paper, name of lead author, the journal, volume and page number) and a brief description of what the paper was about, as well as answering questions (a) through (d). Do not hand in the article itself. For the selection of journal: www.bmj.com is a good one because this is open access (anyone can read any article in full without subscribing). Other possible journals are JAMA or the American Journal of Epidemiology (find the home page via Google or, better, search the UNC library online) though I will accept anything that is in a professional journal (not a newspaper article or popular magazine).

There will be an optional review session for the midterm exam from 5:15-6:30 pm, Monday February 16, Hanes 120.

** MIDTERM EXAM. **

The first midterm will take place in class, Tuesday February 17. This is OPEN BOOK - course text and personal course notes are allowed. Bring a blue book and calculator. Computers are not allowed in the exam.

The material covered will be Chapters 1-4 of the text. As a guide to what
to expect,
** here **
is my exam from Fall 2006. This used the same text and covered the same
material as we have in this class. The only difference is that this time,
to save photocopy costs, I will ask you to write your answers in a blue book
rather than directly on the exam.

HW4: due 02/12/09.

This consists of two conventional problems plus the regression exercise described below. The conventional problems count 10 points

Conventional problems: 4.14, 4.32 from the course text.

Regression problem:

Refer to
**this spreadsheet**
for the data on student heights as well as the question about
howmuch they spent on their last haricut. For convenience, I have
amended the original spreadsheet by editing out those entries for
which someone did not answer these two questions.

I want you to investigate whether someone's height has any influence on how much money they are willing to spend on a haircut. To this end:

(a) Compute a regression in which the Y variable is "Haircut" and the X variable is "height". What is the regression equation? (Use Excel for this.)

(b) Suppose someone has a height of 63 inches. Predict how much they spent on their last haircut.

(c) Calculate the mean and standard deviation, and the correlation coefficient r (again, it's fine to use Excel for these calculations).

(d) Use the formulas given in class (or p. 117 of the text) to show how to derive the intercept and slope of the regression from the statistics you calculated in (c). (The answer should match what you found in (a), but I want to see your working.)

(e) Draw a scatterplot in which the Y variable is "Haircut" and the X variable is "height". Then add in the trend line BY HAND. (For the scatterplot itself, you can either use Excel or draw the points by hand. However, for the trend line, I want you to do this by hand. This is intended to give you practice in how to find the trendline if you have to do it for yourself.)

(f) Now repeat part (a) of this exercise but separating the class into men and women. (There is no need to go through the rest of the calculations in (b) through (e) again.) How do the results change?

(g) Briefly summarize the results of this whole exercise. Do you believe there is a genuine relationship between someone's height and how much they are willing to spend on a haircut?

[Excel hint: if you want to extract a subset of the observations, for example all the female students, you could first "Sort" the data by the "Gender" column. Then all the Ms appear together in a block and so do all the Fs and it is easy to extract the portion you want.]

**Class notes 1/29/09**

(This covers a little more material than I actually did in class,
but since I got almost to the end, I decided to post the whole
thing at once. In Tuesday's class, 02/03/09, I'll review pages 16/17
and then complete this material before going on to Chapter 4.)

HW3, due 2/05/09: 2.96, 2.104, 2.112, 3.40.

Note: question 2.112 asks you to download some data on baseball home runs from the course CD, and use the techniques of this course to determine which one was the best. Obviously there is no unique correct answer to this question and you should feel free to express your personal opinion regarding which characteristics you feel most important. However, what is important is that you back up your opinion by using appropriate statistical techniques. The question will be graded not by which player you select, but how well you back up your answer with suitable statistics.

Links to datasets:

**Class survey (updated)**

** Baseball example**

** Charleston and Mount Airy Temperatures **

** Presentation for Jan 22 and Jan 27**

Notes on class of Tuesday 1/20/09:

Part of Tuesday's class was devoted to producing pie charts, bar
charts and histograms using Excel, and also some of the statistical
functions. I have a handout
**here**
that covers the ** old ** version of Excel - many of the commands
are basically the same in the new version, but they are in different places
in the display. In particular, there is now no "data analysis add-on",
instead, many of the statistical functions are found by going to
"Formulas" then "More Functions" then "Statistical". Also from the
home page, click "Insert" to get a header that includes many of the
graphical display tools as well as "Pivot Table" to create tables.
I may try to update the notes as I go along, but with each new version
of Excel, it becomes harder to explain what's going on in simple
verbal terms - you really have to practice yourself to understand
how the different features work (that's true for me as well!)

Apart from that, we covered measures of center and spread - the mean, median, mode, quartiles, inter-quartile range and standard deviation. Please go through the calculation of standard deviation on page 59 of the text. Then, as a small exercise (not part of the assigned homework), I'm asking you to repeat the calculation using the eight numbers of the CO2 emissions example: 0.2, 0.7, 1.1, 1.2, 1.8, 2.3, 9.8, 19.7. (The answer is that the standard deviation is 6.83. What I'm asking you to do is to show that you can derive that yourself by direct calculation.)

In the next class (Thursday Jan 22), we will cover outliers and boxplots to conclude chapter 2, and then start on chapter 3. I am not in town myself and class will be taken by Vangelis Evangelou. However this will follow the usual schedule and you are expected to attend. Also a reminder that the first homework assignment is due Thursday.

Homework 2, due 1/29/09:

Chapter 2, question 2.20 (page 46), 2.82, 2.84, 2.94 (pages 81-83).

Notes:

Question 2.20 asks you to use software based on the "sugar" data on the CD (amount of sugar in popular brands of cereal). You should all have the book and hence the CD by now, but in case anyone has a problem with that, I append the data below. You do not have to use software - I will accept freehand sketches - but I encourage you to explore the use of Excel at least for the histogram part of this. If you are familiar with other types of software for producing these graphs, that's OK as well (but in that case, please specify what software you used).

The data: 7 5 14 12 1 13 2 12 10 3 11 13 3 10 11 6 10 15 3 3

In question 2.82(c), whichever of the eight descriptive methods you recommend, you should calculate or draw the relevant descriptive statistics (in other words, don't just say which method you would use, you should do it as well).

Question 2.94 asks about mean and standard deviation, which was covered in class on Tuesday. Note: it's not sufficient to answer this question that you enter the numbers in your calculator or in Excel and just quote the answer. I want to see the actual working, like the example on page 59. However, you are allowed to use a calculator or Excel to help you.

** Older material: **

**
Presentation Jan 15 (updated Jan 19) **

**
Presentation Jan 13**

**
File of responses to questionnaire**

** Links to newspaper articles and other
websites mentioned during the course: **

**
USA today on travel risk**

**
PLEASE NOTE: The class is in HANES 120, not Bingham
103 as announced earlier.
**

**
Tuesdays and Thursdays, 12:30-1:45.
**

**
If you have any questions, feel free to email me (rls at email.unc.edu)
**

**
Return to Richard
Smith's page
**