| Hwk No. | Reading | Exercises from textbook | Excel exercises | Assigned | Due / Collected | Returned |
| 1 | Section 1.1 | Excel Primer, Ex. 1 | 8/26 | 8/28 | 9/4 | |
| 2 | Section 1.1 | 1.6, 1.12, 1.14 | Excel Primer, Ex. 2 | 8/28 | 9/2 | 9/9 |
| 3 | Section 1.2 | 1.18, 1.40 (Excel graph) | Excel Primer, Ex. 3 | 9/2 | 9/4 | 9/11 |
| 4 | Section 1.3 | 1.44, 1.47, 1.64,1.76 | 9/4 | 9/9 | 9/16 | |
| 5 | Chapter 1 for quiz; Chapter 2 Intro and Section 2.1 | 1.78, 1.80, 1.88, 1.94 | 9/9 | 9/11 | ||
| 6 | Section 2.1 | 1.102, 1.106; 2.4 | Excel Primer, problem 4; the exercise below | 9/11 | 9/16 | |
| 7 | Section 2.2, 2.3 | 2.10, 2.18, 2.22, 2.28 plus extra | 9/16 | 9/23 | ||
| 8 | Section 2.2, 2.3 | 1.132 (in Excel; see hints below); 2.32 | Excel Primer, Ex. 5; an approximation for normal probabilities -- see below | 9/19 | 9/23 | |
| 9 | Section 2.3, 2.4 | finish 1.132 and the approximations problem; 2.44 | 9/23 | 9/25 | ||
| 9 extra credit | 2.44 extra, 2.56 extra | 9/23 | 10/2 | |||
| 10 | Sections 2.3-2.5 | 2.46, 2.62, 2.66 | use Excel for scatterplots/regression | 9/30 | 10/02 | 10/14 |
| 11 | 2.5, 3.1, 3.2 | 2.82, 2.84, 2.86 | 10/02 | 10/07 | 10/14 | |
| 12 | Sections 3.1-3.3 | 3.4, 3.8, 3.12, 3.18 | 10/7 | 10/9 | 10/16 | |
| 13 | Sections 3.3, 3.4 | 3.28, 3.32, 3.38, Generous Motors example | 10/9 | 10/14 | ||
| 14 | 3.4, 4.1 | 3.44, 3.50, 3.51, 3.54 | 10/14 | 10/16 | 10/21 | |
| 15 | 3.3, optional Utts Ch. 2 and 3 | Analyze a news article | 10/14 | 10/21 | 10/28 | |
| 15 extra credit | Break the code by matching probabilities | 10/20 | 10/28 | |||
| 16 | Sections 4.2, 4.5 | 4.10, 4.13, 4.14, 4.20 | 10/28 | 10/30 | 11/4 | |
| 16 extra | A deck of 52 cards is shuffled, and two cards are dealt. Verify by a direct calculation through a tree diagram that the events "The first card is a spade" and "The second card is an ace" are independent. | 10/28 | 11/4 | |||
| 17 | Sec 4.3, 4.4, 3.4 | 4.44, 4.58, 4.61 | 10/30 | 11/4 | ||
| 18 | Sec. 3.4, 4.4, 5.1 | 5.2, 5.6, 4.74 (use the rules rather than doing the distributions anew!) | 11/11 | 11/13 | ||
| 19 | Sec. 5.1, 5.2 | 5.18, 5.20, 5.26, | Sampling distributions | 11/13 | 11/18 | |
| 20 | Sec 5.1, 5.2, 6.1 | 5.30, 5.36, 5.48, 5.49 | 3.72 | 11/18 | 11/20 | |
| 15 rev | The new code | 11/19 | 11/25 | |||
| 21 | Sec 6.1, 6.2 | 6.4, 6.6, 6.14, 6.15 | 11/20 | 11/25 | ||
| 22 | Sec. 6.2, 6.3 | 6.32, 6.42, 6.50 | 12/2 | 12/4 | ||
| 23 | Test for independence | 12/3 | 12/8 |
We shall simulate a bunch of Bernoulli
trials to see how those two characteristics behave when we have truly
independent sequences of zeros and ones. We shall also take for granted
that the probability of a success (i.e., of 1) is a half.
Q: Can you test for that?
(i) To set up the simulations, here's what I typed:
The numbers run from 1 to 40, the total number of digits in the above series.
I also squeezed those columns by selecting them and dragging the width of any
column.
(ii) The next step would be to generate a single series of zeros and ones.
The random number generator built into Excel is RAND(), and it gives a number
with a distribution uniform over the interval (0,1). (Look up the help if
you need to.) To convert this into 0 or 1, we can do this:
The INT() function takes the integer part, i.e., chops the fraction. If we
start with a uniform distribution over (0,1), then mutliplying by two gives
a uniform over (0,2), and taking an integer part of it gives 0 whenever the
number is between 0 and 1 (and that happens half of the time), and 1 whenever
the number is between 1 and 2 (and that happens the other half of the time).
I then drag the formula across the fourty cells representing fourty trials.
(iii) Now, I need to figure out what my runs are, in the next row.
In a run, two consequent cells
should have the same numbers, otherwise, I would need to start counting
the new run. Here's my idea: I'll start the run with 1, and keep adding 1
until I hit a trial that resulted in something different. I shall then
reset my counter back to one:
If the two consecutive numbers are the same (E2=D2), then the IF() Excel's
operator will take whatever is written next (D3+1) and put into the
corresponding cell. D3 was the previous count, i.e., how long has been the
current run. Of course I would have to start from 1 to have that work out.
If the numbers in cells E2 and D2 are different, then I shall reset the
counter of the length of the run to 1.
I then drag that formula over the remaining 38 trials.
See if you can trace the logic of the computation with an example above.
(I've hidden the first three columns for a moment.)
(iv) OK, now all we need to do is to make some summaries over the second row:
to compute the number of runs, and to find the length of the longest run.
You can do the first one with COUNTIF() function (hint: recall that every
run starts with 1), and the second one is of course the MAX() over the row.
(v) Now we need to repeat the process a number of times
(I'd say 1000 -- it basically makes no difference, you just select the relevant
range and drag it down; Excel shows you how many repetitions it has made).
Trick: I found out that Excel also want to change the number 1 that we put in
the beginning of the run statistics (cell D3). To override that, I changed
D3 to be $D$1:
(vi) The next step would be to figure out the distributions
of the test statistics that we are interested in. It might be rather messy
to put them onto the same sheet, so what I shall do is to provide
those summaries on a separate Excel sheet.
Q: Why are the absolute references needed here?
Excel is nice enough to allow selecting a range from a different sheet
when typing in a formula in the cell. That's certainly useful here.
Make sure you check the sum of all frequencies by computing the totals
(=SUM(...) ) across your columns! You may find out that the range
of plausible # of runs or the longest run is in fact larger than
what one might have expected.
(vii)
Let us go back to the series we want to test. You can repeat steps (iii) and
(iv) to obtain the test statistics for the series in question.
Now, having obtained the distribution of the two statistics of interest
by simulation, test the hypothesis "The series is that of Bernoulli trials
with probability 1/2" vs. "The series some dependence in it".
Submission format: you would need to submit the answers
to questions scattered in the text marked with Q. Also, submit
one page with the distributions of the test statistics. Also,
submit your results of tests (1) on the proportion being 1/2 -- this
is a standard test that you can do using the textbook formulas;
(2) on the independence using your simulated distribution. Report
p-values for all your tests.
The problem is not necessarily as scary and long as it would seem.
The way to organize your computations efficiently in Excel is
to list all 15 SRS samples (say the first number is in column A, the second
is in column B), then putting their mean into the third column,
and then fill in the next three columns with corresponding probabilities
under the three sampling schemes (put the name of each scheme on the top),
with a bunch of zeroes if the corresponding SRS samples
cannot be taken under an alternative scheme. Then you can compute
products "probability * value" and "probability * (value-mean) squared" in the
next few columns, so that summing along those would give you
the means and the variances for the sampling distributions. Make sure
you have relative and absolute references right wherever needed!
Submit the page with your computations
(if it is too wide, you can check "Fit to 1 page wide by 1 page tall"
in Print Preview -> Setup -> Page dialogue), and your answers to questions
(a)-(d) spelled out.
Provide a photocopy (or a printout from the web) of the piece of news
you are discussing.
2.44.
Export the relevant chunk of data into Excel
(File -> Open -> menBA.txt; the Text Import Wizard will pop up, and
you would need to follow the Wizard; the defaults should work reasonably
well.)
Reproduce the regression equation result obtained in the text book.
Continue the analysis by running multiple regression using age and
the square of age as two explanatory variables. Make a plot of your
data with the predicted line. Do you think adding the square term
improves the prediction?
2.56.
Do the multiple regression using all of the IQ, self-concept, and gender
to predict the GPA. What does the coefficient of the gender tell you?
Looking at your multiple regression, which of the variables seem to be
the most important one in explaining GPA? (To answer that, try rethinking
the regression on the standardized scales, i.e., rescaling each of the
explanatory variables to have mean zero and standard deviation of 1. Then
the regression coefficients will tell you my how much, keeping everything
else constant, a unit increase of the explanatory variable increases or
decreases the response. Then the most important variable would be the one
with the highest absolute value of the standardized coefficient.)
1.132. The Excel function that generates random numbers
between 0 and 1 is called RAND(). To convert those numbers
into the normal distribution, you would need to use the function
NORMINV() with the first argument being your random number.
Approximate normal probabilities
Sometimes a fairly simple approximation can be used to compute
the tail probabilities for the normal distribution (i.e. the area
to the left of a negative x, or the area to the right of
positive x). If F(x) is the area to the left of x, then
F(x) is approximately f(x)/|x| for x<0,
and F(x) is approximately 1-f(x)/x for x>0, where f(x) is the
standard normal density (p. 69, mu=0, sigma=1).
Your assignment will be to assess the quality of this approximation.
Follow the steps below.
(a) Open a new sheet in Excel.
The first column will be the values for which you will compute
both the "exact" and approximate values. Title the column something
like "x" and type 1, 2, ... 7 in column A.
(b)
The quality of approximation is usually assessed in two ways:
the absolute accuracy and the relative accuracy. The absolute
accuracy is the absolute value (Excel function: ABS() )
of the difference between the
"true" number and an approximation to it. The relative accuracy
is the absolute accuracy divided by the "true" number.
Q.: Which you think is more important, the absolute or the
relative accuracy?
The first line of your Excel sheet would then look something
like this:
(c) Fill in the whole table in Excel. Are there any
obvious patterns in accuracy, either absolute, relative, or both?
(d) The relative accuracy tells you how many correct decimal
figures your approximation gives. The relative accuracy of 10% means
that only the first digit is accurate. Q: For which values of x
is the approximation at least one digit accurate? Mark
them on the Excel sheet. The relative accuracy of 1% means
that there are two accurate digits given by the approximation.
Q: For which values of x is the approximation accurate to
two digits? Mark them on the Excel sheet.
Hand in Excel sheet and answers to the questions.
Test for independence
As you would have remembered from Chapter 4, independence might be a
tricky issue, and sometimes it takes quite a bit of work to verify
if events are independent or not. In this exercise, we shall try to see
if the given sequences seem to be coming from Bernoulli trials with
independent outcomes, or there is a dependence structure to them.
Here are three sequences: one from a sequence of independent trials,
one is just hand written as I thought was random, and the third one
is from a random process that I know has some dependence structure.
Here are the three sequences to be analyzed, in mixed order:
1011001101010101010010100001101110010010
0101110101000100101011101110010101010110
0100000011001100001110111011010100010100
What we shall look at are runs, i.e. the series of the same
number 0 or 1. The test statistic we shall be using are the number of
such runs (for the first series, it is 27), and the length of the longest
run (for the first series, it is 4). Why would that make sense? Here are
two extreme examples of series that have the same number of ones and zeros
(equal to the half of the total length), but demonstrate very strong
dependence:
0000000000000000000011111111111111111111
0101010101010101010101010101010101010101
In the first series, the two consecutive numbers are very likely to
be the same (except for the break in the middle). It has only two runs,
and the longest is 20. In the second series, each consecutive number is
different from the previous one, so there are fourty runs, with the longest
being 1. The random behavior lies somewhere in between, and in this exercise
we shall try to see how random the sequence of 0 and 1 may be.
Q: What is the probability of observing any of those patterned sequences?









SRS vs. systematic vs. stratified
Suppose you have the following population of the number of students
in several classes: 15, 39, 36, 12, 22, 28. You are taking a sample of size
two from that distribution, compute the mean, and is interested in
that mean's properties.
(a) Write out the possible samples under the simple random
sampling scheme. There should be "6 choose 2" = 15 different samples.
Find the mean and the variance of the sampling distribution by computing
the mean for each possible sample, and then proceeding in the way
similar to what I did in class.
(b) Next, let's analyze the systematic random sample mechanism
properties. There are only three possible samples: the first and the
fourth figures in the list above; the second and the fifth; and
the third and the sixth. Derive the distribution of the sample mean
and find its mean and variance. (Note also that this design is
sensitive to the ordering of the values, which is certainly an
undesirable feature.)
(c) Now, let's suppose we knew from the outset that the first,
the fourth and the fifth classes would be small, and the second, the third
and the sixth would be large. We would then make two strata of three
classes each, and take one observation from each stratum. We shall
then have 3*3 = 9 equally possible samples, and 6 samples from the
above list of SRS samples would take two values from one stratum,
and none from the other. Derive the distribution of x-bar and
find its mean and variance.
(d) What one would want from the sample statistic is that
it is a good estimator of the corresponding population quantity,
so we would want the sample mean to be closest to the population mean.
Does the mean of the distribution of x-bar equal the population mean
for all three designs? Which design has the smallest variance?
All in all, which design has the best statistical properties?
(Let us note in the parentheses that the systematic design is probably
the cheapest way to go, while the stratified design is the most
complicated since it requires having some prior knowledge. The sampling
statisticians have their ways to incorporate the questions of costs
into consideration so as to balance the simultaneous minimization
of the cost and the variance of a survey.)
Is that a good piece of news?
Find a news article that presents the results of a statistical
study. Was it an experimental study or an observational study?
Comment either on the forms of control (for an experiment) or
the sampling procedures (for the sample survey) of the study.
Address the seven critical components mentioned in class.
Do you think the article does a fair job describing the results
of the study? Anyting omitted? Anything added that was not there?
The most efficient way to organize your simulations would be
to have 20 columns corresponding to the number of times you repeat
the sampling, and 25 rows corresponding to the number of observations
you would need to generate. You can compute x-bar and s in the end
of those columns. Keep in mind that Excel keeps updating the
data sheet any time you change it, so it will regenerate the
random numbers each time you enter something new to your sheet.
Once you have created the 20 by 25 table with means and standard
deviations in the bottom, you can select it, Edit -> Copy it to
the Clipboard, and then Edit -> Paste Special -> Paste Values
so as to fix your simulation results.
Now go on and construct the summaries and provide your comments
as requested in the textbook. Hand in the sheet with your normal
quantile plots and discussion of the results.
The next column B will be the area to the left of your x,
and that is given by the Excel function NORMSDIST().
Column C will be the area to the right of x, so it is
1 minus the value in column B.
In the next column, you would need to compute the standard normal
density for a given x. The "e to the power" function is EXP(), the
square root if SQRT(), and the value of the pi number is given
by PI(). Then in column E you will compute your approximation
f(x)/x.
So, in your next column, compute the absolute accuracy, and
in the last column, the relative accuracy.
