Stat 31.4 homeworks

Homeworks are 20% of the total grade. See the relevant section of the syllabus on how the homeworks should be written up. We should be able to be through with the grading in a week's time, so the graded homeworks are returned to you a week after they were collected.
Hwk No. Reading  Exercises from textbook  Excel exercises  Assigned  Due / Collected  Returned
1 Section 1.1   Excel Primer, Ex. 1  8/26  8/28  9/4
2 Section 1.1 1.6, 1.12, 1.14 Excel Primer, Ex. 2 8/28 9/2 9/9
3 Section 1.2 1.18, 1.40 (Excel graph) Excel Primer, Ex. 3 9/2 9/4 9/11
4 Section 1.3 1.44, 1.47, 1.64,1.76   9/4 9/9 9/16
5 Chapter 1 for quiz; Chapter 2 Intro and Section 2.1 1.78, 1.80, 1.88, 1.94   9/9 9/11  
6 Section 2.1 1.102, 1.106; 2.4 Excel Primer, problem 4; the exercise below 9/11 9/16  
7 Section 2.2, 2.3 2.10, 2.18, 2.22, 2.28 plus extra   9/16 9/23  
8 Section 2.2, 2.3 1.132 (in Excel; see hints below); 2.32 Excel Primer, Ex. 5; an approximation for normal probabilities -- see below 9/19 9/23  
9 Section 2.3, 2.4 finish 1.132 and the approximations problem; 2.44   9/23 9/25  
9 extra credit   2.44 extra, 2.56 extra   9/23 10/2  
10 Sections 2.3-2.5 2.46, 2.62, 2.66 use Excel for scatterplots/regression 9/30 10/02 10/14
11 2.5, 3.1, 3.2 2.82, 2.84, 2.86   10/02 10/07 10/14
12 Sections 3.1-3.3 3.4, 3.8, 3.12, 3.18   10/7 10/9 10/16
13 Sections 3.3, 3.4 3.28, 3.32, 3.38, Generous Motors example   10/9 10/14  
14 3.4, 4.1 3.44, 3.50, 3.51, 3.54   10/14 10/16 10/21
15 3.3, optional Utts Ch. 2 and 3 Analyze a news article   10/14 10/21 10/28
15 extra credit   Break the code by matching probabilities   10/20 10/28  
16 Sections 4.2, 4.5 4.10, 4.13, 4.14, 4.20   10/28 10/30 11/4
16 extra   A deck of 52 cards is shuffled, and two cards are dealt. Verify by a direct calculation through a tree diagram that the events "The first card is a spade" and "The second card is an ace" are independent. 10/28 11/4  
17 Sec 4.3, 4.4, 3.4 4.44, 4.58, 4.61   10/30 11/4  
18 Sec. 3.4, 4.4, 5.1 5.2, 5.6, 4.74 (use the rules rather than doing the distributions anew!)   11/11 11/13  
19 Sec. 5.1, 5.2 5.18, 5.20, 5.26, Sampling distributions 11/13 11/18  
20 Sec 5.1, 5.2, 6.1 5.30, 5.36, 5.48, 5.49 3.72 11/18 11/20  
15 rev The new code     11/19 11/25  
21 Sec 6.1, 6.2 6.4, 6.6, 6.14, 6.15   11/20 11/25  
22 Sec. 6.2, 6.3 6.32, 6.42, 6.50   12/2 12/4  
23     Test for independence 12/3 12/8  


Test for independence

As you would have remembered from Chapter 4, independence might be a tricky issue, and sometimes it takes quite a bit of work to verify if events are independent or not. In this exercise, we shall try to see if the given sequences seem to be coming from Bernoulli trials with independent outcomes, or there is a dependence structure to them. Here are three sequences: one from a sequence of independent trials, one is just hand written as I thought was random, and the third one is from a random process that I know has some dependence structure. Here are the three sequences to be analyzed, in mixed order:

1011001101010101010010100001101110010010
0101110101000100101011101110010101010110
0100000011001100001110111011010100010100
What we shall look at are runs, i.e. the series of the same number 0 or 1. The test statistic we shall be using are the number of such runs (for the first series, it is 27), and the length of the longest run (for the first series, it is 4). Why would that make sense? Here are two extreme examples of series that have the same number of ones and zeros (equal to the half of the total length), but demonstrate very strong dependence:
0000000000000000000011111111111111111111
0101010101010101010101010101010101010101
In the first series, the two consecutive numbers are very likely to be the same (except for the break in the middle). It has only two runs, and the longest is 20. In the second series, each consecutive number is different from the previous one, so there are fourty runs, with the longest being 1. The random behavior lies somewhere in between, and in this exercise we shall try to see how random the sequence of 0 and 1 may be. Q: What is the probability of observing any of those patterned sequences?

We shall simulate a bunch of Bernoulli trials to see how those two characteristics behave when we have truly independent sequences of zeros and ones. We shall also take for granted that the probability of a success (i.e., of 1) is a half. Q: Can you test for that?

(i) To set up the simulations, here's what I typed:

The numbers run from 1 to 40, the total number of digits in the above series. I also squeezed those columns by selecting them and dragging the width of any column.

(ii) The next step would be to generate a single series of zeros and ones. The random number generator built into Excel is RAND(), and it gives a number with a distribution uniform over the interval (0,1). (Look up the help if you need to.) To convert this into 0 or 1, we can do this:

The INT() function takes the integer part, i.e., chops the fraction. If we start with a uniform distribution over (0,1), then mutliplying by two gives a uniform over (0,2), and taking an integer part of it gives 0 whenever the number is between 0 and 1 (and that happens half of the time), and 1 whenever the number is between 1 and 2 (and that happens the other half of the time). I then drag the formula across the fourty cells representing fourty trials.

(iii) Now, I need to figure out what my runs are, in the next row. In a run, two consequent cells should have the same numbers, otherwise, I would need to start counting the new run. Here's my idea: I'll start the run with 1, and keep adding 1 until I hit a trial that resulted in something different. I shall then reset my counter back to one:

If the two consecutive numbers are the same (E2=D2), then the IF() Excel's operator will take whatever is written next (D3+1) and put into the corresponding cell. D3 was the previous count, i.e., how long has been the current run. Of course I would have to start from 1 to have that work out. If the numbers in cells E2 and D2 are different, then I shall reset the counter of the length of the run to 1. I then drag that formula over the remaining 38 trials.

See if you can trace the logic of the computation with an example above. (I've hidden the first three columns for a moment.)

(iv) OK, now all we need to do is to make some summaries over the second row: to compute the number of runs, and to find the length of the longest run. You can do the first one with COUNTIF() function (hint: recall that every run starts with 1), and the second one is of course the MAX() over the row.

(v) Now we need to repeat the process a number of times (I'd say 1000 -- it basically makes no difference, you just select the relevant range and drag it down; Excel shows you how many repetitions it has made). Trick: I found out that Excel also want to change the number 1 that we put in the beginning of the run statistics (cell D3). To override that, I changed D3 to be $D$1:

(vi) The next step would be to figure out the distributions of the test statistics that we are interested in. It might be rather messy to put them onto the same sheet, so what I shall do is to provide those summaries on a separate Excel sheet.

Q: Why are the absolute references needed here?

Excel is nice enough to allow selecting a range from a different sheet when typing in a formula in the cell. That's certainly useful here.

Make sure you check the sum of all frequencies by computing the totals (=SUM(...) ) across your columns! You may find out that the range of plausible # of runs or the longest run is in fact larger than what one might have expected.

(vii) Let us go back to the series we want to test. You can repeat steps (iii) and (iv) to obtain the test statistics for the series in question. Now, having obtained the distribution of the two statistics of interest by simulation, test the hypothesis "The series is that of Bernoulli trials with probability 1/2" vs. "The series some dependence in it".

Submission format: you would need to submit the answers to questions scattered in the text marked with Q. Also, submit one page with the distributions of the test statistics. Also, submit your results of tests (1) on the proportion being 1/2 -- this is a standard test that you can do using the textbook formulas; (2) on the independence using your simulated distribution. Report p-values for all your tests.


SRS vs. systematic vs. stratified

Suppose you have the following population of the number of students in several classes: 15, 39, 36, 12, 22, 28. You are taking a sample of size two from that distribution, compute the mean, and is interested in that mean's properties.
(a) Write out the possible samples under the simple random sampling scheme. There should be "6 choose 2" = 15 different samples. Find the mean and the variance of the sampling distribution by computing the mean for each possible sample, and then proceeding in the way similar to what I did in class.
(b) Next, let's analyze the systematic random sample mechanism properties. There are only three possible samples: the first and the fourth figures in the list above; the second and the fifth; and the third and the sixth. Derive the distribution of the sample mean and find its mean and variance. (Note also that this design is sensitive to the ordering of the values, which is certainly an undesirable feature.)
(c) Now, let's suppose we knew from the outset that the first, the fourth and the fifth classes would be small, and the second, the third and the sixth would be large. We would then make two strata of three classes each, and take one observation from each stratum. We shall then have 3*3 = 9 equally possible samples, and 6 samples from the above list of SRS samples would take two values from one stratum, and none from the other. Derive the distribution of x-bar and find its mean and variance.
(d) What one would want from the sample statistic is that it is a good estimator of the corresponding population quantity, so we would want the sample mean to be closest to the population mean. Does the mean of the distribution of x-bar equal the population mean for all three designs? Which design has the smallest variance? All in all, which design has the best statistical properties? (Let us note in the parentheses that the systematic design is probably the cheapest way to go, while the stratified design is the most complicated since it requires having some prior knowledge. The sampling statisticians have their ways to incorporate the questions of costs into consideration so as to balance the simultaneous minimization of the cost and the variance of a survey.)

The problem is not necessarily as scary and long as it would seem. The way to organize your computations efficiently in Excel is to list all 15 SRS samples (say the first number is in column A, the second is in column B), then putting their mean into the third column, and then fill in the next three columns with corresponding probabilities under the three sampling schemes (put the name of each scheme on the top), with a bunch of zeroes if the corresponding SRS samples cannot be taken under an alternative scheme. Then you can compute products "probability * value" and "probability * (value-mean) squared" in the next few columns, so that summing along those would give you the means and the variances for the sampling distributions. Make sure you have relative and absolute references right wherever needed!

Submit the page with your computations (if it is too wide, you can check "Fit to 1 page wide by 1 page tall" in Print Preview -> Setup -> Page dialogue), and your answers to questions (a)-(d) spelled out.


Is that a good piece of news?

Find a news article that presents the results of a statistical study. Was it an experimental study or an observational study? Comment either on the forms of control (for an experiment) or the sampling procedures (for the sample survey) of the study. Address the seven critical components mentioned in class. Do you think the article does a fair job describing the results of the study? Anyting omitted? Anything added that was not there?

Provide a photocopy (or a printout from the web) of the piece of news you are discussing.


2.44. Export the relevant chunk of data into Excel (File -> Open -> menBA.txt; the Text Import Wizard will pop up, and you would need to follow the Wizard; the defaults should work reasonably well.) Reproduce the regression equation result obtained in the text book. Continue the analysis by running multiple regression using age and the square of age as two explanatory variables. Make a plot of your data with the predicted line. Do you think adding the square term improves the prediction?


2.56. Do the multiple regression using all of the IQ, self-concept, and gender to predict the GPA. What does the coefficient of the gender tell you? Looking at your multiple regression, which of the variables seem to be the most important one in explaining GPA? (To answer that, try rethinking the regression on the standardized scales, i.e., rescaling each of the explanatory variables to have mean zero and standard deviation of 1. Then the regression coefficients will tell you my how much, keeping everything else constant, a unit increase of the explanatory variable increases or decreases the response. Then the most important variable would be the one with the highest absolute value of the standardized coefficient.)


1.132. The Excel function that generates random numbers between 0 and 1 is called RAND(). To convert those numbers into the normal distribution, you would need to use the function NORMINV() with the first argument being your random number.
The most efficient way to organize your simulations would be to have 20 columns corresponding to the number of times you repeat the sampling, and 25 rows corresponding to the number of observations you would need to generate. You can compute x-bar and s in the end of those columns. Keep in mind that Excel keeps updating the data sheet any time you change it, so it will regenerate the random numbers each time you enter something new to your sheet. Once you have created the 20 by 25 table with means and standard deviations in the bottom, you can select it, Edit -> Copy it to the Clipboard, and then Edit -> Paste Special -> Paste Values so as to fix your simulation results.
Now go on and construct the summaries and provide your comments as requested in the textbook. Hand in the sheet with your normal quantile plots and discussion of the results.


Approximate normal probabilities

Sometimes a fairly simple approximation can be used to compute the tail probabilities for the normal distribution (i.e. the area to the left of a negative x, or the area to the right of positive x). If F(x) is the area to the left of x, then F(x) is approximately f(x)/|x| for x<0, and F(x) is approximately 1-f(x)/x for x>0, where f(x) is the standard normal density (p. 69, mu=0, sigma=1).

Your assignment will be to assess the quality of this approximation. Follow the steps below.

(a) Open a new sheet in Excel. The first column will be the values for which you will compute both the "exact" and approximate values. Title the column something like "x" and type 1, 2, ... 7 in column A.
The next column B will be the area to the left of your x, and that is given by the Excel function NORMSDIST(). Column C will be the area to the right of x, so it is 1 minus the value in column B.
In the next column, you would need to compute the standard normal density for a given x. The "e to the power" function is EXP(), the square root if SQRT(), and the value of the pi number is given by PI(). Then in column E you will compute your approximation f(x)/x.

(b) The quality of approximation is usually assessed in two ways: the absolute accuracy and the relative accuracy. The absolute accuracy is the absolute value (Excel function: ABS() ) of the difference between the "true" number and an approximation to it. The relative accuracy is the absolute accuracy divided by the "true" number. Q.: Which you think is more important, the absolute or the relative accuracy?
So, in your next column, compute the absolute accuracy, and in the last column, the relative accuracy.

The first line of your Excel sheet would then look something like this:

(c) Fill in the whole table in Excel. Are there any obvious patterns in accuracy, either absolute, relative, or both?

(d) The relative accuracy tells you how many correct decimal figures your approximation gives. The relative accuracy of 10% means that only the first digit is accurate. Q: For which values of x is the approximation at least one digit accurate? Mark them on the Excel sheet. The relative accuracy of 1% means that there are two accurate digits given by the approximation. Q: For which values of x is the approximation accurate to two digits? Mark them on the Excel sheet.

Hand in Excel sheet and answers to the questions.


2.28 (c) Why, in your opinion, it is important to have uncorrelated assets in the investment portfolio?


Excel Excercise due Sep 16

Create the standard normal table in Excel for z's between -4 and -2.5 with at least three accurate digits for all entries. Format the table in the same way it is laid out in the book, i.e. with a column for numbers -4.0, -3.9, ..., -2.5 and a row with 0.00, 0.01, ..., 0.09 in the heading. (You do not need to worry about the color and borders of course.)

Hand in one page containing the "improved" entries of the table. On this page, mark / highlight / circle / show otherwise the following entries: -3.48, -3.21, -3.77, -2.95, -2.50.

Hints and suggestions

1. Start by entering -4, -3.9, etc. in the first column, and 0.00, 0.01, ..., 0.09 in the first row. It is enough to enter two numbers, then you can simply select the two cells and drag them to cover the range you need.

2. Look up the Excel functions you would need to use (we looked over NORMSDIST and NORMDIST -- are those sufficient for the purpose?). See help on those funcions.

3. Think of the formula you'll be entering. It is enough to type it only once, and then drag it to fill the entire table -- think of how you can do that, and make sure you are adding or subtracting things in a proper manner as long as you are dealing with the negative numbers. As you can see in the textbook table, the numbers in the table must be increasing when you move down the table, and decreasing when you move from left to right when you are in the domain of negative numbers (they do increase from left to right on the next page with positive numbers -- do you know why?)

4. To get enough digits in your answer, you would need to format the cells you are putting your formula into. This is done through Format -> Cells -> Number tab -> Number -> Decimal Places. (How many decimals do you need to specify in this option to have at least three digits in the answer shown to you? It is OK to have more than three accurate digits in your answer, so the same formatting can be applied to all cells -- some will then have four or more digits.) I often use the keyboard shortcut Ctrl+1 to call Format Cells dialog window. Formatting is applied to all selected cells (if nothing is selected, then it will apply to the current cell). You might want to check that table you've produced matches the Table A from the textbook when you specify the accuracy of 4 digits.

5. You can specify "Fit to one page" option when printing your document (File -> Page Setup -> Page tab -> Scaling).

If the exercise is taking more than an hour -- come to the open tutorial session to discuss it with the session leader and fellow students. Have your laptop and the problem with you.