SOCI208 Module 2 - Data Patterns

1.  Quantitative Data - Simple Displays

1.  Arrays

An array (in this context) is a list of observations ordered (aka sorted) by the value of a variable.
Exhibit: High school graduation rates of U.S. states sorted in descending order (GRAD data) [m2012.htm]
An ordered list is a useful "primitive" mode of analysis.  When external knowledge is available about the elements, it can suggest causal relations.
Q - What characteristics of U.S. states might be associated with the rate of high school graduation?

2.  Stem & Leaf Display

The stem & leaf display was invented by John Tukey as a tool of exploratory data analysis (Tukey 1977).  The stem & leaf display is a quick method of looking at the distribution of a variable that can be used by hand in a small data set.
Exhibit: Construction of the stem & leaf display
The stem & leaf display gives information on the shape of the distribution, including
Exhibit: Stem & leaf display of percent Hispanics (GRAD data) [m2013.htm]
Exhibit: Stem & leaf display of high school graduation rate (GRAD data) [m2014.htm]

3.  Dot Plots

The dot plot is an alternative to the stem & leaf display that provides similar information.
Exhibit: Dot plot of percent white (GRAD data) [m2015.jpg]
Exhibit: Dot plot of income per capita (GRAD data) [m2016.jpg]

4.  Cumulative Distribution

The cumulative distribution of a variable is constructed as the plot of the rank of the observation (on the vertical axis) against the value of the observation (on the horizontal axis).  Rank can be assigned in two ways.
  1. observations are arranged in ascending order so that rank 1 is assigned to the smallest observation, 2 to the next smallest, etc.; this yields the less-than cumulative distribution (the most common)
  2. observations are arranged in descending order so that rank 1 is assigned to the largest observation, 2 to the next largest, etc.; this yields the more-than cumulative distribution
The less-than (respectiveley, more-than) cumulative distribution displays the number of observations in the data array that are equal to or lower than (respectiveley, equal to or greater than) a given value of the variable.
Exhibit: Cumulative distribution of high school graduation rate (GRAD data) [m2028.jpg]
The (less-than) cumulative percent distribution is obtained by plotting on the vertical axis, instead of the rank, 100*(rank/n) where n is the number of observations in the data set.  The cumulative percent distribution displays the percentage of observations in the data array that are equal to or lower than a given value of the variable.  The shape of the cumulative percent distribution is the same as the shape of the cumulative distribution.  (Alternatively, proportions are used in lieu of percentages.)
Exhibit: Cumulative percent distribution of high school graduation rate (GRAD data) [m2017.jpg]

2.  Quantitative Data - Frequency Distributions

1.  Construction of Frequency Distributions

A frequency distribution is the classification of the elements of a data set by a quantitative variable.  The frequency distribution is constructed by establishing mutually exclusive and exhaustive categories ( = intervals) covering the range of values in the data set and counting the number of observations within each category.
Exhibit: Frequency distribution with equal class intervals; age of farm operators (NWW Figure 2.5 p. 39) [m2007.gif]
Exhibit: Frequency distribution with unequal intervals; income of taxpayers (NWW Figure 2.6 p. 40) [m2008.gif]
The cumulative form of a frequency distribution is called a cumulative frequency distribution.
Exhibit: Cumulative frequency distribution; age of farm operators (NWW Figure 2.8 p. 42) [m2009.gif]
In practice, to construct a frequency distribution one must make decisions about These issues are discussed in NWW pp. 35-37.  Statistical packages have algorithms that attempt to make optimal choices of these parameters.

2.  Graphic Representations of Frequency Distributions

1.  Histogram and Frequency Polygon
Traditional graphic representations of a frequency distribution are the histogram and the frequency polygon. When class intervals are equal the height of the rectangle (histogram) or the polygon point (frequency polygon) is proportional to the frequency of the class.
When class intervals are unequal the height of the rectangle or polygon point must be adjusted to make the area proportional to the frequency (or percent frequency) of the class.  See NWW p. 38 ff.
2.  Comparison of Frequency Distributions
Frequency distributions can be compared if
  1. they have the same class intervals
  2. they are both expressed in percentage form (or they have the same total frequency)
Comparisons of frequency distributions in the same graph are better carried out with frequency polygons or with kernel-estimated denisities (discussed later) than with histograms.
Exhibit: Comparison of age distributions of farm operators and nonfarm civilian labor force (NMM Figure 2.7 p. 41) [m2010.gif]
Exhibit: Distribution of income of men and women (survey2 data) [m2018.jpg]
Histograms can be compared too, but then it is better to use two different panels or to present them back-to-back.
Exhibit: Distribution of lifetime reproductive success (# of children) for men and women in !Kung San (Daly and Wilson 1983, Figure 12-2 p. 325) [m2002.gif]
Exhibit:  Back-to-back histograms for the distribution of income for men and women (survey2 data) [m2040.jpg]
 

3.  Modern Developments - Density Estimators

Statisticians have developed continuous alternatives to frequency distributions based on fixed classes (such as histograms), called density estimators.  The general idea is that a given value of the variable corresponds to a certain "density" of observations that varies continuously over the range of the variable.  Some modern statistical packages provide commands to generate graphic representations of these density estimators.  The following exhibits compare various frequency displays.
Exhibit: Distribution of educational expenditures per pupil - histogram (GRAD data) [m2021.jpg]
Exhibit: Distribution of educational expenditures per pupil - cumulative frequency polygon (GRAD data) [m2022.jpg]
Exhibit: Distribution of educational expenditures per pupil - striped density display (GRAD data) [m2023.jpg]
Exhibit: Distribution of educational expenditures per pupil - dot plot (GRAD data) [m2020.jpg]
Exhibit: Distribution of educational expenditures per pupil - kernel density estimator (GRAD data) [m2019.jpg]

3.  Displays of Qualitative Data

A qualitative distribution is the classification of the elements of a data set by a qualitative variable.
A qualitative distribution is represented as a table or graphically as a simple bar chart.
In a bar chart
Exhibit: Univariate tabular presentation - Distribution of beliefs about High Gods in human societies (Ethnographic Atlas) [m2037.htm]
Exhibit: Simple bar chart - Distribution of beliefs about High Gods in human societies (Ethnographic Atlas) [m2036.jpg]
Exhibit: Combined bar chart - Distribution of beliefs about High Gods in herding and non-herding societies (Ethnographic Atlas) [m2035.jpg] (This example may also be viewed as a bivariate display.)
Qualitative distributions are often represented in the popular press using pie-charts.  However, pie-charts are less often used in the scientific literature for reasons discussed in section 6.

4.  Displays of Bivariate Data

1.  Quantitative Bivariate Data

1.  Scatter Plots
The scatterplots is a workhorse of the statistical analysis of quantitative data.
Exhibit: Scatterplot - high school graduation rate by income per capita (GRAD data) [m2024.jpg]
Exhibit: Scatterplot - high school graduation rate by percent black (GRAD data) [m2025.jpg]
Sometimes people get carried away and produce elaborate scatterplots.
Exhibit: Birth and death rate by economic development (aka the demographic transition) (Nielsen 1994, Figure 4 p. 663) [m2003.gif]
The Importance of Being Square
Research has shown that the most accurate perception of the existence and strength of a relationship corresponds to trend lines with a slope at approximately 45o (Cleveland 1994:<>).  Using a square frame for the scatterplot optimizes perception by insuring that the slope (if an association exists) is approximately 45o.  This is why some statistical programs such as SYSTAT produce square plots as the default.  Other programs such as SAS have been notorious for producing scatterplots that are far from square and can thus hide the existence a relationship.
2.  Time Series Plots
Exhibit: Early time series plot by Playfair (drawn 1785, from Tufte 1983 p. <>) [m2001.gif]
When the variable on the horizontal axis is time (or more generally when there is only one data point for each value on the horizontal axis) a scatterplot becomes a time series plot.
Exhibit: Divorce rate (divorces per 1,000 married females) - U.S. 1920-2000 (divorce data) [m2026.jpg]
Cleveland (1994:<>) argues that perception of trends in the series is optimized when
Exhibit: Divorce rate 1920-2000 with a different aspect ratio [m2027.jpg]
In any case time series plots are not reserved for economists!
Exhibit: Decline of the average age of menarche (Daly and Wilson 1983, Figure 12-7 p. 339) [m2004.gif]
3.  Bivariate Frequency Distribution
A bivariate frequency distribution may be useful to reveal the relationship between two variables when a data set is large.  The bivariate frequency distribution is constructed by dividing the range of each variable into classes and counting the numbers of elements belonging simultaneously to each combination of classes.  This is called a cross-classification or cross-tabulation.  The cross-classified distribution is called a bivariate frequency distribution.
A bivariate frequency distribution can be represented graphically as a 3-dimensional (3-D) histogram viewed in perspective.  Care must be taken to order the categories and/or adjusting the point of view so that tall columns in the foreground do not obscure shorter columns behind.
Exhibit: Bivariate frequency distribution of graduation rate and percent whites - 3-D histogram (GRAD data) [m2030.jpg]
Exhibit: Bivariate frequency distribution of graduation rate and percent whites - 3-D kernel density estimator (GRAD data) [m2029.jpg]

2.  Quantitative and Qualitative Bivariate Data

In a bivariate data set when one variable is qualitative and the other quantitative the general strategy is to compare the distributions of the quantitative variable across categories of the qualitative variable.  The comparison of distributions can be done with a table showing the frequencies within classes of the quantitative variable, given categories of the qualitative variable; or the comparison can be done graphically as a comparison of frequency polygons representing the distributions of the quantitative variable corresponding to each category of the qualitative variable, when the categories of the qualitative variable are not too numerous.  For example, in the next exhibit the distribution of educational achievement (years of educations - a quantitative variable) is compared by sex (male versus female - a qualitative variable).
Exhibit (repeat): Distribution of income for men and women (survey2.syd data) [m2018.jpg]

3.  Qualitative Bivariate Data

When both variables in a bivariate data set are qualitative their relationship can be displayed in a contingency table.  A contingency table displays the bivariate qualitative distribution as the numbers of elements in each cell of a cross-classification of the values of each qualitative variable.

The comparison of distributions can be seen more clearly by

  1. deciding which variable should be viewed as the "consequence" (dependent variable), which one as the "cause" (independent variable)
  2. constructing a table showing the percent distribution of the dependent variable (adding up to 100) within each category of the independent variable; these percent distributions are called conditional distributions of the dependent variable given a value of the independent variable
Exhibit: Contingency table and percent distribution - Beliefs about High Gods in herding and non-herding societies (Ethnographic Atlas) [m2038.htm]
It is more traditional to present conditional distributions with percentages adding to 100 across each row.  (More on this later.)
Percent distributions can be shown graphically as a component-part bar chart (aka divided bar chart).  The following exhibit shows an elaborate divided bar chart.
Exhibit: Divided bar chart - Votes for Mondale, Hart, and Jackson (Cleveland 1994, Figure 4.21 p. 266) [m2005.gif]

5.  Displays of Multivariate Data

When the number of variables is large it is necessary to use more powerful methods of analysis, such as multiple regression.  When the variables are three or four it is possible to examine them simultaneously with simple bivariate displays such as a scatterplot in which the graphical elements reflect the value of a third variable.
When the third variable is qualitative, its values can be indicated by different plotting symbols; this is called a symbolic scatterplot.
Exhibit: Symbolic scatterplot using color - brain weight by body weight for 4 taxa (Cleveland 1994, Figure I, in front of text) [m2031.gif]
When the third variable is quantitative, its value can be indicated by the size of the plotting symbol, as in the bubble plot.
Exhibit: Bubble plot - High school graduation rate by urbanization with symbol size proportional to percent black (GRAD data) [m2039.jpg]

6.  Ethical, Psychological and Aesthetic Aspects of Data Displays

1.  Misleading Graphic Presentations

Exhibit: Different representations of the same time series (NWW Figure 2.19 p. 56) [m2011.gif]

2.  Effectiveness of Data Displays

There is a whole research literature on the effectiveness of graphical displays (see Wilkinson 1990; Cleveland 1994).  Two examples are discussed.
1.  Problems With the Pie-Chart
The following exhibit illustrates Cleveland's critique of the pie-chart.
Exhibit: Comparison of the pie-chart and the dot plot (Cleveland 1994, Figures 4.19 p. 262 and 4.20 p. 263) [m2006.gif]
2.  Multiway Dot Plot Versus Divided Bar Chart
Cleveland argues that the multiway dot plot reveals more information than the divided bar chart, because the segments representing the proportions in a given category can now be compared by position along a common scale.
Exhibit (repeat): Divided bar chart - Votes for Mondale, Hart, and Jackson (Cleveland 1994, Figure 4.21 p. 266) [m2005.gif]
Exhibit: Multiway dot plot - Votes for Mondale, Hart, and Jackson (Cleveland 1994, Figure 4.22 p. 266) [m2034.gif]

3.  Aesthetic Considerations

Aesthetic aspects of graphical design are discussed by authors such as Tufte (1983).  For example, Tufte compares two early time-series plots by William Playfair and argues that the second one, drawn a few months later, shows the design improvement achieved by reducing the number of extraneous elements (such as the detailed grid lines), allowing for a more beautiful as well as more effective graph.
Exhibit: Early time-series plot (1) by Playfair, drawn 1785 (Tufte 1983, p. 91) [m2032.gif]
Exhibit: Early time-series plot (2) by Playfair, drawn 1786 (Tufte 1983, p. 92) [m2033.jpg]




Last modified 26 Aug 2002