SOCI208 Module 2 - Data Patterns
1. Quantitative Data - Simple Displays
1. Arrays
An array (in this context) is a list of observations ordered (aka
sorted) by the value of a variable.
Exhibit: High school graduation rates of
U.S. states sorted in descending order (GRAD data) [m2012.htm]
An ordered list is a useful "primitive" mode of analysis. When external
knowledge is available about the elements, it can suggest causal relations.
Q - What characteristics of U.S. states might be associated
with the rate of high school graduation?
2. Stem & Leaf Display
The stem & leaf display was invented by John Tukey as a tool of exploratory
data analysis (Tukey 1977). The stem & leaf display is a
quick method of looking at the distribution of a variable that can be used
by hand in a small data set.
Exhibit: Construction of the stem & leaf display
The stem & leaf display gives information on the shape of the distribution,
including
-
the existence of clusters of observations with similar values
-
the existence of outlying (extreme) observations
-
the symmetry or asymmetry of the distribution
Exhibit: Stem & leaf display of percent
Hispanics (GRAD data) [m2013.htm]
Exhibit: Stem & leaf display of high school
graduation rate (GRAD data) [m2014.htm]
3. Dot Plots
The dot plot is an alternative to the stem & leaf display that provides
similar information.
Exhibit: Dot plot of percent white (GRAD
data) [m2015.jpg]
Exhibit: Dot plot of income per capita (GRAD data)
[m2016.jpg]
4. Cumulative Distribution
The cumulative distribution of a variable is constructed as the plot of
the rank of the observation (on the vertical axis) against the value of
the observation (on the horizontal axis). Rank can be assigned in
two ways.
-
observations are arranged in ascending order so that rank 1 is assigned
to the smallest observation, 2 to the next smallest, etc.; this yields
the less-than cumulative distribution (the most common)
-
observations are arranged in descending order so that rank 1 is assigned
to the largest observation, 2 to the next largest, etc.; this yields the
more-than
cumulative distribution
The less-than (respectiveley, more-than) cumulative distribution displays
the number of observations in the data array that are equal to or lower
than (respectiveley, equal to or greater than) a given value of the variable.
Exhibit: Cumulative distribution of high
school graduation rate (GRAD data) [m2028.jpg]
The (less-than) cumulative percent distribution is obtained by plotting
on the vertical axis, instead of the rank, 100*(rank/n) where n is the
number of observations in the data set. The cumulative percent distribution
displays the percentage of observations in the data array that are equal
to or lower than a given value of the variable. The shape of the
cumulative percent distribution is the same as the shape of the cumulative
distribution. (Alternatively, proportions are used in lieu of percentages.)
Exhibit: Cumulative percent distribution
of high school graduation rate (GRAD data) [m2017.jpg]
2. Quantitative Data - Frequency Distributions
1. Construction of Frequency Distributions
A frequency distribution is the classification of the elements of a data
set by a quantitative variable. The frequency distribution is constructed
by establishing mutually exclusive and exhaustive categories ( = intervals)
covering the range of values in the data set and counting the number of
observations within each category.
Exhibit: Frequency distribution with equal
class intervals; age of farm operators (NWW Figure 2.5 p. 39) [m2007.gif]
Exhibit: Frequency distribution with unequal intervals;
income of taxpayers (NWW Figure 2.6 p. 40) [m2008.gif]
The cumulative form of a frequency distribution is called a cumulative
frequency distribution.
Exhibit: Cumulative frequency distribution;
age of farm operators (NWW Figure 2.8 p. 42) [m2009.gif]
In practice, to construct a frequency distribution one must make decisions
about
-
the width of the classes
-
the number of classes (which is related to the width)
-
the precise definition of class limits
These issues are discussed in NWW pp. 35-37. Statistical packages
have algorithms that attempt to make optimal choices of these parameters.
2. Graphic Representations of Frequency Distributions
1. Histogram and Frequency Polygon
Traditional graphic representations of a frequency distribution are the
histogram and the frequency polygon.
-
A histogram is a rectangular graph of a frequency distribution
-
A frequency polygon is a line graph of a frequency distribution
When class intervals are equal the height of the rectangle (histogram)
or the polygon point (frequency polygon) is proportional to the frequency
of the class.
When class intervals are unequal the height of the rectangle or polygon
point must be adjusted to make the area proportional to the frequency (or
percent frequency) of the class. See NWW p. 38 ff.
2. Comparison of Frequency Distributions
Frequency distributions can be compared if
-
they have the same class intervals
-
they are both expressed in percentage form (or they have the same total
frequency)
Comparisons of frequency distributions in the same graph are better carried
out with frequency polygons or with kernel-estimated denisities (discussed
later) than with histograms.
Exhibit: Comparison of age distributions
of farm operators and nonfarm civilian labor force (NMM Figure 2.7 p. 41)
[m2010.gif]
Exhibit: Distribution of income of men and women
(survey2 data) [m2018.jpg]
Histograms can be compared too, but then it is better to use two different
panels or to present them back-to-back.
Exhibit: Distribution of lifetime reproductive
success (# of children) for men and women in !Kung San (Daly and Wilson
1983, Figure 12-2 p. 325) [m2002.gif]
Exhibit: Back-to-back histograms for the
distribution of income for men and women (survey2 data) [m2040.jpg]
3. Modern Developments - Density Estimators
Statisticians have developed continuous alternatives to frequency distributions
based on fixed classes (such as histograms), called density estimators.
The general idea is that a given value of the variable corresponds to a
certain "density" of observations that varies continuously over the range
of the variable. Some modern statistical packages provide commands
to generate graphic representations of these density estimators.
The following exhibits compare various frequency displays.
Exhibit: Distribution of educational expenditures
per pupil - histogram (GRAD data) [m2021.jpg]
Exhibit: Distribution of educational expenditures
per pupil - cumulative frequency polygon (GRAD data) [m2022.jpg]
Exhibit: Distribution of educational expenditures
per pupil - striped density display (GRAD data) [m2023.jpg]
Exhibit: Distribution of educational expenditures
per pupil - dot plot (GRAD data) [m2020.jpg]
Exhibit: Distribution of educational expenditures
per pupil - kernel density estimator (GRAD data) [m2019.jpg]
3. Displays of Qualitative Data
A qualitative distribution is the classification of the elements
of a data set by a qualitative variable.
A qualitative distribution is represented as a table or graphically
as a simple bar chart.
In a bar chart
-
each bar corresponds to a class, with the length of the bar denoting the
number (or percentage) of elements in the class
-
the bars differ only in length, not in width
-
unlike in a histogram, a space is left between each bar; this is done to
help label the classes and to emphasize that classes do not represent intervals
of a continuous range (as in a histogram)
-
the classes may be ranked according to the length of the bars to facilitate
comparisons of frequencies; this works best when classes have no substantive
ordering, otherwise it is better to maintain the original order of the
classes (see exhibits of religious beliefs below)
-
the bars may be vertical (see below) or horizontal (see NWW p. 45)
Exhibit: Univariate tabular presentation
- Distribution of beliefs about High Gods in human societies (Ethnographic
Atlas) [m2037.htm]
Exhibit: Simple bar chart - Distribution of beliefs
about High Gods in human societies (Ethnographic Atlas) [m2036.jpg]
Exhibit: Combined bar chart - Distribution of beliefs
about High Gods in herding and non-herding societies (Ethnographic Atlas)
[m2035.jpg] (This example may also be viewed as a bivariate display.)
Qualitative distributions are often represented in the popular press using
pie-charts. However, pie-charts are less often used in the scientific
literature for reasons discussed in section 6.
4. Displays of Bivariate Data
1. Quantitative Bivariate Data
1. Scatter Plots
The scatterplots is a workhorse of the statistical analysis of quantitative
data.
Exhibit: Scatterplot - high school graduation
rate by income per capita (GRAD data) [m2024.jpg]
Exhibit: Scatterplot - high school graduation rate
by percent black (GRAD data) [m2025.jpg]
Sometimes people get carried away and produce elaborate scatterplots.
Exhibit: Birth and death rate by economic
development (aka the demographic transition) (Nielsen 1994, Figure 4 p.
663) [m2003.gif]
The Importance of Being Square
Research has shown that the most accurate perception of the existence and
strength of a relationship corresponds to trend lines with a slope at approximately
45o (Cleveland 1994:<>). Using a square frame for the
scatterplot optimizes perception by insuring that the slope (if an association
exists) is approximately 45o. This is why some statistical
programs such as SYSTAT produce square plots as the default. Other
programs such as SAS have been notorious for producing scatterplots that
are far from square and can thus hide the existence a relationship.
2. Time Series Plots
Exhibit: Early time series plot by Playfair
(drawn 1785, from Tufte 1983 p. <>) [m2001.gif]
When the variable on the horizontal axis is time (or more generally when
there is only one data point for each value on the horizontal axis) a scatterplot
becomes a time series plot.
Exhibit: Divorce rate (divorces per 1,000
married
females) - U.S. 1920-2000 (divorce data) [m2026.jpg]
Cleveland (1994:<>) argues that perception of trends in the series is
optimized when
-
points are joined by a straight line
-
a symbol (such as a dot) is used for each data point
-
the aspect ratio of the frame is adjusted so that the average slope is
about 45o;
Exhibit: Divorce rate 1920-2000 with a
different aspect ratio [m2027.jpg]
In any case time series plots are not reserved for economists!
Exhibit: Decline of the average age of
menarche (Daly and Wilson 1983, Figure 12-7 p. 339) [m2004.gif]
3. Bivariate Frequency Distribution
A bivariate frequency distribution may be useful to reveal the relationship
between two variables when a data set is large. The bivariate frequency
distribution is constructed by dividing the range of each variable into
classes and counting the numbers of elements belonging simultaneously to
each combination of classes. This is called a cross-classification
or cross-tabulation. The cross-classified distribution
is called a bivariate frequency distribution.
A bivariate frequency distribution can be represented graphically as
a 3-dimensional (3-D) histogram viewed in perspective. Care must
be taken to order the categories and/or adjusting the point of view so
that tall columns in the foreground do not obscure shorter columns behind.
Exhibit: Bivariate frequency distribution
of graduation rate and percent whites - 3-D histogram (GRAD data) [m2030.jpg]
Exhibit: Bivariate frequency distribution of graduation
rate and percent whites - 3-D kernel density estimator (GRAD data) [m2029.jpg]
2. Quantitative and Qualitative Bivariate Data
In a bivariate data set when one variable is qualitative and the other
quantitative the general strategy is to compare the distributions of the
quantitative variable across categories of the qualitative variable.
The comparison of distributions can be done with a table showing the frequencies
within classes of the quantitative variable, given categories of the qualitative
variable; or the comparison can be done graphically as a comparison of
frequency polygons representing the distributions of the quantitative variable
corresponding to each category of the qualitative variable, when the categories
of the qualitative variable are not too numerous. For example, in
the next exhibit the distribution of educational achievement (years of
educations - a quantitative variable) is compared by sex (male versus female
- a qualitative variable).
Exhibit (repeat): Distribution of income
for men and women (survey2.syd data) [m2018.jpg]
3. Qualitative Bivariate Data
When both variables in a bivariate data set are qualitative their relationship
can be displayed in a contingency table. A contingency table displays
the bivariate qualitative distribution as the numbers of elements in each
cell of a cross-classification of the values of each qualitative variable.
The comparison of distributions can be seen more clearly by
-
deciding which variable should be viewed as the "consequence" (dependent
variable), which one as the "cause" (independent variable)
-
constructing a table showing the percent distribution of the dependent
variable (adding up to 100) within each category of the independent variable;
these percent distributions are called conditional distributions
of the dependent variable given a value of the independent variable
Exhibit: Contingency table and percent
distribution - Beliefs about High Gods in herding and non-herding societies
(Ethnographic Atlas) [m2038.htm]
It is more traditional to present conditional distributions with percentages
adding to 100 across each row. (More on this later.)
Percent distributions can be shown graphically as a component-part
bar chart (aka divided bar chart). The following
exhibit shows an elaborate divided bar chart.
Exhibit: Divided bar chart - Votes for
Mondale, Hart, and Jackson (Cleveland 1994, Figure 4.21 p. 266) [m2005.gif]
5. Displays of Multivariate Data
When the number of variables is large it is necessary to use more powerful
methods of analysis, such as multiple regression. When the variables
are three or four it is possible to examine them simultaneously with simple
bivariate displays such as a scatterplot in which the graphical elements
reflect the value of a third variable.
When the third variable is qualitative, its values can be indicated
by different plotting symbols; this is called a symbolic scatterplot.
Exhibit: Symbolic scatterplot using color
- brain weight by body weight for 4 taxa (Cleveland 1994, Figure I, in
front of text) [m2031.gif]
When the third variable is quantitative, its value can be indicated by
the size of the plotting symbol, as in the bubble plot.
Exhibit: Bubble plot - High school graduation
rate by urbanization with symbol size proportional to percent black (GRAD
data) [m2039.jpg]
6. Ethical, Psychological and Aesthetic Aspects of Data Displays
1. Misleading Graphic Presentations
Exhibit: Different representations of the
same time series (NWW Figure 2.19 p. 56) [m2011.gif]
2. Effectiveness of Data Displays
There is a whole research literature on the effectiveness of graphical
displays (see Wilkinson 1990; Cleveland 1994). Two examples are discussed.
1. Problems With the Pie-Chart
The following exhibit illustrates Cleveland's critique of the pie-chart.
Exhibit: Comparison of the pie-chart and
the dot plot (Cleveland 1994, Figures 4.19 p. 262 and 4.20 p. 263) [m2006.gif]
2. Multiway Dot Plot Versus Divided Bar Chart
Cleveland argues that the multiway dot plot reveals more information
than the divided bar chart, because the segments representing the proportions
in a given category can now be compared by position along a common scale.
Exhibit (repeat): Divided bar chart - Votes
for Mondale, Hart, and Jackson (Cleveland 1994, Figure 4.21 p. 266) [m2005.gif]
Exhibit: Multiway dot plot - Votes for Mondale,
Hart, and Jackson (Cleveland 1994, Figure 4.22 p. 266) [m2034.gif]
3. Aesthetic Considerations
Aesthetic aspects of graphical design are discussed by authors such as
Tufte (1983). For example, Tufte compares two early time-series plots
by William Playfair and argues that the second one, drawn a few months
later, shows the design improvement achieved by reducing the number of
extraneous elements (such as the detailed grid lines), allowing for a more
beautiful as well as more effective graph.
Exhibit: Early time-series plot (1) by
Playfair, drawn 1785 (Tufte 1983, p. 91) [m2032.gif]
Exhibit: Early time-series plot (2) by Playfair,
drawn 1786 (Tufte 1983, p. 92) [m2033.jpg]
Last modified 26 Aug 2002