news people and virtually all journalism students today have some familiarity
with computers. Their experience usually starts with word processing, either
on a mainframe editing system or on a personal computer. Many learn some
other application, such as a spreadsheet or a database. Your mental image
of a computer depends very much on the specific things you have done with
one. This chapter is designed to invite your attention to a very wide range
of possibilities for journalistic applications. As background for that
broad spectrum, we shall now indulge in a little bit of nostalgia.
Counting and sorting
Bob Kotzbauer was the Akron Beacon Journal's legislative reporter, and I was its Washington correspondent. In the fall of 1962, Ben Maidenburg, the executive editor, assigned us the task of driving around Ohio for two weeks, knocking on doors and asking people how they would vote in the coming election for governor. Because I had studied political science at Chapel Hill, I felt sure that I knew how to do this chore. We devised a paper form to record voter choices and certain other facts about each voter: party affiliation, previous voting record, age, and occupation. The forms were color coded: green for male voters, pink for females. We met many interesting people and filed daily stories full of qualitative impressions of the mood of the voters and descriptions of county fairs and autumn leaves. After two weeks, we had accumulated enough of the pink and green forms to do the quantitative part. What happened next is a little hazy in my mind after all these years, but it was something like this:
Back in Akron, we dumped the forms onto a table in the library and sorted them into three stacks: previous Republican voters, Democratic voters, and non-voters. That helped us gauge the validity of our sample. Then we divided each of the three stacks into three more: voters for Mike DiSalle, the incumbent Democrat, votes for James Rhodes, the Republican challenger, and undecided. Nine stacks, now. We sorted each into two more piles, separating the pink and green pieces of paper to break down the vote by sex. Eighteen stacks. Sorting into four categories of age required dividing each of those eighteen piles into four more, which would have made seventy-two. I don't remember exactly how far we got before we gave up, exhausted and squinty-eyed. Our final story said the voters were inscrutable, and the race was too close to call.
The moral of this story is that before you embark on any complicated project involving data analysis, you should look around first and see what technology is available. There were no personal computers in 1962. Mainframe computing was expensive and difficult, not at all accessible to newspaper reporters. But there was in the Beacon Journal business office a machine that would have saved us if we had known about it. The basic concept for it had been developed nearly eighty years before by Dr. Herman Hollerith, the father of modern computing.
Hollerith was an assistant director of the United States Census at a time when the census was in trouble. It took seven and a half years to tabulate the census of 1880, and the country was growing so fast that it appeared that the 1890 census would not be finished when it was time for the census of 1900 to be well under way. Herman Hollerith saved the day by inventing the punched card.
It was a simple three-by-five inch index card divided into quarter-inch squares. Each square stood for one bit of binary information: a hole in the square meant "yes" and no hole meant "no." All of the categories being tabulated could fit on the card. One group of squares, for example, stood for age category in five-year segments. If you were 21 years old on April 1, 1890, there would be a card for you, and the card would have a hole punched in the 20-24 square.
Under Hollerith's direction, a machine was built that could read 40 holes at a time. The operator would slap a card down on its bed, and pull a lid down over it. Tiny spikes would stop when they encountered a solid portion of the card and pass through where they encountered holes. Below each spike was a cup of mercury. When the spike touched the mercury, an electrical contact was completed causing a counter on the vertical face of the machine to advance one notch. This machine was called the Tabulator.
There was more. Hollerith invented a companion machine, called the Sorter, which was wired into the same circuit. It had compartments corresponding to the dials on the Tabulator, each with its own little door. The same electrical contact that advanced a dial on the Tabulator caused a door on the Sorter to fly open so that the operator could drop the tallied card into it. A clerk could take the cards for a whole census tract, sort them by age in this manner, and then sort each stack by gender to create a table of age by sex distribution for the tract. Hollerith was so pleased with his inventions that he left the Bureau and founded his own company to bid on the tabulation contract for the 1890 census. His bid was successful, and he did the job in two years, even though the population had increased by 25 percent since 1880.
Improvements on the system began almost immediately. Hollerith won the contract for the 1900 census, but then the Bureau assigned one of its employees, James Powers, to develop its own version of the punched-card machine. Like Hollerith, Powers eventually left to start his own company. The two men squabbled over patents and eventually each sold out. Powers's firm was absorbed by a component of what would eventually become Sperry Univac, and Hollerith's was folded into what finally became IBM. By 1962, when Kotzbauer and I were sweating over those five hundred scraps of paper, the Beacon Journal had, unknown to us, an IBM counter-sorter which was the great grandchild of those early machines. It used wire brushes touching a copper roller instead of spikes and mercury, and it sorted 650 cards per minute, and it was obsolete before we found out about it.
By that time, the Hollerith card, as it was still called, had smaller holes arranged in 80 columns and 12 rows. That 80-column format is still found in many computer applications, simply because data archivists got in the habit of using 80 columns and never found a reason to change even after computers permitted much longer records. I can understand that. The punched card had a certain concreteness about it, and, to this day, when trying to understand a complicated record layout in a magnetic storage medium I find that it helps if I visualize those Hollerith cards with the little holes in them.
Computer historians have been at a loss to figure out where Hollerith got the punched-card idea. One story holds that it came to him when he watched a railway conductor punching tickets. Other historians note that the application of the concept goes back at least to the Jacquard loom, built in France in the early 1800s. Wire hooks passed through holes in punched cards to pick up threads to form the pattern. The player piano, patented in 1876, used the same principle. A hole in a given place in the roll means hit a particular key at a particular time and for a particular duration; no hole means don't hit it. Any piano composition can be reduced to those binary signals.1
From counting and sorting, the next step is performing mathematical calculations in a series of steps on encoded data. These steps require the basic pieces of modern computer hardware: a device to store data and instructions, machinery for doing the arithmetic, and something to manage the traffic as raw information goes in and processed data come out. J. H. Muller, a German, designed such a machine in 1786, but lacked the technology to build it. British Mathematician Charles Babbage tried to build one starting in 1812. He, too, was ahead of the available technology. In 1936, when Howard Aiken started planning the Mark I computer at Harvard, he found that Babbage had anticipated many of his ideas. Babbage, for example, foresaw the need to provide "a store" in which raw data and results are kept and "a mill" where the computations take place.2 Babbage's store and mill are today called "memory" and "central processing unit" or CPU. The machine Babbage envisioned would have been driven by steam. Although the Mark I used electrical relays, it was basically a mechanical device. Electricity turned the switches on and off, and the on-off condition held the binary information. It generated much heat and noise. Pieces of it were still on display at the Harvard Computation Center when I was last there in 1968.
Mark I and Aiken served in the Navy toward the end of World War II, working on ballistics problems. This was the project that got Grace Murray Hopper started in the computer business. Then a young naval officer, she rose to the rank of admiral and contributed some key concepts to the development of computers along the way.
Parallel work was going on under sponsorship of the Army, which also needed complicated ballistics problems worked out. A machine called ENIAC, which used vacuum tubes, resistors, and capacitors instead of mechanical relays, was begun for the Army at the University of Pennsylvania, based in part on ideas used in a simpler device built earlier at Iowa State University by John Vincent Atanasoff and his graduate assistant, Clifford E. Berry. The land-grant college computer builders did not bother to patent their work; it was put aside during World War II, and the machine was cannibalized for parts. The Ivy League inventors were content to take the credit until the Atanasoff-Berry Computer, or ABC machine, as it came to be known, was rediscovered in a 1973 patent suit between two corporate giants. Sperry Rand Corp., then owner of the ENIAC patent, was challenged by Honeywell, Inc., which objected to paying royalties to Sperry Rand. The Honeywell people tracked down the Atanasoff-Berry story, and a federal district judge ruled that the ENIAC was derived from Atanasoff's work and was therefore not patentable. That's how Atanasoff, a theoretical physicist who only wanted a speedy way to solve simultaneous equations, became recognized as the father of the modern computer. The key ideas were the use of electronic rather than mechanical switches, the use of binary numbers, and the use of logic circuits rather than direct counting to manipulate those binary numbers. These ideas came to the professor while having a drink in an Iowa roadhouse in the winter of 1937, and he built his machine for $6,000.3
ENIAC, on the other hand, cost $487,000. It was not completed in time to aid the war effort, but once turned on in February 1946, it lasted for nearly ten years, demonstrating the reliability of electronic computing, and paved the way for the postwar developments. Its imposing appearance, banks and banks of wires, dials, and switches, still influences cartoon views of computers.
Once the basic principles had been established in the 1940s, the problems became those of refining the machinery (the hardware) and developing the programming (the software) to control it. By the 1990s, a look backward saw three distinct phases in computing machinery, based on the primary electronic device that did the work:
Transistors are better than tubes because they are cheaper, more reliable, smaller, faster, and generate less heat. Integrated circuits are built on tiny solid-state chips that combine many transistors in a very small space. How small? Well, all of the computing power of the IBM 7090, which filled a good-sized room when I was introduced to it at Harvard in 1966, is now packed into a chip the size of my fingernail. How do they make such complicated things so small? By way of a photo-engraving process. The circuits are designed on paper, photographed so that a lens reduces the image -- just the way your camera reduces the image of your house to fit on a frame of 35 mm. film -- and etched on layers of silicon.
As computers got better, they got cheaper, but one more thing had to happen before their use could extend to the everyday life of such nonspecialists as journalists. They had to be made easy to use. That is where Admiral Grace Murray Hopper earned her place in computer history. (One of her contributions was being the first person to debug a computer: when the Mark I broke down one day in 1945, she traced the problem to a dead moth caught in a relay switch.) She became the first person to build an entire career on computer programming. Perhaps her most important contribution, in 1952, was her development of the first assembly language.
To appreciate the importance of that development, think about a computer doing all its work in binary arithmetic. Binary arithmetic represents all numbers with combinations of zeros and ones. To do its work, the computer has to receive its instructions in binary form. This fact of life limited the use of computers to people who had the patience, brain power, and attention span to think in binary. Hopper quickly realized that computers were not going to be useful to large numbers of people so long as that was the case, and so she wrote an assembly language. An assembly language assembles groups of binary machine language statements into the most frequently used operations and lets the user invoke them by working in a simpler language that uses mnemonic codes to make the instructions easy to remember. The user writes the program in the assembly language and the software converts each assembler statement into the corresponding machine language statements -- all "transparently" or out of sight of the user -- and the computer does what it is told just as if it had been given the orders in its own machine language. That was such a good idea that it soon led to yet another layer of computer languages called compilers. The assembly languages were machine-specific; the compilers were written so that once you learned one you could use it on different machines. The compilers were designed for specialized applications. FORTRAN (for formula translator) was designed for scientists, and more than thirty years and many technological changes later is still a standard. COBOL (for common business oriented language) was produced, under the prodding of Admiral Hopper, and is today the world standard for business applications. BASIC (for beginners all-purpose symbolic instruction code) was created at Dartmouth College to provide an easy language for students to begin on. It is now standard for personal computers.
To these three layers -- machine language, assembler, and compiler -- has been added yet a fourth layer. Higher-level special purpose languages are easy to use and highly specialized. They group compiler programs and let the user invoke them in a way that is almost like talking to the computer in plain English. For statistical applications, the two world leaders are SPSS (Statistical Package for the Social Sciences) and SAS (Statistical Analysis System). If you are going to do extensive analysis of computer databases, sooner or later you will probably want to learn one or both of these two higher-level languages. Here is an example that will show you why:
You have a database that lists every honorarium reported by every member of Congress for a given year. The first thing you want to know is the central tendency, so you write a program to give you the mean, the variance, and the standard deviation. A FORTRAN program would require 22 steps. In SAS, once the data have been described to the computer, there are just three lines of code. In SPSS there is only one:
For a comparative evaluation of SAS and SPSS, keep reading. But first there is one other kind of software you need to know about. Every computer needs a system for controlling its activity, directing instructions to the proper resources. Starting with the first of the third-generation IBM mainframe computers, the language enabling the user to control the operating system was called JCL for Job Control Language. Now "job control language" has become a generic term to mean the language used to run any operating system. (On second-generation mainframes, which could only work on one job at a time, we filled out a pencil-and-paper form telling the computer operator what tapes to mount on what drives and what switches to hit.) The operating systems also include some utility programs that let you do useful things with data like sorting, copying, protecting, and merging files.
other kind of software is needed for batch computing. If you are going
to send the computer a list of instructions, you need a system for entering
and editing those instructions. Throughout the 1960s and part of the 1970s,
instructions were entered on punched cards. You typed the instructions
at a card-punching machine and edited them by throwing away the cards with
mistakes and substituting good ones. Today the instructions are entered
directly into computer memory and edited there. Older editing systems still
in use are TSO (for time-sharing option) and WYLBUR (named to make it seem
human). XEDIT is a powerful and more recent IBM editor. If you do mainframe
computing, you will have to learn one of the editor systems available for
that particular mainframe. Personal computer programs that allow batch
processing have their own built-in editors, and you can learn them at the
same time you learn the underlying program. You can also use the word-processing
program with which you are most familiar to write and edit computer programs.
The first decision to make when approaching a task that needs a computer is whether to do the job on a mainframe or on a personal computer. The second is what software to use. Software can generally be classified into two kinds: that which operates interactively, generally by presenting you with choices from a menu and responding to your choices, and that which operates in batch mode, where you present a complete list of instructions and get back a complete job. Some statistical packages offer aspects of both.
The threshold of size and complexity at which you need a mainframe keeps getting pushed back. As recently as the early 1980s, a mainframe would routinely be used to analyze a simple public opinion survey with, say, 50 questions and 1,500 respondents. By the late 1980s, personal computers powerful enough to do that job more conveniently were commonplace in both homes and offices. By 1989, USA Today had begun to work with very powerful personal computers to read and analyze large federal government computer archives in its own special projects office. Mainframes were still needed for the larger and more complex databases, but it seems likely that mainframes could become irrelevant for most journalistic work at some point during the shelf life of this book.
After word processing, the most common personal computer applications are spreadsheets and database programs. The best way to get to know a spreadsheet (examples: Lotus, SuperCalc, PC-Calc) is to use one as your personal check register. As a journalist or potential journalist, you are probably more comfortable with words than numbers and don't get your checkbook to balance very often. A spreadsheet will make it possible and may even encourage you to seek out more complicated applications. For example, when Tom Moore was in the Knight-Ridder Washington Bureau, he created a spreadsheet model for a hypothetical federal tax return. Then when Congress debated changes in the tax law, he could quickly show how each proposal would affect his hypothetical taxpayer.
To understand what a database program (examples: dBase, Paradox, PC-File, Q & A) is good for, imagine a project requiring data stored on index cards. The school insurance investigation described in chapter 2 is a good example. A database program will sort things for you and search for specific things or specific relationships. One thing it is especially good for is maintaining the respondent list for a mail survey, keeping track of who has answered, and directing follow-up messages to those who have not. A database system is better at information retrieval than it is at systematic analysis of the information, but many reporters have used such systems for fairly sophisticated analysis.
Those who design computer software and those who decide what software to use have difficult choices to make. Life is a tradeoff. The easier software is to learn and use, the less flexible it is. The only way to gain flexibility is to work harder at learning it in the first place. It is not the function of this book to teach you computer programming, but to give you a general idea of how things work. To do that, this next section is going to walk you through a simple example using SPSS Studentware, a program that is cheap and reliable and achieves a nice balance between flexibility and ease of use.
ensure that the example stays simple, we'll use only ten cases. But the
data are real enough, and they include both continuous and categorical
variables. What we have here is a list of the ten largest newspapers according
to the September 1988 Audit Bureau of Circulation figures and four data
fields for each: 1988 circulation, 1983 circulation, whether or not it
is a national newspaper (I define it as a national newspaper if it is published
outside North Carolina, and I can buy it on a newsstand in Chapel Hill)
and whether or not it is located in the northeast. On the last two questions,
a 1 is entered if it meets the criterion and a 2 is entered if it does
not. Here is what the complete database looks like:
Before we do anything with it, let's visualize a couple of concepts. In dealing with any set of data, the first thing you need to know is what the unit of analysis is. In this case, the unit is the individual newspaper. Each line in the data is a unit of analysis. Another word for it is observation, which is the term used in SAS manuals. Yet another is case, a term preferred by the writers of SPSS instructions. Each case or observation in the example above is one line or record, to use a common data-processing term. In a larger data set, you might have more than one record per case. When data were entered on punched cards, the standard record length was 80 characters, which was the width of the standard Hollerith card. Now your data entry medium is more likely to be a magnetic tape or disk, and there is less restriction on record length and therefore less need to have more than one record per case. However, 80 characters is still a good length if you are likely to want to look at your data on a computer screen. The typical word processor shows an 80-character screen, and if you have to edit the data, the word processor with which you are most familiar can be the best way to do it. Another practical length is 132 characters, the number that will fit on a wide-carriage printer.
If you have trouble picturing the concepts of "record" and "unit of analysis," imagine that your data are entered on three-by-five index cards. Each card is a record. What does each card stand for? Is it a person, as in a public opinion poll? A political contribution? A piece of real estate? Whatever it is, that is your unit of analysis (or "case" if you are using SPSS, "observation" if you are dealing with SAS).
Here are some other things worth noticing about the simple data set in our example. The identity of each case comes first, and the newspaper names have been compressed to six-character mnemonics. It would be perfectly okay to list the name in full. However, that might take some extra programming because many analysis programs set limits on the length of non-numeric fields that they can handle. Six or eight characters is usually safe. In this data set, we have four fields. The first is alphanumeric, and the other three are numeric. Computers are better at manipulating numeric data and, where we have a choice, we usually prefer all numbers. An identification field is not used for manipulation, as a rule, and so we don't mind not having numbers there.
Another thing to note about this data set is that it is in fixed format. In other words, each field of data lines up (with right justification) vertically. If we think of the character fields as vertical columns, the identification always occupies columns 1 through 6, circulation size is in 8 through 14, prior circulation is in locations 16 through 22, and so forth. Some analysis systems, including both SAS and SPSS, are so forgiving that they don't require this much attention to "a place for everything and everything in its place." They can be made to recognize variables just by the order in which they appear, provided they are delimited. The data in our example are delimited by spaces, meaning there is a space to tell the computer where one field stops and another begins. In some situations, it is better to use commas or other characters as delimiters. In the old punched-card days, delimiters were not used as much because of the limited space. We liked to cram the fields together, cheek to cheek. With delimiters, your data are easier for humans to read, even if the computer doesn't care.
Now think for a moment about what we might want to do with this data set. One obvious thing is to calculate a mean and a standard deviation for each circulation year. That way, we can see if the ten largest papers as a whole have been declining or growing in circulation. (Eyeball inspection of the list shows there are examples of both.) We would also be interested in knowing the growth or decline rate for each paper over the five-year period. Here is the entire SPSS program for doing all of that. The program would be the same whether we were dealing with ten newspapers or 10,000.
Only four statements. No more. Here's what they do:
1. DATA LIST. This is a format statement. It tells SPSS to look in its own directory for the file named "PAPER.DOC." How did the file get there? I put it there with my word processor. It then tells SPSS that the first variable is named ID, that it is found in positions 1 through 6 and that it is alphanumeric rather than numeric (the default). Then each of the other variables is named and its location given.
2. COMPUTE. This is a powerful SPSS command for making new variables out of old ones. This particular statement says that, for each case, subtract the old circulation from the new and divide the result by the old. That, of course, yields the percent change from 1983 to 1988. The command further tells SPSS to assign the resulting value to a new variable named GROWTH.
3. FREQUENCIES. This simple command tells SPSS to report the frequency of each occurrence of each value for each variable in three ways; absolute terms, simple percent, and cumulative percent. The STATISTICS option further orders the mean, median, standard deviation, and range for each variable.
4. LIST is asking for a simple report showing the five-year circulation shift for each newspaper.
The total output from these four simple commands is three pages long. The important part can be summarized quite succinctly. The mean circulation for these ten papers rose from 932,165 to 1,022,786 over the five-year period. To see which grew and which shrank, here is a direct quote from the SPSS output:
Note that the papers appear in the order of their size in 1988. You would rather see the list sorted by growth? No problem. SPSS will sort it, or you can copy the output to your word processor and sort it there. Either way, the result will look like this:
Now to do some real analysis. Which papers have experienced the most growth, national or local, northeast or elsewhere? There are at least two convenient ways to test this. One is to get the mean growth for each category. Another is to reduce the GROWTH variable itself to categorical status and run a crosstab. To do that, we'll take the top five on the list and call them the high-growth papers, and the remainder the low or no-growth papers. Here is the SPSS code for comparing the means:
MEANS GROWTH BY NAT NOREAST.:
That produces output that tells you that the mean growth for each of the subgroups was:
Of course, these means are severely impacted by USA Today, a national paper published in the northeast with a 98 percent growth rate across these five years. To minimize its influence, we can cut back to the categorical level of measurement and just classify all the papers as high growth or low growth. The SPSS code to do that (and introduce some labels on the output to make the tables easy to read) is as follows:
The first line computes a new variable called GROWCAT (for growth category) and sets it to an initial value of 1 for each case.
The second line tells the computer to evaluate each case and if its value for GROWTH is greater than .04, to change GROWCAT from a 1 to a 2. That leaves each case classified as a 1 or a 2 depending on whether its value was high or low. The next line gives labels to those values so you won't forget them. Another VALUE LABELS command labels the national and northeast variables for the same reason.
This table was copied directly from the SPSS output to my word processor. What you see here is what you get from SPSS.
National papers and those in the northeast still have an edge. Seventy-five percent of the national papers but only half of the local ones were in the high circulation gain group. Two-thirds of the northeast papers and only half of the others enjoyed high growth.
course, statistical tests, such as chi-square, are easily added.
SPSS and SAS compared
Both SAS and SPSS have been around for a long time. My first encounter with such user-oriented, higher-level languages was at Harvard in 1966, where faculty members in the department of social relations had written a language called DATA-TEXT for Harvard's IBM 7090.4 They worked on a government grant and gave the product away to anyone for the cost of a blank tape, then about $10. It never really caught on because, to make it fast and efficient, they wrote it mostly in the 7090's assembler language. That meant that when the third generation of computers came along it could not be quickly adapted. By the time the Harvard folks got around to it, SPSS, written in FORTRAN and therefore readily transportable, had passed them in popularity. Today, SPSS is a booming business, based in Chicago, the academic home base of Norman Nie, its chief founder. SAS, meanwhile, based in Cary, North Carolina, became the chief rival to SPSS.
Both systems are constantly being improved and expanded, and so any comparison between them risks becoming quickly outdated. However, as late as 1990, there were fundamental differences in approach traceable to the respective corporate cultures of SAS and SPSS, which did not seem likely to change over time.
SAS was more of a programmer's system, SPSS was better suited to the nonprogrammer. In the tradeoff between flexibility and ease of use, SAS leaned a little more toward flexibility. If you were going to analyze data often, that is, more than two or three times a year, it could be worth the trouble to master SAS. With SPSS you did not have to think like a programmer. Some steps that SAS kept visible in order to force you to understand what was happening in the computer were made transparent by SPSS. This was particularly true where crosstabs were concerned. Labeling and setting up tables was much easier in SPSS.
SAS justly gained fame for its file management capabilities. If you had large and complicated bodies of data to work with on a mainframe, SAS was great at letting you reshape them and get them into workable form. Both SAS and SPSS were, by the late 1980s, capable of reading complicated formats, some of which will be discussed shortly.
The weakest point for SAS was its manuals. Those produced in the 1980s were written by programmers for programmers, and, until you learned to think like a computer programmer, they were hard to read. The SAS folks cranked them out so fast that they sometimes did not get them organized well. An early introduction to SAS-PC, for example, told you clearly, with four-color illustrations, how to save a program file, but it never mentioned how to retrieve it once it was saved. SPSS manuals were more readable. Best of all, SPSS had Marija Norusis, the clearest writer on computing and statistical method I have ever encountered. Norusis has produced a series of books for SPSS which integrate the explanation of computer technique and statistical method, which is the logical way to learn this stuff.5 It lets you mix learning and doing in a way that constantly rewards your efforts.
In their PC versions, SAS and SPSS designed very different editors, the systems that let you prepare your batch instructions. The SPSS editor was combined with a menu system. This is good news for personal computer users who are likely to be accustomed to menus, but SPSS, like SAS, is meant for batch mode. Most newcomers to SPSS used the menus like training wheels and bypassed them for direct entry as soon as possible. The SAS editor, called "Display Manager," was more logical and easier to learn than the SPSS version. It gave you three screens: one for your batch program, one for the output, and one for the log that recorded the good or bad things that happened to your program, including the error messages. One key let you toggle between the three screens, and another key let you zoom any of them up to a full screen so you could concentrate on its contents. Not content with that, and perhaps not wanting to be outdone by SPSS, SAS in 1989 offered a menu-driven version to appeal to potential users who felt the need for training wheels.
Both SAS and SPSS had only minor differences in their mainframe and PC languages. After learning one, you could easily switch to the other. Starting in 1988, I stopped introducing students to the mainframe and let them learn first on the PC because the feedback is faster and the student has a greater sense of control. Both SAS and SPSS had systems for exporting their system files -- files with the format and label instructions already carried out -- between mainframes and PCs. And at the mainframe level, SPSS and SAS could read each other's system files, a clever move designed to encourage users of one to switch to the other without worrying about losing the value of their existing data libraries.
SAS v. SPSS story is a fine example of the power of competition in a free
market system. Each keeps trying to outdo the other. Many users, afraid
of being left behind in some new development, maintain a bilingual capability.
A database that follows the model of the ten largest newspapers used earlier in this chapter is straightforward and easy to work with no matter how large it gets. If we had 2,000 newspapers and 2,000 variables (4 million pieces of information), the logic and the programming would be exactly the same as we used with ten papers and five variables. Such a database is called rectangular. Every case has the same number of records and the same number of variables.
There are two fairly common types of nonrectangular files:
In the first of these two cases, a file can be treated as if it were rectangular with the variables that would have been in the missing records defined as "missing." Both SAS and SPSS provide for automatic treatment of missing values. When calculating percentages, for example, they use the number of nonmissing values as the base. For example, if you had a file describing the 83 residents of a dormitory, and if 40 were classified as males, 40 as females, and the gender of three was unknown, either system would report 50 percent males and 50 percent females unless you specified missing as a separate category.
But sometimes an unequal number of records does not denote missing values. It may just mean different quantities of whatever is being measured. A hierarchical file is one way of dealing with this situation. Suppose the government created a computer file based on reports filed by manufacturing companies on their disposition of toxic waste. The unit of analysis (or case or observation) would be a single plant. Then there might be one record for each toxic chemical emitted, with each record showing how much of the chemical was discharged in each of several sectors of the environment, e.g., land, water, air, or recycling facility. A plant that dumped a lot of different chemicals would have more records per case than a plant that dumped a few. Both SAS and SPSS are equipped to handle this situation.
Let's complicate this example a little bit more. Suppose there is one report for each corporate entity that emits toxic waste. The first record in each case would have information about the corporation, it size, location of headquarters, industrial classification, and so on. Call this Record Type 1.
For each of these corporate records, there is a set of plant records, one for each plant. This would be Record Type 2, and it would contain information about the plant, including geographic location, size, product line, etc.
For each plant record, there would be yet another set of records (Type 3), one for each toxic chemical discharged. Each of these records would give the generic name for the chemical, any trade names, the amount, an indication of its form (gas, liquid, or solid).
Finally, for each chemical record, envision one more set (Type 4), one for each method of disposal used for that particular chemical, i.e., ground, water, air, recycling. Each of these records could give details on the time, place, and manner of each emission.
If all of that sounds complicated, it is because it is. However, there is some good news here. The good news is that a flexible analysis package like SAS or SPSS can deal with this kind of file, and, even better, it can let you choose the unit of analysis.
files are created by people who don't have the slightest idea what the
analyst will eventually be interested in, and so the files are designed
to leave all possibilities open. The advantage is that you can set your
unit of analysis at any level in the hierarchy. Suppose, for example, you
want the individual plant to be the unit of analysis. The computer can
spread the corporate data across all of the plant cases so that you can
use the corporate variables in comparing the characteristics of different
plants. Or if you want the individual chemical emission to be the unit
of analysis, you can tell the computer to spread the corporate and plant
data to cover each emission. You do that by creating a rectangular file
first. After that, the rest of the analysis is straightforward.
Communication among computers
Twenty years ago, the first law of computers seemed to be "Everything is incompatible." Today, compatibility is usually close at hand.
While computers use binary formats to hold and process information, there are a number of possible ways to do it. The smallest unit of information is the binary "bit," meaning one piece of on-off, yes-no, open-closed information. By stringing several bits together, one can encode more complicated pieces of information, and the standard convention is to string them together in groups of eight. Each group of eight is called a "byte." When a computer manufacturer tells you that a machine has 512K of random access memory, it means 512 kilobytes or 512,000 bytes. A byte is also the equivalent of a letter, number, or special character on the keyboard. For example in the Extended Binary-Coded Decimal Interchange Code (EBCDIC), standard on IBM mainframes, the eight-bit expression 11010111 stands for the letter "P." Another coding system, the American Standard Code for Information Exchange (ASCII), is used on most personal computers. Right there you can sense some problems if you try to move data from one kind of computer to another, but they have been mostly anticipated by the designers of the communication equipment. If you move data between a mainframe and a personal computer, the communication software takes care of the ASCII-EBCDIC conversion, and you seldom have to be aware of the difference.
How do you get data from one place to another? The telephone is convenient for data sets that are not too large. Their speeds are measured in baud rates, after the Baudot code. Low-priced modems in 1989 operated at 1,200 baud, which is about 120 characters per second or 1,200 words per minute--more than most people's comfortable reading speed. More expensive equipment was capable of 9,600 baud. Even so, for very large data sets to be moved across phone lines takes a considerable amount of time, and there were still situations where it was more convenient to move data from one place to another by physically carrying it in a magnetic or optical medium. A development that helped some news organizations was the availability of a desktop tape drive that could transfer the contents of a tape written on a mainframe to a personal computer. USA Today installed one in 1989. A local area network, or LAN, provides a means of moving data among computers over short distances.
Dealing with large government databases usually means having to work with tapes unless you can talk the governmental unit into copying the material onto personal computer disks. Tapes are more complicated than disks only because there are more ways to store data on them.
you tackle a tape data set for the first time, you will probably be working
with a mainframe, and you can expect help from the computer professionals
at the mainframe installation. If you have a good description of the tape,
that person can put together the job control language (JCL) to read it
and boot SAS or SPSS. You don't have to learn to do everything yourself,
at least not all at once. But you will find it easier to communicate with
the pros if you know the following facts about how tape data sets are constructed.
The data are stored on tracks which run the length of the tape. Nine-track is the standard IBM format, but some systems still use seven tracks. Before a tape can be written on for the first time, it has to be initialized, which usually means giving it an internal, machine-readable label and specifying a density range. The most common density levels are 1,600 and 6,250 bytes per inch (BPI). The key variables for the data layout are logical record length (LRECL in job control language) and block size (BLKSIZE). Because a tape drive reads the data sequentially, spooling through the tape from the beginning to find what it is told to look for, it pays to pack the records in cheek-to-cheek to reduce the distance the tape has to travel, and so the records are "blocked." If a tape has an LRECL of 80 and BLKSIZE of 80, each logical record is its own block, and the data are said to be in "card image," because the physical records are analogous to a deck of old-fashioned Hollerith cards. You will also need to specify the record format (RECFM), which is usually FB for fixed format (i.e., each record is the same length), and the records are arranged in blocks. These characteristics are all specified on a JCL statement that describes the Data Control Block (DCB). Many different data sets can be kept on one tape. You might store a dozen public opinion polls on one tape, for example. To keep track of them, the computer leaves an end-of-file (EOF) marker at the end of each data set that it writes. To get back to that same data set, you just specify its sequence number in the JCL statement. Two EOF marks together constitute the end-of-tape signal. It helps to keep the tape from running off the reel and flopping foolishly around.
Good news for IBM users: when you use an IBM standard label tape, you can often ignore most of the DCB business because it is contained in the tape's own internal label and your computer software will read it and adjust things for you. Here is an example of a JCL data definition statement for reading a standard label tape:
The two slashes tell the computer that it is reading job control language. INPUT DD means that it is about to receive the data definition for an incoming file. The four strings of characters separated by periods are the tape's internal, machine-read label. LABEL=(1,SL) means that this is an IBM standard label tape, and the machine is to read the first file on the tape. UDL393 is the external label on the tape. A human being has to locate it by that label, pick it off a shelf, and mount it on a tape drive. If this were not a standard label tape, or if you were uncertain, you could bypass the label processing and spell out the DCB characteristics in that same statement.
Sometimes very large data sets, especially those prepared by or for the federal government, use some special coding systems designed to save space. They use hexadecimal or zoned decimal or packed decimal notation. Not to worry. Both SAS and SPSS have provisions to allow you to warn the computer in the input statement to watch out for that stuff. As long as you tell the computer what to expect, there is no problem, and the output will show the conventional numbers you want.
formats are more standardized in a personal computer, and you seldom have
to worry about the details of how information is laid out on a disk. But
you will want to keep an operating system manual handy, for its utilities
if for nothing else. The companies that write operating system software
tend to issue manuals that are compulsive in their completeness. This makes
them hard to read. Browse in the computer department of a good bookstore
until you find a manual by an independent author that is pitched at your
level. Microsoft DOS (for Disk Operating System) was the standard for IBM
and compatible computers throughout the 1980s. A newer system, OS/2, was
designed to allow more efficient use of resources by permitting a personal
computer to work on more than one task at once.
do data get onto the tape or disk medium in the first place? Someone types
them in. When you have data that you generated yourself, through a survey,
field experiment, or coding from public records, you can type it in yourself,
using your favorite word processor, especially if
you have a word processor that keeps track of the columns for you so that
you can be sure that each entry in a fixed format is going to the right
place. Save it in ASCII code, unformatted, and read it directly on a personal
computer or upload it through a modem to a mainframe. For any but small-scale
projects, however, it is better to send the data to a professional data
entry house. The pros can do it faster and with fewer errors than you can.
Normally, data entry suppliers verify each entry by having it done twice,
with a computer checking to make certain that each operator read the material
the same way. A variety of optical character readers is also available
to machine-read printed or typed materials or special pencil-and-paper
The nerd factor
Computers are so fascinating in and of themselves that it is easy to get so absorbed in the minutia of their operation that you forget what you started to use the computer for in the first place. The seductive thing about the computer is that it presents many interesting puzzles for which there is always an answer. And if you work with it long enough and hard enough, it will always reward you.
Most of life is not that way. Rewards are uncertain; you never have complete control. And so it becomes tempting to concentrate on the area where you do have control, the computer and its contents, to the exclusion of everything else. Neither academics nor journalists can afford to become that narrow. The computer needs to be kept in its place: as a tool to help you toward a goal, not as the goal itself.
can't learn everything there is to know about computers, but you can learn
what you need to know to get the story. You will find that concepts and
procedures that you do not use more than once are quickly forgotten, and
that you will build two kinds of knowledge: things you need to know and
do yourself, and things for which you can find ready help when you need
it. Be a journalist first, and don't use the computer to shut out the world.
1. Many of these historical details come from Robert S. Tannenbaum, Computing in the Humanities and Social Sciences (Rockville, Md.: Computer Science Press, 1988). return to text
2. G. Harry Stine, The Untold Story of the Computer Revolution (New York: Arbor House, 1985), p. 22. return to text
3. Allan R. Mackintosh, "Dr. Atanasoff’s Computer," Scientific American, August 1988, pp. 90-96. See also the biography by a veteran journalist, Clark R. Mollenhoff, Atanasoff: Forgotten Father of the Computer (Ames: Iowa State University Press, 1988). return to text
4. The Data-Text System: A Computer Language for Social Science Research, Preliminary Manual (Cambridge: Department of Social Relations, Harvard University, 1967), Leader of the Data-Text team was Arthur S. Couch. Some members later worked on the creation of SPSS. return to text
5. For example, try Marija J. Norusis, The SPSS Guide to Data Analysis (Chicago: SPSS, Inc., 1986). return to text