Linguists for the responsible use of internet data
Please feel free to join in this discussion and to contribute ideas, arguments,
and suggested methods.
Traditionally linguists have used corpora and databases as empirical sources for
research. The internet now houses billions of electronically searchable words
representing dozens of languages. All of this vast digital resource has the potential
to be used as linguistic data. We as linguists need to develop standards and best
practices for the use of this unprecedented resource to ensure that our use of
internet data will comply fully with the goals and expectations of our profession.
This page will serve as a discussion forum by presenting the advantages
of internet data, debating possible disadvantages of
internet data, and by suggesting standards for the
use of internet data. Advantages of
- The internet contains billions of electronically searchable
words, more than any corpus.
- The use of language on the internet is relatively
unconstrained, representing both formal and informal styles, and is thus more
"democratic" than most corpora, capturing spontaneous language use that most corpora
do not represent.
- New data is constantly being added, which means that data
includes current usage (unlike corpora, which quickly age and become outdated).
- Advanced options of search engines facilitate searches for phrases and constructions.
- Some search engines can lemmatize words, facilitating searches that include
all the morphological forms of a word.
- Search engines can provide statistical
data about the number of "hits" for each search, making it possible to compare
the relative frequency of linguistic forms.
- Some search engines enable domain
restrictions so that queries can be limited to a controlled subset of sources.
Disadvantages of internet data:
- There is no editorial oversight, no gatekeeper, which means that we have no
control over what kinds of texts are included or whether they contain typographical
errors, etc. Suggested remedies: Conduct a manual
spot-check of a random representative sample of the data to screen for editorial
problems. Report the "error rate" and any other problems along with data. If deemed
necessary, use domain restrictions to reduce these problems.
- Some internet
pages contain material not written by native speakers of the language and might
not represent authentic usage. Suggested remedies:
Conduct a manual spot-check of a random sample of the data to screen for non-native
use (cross-check with a native speaker if necessary). Report the rate of flawed/non-native
use. Domain restictions might help if there is a serious problem. In our experience,
the rate of problematic non-native language is extremely low in comparison with
- Chatrooms and similar informal internet "spaces" involve very
stylized uses of language that are sometimes very different from the standard
language. Suggested remedies: Manual checking
of a sample of data, reporting of rates of such examples, and possible use of
domain restrictions. In some intances, it may, however, be beneficial to study
this particular type of language use.
- Since it is difficult to determine the
total number of words queried in an internet search, it may be impossible to determine
the absolute frequency of a linguistic phenomenon. Suggested
remedies: Supplement internet research with data from corpora in order
to determine absolute frequency, if needed. In most cases, relative frequency
(not absolute frequency) is the more relevant piece of information.
a single item is repeated multiple times, for it has been copied at various sites
on the internet. For example, a sentence containing a rare construction might
appear copied verbatim on five or six sites, thus artificially exaggerating its
frequency. Suggested remedies: Manually check
a random sample of the data and note the rate of repetitions and report this along
with your data. It may be possible to develop a coefficient for adjusting the
data to take this into account. However, it is probably the case that this is
simply a uniform fact that affects the frequency of all the data on the internet,
so it's ultimately just a wash.
- The data on the internet is unstable, and
sites that are here today are often gone tomorrow. Suggested
remedies: If collecting individual pieces of data, collect all the
contextual data that you need along with the data and build files contaning: your
contextualized data, the URL, and the date on which the URL was accessed. If collecting
statistical data, record the URL of the search result and the date on which it
- Some of the text on the internet is copyright protected. Suggested
remedies: As linguists, we are not collecting content, just examples
of usage. Normally we will not need enough context to involve any realistic copyright
violation. Still, all URLs and the dates that they were accessed should be recorded
and cited as appropriate.
for use of internet data:
- Record information detailing the goals and
methodology of the search, including the names of the search engines used, and
examples of how queries were constructed.
- Copy contextualized data into files,
along with the URLs and the date accessed.
- If large amounts of data (several
hundred or more) are collected, perform a manual scan of a random sample of the
data (usually 30-50 items) to check for items that contain typographical errors
or "flawed" language and repetitions of identical items. Report these rates along
with your data. When dealing with rare phenomena and small numbers of data, pay
close attention to replications of identical items.
- Collect only enough context
as needed for your study. Beware of any possible copyright violations.
needed, compare internet data with data from a corpus, or use a corpus to establish
- Before undertaking large projects, researchers might wish
to contact the relevant search-engine companies for advice and to make sure that
they won't overburden the system.
If you want to contribute ideas to
this duscussion, please feel free to use our blog. Go to this
site, login in as "uncslav" with the password "uncslav", and click on "Just
who have contributed to this discussion:Hyug Ahn, Ashley Batten, Biljana
Belamaric-Wilsey, Jamie Bishop, Sung-ho Choi, Dagmar Divjak, Sean Flanagan, Laura
A. Janda, Anne S. Keown, Patrick Murphy, James Phillips, Jenne Powers
This site last updated:
October 28, 2003