Text Mining Toolkit: User's Manual
Contents
Introduction
Overview
The Text Mining Toolkit (TMT) is a software tool to aid in the automatic discovery of topics in a corpus of text or HTML (Hypertext Markup Language) documents. This toolkit includes components to parse a body of documents, apply data mining algorithms to those documents, and analyze the results of those algorithms. The toolkit provides a Java™ Application Programming Interface (API) for software developers to write programs utilizing the parsing and analysis facilities. It also provides a simple, configuration-file-driven, command-line interface to automate the parsing and analysis of a large collection of documents. This manual will only cover the TMT command-line tool, not the TMT API.
The basic steps to using the command-line tool are outlined below. All these steps are explained in further detail throughout this manual.
- Mirror the web site you want to analyze with
wget.
- Index the mirrored collection of web documents, which involves parsing the documents and extracting term-counts from specified parts of the documents.
- Apply a clustering algorithm to the indexed collection.
- Manually evaluate the output of the clustering, labeling the clusters or adjusting the clustering parameters as necessary.
License
The Text Mining Toolkit is release under the GNU General Public License. A copy of the license should have been distributed with the software, and can also be downloaded from the GNU General Public License website.
Configuration
The TMT command line interface is highly configurable via an XML-based configuration file (sample file). This configuration file has a simple, yet powerful, format which is described in detail throughout this document. The organization of the configuration file corresponds to the high-level organization of the toolkit:
<config>
<index>
... indexing configuration details ...
</index>
<cluster>
... clustering configuration details ...
</cluster>
<analysis>
... analysis configuration details ...
</analysis>
</config>
Software Dependencies
The TMT has several software dependencies:
All of the necessary Java libraries for the above tools are included with the Text Mining Toolkit. The toolkit also uses the output from a web-crawl made with a freely available Unix/Linux tool, wget.
Running the TMT Command-line Tool
To run the Text Mining Toolkit command-line interface, one must complete the following:
- Java™ must be in the user's path.
- The TMT jar file and all the above jar files must be in the user's
CLASSPATH environment variable.
- A log file must have been produced from a web crawl using
wget.
- A TMT configuration file must be present.
The Text Mining Toolkit comes with a simple script to invoke the command-line interface. This script is configured to set up the CLASSPATH variable correctly, but you may need to modify the script to suit your installation environment. This script should run on most UNIX/Linux installations.
A Note on using wget
Use of the freely-available tool, wget, is required for running the command-line TMT. This tool comes installed standard with most Linux distributions, but can also be downloaded from the GNU wget web site.
wget is used to mirror and create a local copy of the web site you are interested in clustering. The log-file generated by wget and the downloaded documents are used as input to the TMT command line tool. In order to produce a properly formatted wget log file, the following options must be used:
wget -nv -o log-file [other options]
where log-file is the log file to be created and other options are the other arguments required by wget to mirror a web site. See the GNU wget web site for detailed information on the available options. The following example command downloads the first three levels of a web site and produces the proper output:
wget -nv -o wget.log -r -l 3 -A html,htm -E -np http://www.example.com
Toolkit Components
There are three major components to the TMT: indexing, clustering, and analysis. The indexing component is responsible for reading files off disk, converting those files to structured data, and applying transformations to that data. The clustering component is responsible for selecting a clustering algorithm, configuring that algorithm, applying more transformations to the data (if necessary), and applying the clustering algorithm to the data. Finally, the analysis component is responsible for applying the trained clusterer and converting it into useful, understandable output.
Component: Indexing
The indexing component of the toolkit is responsible for parsing a mirrored web site and converting HTML documents into structured data which can be used by the clustering algorithms. This structured data is referred to as an indexed collection. For each HTML document in the collection, a series of numbers is produced. Each number corresponds to how many times a word occurs in the document. This series of numbers for each document is called a document representation. These document representations together make up the document-term matrix, which can be thought of as a grid, where the rows are documents, the columns are terms, and each cell of the grid is the number of times a term occurs in the corresponding document.
It is important to note that most words do not occur in most documents. This type of dataset is known as a sparse dataset. Whenever writing an indexed collection to disk, it is most efficient to make sure the data written to disk is in this sparse format.
A sample section of the configuration that corresponds to the indexing component is given below. An explanation of each element follows.
<index collection-name="name"
input-file="input-file"
input-dir="input-dir"
save-to-file="true|false">
<parse-filters>
<filter class-name="idl.tmt.documentparsing.filters.WordFilter"/>
<filter class-name="idl.tmt.documentparsing.filters.LowerCaseFilter"/>
<filter class-name="idl.tmt.documentparsing.filters.StopWordFilter">
<param name="stopWordFile" value="stopList.txt"/>
</filter>
</parse-filters>
<representation>
<builder class-name="idl.tmt.representation.TitleTextRepresentationBuilder"
weight="1.0" share-termlist="true"/>
<builder class-name="idl.tmt.representation.MetaTextRepresentationBuilder"
weight="1.0" share-termlist="true"/>
<builder class-name="idl.tmt.representation.LinkTextRepresentationBuilder"
weight="1.0" share-termlist="true" binarize="true"/>
</representation>
<transformations>
<transform class-name="idl.tmt.representation.transformations.TermOccurrenceFilter">
<param name="minOccurrences" value="5"/>
</transform>
</transformations>
</index>
The index element takes several attributes (all required):
collection-name identifies this indexed collection for use with clustering, and is used when saving the collection
input-file is the wget log file generated from the web-crawl. This should be an absolute file path.
input-dir is the local root directory from the web crawl.
save-to-file (true/false) indicates whether or not the indexed collection should be saved to disk. In almost all cases, this should be "true" to guard against data loss if the application quits before clustering is complete. The filename that the indexed collection is saved to is <collection-name>.indexedcollection.dat.
parse-filters element
The parse-filters controls which word-based filters are applied during the parsing of the HTML documents. Typically, these filters are used to do things like make all the characters lower-case, restrict to words of at least 3 characters in length, or apply a "stop-list" of words to exclude in the analysis. Filters are specified through filter elements within the parse-filter element. The available filters are:
idl.tmt.documentparsing.filters.LengthFilter excludes words shorter than the specified length. This filter takes a single required parameter, minLength, which specifies the minimum word length to allow. This parameter should typically be set to 3.
idl.tmt.documentparsing.filters.LowerCaseFilter converts all characters to lower-case.
idl.tmt.documentparsing.filters.StemFilter applies the Porter Stemmer to each word. The stemmer attempts to remove suffixes from all words, so that words like "president", "presidents" and "presidential" are all treated identically. See the Porter Stemming Algorithm page for more information.
idl.tmt.documentparsing.filters.StopWordFilter removes specific words from the analysis. This filter takes a single required parameter, stopWordFile, which specifies the file to use as the stop-word list. This file should be a plain-text file which contains a single word per line.
idl.tmt.documentparsing.filters.UpperCaseFilter converts all characters to upper-case.
idl.tmt.documentparsing.filters.WordFilter removes all non-alpha-numeric characters from the words.
Typically, only the following filters are used, in this order: WordFilter, LowerCaseFilter, StopWordsFilter (with a stop-word list tailored to the specific web site), and LengthFilter (with minLength set to 3).
representation element
The representation element specifies which part of the HTML documents terms are drawn from to create the document representations. Terms could be pulled out of title of the document, the body of the document, or from within specific HTML elements within the document. The individual components which extract these terms are called representation builders and they are specified through the builder element within the representation element. The available builders are:
idl.tmt.representation.BodyTextRepresentationBuilder extract all terms from within the <body> element of a HTML document, excluding HTML tags themselves.
idl.tmt.representation.LinkTextRepresentationBuilder extracts terms from anchor tags (<a>) of an HTML document. These terms are not added to the document representation of the document containing the anchor, but to the document that is linked-to by the anchor.
idl.tmt.representation.MetaTextRepresentationBuilder extracts terms from the <meta> tag's content attribute when the type attribute is "keywords", "subject", or "description".
idl.tmt.representation.TitleTextRepresentationBuilder extracts terms from the <title> element.
Typically, the following builders are used: LinkText, MetaText, and TitleText.
All the builder elements also support the following attributes:
weight (optional, number, defaults to 1.0) specifies how strongly the terms from this builder should be weighted compared to the rest of the builders. This should usually be set to "1.0", but can be increased or decreased if desired. The values for this attribute should not differ by more than 2 across representation builders.
binarize (optional, true/false, defaults to false) indicates whether this builder should indicate binary term occurrence (one or zero) rather than term counts. Most builders should have this attribute set to "false", but the LinkText builder should have it set to "true". This is because many documents can link to a single document using the same text, such as "home". A non-binarized link-text representation would assign very high values to pages that are frequently linked-to, and low values to less frequently linked-to.
share-term-list (optional, true/false, defaults to true) indicates whether this builder should share a term-list with the other builders, or use its own. This should always be set to "true".
The transformations element provides the ability to apply global transformations to the data after parsing has completed. These transformations can include removing uncommon terms, re-weighting terms, or re-weighting documents. The available transformations include:
idl.tmt.representation.transformations.MatrixColumnCenterer centers the term-occurrences (columns of the document-term matrix) around their means.
idl.tmt.representation.transformations.MatrixRowNormalizer normalizes the rows of the document-term matrix so that each document has a length of 1.
idl.tmt.representation.transformations.TermOccurrenceFilter removes terms which occur in less than the specified number of documents. This transformation takes a required parameter, minOccurrences, which specifies the minimum number of documents the term must occur in order to retain the term.
idl.tmt.representation.transformations.TfidfWeighter re-weights all the term-counts with the Term Frequency-Inverse Document Frequency weighting. See Tf Idf Ranking for more information.
Typically, only the TermOccurrenceFilter is used, with minOccurrences set from 5 to 20.
It is important to note that the MatrixColumnCenterer and TfidfWeighter transformations convert sparse representations to dense representations, where almost all of the values in the document-term matrix are non-zero. The memory and disk-space requirements for working with dense matrices are vastly greater than working with sparse matrices. In most cases, it is inadvisable to apply those transformations at the indexing stage.
Component: Clustering
The clustering component is responsible for processing the data produced from the indexing step. The steps of the clustering task are: (1) retrieve the indexed collection, (2) transform the collection (optional), (3) select a portion of the collection as a training-set, (4) configure and train the clusterer, and (5) save the clusterer for future analysis.
A sample section of the configuration that corresponds to the clustering component follows:
<cluster name="name" use-collection="coll-name" save-to-file="true|false">
<clusterer class-name="idl.tmt.clusterers.EnhancedEM">
<param name="initializerName" value="idl.tmt.clusterers.RandomInstancesEMInitializer"/>
<param name="debug" value="false"/>
<param name="maxClusterersToBuild" value="10"/>
<param name="seed" value="50"/>
<param name="minStdDev" value="0.02"/>
<param name="numClusters" value="12"/>
</clusterer>
<training-sets>
<set class-name="idl.tmt.training.RandomSelector">
<param name="instanceCount" value="1500"/>
<param name="seed" value="100"/>
</set>
</training-sets>
</cluster>
The cluster element takes several attributes (all required):
name identifies this clusterer for use with analysis, and is used when saving the clusterer
use-collection corresponds to an indexed collection (the value of the collection-name attribute). This identifies which indexed collection will be used when training the clusterer. First, an indexed collection will be looked for in memory. A collection could be in memory if the indexing step was performed in the same invocation of the tool as the clustering. Otherwise, if no collections exist of this name, the collection will be looked for on disk.
save-to-file (true/false) indicates whether or not the clusterer should be saved to disk. In almost all cases, this should be "true" to guard against data loss if the application dies before clustering is complete. The filename that the indexed collection is saved to is <name>.clusterer.dat.
The clusterer element specifies and configures the clustering algorithm to use in the clustering. This element has one required attribute, class-name and this should be set to idl.tmt.clusterers.EnhancedEM. This is an enhancement of Weka's EM algorithm that has been tailored for use with the text mining toolkit. There are several algorithm-specific parameters that can be set through the param element. These are:
initializerName specifies how the EM algorithm should be initialized. This value should be set to idl.tmt.clusterers.RandomInstancesEMInitializer
numClusters specifies the number of clusters to build. Depending on the application, this could be any value above 1.
minStdDev roughly corresponds to the "fuzziness" of the clusters. This should be set to a value between 0.01 and 0.1, where the larger values correspond to "fuzzier" clusters. Typically, this should be set to 0.02.
maxClusterersToBuild specifies how many times the clusterer should be run. Each run develops a statistical model, and different statistical models can be compared based on how well they fit the data. When several statistical models are built, the best one is chosen for the final clustering. This should be set to a value between 1 and 20. Typically, 10 clusterers are built.
seed (optional) specifies the random seed to use. The clustering initialization is based on a random process, and by changing the seed you can force different initial configurations.
debug (true/false) specifies whether or not debugging information should be printed to standard-output during the training of the clusterer.
The training-sets element specifies how to extract a training set from the indexed collection with which to train the clusterer. A training set is usually a smaller subset than the entire indexed collection. When building a training set, you should ensure that you have about 100 documents per cluster. For example, if you are building 12 clusters, make sure that there are at least 1200 documents selected from the indexed collection.
Training set selectors can be specified through the set element within the training-sets element. The available training set selectors are:
idl.tmt.training.FullCollectionSelector selects the entire indexed collection to use as a training set. This should only be used when the indexed collection is relatively small.
idl.tmt.training.RandomSelector randomly selects documents from the indexed collection to use as a training set. This selector takes several parameters: instanceCount (required) specifies how many documents to select and seed (optional) specifies the random seed to use.
idl.tmt.training.RegexSelector selects documents based on matching a regular expression to the document's full path name. This selector takes one required parameter, regex (required) which specified the regular expression. If you know of a section of the web site that would make a good training set, that section can be specified through this selector.
Note that several sets can be specified in one training-sets element. In this way, you could specify a specific part of the website, and supplement that with additional randomly selected documents from the entire web site.
The clusterer element also supports a transformations element (not shown in the configuration excerpt above). The parameters of this element are identical to the transformations element above.
Component: Analysis
There are currently two modes of analysis for the text mining toolkit: (1) generating HTML pages which list the documents belonging to a cluster and the top terms for that cluster, and (2) generating a spreadsheet-like file which contains all the document URLs and which cluster they belong to. A sample analysis configuration which shows the parameterization for the HTML analysis follows:
<analysis type="HTMLAnalysis"
name="analysis-name"
use-collection="collection-name"
use-clusterer="clusterer-name"/>
The analysis element does not take any nested elements, but takes four required attributes:
type identifies the type of analysis to be performed with the given clusterer. The possible values for this attribute are HTMLAnalysis, which generates HTML pages, and Table which generates a spreadsheet-like table of the documents and cluster memberships. More details on these output formats are given below.
name provides a name for this analysis component.
use-collection identifies an indexed collection to use for this analysis. This attribute corresponds to the value of the collection-name attribute of an index element.
use-clusterer identifies a clusterer to use for the analysis. This attribute corresponds to the value of the name attribute of a cluster element.
HTML analysis
The HTMLAnalysis type attribute produces a set of Hyper-text Markup Language (HTML) documents that can be viewed through a web browser. One document is created for each cluster, and an index document is created with shows global information about the clustering as a whole. The index page displays the size of each cluster, provides a link to the individual cluster pages, and provide the top 10 terms associated with each cluster. A abbreviated example of the index page follows:
Cluster Output for Collection: collection-name
Basic Stats
Number of Docs: 9429
Number of Terms: 1256
Number of Clusters: 12
Detailed Synopses
Overview: Highest log-odds terms for each cluster
cluster 0
| Term |
LOR |
| dedicated | 3.5718 |
| recently | 3.2716 |
| hot | 3.2687 |
| webcast | 3.2415 |
|
| Term |
Frequency |
| survey | 195 |
| population | 161 |
| state | 132 |
| income | 125 |
|
Note that for each cluster, two lists of terms are given. The first list is the top terms by Log-Odds-Ratio (LOR) and the second list is the most frequent terms for that cluster. The LOR is a statistical measure of how associated a term is with a cluster. The LOR favors terms that occur often in this cluster while occuring rarely in other clusters. The frequent terms are the terms that occur in the most documents in this cluster. The term frequency does not take into account how many times the term occurs in other clusters.
The cluster pages show more specific data about each cluster. This information includes the top 15 terms, all the documents which belong to that cluster, and the probability that the document belongs to that cluster. An abbreviated example of a cluster page follows:
index
Results for Cluster 2
Terms with highest log-odds
| Term |
LOR |
| natinal | 12.2291 |
| substandard | 12.2291 |
| collects | 5.5709 |
|
| Term |
LOR |
| survey | 99 |
| areas | 74 |
| american | 74 |
|
Documents ordered by probability on this cluster
The top terms are displayed at the top of the cluster pages. Below these terms, is a list of the document URLs belonging to this cluster. Below the document URL is a unique number for this document, and a list of the indexing terms used for this document. Remember that these terms come from specific locations in the document defined in the index element above. To the right of the document URL, a table of probabilities of cluster membership for this document on each cluster is displayed. Note that you will frequently see a probability of 1.0 for one cluster and 0.0 for all the other clusters. There may be documents at the bottom of the list which have a lower probability of membership to this cluster.
Table analysis
The Table type attribute produces a flat-text file containing the cluster memberships. The format of this file is as follows:
URL1 [tab] 1
URL2 [tab] 2
URL3 [tab] 1,2
where each line starts with the URL of that document, followed by a TAB, followed by the cluster number that the document belongs to. It is possible for a document to belong to more than one cluster if the probability of cluster membership for more than one cluster is greater than 0.20. If this is the case, a list of cluster numbers separated by commas will be in the second column, instead of a single number. The file created will be named <name>.rbdata where name corresponds to the name attribute of this element.