Text Mining Toolkit: User's Manual

Contents

Introduction

Overview

The Text Mining Toolkit (TMT) is a software tool to aid in the automatic discovery of topics in a corpus of text or HTML (Hypertext Markup Language) documents. This toolkit includes components to parse a body of documents, apply data mining algorithms to those documents, and analyze the results of those algorithms. The toolkit provides a Java™ Application Programming Interface (API) for software developers to write programs utilizing the parsing and analysis facilities. It also provides a simple, configuration-file-driven, command-line interface to automate the parsing and analysis of a large collection of documents. This manual will only cover the TMT command-line tool, not the TMT API.

The basic steps to using the command-line tool are outlined below. All these steps are explained in further detail throughout this manual.
  1. Mirror the web site you want to analyze with wget.
  2. Index the mirrored collection of web documents, which involves parsing the documents and extracting term-counts from specified parts of the documents.
  3. Apply a clustering algorithm to the indexed collection.
  4. Manually evaluate the output of the clustering, labeling the clusters or adjusting the clustering parameters as necessary.

License

The Text Mining Toolkit is release under the GNU General Public License. A copy of the license should have been distributed with the software, and can also be downloaded from the GNU General Public License website.

Configuration

The TMT command line interface is highly configurable via an XML-based configuration file (sample file). This configuration file has a simple, yet powerful, format which is described in detail throughout this document. The organization of the configuration file corresponds to the high-level organization of the toolkit:

    <config>
        <index>
            ... indexing configuration details ...
        </index>
        
        <cluster>
            ... clustering configuration details ...
        </cluster>
        
        <analysis>
            ... analysis configuration details ...
        </analysis>
    </config>

Software Dependencies

The TMT has several software dependencies: All of the necessary Java libraries for the above tools are included with the Text Mining Toolkit. The toolkit also uses the output from a web-crawl made with a freely available Unix/Linux tool, wget.

Running the TMT Command-line Tool

To run the Text Mining Toolkit command-line interface, one must complete the following:
  1. Java™ must be in the user's path.
  2. The TMT jar file and all the above jar files must be in the user's CLASSPATH environment variable.
  3. A log file must have been produced from a web crawl using wget.
  4. A TMT configuration file must be present.
The Text Mining Toolkit comes with a simple script to invoke the command-line interface. This script is configured to set up the CLASSPATH variable correctly, but you may need to modify the script to suit your installation environment. This script should run on most UNIX/Linux installations.

A Note on using wget

Use of the freely-available tool, wget, is required for running the command-line TMT. This tool comes installed standard with most Linux distributions, but can also be downloaded from the GNU wget web site.

wget is used to mirror and create a local copy of the web site you are interested in clustering. The log-file generated by wget and the downloaded documents are used as input to the TMT command line tool. In order to produce a properly formatted wget log file, the following options must be used:

    wget -nv -o log-file [other options]
where log-file is the log file to be created and other options are the other arguments required by wget to mirror a web site. See the GNU wget web site for detailed information on the available options. The following example command downloads the first three levels of a web site and produces the proper output:

    wget -nv -o wget.log -r -l 3 -A html,htm -E -np http://www.example.com

Toolkit Components

There are three major components to the TMT: indexing, clustering, and analysis. The indexing component is responsible for reading files off disk, converting those files to structured data, and applying transformations to that data. The clustering component is responsible for selecting a clustering algorithm, configuring that algorithm, applying more transformations to the data (if necessary), and applying the clustering algorithm to the data. Finally, the analysis component is responsible for applying the trained clusterer and converting it into useful, understandable output.

Component: Indexing

The indexing component of the toolkit is responsible for parsing a mirrored web site and converting HTML documents into structured data which can be used by the clustering algorithms. This structured data is referred to as an indexed collection. For each HTML document in the collection, a series of numbers is produced. Each number corresponds to how many times a word occurs in the document. This series of numbers for each document is called a document representation. These document representations together make up the document-term matrix, which can be thought of as a grid, where the rows are documents, the columns are terms, and each cell of the grid is the number of times a term occurs in the corresponding document.

It is important to note that most words do not occur in most documents. This type of dataset is known as a sparse dataset. Whenever writing an indexed collection to disk, it is most efficient to make sure the data written to disk is in this sparse format.

A sample section of the configuration that corresponds to the indexing component is given below. An explanation of each element follows.

<index collection-name="name"
       input-file="input-file"
       input-dir="input-dir"
       save-to-file="true|false">
       
<parse-filters>
    <filter class-name="idl.tmt.documentparsing.filters.WordFilter"/>
    <filter class-name="idl.tmt.documentparsing.filters.LowerCaseFilter"/>
    <filter class-name="idl.tmt.documentparsing.filters.StopWordFilter">
        <param name="stopWordFile" value="stopList.txt"/>
    </filter>
</parse-filters>

<representation>
    <builder class-name="idl.tmt.representation.TitleTextRepresentationBuilder"
        weight="1.0" share-termlist="true"/>
    <builder class-name="idl.tmt.representation.MetaTextRepresentationBuilder"
        weight="1.0" share-termlist="true"/>
    <builder class-name="idl.tmt.representation.LinkTextRepresentationBuilder"
        weight="1.0" share-termlist="true" binarize="true"/>
</representation>

<transformations>
    <transform class-name="idl.tmt.representation.transformations.TermOccurrenceFilter">
        <param name="minOccurrences" value="5"/>
    </transform>
</transformations>

</index>
The index element takes several attributes (all required):

parse-filters element

The parse-filters controls which word-based filters are applied during the parsing of the HTML documents. Typically, these filters are used to do things like make all the characters lower-case, restrict to words of at least 3 characters in length, or apply a "stop-list" of words to exclude in the analysis. Filters are specified through filter elements within the parse-filter element. The available filters are: Typically, only the following filters are used, in this order: WordFilter, LowerCaseFilter, StopWordsFilter (with a stop-word list tailored to the specific web site), and LengthFilter (with minLength set to 3).

representation element

The representation element specifies which part of the HTML documents terms are drawn from to create the document representations. Terms could be pulled out of title of the document, the body of the document, or from within specific HTML elements within the document. The individual components which extract these terms are called representation builders and they are specified through the builder element within the representation element. The available builders are: Typically, the following builders are used: LinkText, MetaText, and TitleText.

All the builder elements also support the following attributes:

transformations element

The transformations element provides the ability to apply global transformations to the data after parsing has completed. These transformations can include removing uncommon terms, re-weighting terms, or re-weighting documents. The available transformations include: Typically, only the TermOccurrenceFilter is used, with minOccurrences set from 5 to 20.

It is important to note that the MatrixColumnCenterer and TfidfWeighter transformations convert sparse representations to dense representations, where almost all of the values in the document-term matrix are non-zero. The memory and disk-space requirements for working with dense matrices are vastly greater than working with sparse matrices. In most cases, it is inadvisable to apply those transformations at the indexing stage.

Component: Clustering

The clustering component is responsible for processing the data produced from the indexing step. The steps of the clustering task are: (1) retrieve the indexed collection, (2) transform the collection (optional), (3) select a portion of the collection as a training-set, (4) configure and train the clusterer, and (5) save the clusterer for future analysis.

A sample section of the configuration that corresponds to the clustering component follows:

<cluster name="name" use-collection="coll-name" save-to-file="true|false">

<clusterer class-name="idl.tmt.clusterers.EnhancedEM">
    <param name="initializerName" value="idl.tmt.clusterers.RandomInstancesEMInitializer"/>
    <param name="debug" value="false"/>
    <param name="maxClusterersToBuild" value="10"/>
    <param name="seed" value="50"/>
    <param name="minStdDev" value="0.02"/>
    <param name="numClusters" value="12"/>
</clusterer>

<training-sets>
    <set class-name="idl.tmt.training.RandomSelector">
        <param name="instanceCount" value="1500"/>
        <param name="seed" value="100"/>
    </set>
</training-sets>

</cluster>
The cluster element takes several attributes (all required):

clusterer element

The clusterer element specifies and configures the clustering algorithm to use in the clustering. This element has one required attribute, class-name and this should be set to idl.tmt.clusterers.EnhancedEM. This is an enhancement of Weka's EM algorithm that has been tailored for use with the text mining toolkit. There are several algorithm-specific parameters that can be set through the param element. These are:

training-sets element

The training-sets element specifies how to extract a training set from the indexed collection with which to train the clusterer. A training set is usually a smaller subset than the entire indexed collection. When building a training set, you should ensure that you have about 100 documents per cluster. For example, if you are building 12 clusters, make sure that there are at least 1200 documents selected from the indexed collection.

Training set selectors can be specified through the set element within the training-sets element. The available training set selectors are: Note that several sets can be specified in one training-sets element. In this way, you could specify a specific part of the website, and supplement that with additional randomly selected documents from the entire web site.

transformations element

The clusterer element also supports a transformations element (not shown in the configuration excerpt above). The parameters of this element are identical to the transformations element above.

Component: Analysis

There are currently two modes of analysis for the text mining toolkit: (1) generating HTML pages which list the documents belonging to a cluster and the top terms for that cluster, and (2) generating a spreadsheet-like file which contains all the document URLs and which cluster they belong to. A sample analysis configuration which shows the parameterization for the HTML analysis follows:

    <analysis type="HTMLAnalysis" 
            name="analysis-name" 
            use-collection="collection-name" 
            use-clusterer="clusterer-name"/>
The analysis element does not take any nested elements, but takes four required attributes:

HTML analysis

The HTMLAnalysis type attribute produces a set of Hyper-text Markup Language (HTML) documents that can be viewed through a web browser. One document is created for each cluster, and an index document is created with shows global information about the clustering as a whole. The index page displays the size of each cluster, provides a link to the individual cluster pages, and provide the top 10 terms associated with each cluster. A abbreviated example of the index page follows:

Cluster Output for Collection: collection-name

Basic Stats

Number of Docs: 9429
Number of Terms: 1256
Number of Clusters: 12

Detailed Synopses

Overview: Highest log-odds terms for each cluster


cluster 0

Term LOR
dedicated 3.5718
recently 3.2716
hot 3.2687
webcast 3.2415
Term Frequency
survey 195
population 161
state 132
income 125

Note that for each cluster, two lists of terms are given. The first list is the top terms by Log-Odds-Ratio (LOR) and the second list is the most frequent terms for that cluster. The LOR is a statistical measure of how associated a term is with a cluster. The LOR favors terms that occur often in this cluster while occuring rarely in other clusters. The frequent terms are the terms that occur in the most documents in this cluster. The term frequency does not take into account how many times the term occurs in other clusters.

The cluster pages show more specific data about each cluster. This information includes the top 15 terms, all the documents which belong to that cluster, and the probability that the document belongs to that cluster. An abbreviated example of a cluster page follows:

index

Results for Cluster 2

Terms with highest log-odds

Term LOR
natinal 12.2291
substandard 12.2291
collects 5.5709
Term LOR
survey 99
areas 74
american 74

Documents ordered by probability on this cluster

Document C C0 C1 C2 C3
http://www.example.org/
0:[example, terms, for, example, site]
2 0010
http://www.example.org/special/
1:[special, example, terms, for, example, site]
2 0010

The top terms are displayed at the top of the cluster pages. Below these terms, is a list of the document URLs belonging to this cluster. Below the document URL is a unique number for this document, and a list of the indexing terms used for this document. Remember that these terms come from specific locations in the document defined in the index element above. To the right of the document URL, a table of probabilities of cluster membership for this document on each cluster is displayed. Note that you will frequently see a probability of 1.0 for one cluster and 0.0 for all the other clusters. There may be documents at the bottom of the list which have a lower probability of membership to this cluster.

Table analysis

The Table type attribute produces a flat-text file containing the cluster memberships. The format of this file is as follows:
URL1 [tab] 1
URL2 [tab] 2
URL3 [tab] 1,2
where each line starts with the URL of that document, followed by a TAB, followed by the cluster number that the document belongs to. It is possible for a document to belong to more than one cluster if the probability of cluster membership for more than one cluster is greater than 0.20. If this is the case, a list of cluster numbers separated by commas will be in the second column, instead of a single number. The file created will be named <name>.rbdata where name corresponds to the name attribute of this element.