NAME

doc2mat - Converting documents into the vector-space format used by CLUTO

SYNOPSIS

doc2mat [options] doc-file mat-file

ARGUMENTS

doc2mat takes as input two arguments. The first argument is the name of the file that stores the documents to be converted into the vector-space format used by CLUTO and the second argument is the name of the file that stores the resulting document-term matrix.

doc-file: This is the name of the file that stores the documents using one document at each line format.
mat-file: This is the name of the file that will store the generated CLUTO compatible mat file, and the file-stem for the .clabel file and the .rlabel file if it is applicable.

OPTIONS

-nostem: Disable word stemming. By default all words are stemmed.
-nostop: Disable the elimination of stop words using the internal list of stop words. By default stop words are eliminated.
-mystoplist=file: Specifies a user supplied file that specifies local stop-words. If the -nostop option has been specified, then by providing a user-supplied file you essentially over-ride all internal stop words.
-skipnumeric: Specifies that any words that contain numeric digits are to be eliminated. By default, a token that contains numeric digits is retained.
-minwlen=int: Specifies the length of the smallest token to be kept prior to stemming. The default value is three.
-nlskip=int: Indicates the number of leading tokens to be ignored during text processing. This parameter is useful for ignoring any document identifier information that may be in the beginning of each document line. The default value is zero.
-tokfile: Writes the token representation of each document after performing the tokenization and/or stemming and stop-word elimination.
-help: Displays this information.

DESCRIPTION

doc2mat convertes a set of documents into a vector-space format and stores the resulting document-term matrix into a mat-file that is compatible with CLUTO's clustering algorithms.

The documents are supplied in the file doc-file, and each document must be stored on a single line in that file. As a result, the total number of documents in the resulting document-term matrix will be equal to the number of rows in the file doc-file.

doc2mat supports both word stemming (using Porter's stemming algorithm) and stop-word elimination. It contains a default list of stop-words that it can be either ignored or augmented by providing an file containing a list of words to be eliminated as well. This user-supplied stop-list file is supplied using the -mystoplist option and should contain a white-space separated list of words. All of these words can be on the same line or multiple lines. Note that stop-word elimination occurs before stemming, so the user-supplied stop words should not be stemmed.

The tokenization performed by doc2mat is quite straight-forward. It starts by replacing all non-alphanumeric characters with spaces, and then the white-space characters are used to break up the line into tokens. Each of these tokens is then checked against the stop-list, and if they are not there they get stemmed. By using the -skipnumeric option you can force doc2mat to eliminate any tokens that contain numeric digits. Also, by specifying the -tokfile option, doc2mat will create a file called mat-file.tokens, in which each line stores the tokenized form of each document.

Some of leading fields of each line can potentially store document specific information (e.g., document identifier, class label, etc), and they can be ignored by using the -nlskip option. In cases in which -nlskip is greater than zero, the -nlskip leading tokens are treated as the label of each row and they are written in the file called mat-file.rlabel.

AUTHOR

George Karypis <karypis@cs.umn.edu>