doc2mat - Converting documents into the vector-space format used by CLUTO
doc2mat [options] doc-file mat-file
doc2mat takes as input two arguments. The first argument is the name of the file that stores the documents to be converted into the vector-space format used by CLUTO and the second argument is the name of the file that stores the resulting document-term matrix.
doc2mat convertes a set of documents into a vector-space format and stores the resulting document-term matrix into a mat-file that is compatible with CLUTO's clustering algorithms.
The documents are supplied in the file doc-file, and each document must be stored on a single line in that file. As a result, the total number of documents in the resulting document-term matrix will be equal to the number of rows in the file doc-file.
doc2mat supports both word stemming (using Porter's stemming algorithm) and stop-word elimination. It contains a default list of stop-words that it can be either ignored or augmented by providing an file containing a list of words to be eliminated as well. This user-supplied stop-list file is supplied using the -mystoplist option and should contain a white-space separated list of words. All of these words can be on the same line or multiple lines. Note that stop-word elimination occurs before stemming, so the user-supplied stop words should not be stemmed.
The tokenization performed by doc2mat is quite straight-forward. It starts by replacing all non-alphanumeric characters with spaces, and then the white-space characters are used to break up the line into tokens. Each of these tokens is then checked against the stop-list, and if they are not there they get stemmed. By using the -skipnumeric option you can force doc2mat to eliminate any tokens that contain numeric digits. Also, by specifying the -tokfile option, doc2mat will create a file called mat-file.tokens, in which each line stores the tokenized form of each document.
Some of leading fields of each line can potentially store document specific information (e.g., document identifier, class label, etc), and they can be ignored by using the -nlskip option. In cases in which -nlskip is greater than zero, the -nlskip leading tokens are treated as the label of each row and they are written in the file called mat-file.rlabel.
George Karypis <karypis@cs.umn.edu>