AFGEN is a program that generates descriptor spaces for chemical compound(s). The descriptor space consists of graph fragments that can have three different types of topologies: paths (PF), acyclic subgraphs (AF), and arbitrary topology subgraphs (GF).
This manual is divided in the following sections:
Once you download AFGEN, you need to uncompress and untar it using the following commands:
> tar -xzf afgen-2.0.tar.gz
This will create a directory named afgen-2.0
with the following structure:
afgen-2.0\ builds\ Linux-i686\ Linux-x86_64\ doc\ examples\
afgen
and is located in the architecture-specific subdirectory under builds
. The afgen
program is invoked at the command-line within a shell window (e.g., xterm, Gnome terminal, etc).afgen
program can be used in two different modes. In the first mode, referred to as the descriptor-space generation mode (DSGM) , afgen
reads a library of compounds, finds all the fragment-based descriptors, and then represents each library compound as a frequency vector in that descriptor space.
In the second mode, referred to as the descriptor-space projection mode (DSPM) , afgen
takes as input a library of compounds and a previously generated set of afgen
descriptors (using the -fragfile
option) and then represents each library compound as a frequency vector in that descriptor space. Note that in this second mode, no new fragments (i.e., descriptors) are generated.
afgen
is the following (can be obtained by typing afgen -help
):
Usage afgen [options] filename Required parameters filename The file containing the library of compounds. The compounds are specified using MDL's SD file format (http://www.mdl.com) or Tripos's mol2 format (http://www.tripos.com). The specific format of the library is determined by the exension of the file, which can be either .sdf or .mol2. Optional parameters -fragtype=string Specifies the type of fragment-based descriptors to be generated. Supported options are: gf arbitrary fragments [default] af acyclic fragments pf path fragments -lmin=int Specifies the minimum length (in terms of bonds) of the generated fragments. The default value is 3. -lmax=int Specifies the maximum length (in terms of bonds) of the generated fragments. The default value is 7. -fmin=int Specifies the minimum frequency that a fragment must have before it becomes a descriptor. The frequency of a fragment is based on the number of distinct compounds that it occurs at. The default value is 1 (i.e., all fragments are treated as descriptors). -noh This option forces afgen to remove any hydrogen atoms from the compounds. -noatyping Forces afgen to ignore the atom typing specified in the input file (if any). If this option is used, then only the basic atom types are used (e.g., P, N, O, etc.). This option applies only to inputs files that use the mol2 format. -armark Detects aromatic rings in SDF files and relabels the bonds as aromatic. -outfile=string Specifies the stem of the output files (.out & .frags.[sdf, mol2]). If outfile is not specified then the output stem is the same as input stem. -fragfile=string Specifies the file containing previously generated fragments to be used for restricting the descriptor representation of the library. If this file is specified, the compounds will be represented as vectors in that descriptor space and any fragments contained in the compounds but not in that file will be ignored. -nooutput Does not produce any output files and used for testing only. -help Prints this message.
afgen
program uses two different input files. The first is the file that stores the compound library. At this point afgen
supports library files in MDL's SD file format (
.sdf extension) and Tripos's Mol2 file format (
.mol2 extension). These are two of the most widely used structural formats for chemical compounds. Information regarding the SDF format can be found at http://www.mdl.com and information regarding the Mol2 format can be found at http://www.tripos.com. If your library is not already in SDF or Mol2 formats, you can use OpenBabel http://openbabel.org to convert your input files in one of these two formats.
The second file, supplied via the -fragfile
optional parameter, is the fragment file generated from a previous run of afgen
. You should not manually edit this file, unless you have a good understanding of the information stored in it.
afgen
, it will create either two or one files. When afgen
is used to generate a set of fragment-based descriptors for a library (DSGM) it will produce both a fragment file (extension
.frag.sdf or
.frag.mol2) containing the SDF/Mol2 representation of the discovered fragments (i.e., descriptors) and the descriptor-based representation of the library (extension
.out). When afgen
is used to generate the descriptor-based representation of a library with respect to a previously discovered set of graph fragments (DSPM), it will only produce the descriptor-based representation of the library.
afgen
creates a temporary file with the extension
.afgen-tmp. Upon successful completion, afgen
deletes this file. afgen -lmin=2 -lmax=3 test1.sdf
(The test1.sdf
file is in the examples
directory).
AFGEN 20AFGEN-GF:2-3:1 1 0 0 0 0 0 0 0 0999 V2000 0.0 0.0 0.0 C 0 0 0 0 0 M END > <AFGEN_NFRAGS> 19 > <AFGEN_VLBLS> 3 000H 001C 002O $$$$ 1 20AFGEN-GF:2-3:1 3 2 0 0 0 0 0 0 0999 V2000 0.0 0.0 0.0 C 0 0 0 0 0 0.0 0.0 0.0 C 0 0 0 0 0 0.0 0.0 0.0 C 0 0 0 0 0 1 2 1 0 0 0 0 1 3 2 0 0 0 0 M END > <AFGEN_EDGES> 2 0:1:1:1:0 0:1:2:1:1 $$$$ 2 20AFGEN-GF:2-3:1 4 3 0 0 0 0 0 0 0999 V2000 0.0 0.0 0.0 C 0 0 0 0 0 0.0 0.0 0.0 C 0 0 0 0 0 0.0 0.0 0.0 C 0 0 0 0 0 0.0 0.0 0.0 C 0 0 0 0 0 1 2 2 0 0 0 0 1 3 1 0 0 0 0 2 4 1 0 0 0 0 M END > <AFGEN_EDGES> 3 0:1:1:1:1 0:1:2:1:0 1:1:3:1:0 $$$$
The format of the fragment file follows the format of the input file. If the input library is in SDF format then the fragment file is in SDF format. If the input library is in mol2 format, then the fragment file is in mol2 format.
The above example shows the SDF-version of the fragment file. The first compound corresponds to a dummy compound that contains some information related to the number of discovered fragments (AFGEN_NFRAGS
) and a mapping of the chemical atoms to internal numbering (AFGEN_VLBLS
). The rest of the file contains the SDF format of the fragments that were generated numbered from 1. This fragment numbering is used by the descriptor-based representation file to indicate the set of fragments along with their frequencies that are contained in each compound.
The mol2-version of the fragment file contains similar information.
AFGEN_EDGES
). This information are used by afgen
when operating in the DSPM mode and should not be modified. afgen -lmin=3 -lmax=4 -fmin=30 test2.sdf
.
>318 1:5 2:2 3:2 4:9 5:4 6:1 7:1 8:2 9:2 10:1 11:1 12:2 13:2 14:2 15:2 16:1 17:2 18:2 19:1 3:3 4:11 >332 1:4 2:2 3:2 4:7 5:3 6:1 7:1 8:2 9:2 10:1 11:1 12:1 13:1 14:1 15:1 16:1 20:1 21:1 3:1 4:10 >432 1:4 4:7 5:3 10:1 11:1 13:1 14:2 15:2 16:1 20:2 22:1 23:1 24:1 25:2 26:1 27:2 28:1 29:5 30:6 31:1 3:10 4:20 >656 1:6 4:8 5:3 10:4 11:2 13:2 14:2 15:2 29:1 32:2 33:2 34:2 35:2 36:2 37:2 38:2 39:2 40:1 41:1 42:1 43:1 3:5 4:12 >836 1:5 2:1 3:1 4:5 5:2 10:1 11:1 13:1 14:6 15:6 17:1 18:2 19:2 20:1 21:3 25:5 26:1 27:3 31:3 44:2 45:1 3:3 4:9 >847 18:3 20:4 21:13 22:4 24:4 25:5 28:5 29:8 30:3 31:1 33:2 35:1 37:1 40:2 44:5 45:3 46:2 47:1 48:2 49:2 50:2 51:2 3:77 4:167 >849 18:15 21:27 25:9 31:6 44:21 3:38 4:125 >851 1:4 2:1 3:1 4:7 5:3 6:1 7:1 8:2 9:2 10:1 11:1 12:1 13:1 14:2 15:2 20:1 21:1 25:2 26:1 27:2 31:1 44:1 45:1 3:1 4:3
The descriptor-space representation file contains three lines for each library compound. The first line (starting with a >
character) contains the name of the compound and is obtained from the library file itself.
The second line contains the actual descriptor-space representation. It consists of a sequence of fragment-ID
:frequency pairs separated by a space (sorted in increasing fragment-ID order). The fragment-ID corresponds to the name of the fragment as stored in the fragment file and the frequency counts the number of embeddings of this fragment in the compound.
The third line contains a sequence of fragment-length
:frequency pairs for those descriptors that are not included in the descriptor-based representation of the compound. These numbers will be non-zero under two scenarios. First, if a value greater than 1 is specified for -fmin
, some of the fragments present in the library may be pruned because they do not meet the minimum frequency cutoff. For each fragment that is pruned, their frequencies are added up on a per-length basis and are reported in this line. For example, in the case of the 318
compound in the above example, due to the -fmin=30
, it resulted in pruning fragments of length 3 whose total frequency was 3, and fragments of length 4 whose total frequency was 11. The second case in which these frequencies will be non zero are when afgen
is used in the DSPM mode. Since the descriptors are restricted to only those provided by the -fragfile
option, some of the fragments present in each compound may not have corresponding descriptors. In such cases, the total frequency (on a per fragment length basis) for these ignored fragments is reported in that line.
afgen
outputs to the screen by executing: afgen -lmin=3 -lmax=4 -fmin=30 test2.sdf
.
******************************************************************************* AFGEN 2.0.0 Copyright 2006-2008, Regents of the University of Minnesota (HEAD: 3939, Built on: Mar 21 2008, 11:22:29) Library Information --------------------------------------------------------- Library file: test2.sdf #Compounds: 99, #AtomTypes: 10, #EdgeTypes: 8 Options --------------------------------------------------------------------- fragtype=GF, lmin=3, lmax=4, fmin=30, noh: no outfile: test2, fragfile: <not specified> Generating fragments... ----------------------------------------------------- nfrags: 572, nslfrags: 884, lnnz: 5594, ncfrags: 0, hr: 92.94% Done. Pruning fragments... -------------------------------------------------------- pnfrags: 53 Done. Writing fragments ... ------------------------------------------------------- Done. *******************************************************************************
nfrags
) and the number of non-zeros in the descriptor-based representation of the library (lnnz
). The number of non-zeros indicates the space that will be required to store the descriptor-based representation in memory.
Among the other displayed information, ncfrags
is the number of fragments read from the file specified via -fragfile
and pnfrags
is the number of fragments after frequency-based pruning.
afgen:
Property | Limit | |
---|---|---|
Number of atoms in a compound | 511 | |
Number of bonds in a compound | 511 | |
Number of atoms in a fragment | 32 | |
Number of bonds in a fragment | 32 | |
Degree of an atom in a fragment | 5 | |
Number of atoms types | 256 | |
Number of bond types | 8 |
If you encounter any problems or have any suggestions, please contact George Karypis via email at karypis@cs.umn.edu.
Copyright and License Notice ---------------------------- The AFGEN package is copyrighted by the Regents of the University of Minnesota. It can be freely used for educational and research purposes by non-profit institutions and US government agencies only. Other organizations are allowed to use AFGEN only for evaluation purposes, and any further uses will require prior approval. The software may not be sold or redistributed without prior approval. One may make copies of the software for their use provided that the copies, are not sold or distributed, are used under the same terms and conditions. As unestablished research software, this code is provided on an ``as is'' basis without warranty of any kind, either expressed or implied. The downloading, or executing any part of this software constitutes an implicit agreement to these terms. These terms and conditions are subject to change at any time without prior notice.