
The catDB utility program allows you to build small databases (containing fewer than ten thousand compounds) for the use of individual chemists and their project teams, and large databases (containing hundreds of thousands of compounds) suitable for use by an entire corporation.
Select a topic:
The data in a spreadsheet is a local copy of the database's data. if you make changes in the spreadsheet data, these changes are only made to the local spreadsheet copy. In order to permanently change data in the database itself, you must apply the Commit 1D Changes To Database command; this will update your database by copying in the altered local spreadsheet data.
Catalyst is designed as a Client-Server application which permits distribution of computation for database searches over multiple computers. A typical search is a multi-step process consisting of the following steps:
Steps 1, 2, and 4 are performed on the catDisk server. Step 3 is performed in the relational database engine (Oracle). Step 4 can be executed in parallel. To control the execution of Step 4, set the Search.server.hosts variable in your ~/.Catalyst file.
Example 1:
Search.server.hosts = "DefaultServer"Placing the above line in your ~/.Catalyst file sets the search to use the default server on the computer that holds the .0bdb file:
Example 2:
Search.server.hosts = "InClient"
Specifying "InClient" means that Step 4 (isomorphism) is performed on the client on which the Catalyst or catSearch application is running. This configuration emulates the behavior of Catalyst version 2.3.
Example 3:
Search.server.hosts = "HostA:HostA:HostB"
The variable can be set to a string of host names separated by colons. The above line sets up a three process parallel search on hosts named "HostA" and "HostB". Here, HostA must be a multi-processor machine and both hosts have to run the catDisk server. For help on how to configure your client-server most effectively, consult MSI Scientific Support.
Indexes are another way of dealing with large volumes of data. Functionally, Catalyst indexes are like the indentations for each letter of the alphabet cut into the front edge of a dictionary. The thumb index comprised of these indentations allows you to jump quickly to the ds when looking up the word data. Catalyst builds indexes automatically, and they are exploited automatically during database searching. Nevertheless, if you are building a large database, you must be aware of them and make certain specifications related to indexing.
Catalyst databases support integrated 1D-2D-3D searching. That is, you can search for compounds that match a hypothesis composed of 1D components (constraints on data fields), 2D components (constraints on atom or bond types), and 3D components (constraints on geometric relationships). For example, your hypothesis can contain a 1D component such as MolecularWeight < 800, a 2D component that specifies a nitrogen that is not in an amide bond, and a 3D component that restricts the nitrogen to 5 to 7 angstroms from the center of a phenyl ring.
The data model defines which data fields are associated with each molecule in a database. In a Catalyst database each compound is uniquely identified by its name and by an internal compound reference number called a cref (pronounced see-ref). Each entry in the database can be retrieved by its molecular structure, and this structure can exist more than once in the database. Thus, you can build a Catalyst database with one entry named Captopril and another named (S)-1-(3-mercapto-2-methyl-1-oxopropyl)-L-proline. As a user, you have the responsibility for ensuring the uniqueness of chemical structures in the database, if that is what you want.
For those familiar with relational database terminology, while Catalyst's database engine supports full relational database functionality, only a single-table data model is accessible to the user. Molecular topology (for which duplicates can exist) acts as a secondary key, molecule name (for which duplicates are not allowed) acts as a secondary key, and the single primary key is the cref. The major consequence of the single-table data model is that you cannot store multiple values for a property in a Catalyst database. Thus, you cannot use a Catalyst database to store an indeterminate and unlimited number of IC50s for a compound. With Catalyst, you can store only a single value (for example, usually the average or median).
You can specify the molecular topology and the molecule name for each database entry. The system provides the crefs, which are invisible to the Catalyst user. The construction of very large databases requires that the person building the database be aware of how the crefs are assigned.
Each database can optionally have an associated property dictionary (.bpd) file, which the database builder provides. For each property, the following information must be specified:
A Catalyst database named Fred, has the following files associated with it:
Warning: You should not modify the configuration file by other means as doing so can cause serious problems.
Furthermore, tables are stored for Fred in the 1D-data server's database. Each table contains the database ID encoded in the table's name (6458 in this example). These database tables are observable only to your system administrator.
Database operations in Catalyst can require significant disk space, depending upon the size of the databases being manipulated. Guidelines for developing a rough estimate of the disk space needed for your database follow:
For each compound: 1 byte per character in a compound name + approximately 4 bytes per heavy (nonhydrogen) atom + 20 bytes overhead.
For each conformer: Approximately 8 bytes per heavy (nonhydrogen) atom.
Note: If compounds are quite large (more than 200 heavy atoms), double the estimates listed above. The guidelines for the .0bdb-file space requirements are approximate. Disk usage depends on many factors including the size of the compound, connectivity, and the number and positions of functional points (hydrophobes, hydrogen bond acceptors and donors, etc.) on each conformer.
Note: If you want to install your database as a corporate database for others to use, see "Moving Databases" in the MSI document, Installing and Maintaining Catalyst.
To familiarize yourself with some UNIX and catDB program commands that are often useful when constructing databases, you can work through the following exercise.
.../database/platform/name
in which the database's name appears in place of name and the platform name is substituted for platform.
more /path/database/platform/Sample/Sample.bdband pressing Enter. Remember to substitute the names of the directories you obtained in step 1 for name and platform. For example,
more /biocad/r3.0/database/irix5r3/Sample/Sample.bdbis the appropriate entry for a system on which databases are stored in /biocad/r3.0 and the type of platform is irix5r3.
ls -l /path/platform/Sampleand press Enter. The size in bytes of each file is listed in the column to the left of the date each was last saved.
In order to construct a database the Catalyst 2D/3D server, catDis, must be running on the host on which you are building the database. To determine if the server program is running:
ps -ef | grep catDisk | grep -v grepand press the Enter key. If the command results in output such as
biocad 623 511 0 12:31:36 ? 0:00 /biocad/r3.0/software/qa/iris/bin/catDisk ca3.0
catDisk is running and you can go to step 3. If you get no output, catDisk is not running, and you should go to the next step.
Note: To start catDisk you need one catDisk token and this is sufficient to run single process searches. Parallel searches require additional catDisk tokens, one for each additional process.
When you create a database using the Catalyst Create Database... command, the files of which it is composed are stored according to the specifications in the default configuration file set up by the catDB program. Also by default, the database is constructed with conformational analysis for each compound and with indexes to make searching for 2D and 3D structures more efficiently. If you want more control over catDB commands and options, for example, to suppress conformational analysis or to defer index construction, see "Building a Custom Database from Your StockroomDB with catDB". Follow these steps to convert your private StockroomDB into a Catalyst database you can share with others:
Input Spreadsheet. The name of the currently selected spreadsheet appears in this text box. To specify another spreadsheet, click in the text box, use the Backspace or Delete key to remove unwanted characters, and type in the spreadsheet's name.
Output Database. Click in this text box and type in a name for your database. If you fail to specify a name for your database, Catalyst displays an alert message reminding you to do so when you try to create the database.
Existing Conformers. Select Discard to generate a new conformational model for each compound or, if you are using Fast generation, you can select Use to retain conformers for the compounds in your spreadsheet. Discard is the default. See "ExistingConfs=" for detailed information on these options.
Maximum Number of Conformers. Click in the text box and type a number for the upper limit on the number of conformers to be generated during conformational analysis. The default value is 100. See "MaxConfs=" for detailed information on this option.
Remote Host. Click on a machine name from the scroll list to specify a remote computer on which to construct the database, and Catalyst enters the machine name in this text box. Alternatively, you can click in this text box, use the Backspace or Delete key to remove unwanted characters, and then type in the name of a remote machine. Note that to construct a database, catDisk (the 2D/3D server program) must be running on the remote host. See "Before You Start Building Databases" for information on how to determine if catDisk is running.
Local Directory. Catalyst provides a unique default directory name of the form processnameDir. If the database is constructed locally, this is where it will be placed. Should you want to specify different directory name, you may edit this field.
Remote Directory. Catalyst provides a default directory name of the form
/usr/tmp/processnameDir in this text box. You should specify a directory
for your database that is on a partition that is local (not NFS mounted) to the machine on which catDisk is running. To find out if the partition is local, in a UNIX shell window
on the host in the directory from which you intend to create a database,
type
df .
(make certain there is a space between df and the .) and press Enter. If efs (extended file system) appears under Type, the partition is local. If nfs (network file system) appears under Type, the partition is NFS mounted.
Cancel. Select Cancel to close the Job Options dialog box without changing any settings.
Help. Select Help to display Catalyst's Help window.
OK. Select OK to make your specifications effective, close the Job Options dialog box, and return to the Create Database dialog box.
Create. Starts the database construction process according to your specifications in the Create Database and Job Options dialog boxes.
Cancel. Closes the Create Database dialog box without changing any settings.
Help. Displays Catalyst's Help window.
Follow these steps to use the
catDB program to create a database from your private StockroomDB into a Catalyst database you can share with others:df .
(make certain there is a space between df and the .) and press Enter. If efs (extended file system) appears under Type, the partition is local. If nfs (network file system) appears under Type, the partition is NFS mounted.
cd /home/arlene/Training
catDB CREATE extendedspreadsheetname.esp
The catDB program first displays the contents of the default configuration (.bdb) file for the database you are creating as shown below. The output in italics will be different for you because it depends on your specific hardware installation.
catDB version 2.2
Default configuration:
! Copyright © 1991-1999
! All Rights Reserved
! Biocad Database Configuration file
Database Name = extendedspreadsheetname
catDB version = 2.2
Database ID = 29878 Unique ID chosen by catDB. Yours will be different.
Conformational Models: Specifications for the .0bdb file.
host = ravel Host computer for .0bdb file.
path = /home/arlene/Training/ Directory for .0bdb file.
!
1D Data: Specifications for the 1D property data file.
host = blackhole Host computer for 1D property data file.
server = cat1-2.2 Name for server program for 1D property data file.
!
2D Index Specifications for the .2bdb file (2D searching indexes).
host = ravel
path = /home/arlene/Training/
!
3D Index: Specifications for the .3bdb file (3D searching indexes).
host = ravel
path = /home/arlene/Training/
!
Feature Dictionary Specifications for the .chm file.
host = ravel
path = /home/arlene/Training
catDB INFO databasename.bdb
(substituting the name of the database for databasename) and press the Enter key. Executing the catDB INFO command prints out the number of compounds in the database, and a description of its property dictionary.
Another way of building a Catalyst database is to construct one from an existing database. If someone in your organization has already built a Catalyst version of your corporate database and you want to build a database containing a subset of that (for example, all molecules synthesized by project 452 or all the D2 antagonists), you can follow a sequence of steps similar to the ones for creating a database from compounds in your Stockroom.
The following exercise is provided to illustrate how to construct a single-conformer database and how to overcome certain common problems including
For this exercise /installdir/cattrain/ex9.sd is the input file. For installdir substitute the name of your Catalyst directory in which your training materials were installed. The input file contains a small number of errors of the types listed above. The purpose of the exercise is to introduce you to the techniques for resolving typical problems that might be encountered when constructing a large database. Dealing with such problems generally requires a degree of UNIX and Catalyst sophistication. The exercise assumes that you have some UNIX expertise and a familiarity with the UNIX nawk (new awk) utility. The non-Catalyst input formats that catDB can read are SMILES, MOL, and SD files. All specify molecular topology; SD files specify 1D data as well.
The first questions to ask about an SD file are
The following one-line UNIX statement entered on the command line counts the number of termination records (occurrences of $$$$) in the file, thus determining the number of compounds by tabulating the number of delimiters that separate one compound's definition from another:
grep '$$$$' filename | wc -l
Substitute the path and file name of your SD file for filename.
Note: Using ex9.sd, the result of executing the statement is 200.
The UNIX statement
grep "> <" filename | sort -u
locates all occurrences of the > < (greater than, space, space, less than) character sequence delimiter preceding
a property-data specification in the SD file, and pipes them to the sort utility with the unique option to display a list of the file's property names.
Note: You must include two spaces between the greater than and less than characters. If you provide only one space, your command statement will fail to locate the delimiter for 1D data specifications in SD files. Using ex9.sd, the result of executing the statement is
> <CAS_number> > <IC50>
cat filename| nawk '/$$/ { for (i=1;i<=4;i++) getline;atomsum+=substr($0,1,4);ct++;};Enter this command statement as one long line, pressing the Enter key only after you have finished typing in both of the lines shown above.
END {print (14.7*atomsum)/ct;}'
Note: Using ex9.sd, the result of executing the statement is 230.937.
To calculate the total number of chiral centers explicitly marked as unknown, execute
cat filename | nawk 'BEGIN{ct=0;}; NF==10 {if ($7==3) ct++;};Again, enter this command statement as one long line before pressing the Enter key.
END {print ct;}'
Note: Using ex9.sd, the result of executing the statement is 0.
If the average molecular weight is more than 400 or if the number of unknown stereocenters is large (more than 0.5 per molecule on average), you can anticipate that the conformer generation step in database building will be exceptionally time-consuming. (The catDB program automatically populates both chiralities if the center is listed as unknown.)
grep '$$$$' /installdir/cattrain/ex9.sd | wc -l(substituting your Catalyst directory name for installdir) and pressing the Enter key.
grep "> <" /installdir/cattrain/ex9.sd | sort -uand pressing the Enter key. The system displays the following result:
> <CAS_number> > <IC50>
cat /installdir/cattrain/ex9.sd | nawk '/$$/ { for (i=1;i<=4;i++) getline; atomsum+=substr($0,1,4);ct++;}; END {print (14.7*atomsum)/ct;}'as one long line and then pressing the Enter key. The system displays 230.937 as the result.
cat /installdir/cattrain/ex9.sd | nawk 'BEGIN{ct=0;};as one long line and then pressing the Enter key.
NF==10 {if ($7==3) ct++;};
END {print ct;}'
Before building the configuration (.bdb) file, you must make some decisions
on where the database files will reside, which in turn is determined by
the computers that will function as data servers. Available disk space and
server computers are usually the critical factors in determining where to
put database files. A full multiconformer 3D database requires roughly 1
megabyte per 1,000 compounds. To determine which computers are servers,
consult your system administrator. Use the UNIX df command to report the free disk space on each server. The disks that can
be "served" by the server must be local to that machine, and not NFS mounted.
You will usually use the 1D server common to all users. The
rlogin machinename(substituting the name of the computer for machinename) and pressing the Enter key.
catDB CONFIG exercise9and press the Enter key. The catDB CONFIG command returns a listing for a default configuration file in the same manner that the catDB CREATE command does.
Do you want to use the default configuration shown above? [y] :
catDB RECONFIG exercise9.bdb
There are two methods for building a property dictionary file. The first, is to create it directly in an editor, using $CATALYST_CONF/Corporate.bpd as a template. That is, you can copy this file into an editor, edit it to specify the properties you want, and save it with your database name and a .bpd extension. Remember that the Special field should always be NULL for your definitions. (Non-NULL values are reserved for internal Catalyst use only.) Properties that are nearly always present for each molecule should be given a UNIVERSAL Schema specification; properties that occur for less than fifty per cent of the compounds should be given a SPECIFIC specification. For properties you expect to search on with any constraint except approximately equals substring search, e.g. RGDIC_50 < 10.0), specify QUICK_REF for Reference. Otherwise you can save some disk space on the 1D server by choosing SLOW_REF for reference.
The other method for building a property dictionary file is to define all the properties for your new database interactively in Catalyst by editing the property dictionary for your StockroomDB using the Edit Property Dictionary command in the Stockroom Databases menu. (See "Edit Property Dictionary..." for details on using the Edit Property Dictionary... command.) After editing your StockroomDB properties, perform a Save StockroomDB command. You can then use the UNIX cp command to make a copy of your Stockroom property dictionary for your exercise9 database as follows:
cp catdata/StockroomDB.bpd exercise9.bpdUse a text editor to remove any unwanted property definitions. For this exercise, use a text editor to append the following definitions for IC50 and CAS_number to the property dictionary file:
IC50 FLOAT UNIVERSAL QUICK_REF NULL IC50 activity value
CAS_number STRING UNIVERSAL QUICK_REF NULL CAS registry number
When you have the necessary input files (.sd, .bdb, and .bpd), you're ready to build the database.
Note: If, in the .sd file, there is a property whose value represents the name of the compounds (e.g. RegCmpdName), you should specify that property name in your ~/.Catalyst file before carrying out the catdb sd command. For example, adding the line:
importMOL.realCompoundNameProperty=RegCmpdNameto your ~/.Catalyst file specifies that the property called RegCmpdName will be used (if encountered) instead of the normal compound name field in each MOL header in building the database. The reason for using an alternate property to hold the compound name is that the .sd file format limits the number of characters in a compound name in the MOL header to 80, and many compounds have names longer than 80 characters.
The catDB SD command for building a database has many options, but for the purposes of this exercise, type the following as one long line and then press Enter:
nohup catDB SD /installdir/cattrain/ex9.sd exercise9 PropDict=exercise9The MaxConfs=1 option specifies building only one conformer for each molecule in the input ex9.sd file. The file you name with the PropDict= option specifies the property dictionary the catDB program uses when constructing the database; catDB automatically appends the .bpd extension to the file name you type. The errData= option puts all unprocessed data in the file you specify to the right of the = (equals) sign. The >& directs standard output and standard error to a file named exercise9.log. Following the command statement with the & operator runs the process in the background. The nohup command ensures that the catDB process will not be killed if the current window is killed.
MaxConfs=1 errData=exercise9.err.sd >& exercise9.log &
Building the exercise9 database should take about three minutes on an R3000 Indigo. After processing is complete, verify the number of compounds which were successfully converted into a Catalyst database using the command
catDB INFO exercise9.bdbSince the INFO command reports 197 compounds in the database, 3 compounds did not convert. In the exercise9.log file, for each molecule successfully processed, you'll see lines like the following:
56145029 : Processing...The log file also records which molecules had problems. These three compounds have been written into exercise9.err.sd because the errData= option was specified. In a real-life situation you would need to examine the data for the three molecules, and using knowledge of the MOL format, determine how they must be fixed. For example, a tetravalent, neutral nitrogen is a common problem; the charge must be specified in either the atom line or at the end in a CHG field. You may need to go back to your original documentation to retrieve an original drawing of the structure, which you can then draw using Catalyst. Exporting this new MOL file, followed by replacement of the original MOL data should correct the problem.
56145029 : CC(=O)N[C@?H]1N[C@?H](Cl)[C@?H](N)[C@?H](Cl)N1
70924833 : Processing...
70924833 : CC(=O)NC[C@?H]1CC[C@?H](CC1)CNC(=O)C
Verify that you can install your newly built single-conformer database in Catalyst, and that you can see the 1D data for CAS_number and IC50. Also verify that you can perform a 3D search. Change one of the IC50 values and commit that change to the database, dispose of the workbench, and do the search again to verify that your updated value has been recorded.
The strategy for building a large, multiconformer database is to 1) build the conformational models in parallel, 2) merge the database segments into a single database, 3) add the 1D data, and 4) build the 2D and 3D searching indexes. Any problems with nonimported structures are best resolved by iteratively appending the structures to the database after it is built.
The first task is the generation of conformational models for each molecule in the database. We recommend that you specify up to 100 FAST conformers per molecule; the catDB program will build a conformational model composed of fewer if the conformational space can be adequately covered by less than 100. At this level of conformational analysis, building a multiconformer database should require roughly one R4400 CPU-day per 8,000 molecules, although the actual time is dependent on the size and the flexibility of the compounds.
Note: The default conformational model energy range is 20 kcal/mol. For FAST conformational analysis you can specify a different energy range by adding the user parameter
confAnalysis.catDB.maxEnergySpread = n
to your .Catalyst file. Substitute a value in Joules for n.
The procedure outlined below assumes constructing a 250,000-compound database with input data provided in three SD files (file1.sd, file2.sd, and file3.sd) containing 100,000, 100,000, and 50,000 compounds, respectively. Each of the SD files contains a collection of property data for the compounds in them. Approximately one gigabyte of disk space will be required to store such a 1D/2D/3D database.
cshThe question marks are prompts and are not something you are supposed to type. It is also important to type a space before and after the < and the @ characters.
set count = 1
while ($count < 26)
? catDB CONFIG part${count}.bdb
? @ count++
? end
File run1:
nohup catDB SD file1.sd part1.bdb maxconfs=100 startafter='$0' \ stopafter='$10000' >& part1.out & nohup catDB SD file1.sd part2.bdb maxconfs=100 startafter='$10001' \ stopafter='$20000' >& part2.out & nohup catDB SD file1.sd part3.bdb maxconfs=100 startafter='$20001' \ stopafter='$30000' >& part3.out & nohup catDB SD file1.sd part4.bdb maxconfs=100 startafter='$30001' \ stopafter='$40000' >& part4.out & #nohup catDB SD file1.sd part5.bdb maxconfs=100 startafter='$40001' \ # stopafter='$50000' >& part5.out & #nohup catDB SD file1.sd part6.bdb maxconfs=100 startafter='$50001' \ # stopafter='$60000' >& part6.out & #nohup catDB SD file1.sd part7.bdb maxconfs=100 startafter='$60001' \ # stopafter='$70000' >& part7.out & #nohup catDB SD file1.sd part8.bdb maxconfs=100 startafter='$70001' \ # stopafter='$80000' >& part8.out & #nohup catDB SD file1.sd part9.bdb maxconfs=100 startafter='$80001' \ # stopafter='$90000' >& part9.out & #nohup catDB SD file1.sd part10.bdb maxconfs=100 startafter='$90001' \ # stopafter='$100000' >& part10.out &Note: The StartAfter= and StopAfter= options control the portions of the .sd file that is processed. Only the first four commands will be executed in this script file's current form as only four processors are being used in the parallel build. If more commands are uncommented (by removing the leading # character), a larger number of simultaneous processes will be started.
File run2:
#nohup catDB SD file2.sd part11.bdb maxconfs=100 startafter='$0' \ # stopafter='$10000' startcref=100001 >& part11.out & #nohup catDB SD file2.sd part12.bdb maxconfs=100 startafter='$10001' \ # stopafter='$20000' startcref=100001 >& part12.out & #nohup catDB SD file2.sd part13.bdb maxconfs=100 startafter='$20001' \ # stopafter='$30000' startcref=100001 >& part13.out & #nohup catDB SD file2.sd part14.bdb maxconfs=100 startafter='$30001' \ # stopafter='$40000' startcref=100001 >& part14.out & #nohup catDB SD file2.sd part15.bdb maxconfs=100 startafter='$40001' \ # stopafter='$50000' startcref=100001 >& part15.out & #nohup catDB SD file2.sd part16.bdb maxconfs=100 startafter='$50001' \ # stopafter='$60000' startcref=100001 >& part16.out & #nohup catDB SD file2.sd part17.bdb maxconfs=100 startafter='$60001' \ # stopafter='$70000' startcref=100001 >& part17.out & #nohup catDB SD file2.sd part18.bdb maxconfs=100 startafter='$70001' \ # stopafter='$80000' startcref=100001 >& part18.out & #nohup catDB SD file2.sd part19.bdb maxconfs=100 startafter='$80001' \ # stopafter='$90000' startcref=100001 >& part19.out & #nohup catDB SD file2.sd part20.bdb maxconfs=100 startafter='$90001' \ # stopafter='$100000' startcref=100001 >& part20.out &
Note: The StartCref= option ensures that this portion of the database does not conflict with the pieces built by the run1 script. The number used (100001) is one greater than the total number of compounds in file1.sd. For a given .sd file referenced in a script, the value of the StartCref= option should always be the same, e.g., 100001 in the example above.
File run3:
#nohup catDB SD file3.sd part21.bdb maxconfs=100 startafter='$0' \ # stopafter='$10000' startcref=200001 >& part21.out & #nohup catDB SD file3.sd part22.bdb maxconfs=100 startafter='$10001' \ # stopafter='$20000' startcref=200001 >& part22.out & #nohup catDB SD file3.sd part23.bdb maxconfs=100 startafter='$20001' \ # stopafter='$30000' startcref=200001 >& part23.out & #nohup catDB SD file3.sd part24.bdb maxconfs=100 startafter='$30001' \ # stopafter='$40000' startcref=200001 >& part24.out & #nohup catDB SD file3.sd part25.bdb maxconfs=100 startafter='$40001' \ # stopafter='$50000' startcref=200001 >& part25.out &
Note: The StartCref= option ensures that this portion of our database does not conflict with the pieces built by scripts run1 and run2. The number used (200001) is one greater that the total number of compounds in file1.sd and file2.sd.
nohup catDB SD file1.sd part5.bdb maxconfs=100 startafter='$40001' \ stopafter='$50000' >& part5.out &Note: Invariably one or more of your database construction jobs will be interrupted by normal system maintenance or an unexpected system failure. This is covered in "Stopping and Restarting Database Construction".
catDB MERGE db1 \ dblist=part1,part2,part3,part4,part5,part6,part7,part8,part9,part10 no1DcatDB MERGE db2 \ dblist=part11,part12,part13,part14,part15,part16,part17,part18,part19,part20 no1D
catDB MERGE db3 dblist=part21,part22,part23,part24,part25 no1D
The No1D option suppresses the generation of 1D property data for the resulting database because they will be created in the next step.
Note: To conserve disk space, the conformational model binary data files named part#.#.0bdb should be backed up onto tape after each merge has been completed. Then use the following command to purge the database files from disk.
catDB DELETE_DB part#
grep "> <" file1.sd file2.sd file3.sd | sort -uwhere the string being searched for is > < (the greater than, space, space, and less than characters). See "Defining the Property Dictionary (.bpd) File" for a discussion of the construction of a property dictionary file. It is best to strictly limit the number of UNIVERSAL properties (preferably fewer than ten) that are indexed (QUICK_REF) as the indexes can require significant disk space, and they also tend to slow the 1D data creation process significantly. Once the property dictionary file has been constructed with the name DB.bpd, create the default 1D data with the following commands:
catDB CREATE_1D db1 propdict=DB.bpdcatDB CREATE_1D db2 propdict=DB.bpd
catDB CREATE_1D db3 propdict=DB.bpd
catDB SD_UPDATE file1.sd db1 errData=file1.errors.sdNote: To conserve disk space, the three .sd files can be archived and deleted after the SD_UPDATE procedures have completed.catDB SD_UPDATE file2.sd db2 errData=file2.errors.sd
catDB SD_UPDATE file3.sd db3 errData=file3.errors.sd
catDB CONFIG CorpDBUse the default configuration that is provided. Merge the three pieces of the database using
catDB MERGE CorpDB DBlist=db1,db2,db3 No2Dindex No3Dindex
Note: To conserve disk space, back up the conformational model binary data files named db#.#.0bdb onto tape after merge the component databases. Then use the following commands to purge the database files from disk:
catDB DELETE_DB db1catDB DELETE_DB db2
catDB DELETE_DB db3
catDB RECALC CorpDB
At this point the database is ready for general use by the scientists within your organization.
It is reasonable to expect interruptions at some point during the construction of a large database. These interruptions could be orderly (for example, routine preventive maintenance), or they could be unexpected in nature (for example, a power outage or system failure).
If you need to halt catDB processes for any reason, you can do so on a per-process basis. Each running database construction process is associated with a database configuration file that has a .bdb extension. To stop a database building process, change your current working directory to the directory that contains the database configuration file and create a new file whose name is the database configuration file name with .stop as an additional extension. For example, if the database configuration file is named part1.bdb and this file resides in the directory /home4/DB, the following commands stop the ongoing construction process:
cd /home4/DB
touch part1.bdb.stop
Database building stops once the next compound is processed. The only outward sign that a stop instruction has been received is the removal of the newly created file with the .stop extension. Once a stop instruction has been received it cannot be reversed.
Important Note: Do not employ the procedure described above if your catDB process is using the AllowNFS option, because you will not be able to restart the process as described below. For additional information, see "AllowNFS".
Restarting database construction is independent of the way in which it was stopped or interrupted. The first task is to determine the last compound that was written to the database. This cannot reliably be obtained from any of the output and/or error files produced by the catDB program. You must use the command procedure that follows. For this example, we will restart the construction of part1.bdb.
catDB INFO part1.bdb Detail No1D |& grep Last
This command reports the name of the last compound that is saved in the database, for example:
Last compound in the database is 'Methylacetate'.
The next step is to modify the original command that started the process to indicate where database construction should resume. The modifications to the original command are in boldface italics in the example below, and should be made in the run1 script file being used to track the parallel building processes.
nohup catDB SD file1.sd part1.bdb maxconfs=100 startafter="Methylacetate"\
stopafter='$10000' APPEND >& part1.out &
The argument change for the StartAfter= option provides the name of the last compound in double quotes, in contrast to the single quotes that where used to indicate a specific starting record number. The APPEND option is required to inform the process that you are intentionally adding to an existing database file.
Important Note: If your catDB process is using the AllowNFS option, you will not be able to restart it as described above. For more information, see "AllowNFS".
While the previous sections outline reliable ways to construct databases, there are a variety of problems that might be encountered. Select from the following list:
Databases to be merged must not have overlapping crefs (internal compound reference numbers). Conflicts among crefs most commonly occur when multiple input files are used in the database construction process, and the StartCref= option is misused or not used at all. A simple but wasteful resolution of cref conflicts is to rebuild the affected portions of your database. A better solution is provided by the REPAIR_DB command, which has the following syntax:
catDB REPAIR_DB DBname.bdb StartCref=nAn appropriate starting cref number can be obtained with the catDB INFO command and its Detail option. See "INFO" for a detailed description of the command and its usage.
Note that the REPAIR_DB option has many restrictions. A database with crefs to be shifted must not have 1D property data, 2D indexes, or 3D indexes; each of these components will need to be remade after the cref shift has been completed. The REPAIR_DB command also works entirely in memory. Thus, its use might be limited by your computer system's resources. See "REPAIR_DB" for details.
A less serious problem that is encountered when trying to construct a large database using CPU resources that are distributed across a network involves the access of disk partitions that are mounted using NFS. While in general it is better not to utilize NFS partitions because of problems with file locking, and more broadly with the network traffic that results from this activity, an override has been provided for the careful user. To permit the catDB program to use a NFS disk partition, use the AllowNFS option on the catDB command line. See "AllowNFS" for details.
If you made a mistake in a 1D property definition (such as specifying a DATE as a STRING), the most robust solution is to remove the 1D portion of the database, correct the property definition in the .bpd file, and reload the data using one of the UPDATE commands supported by the catDB program. For more information, select from the following:
Database maintenance falls into the following main categories:
Updating 1D data is easy in Catalyst, and is a primary reason for building a database with 1D data. To change or remove values in a database's 1D component that you created:
You can also perform a mass update by making your changes in a spreadsheet and exporting it as a spreadsheet (.spst) file. Then use the catDB SPST_UPDATE command with the database's configuration file to update the values in the database with the ones in the .spst file. The syntax of the command is
catDB SPST_UPDATE spreadsheetname.spst outputDBname.bdbSubstitute the name of the database's configuration file for outputDBname.bdb and the name of the spreadsheet file for spreadsheetname.spst.
For detailed information on catDB commands for mass updates, select from the following:
catDB ADD_PROPERTY outputDBname.bdb propertydictionaryname.bpdFor specifics on individual catDB commands for these operations, select from the following:catDB DELETE_PROPERTY outputDBname.bdb propertydictionaryname.bpd
To delete a set of compounds, first install the database in Catalyst and build a spreadsheet containing the names of all of the molecules to be removed. Export the spreadsheet as an .spst file. The easiest way to do this is to perform a 1D search if a whole class of molecules is to be removed. Or you can use the
Find command to locate molecules individually and save them in the Stockroom. In the latter case, you should browse the Stockroom, clear all rows except those with the names of compounds to be deleted, save the resulting spreadsheet, and export it as an .spst file. You can then use the spreadsheet file to remove the compounds in it from your database by issuing the commandcatDB SPST_DELETE spreadsheetname.spst outputDBname.bdbin which you substitute the name of your spreadsheet file for spreadsheetname.spst and the name of your database's configuration file for outputDBname.bdb.
For details on individual operations in the procedure, select from the following:
The catDB program also lets you remove compounds from database with extended spreadsheet files. See "DELETE" for details.
Adding new compounds to an existing database will be necessary to maintain an up-to-date archive of information for the scientists in your organization. This is most easily accomplished with the sequence of commands below. This example details an addition to a database named
CorpDB that is described by a database configuration file called CorpDB.bdb.catDB INFO CorpDB Detail No1D |& grep CrefThe command returns a single line of data such as
Cref low, high = 13002, 13145
You will need to add 1 to the higher number reported and use that value to specify the StartCref= option in a subsequent database building step. For this example, the correct specification is StartCref=13146 for the database building step described below.
catDB CREATE UpdateCorpDB.esp StartCref=13146
catDB MERGE CorpDB DBlist=UpdateCorpDB,CorpDB
The compounds in UpdateCorpDB will replace those in CorpDB if duplicate names are present. This procedure can be used to update existing compounds in a database. Note, however, that if compounds are replaced using this technique rather than using the Append/Replace Database Compounds... command in Catalyst, all existing 1D data for the replaced compounds will be lost. See "Append/Replace Database Compounds" for information on replacing 2D and 3D topologies while preserving 1D data.
For details on individual operations in the procedure, select from the following:
As your system expands over time, it might be necessary to change the physical location of one or more of the database servers and the database files that store the various data components in the database. Such database configuration problems are handled with the
catDB RECONFIG command. See "RECONFIG" for details on altering the locations of conformational models, the 2D indexes, the 3D indexes, and the feature dictionary used to construct the 3D indexes. Consult MSI Scientific Support for instructions on how to alter the location of the 1D data server for a database.
Copyright © 1999, Molecular Simulations Inc. All rights reserved.