Control File Specification
XML Text file to describe how to read a data file and load its contents into the database. Features include:
- Define and use keywords that are expanded at run time
- Keywords can contain other keywords
- Keys are defined once, then can be used in multiple Keyspaces / Mappings (no fear that typos will make the two different)
- Ability to remove extraneous data from keys at processing time
- Define new keyspaces, and/or add to existing ones
- Define new mappings, and/or add to existing ones
- Add links between items
- Robust handling of extraneous data, with the ability to define beginning of line comment characters, an end of data marker, and even the ability to ignore individual rows based on column contents
- User can specify the field / column separator character (defaults to <tab>)
- One data file per control file. But possibility of many keyspaces and mappings per control file
- ? Download data files from remote hosts, then update their contents in the DB if anything has changed. ? * Implement later *
- XML Schema description of the format, so can build using an XML Parser (and thus have your file validated ahead of time).
File Format
The lexical definition of the format for a control file is given
here. The explanations are given below. Unless otherwise mentioned, keys must be given in the order they are listed here. The actual tag will be given in
this font
.
User Keywords UserKeywords
(Optional)
Zero or more instances of elements of tag
Keyword
, each consisting of:
-
Key
{<text>} Any text (other than '{' or '}') surrounded by {} (i.e. <Key>{foo {bar}}<\Key>
is not a valid keyword). Case and whitespace are both significant.
-
Definition
<text> Any text you want, including other keywords (i.e. <Definition>Today's date is {Date}</Definition>
), or the Grinder's pre-defined keywords.
Data File DataFile
The full path to the file this control file is connected to. The file must be on the Servlet's file system (which is to say it must be accessible from Disco).
FTP: FTP
(Optional)
Complete ftp url for downloading the data file. ? If given, then will
always download a new copy of the data file before any processing is done. ?
* Implement later
*
Separator Separator
(Optional)
A Java string defining the field separator character (i.e. \t for a tab). Default is a tab.
Ignored text (Optional)
There are 4 ways to tell the Grinder to ignore text
-
Ignore
<text> A data file can begin with an undetermined number of comment lines. Comment lines must begin with a character, or string of characters, defined by an Ignore
tag (you can have multiple Ignore
tags). The comment block only lasts until we come to the first non-blank, non-comment line.
-
IgnoreLines
<count> <count>
is an integer, giving the number of non-comment lines to skip after the comment lines are finished. Typically 1, if you have a header line before the data starts.
-
IgnoreIfColumn
There are three ways to do this. All three take a Column
specifier, which tells the Grinder which column to look at
-
IgnoreIfContains
<text> Ignore the row if the column contains this text
-
IgnoreIfIs
<text> Ignore the row if the column exactly matches this text
-
IgnoreIfIsNot
<text> Ignore the row if the column does not exactly match this text
-
DoneIfBeginsWith
<text> Marks the end of the data in the data file. Nothing more is read after running into a line that starts with this text. Case sensitive.
Key Definitions KeyDefinitions
All keys used in either the Keyspaces, or the Mappings, must be defined here, and that definition referenced in the Keyspace or Mapping.
KeyDefinitions
contains one or more instances of elements of tag
Key
, each consisting of:
-
Name
{<text>} Any text other than '{' or '}'. Case and whitespace are both significant.
-
Definition
<text> Any text you want, so long as it includes at least one keyword. In fact it, had better include at least one {Col }
keyword, or otherwise you will only get one key from the entire data file, but the XML Schema doesn't enforce that. Note you can have multiple columns merged together for one key, and / or filter out part of the text with <subtract text> in the column keyword(s). (see below)
Keyspace Definitions Keyspace
Definition of a Keyspace that may already exist in the database. If it does, then any new keys will be added, and any key whose names match existing keys will be ignored.
Optional
The elements (which can be given in any order) of a Keyspace are:
-
Name
Name of this keyspace. Whitespace is significant. Static text, keywords, or both. This is the Keyspace's "handle". The Keyspaces name in the DB will be the same, except all whitespace will be replaced with '_'. Required
-
Species
Species this keyspace belongs to. Static text, keywords, or both. Required
-
Description
Description of the keyspace: Static text, keywords, or both. Can be empty. Required
-
URL
A URL to the people who created the data used in the keyspace. Optional
-
FTP
FTP location of the datafile that defines this keyspace. If not given, and the data file FTP is given, will use that. Optional
-
KeyInfo
Information describing how to get key information from the data file. Required
-
KeyName
Must be a KeyDefinitions
name. Required
-
KeyDescription
Can be defined just like a name, (i.e. "HGNC_{Col 1}" or "{Col 2} {Col 3} From {FileName}") except that descriptions don't have to be unique, so they (unlike names) can be all static text. Required
-
ID
Highly optional, and not recommended. It must generate a unique integer id. A row with a non-unique id generates an error, rather than just being ignored. The normal situation is to not define this, and to let the database assign each record a row id.
Mapping Definitions Mapping
Definitions of mappings between two or more Keyspaces.
Optional
The elements (which can be given in any order) of a Mapping are:
-
Name
Name of this mapping. Whitespace is significant. Static text, keywords, or both. Required
-
Description
Description of the mapping: Static text, keywords, or both. Required
-
URL
A URL to the people who created the mapping Optional
-
FTP
FTP location of the datafile that defines this mapping. Optional
-
AddNew
An empty tag (i.e.
If specified, then if the file has a mapping that uses key(s) that don't exist in their keyspace, add the key to the keyspace and then add the mapping. The default behavior is to generate warnings about unknown keys, not to add them. Optional
-
Download Date
Timestamp for when this data was generated. Defaults to {FileDate}. Optional
-
Quality
Integer giving the quality of the mapping. 1 is best, and default value. Optional
-
MapData
The actual mapping information. Comes as one of two types. Required
-
MapPair
A mapping between two keyspaces. Can be bidirectional (A -> B implies B -> A) or one way. Keys can be given in any order
-
SourceKeyspace
Name of keyspace to map source from. Can be static, or variable, but an invalid keyspace name here will generate an error. While the Keyspace doesn't have to be one declared in the control file, normally it will be so declared. Required
-
SourceKeys
The source keys we are mapping from. Must be a KeyDefinitions
name. Required
-
TargetKeyspace
Name of keyspace to map target from. Can be static, or variable, but an invalid keyspace name here will generate an error. While the Keyspace doesn't have to be one declared in the control file, normally it will be so declared. Required
-
TargetKeys
The target keys we are mapping to. Must be a KeyDefinitions
name. Required
-
Bidirectional
An empty tag (i.e.
If specified, then mappings go both ways, so both source::target and target::source will be added. Optional
-
Multiplicity
Gives the multiplicity of the source::target mapping. Options are [1-1|1-M|M-1|M-M]. Default is 1-1. Optional
-
SourceDescription
Static text (including keywords that resolve to something other than the contents of a column) that will be used as the description for any keys added to SourceKeyspace? . Only used if Add New is set. Optional
-
TargetDescription
Static text (including keywords that resolve to something other than the contents of a column) that will be used as the description for any keys added to TargetKeyspace? . Only used if Add New is set. Optional
-
MapFamily
Used to generate all bidirectional mappings between n keyspaces. Each MapFamily defines one of the n keyspaces involved. Keys can be given in any order
-
KeyspaceName
Name of keyspace to use in the mappings. Can be static, or variable, but an invalid keyspace name here will generate an error. While the Keyspace doesn't have to be one declared in the control file, normally it will be so declared. Required
-
KeyName
The keys we are mapping from and to. Must be a KeyDefinitions
name. Required
-
ExtraKeyDescription
Static text (including keywords that resolve to something other than the contents of a column) that will be used as the description for any keys added to Keyspace. Only used if Add New is set Optional
-
MultiMap
An empty tag (i.e.
If specified, then individual items in this keyspace map to multiple items in other keyspaces Optional
Pre-defined Keywords
The following keywords are defined by the parser:
- {Col n -<subtract text>}: The text from the nth column (first column is 1), minus <subtract text> (if any) (i.e. if column 1 of a certain row contains "MGI: 5", then
{Col 1-MGI: }
would produce '5' for that row, and {Col 1-MGI:}
would produce ' 5').
- {Date}: The current date
- {Time}: The current time
- {Time Stamp}: The current date and time
- {File Date}: The "last modified" data and time for the data file. All four of the above defined by the current locale
- {Data File}: The data file managed by this control file
- {File Name}: The name of the data file managed by this control file
Examples
HGNC ExampleGEO GPL96 Example
-- Main.gregd - 09 Jul 2007