CurrentDesign | Systems Biology Group

Main Table
Mapping Table
Paths Table
Key Space Table
Link Table

Main Table

Master list of key spaces. Will be used to tell the user what we have available.

id: 32 bit unsigned integer, unique id within the table. How it's referenced from links
name: Name of Key space, which is also the name of the table that holds items from that key space, so must be unique. 30 chars max
species: The species of the key space
description: Description of the key space, 255 char max (if decide need more, that's not a problem).
url: URL of key space, or null if it doesn't have one.
ftp: URL for where we got the data for this key space, or null if it didn't get for someplace with a URL.
lastDownloaded: When was the data generated. Defaults to when the entry was added to this table, which is probably a bad idea.
flags: Currently set to 0. 64 bit flags for later use.

CREATE TABLE KeySpaces (
   id INT UNSIGNED NOT NULL AUTO_INCREMENT DEFAULT NULL,   # PRIMARY KEY field
   name VARCHAR(30) NOT NULL UNIQUE,          # Also indexed
   species VARCHAR(30) NOT NULL,             # Also indexed
   description VARCHAR(255) NOT NULL DEFAULT '',
   url VARCHAR(100) DEFAULT NULL,
   ftp VARCHAR(100) DEFAULT NULL,
   lastDownloaded TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
   flags BIT (64) NOT NULL DEFAULT b'0',
   PRIMARY KEY (id),
   INDEX (species))
      ENGINE = MyISAM;

Mapping Table

Master list of mappings between key spaces. Will be used to tell the user where the mapping came from, and will be used to decide which link tables to use (when there's a conflict).

id: 32 bit unsigned integer, unique id within the table. How it's referenced from links
name: Name of the mapping, must be unique. 30 chars max
description: Description of the mapping. Who did it / where did it come from, anything else the user might want to know when deciding what to trust (i.e. all provenence information should be here). 255 char max (if decide need more, that's not a problem).
url: URL of the mapping orgainization / individual, or null if it doesn't have one.
ftp: URL for where we got the data for this mapping, or null if it didn't get for someplace with a URL.
lastDownloaded: When was the data generated. Defaults to when the entry was added to this table, which is probably a bad idea.
quality: Integer, currently defaults to 1 (best). Number to help determine when, and if, to use this mapping.
flags: Currently set to 0. 64 bit flags for later use.

CREATE TABLE Mappings (
   id INT UNSIGNED NOT NULL AUTO_INCREMENT DEFAULT NULL,   # PRIMARY KEY field
   name VARCHAR(30) NOT NULL UNIQUE,          # Also indexed
   description VARCHAR(255) NOT NULL DEFAULT '',       # aka provenance.  Who did this mapping?
   url VARCHAR(100) DEFAULT NULL,
   ftp VARCHAR(100) DEFAULT NULL,
   lastDownloaded TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
   quality INT UNSIGNED NOT NULL DEFAULT 1,
   flags BIT (64) NOT NULL DEFAULT b'0',
   PRIMARY KEY (id))
      ENGINE = MyISAM;

Paths Table

Master list of best paths to follow when mapping from one keyspace to another. This is what makes the Grinder. Unanswered question: How many possible paths per keyspace pair?

sourceKS: 32 bit unsigned integer, ID of the starting keyspace
targetKS: 32 bit unsigned integer, ID of the ending keyspace
pathRank: 32 bit unsigned integer, The best path (first one to check) will have rank 1, second best rank 2, and so on.
mapOrder: 32 bit unsigned integer, Where this intermediate step is in the path
ksID: 32 bit unsigned integer, ID of this intermediate keyspace
Example of use: If you have a Path A - B - C - D, then A is the starting keyspace, D is the ending keyspace, and there will be two entries in this table for this pat. B will have a mapOrder of 1, and C will have a mapOrder of 2.

CREATE TABLE BestLinks (
   sourceKS INT UNSIGNED NOT NULL,      # ID of source keyspace table
   targetKS INT UNSIGNED NOT NULL,      # ID of target keyspace table
   pathRank INT UNSIGNED NOT NULL,      # 1 for first path to check, 2 for second, and so on
   mapOrder INT UNSIGNED NOT NULL,      # Go from source to 1, 1 to 2 ... N to target
   ksID INT UNSIGNED NOT NULL,      # Id of the keyspace to go through at this point
   INDEX kSes (sourceKS, targetKS))   # Will always search on source, target, named so can tell DB to use it
      ENGINE = MyISAM;

Key Space Table

One for each key space we use. Everything revolves around these tables.

id: 32 bit unsigned integer, unique id within the table. How it's referenced from links
name: 30 char max key space name: what the user will give us, or get. Unique, indexed.
description: Description of the item, may get rid of, do we need it? 255 char max (if decide need more, that's not a problem).

CREATE TABLE <KeySpace> (
   id INT UNSIGNED NOT NULL AUTO_INCREMENT,   # PRIMARY KEY field
   name VARCHAR(30) NOT NULL UNIQUE,       # Also indexed
   description VARCHAR(255) NOT NULL DEFAULT '',
   PRIMARY KEY (id))
      ENGINE = MyISAM;

Link Table

One for each key space we use, containing all outgoing links. It's the workhorse of the Grinder.
An index built on geneID + targetName would give us exactly the rows we wanted for any given search. However, an index build on just geneID, while not giving unique hits, results in a much smaller index (4 bytes per entry rather than 34) that will return a small number of rows to be searched for a match. In the case of all 1-1 mappings it will return a max of the # key spaces - 1.
While I may revisit this decision later, for now this seems the quicker route to follow.

sourceID: ID of item in the key space's table. Our starting point. Not unique, since there will be one of these for each table that has a direct link for this item.
targetID: ID of item in the target key space's table. Our ending point. Not necessarily unique, since we're mapping to and from row ids.
targetKS: ID of the key space that this link item maps to.
linker: ID of the mapping used to make this link.
mutiplicity: Enumerated type describing the link (1-1, 1-M(any), M-1, M-M). Will be used to evaluate the "quality" of the link, and link path.
biDirectional: Does the link go both ways. FALSE if this item is not the best Blast hit for target.
sourceID Index: Index we'll use to find links.
targetKS Index: Index of target key space IDs. There because it might be useful to be able to figure out how many links key space A has to key space B.
linker Index: Index of mapping IDs. There because it might be useful to be able to figure out all the links done by one mapping.

CREATE TABLE <KeySpace Links> (
   sourceID INT UNSIGNED NOT NULL,      # Starting gene
   targetID INT UNSIGNED NOT NULL,      # Ending gene
   targetKS INT UNSIGNED NOT NULL,      # ID of target keyspace table
   linker INT UNSIGNED NOT NULL,      # ID from Mappings Table.  Who did this mapping
   mutiplicity ENUM('1-1', '1-M', 'M-1', 'M-M') NOT NULL DEFAULT '1-1',
   biDirectional BOOLEAN NOT NULL DEFAULT TRUE,   # Does link go both ways?
   INDEX (sourceID),            # Not unique, but should be useful
   INDEX (targetKS),            # Not sure will be useful, since will be thousands of each
   INDEX (linker))            # Not sure will be useful, since will be thousands of each
      ENGINE = MyISAM;

-- Main.gregd - 21 Jun 2007

Grinder_ER_Diagram.pdf: Entity Relationship Diagram for the Grinder DB

Attachment	Size
Grinder_ER_Diagram.pdf	43.42 KB