IDB database system and its use in the human genome project: HUGEMAP

Eric Viara, Stuart Pook, Bruno Lacroix, Michel Tissot, Laurent Atlan, Annick Cohen-Akenine, Guy Vaysseix & Emmanuel Barillot
Généthon, CEPH

1) The IDB database system
We designed an object-oriented database management system (IDB: Integrated DataBase system). This system is implemented on top of an efficient home-made storage manager (SE) and offers:
reliability: two millions objects are currently stored in our work database. efficiency: simpler and faster than all other tested systems. flexibility: C code is automatically generated from a schema. Modification of the schema (adding new classes, for example) does not require recreation of the associated databases. traditional functionalities: generalized trigger rules, complex query expression (with regular expression), type polymorphism, indexes, complex keys, methods in the database. multi-database management: object cross-references between IDB databases are managed by the system. protections: they exist at the level of the object. Each object has a protection; each user has a list of protections which limits the objects that he can read or write. an interface to an interpreted syntax-complete language: Tcl. All functions and structures of the API and the client programs, as well as C functions, have been imported within Tcl. This permits complex interactive queries and interpreted Tcl-scripts. a generic (meta-description independent) browser: navigation through any IDB database.
We are now planning to implement a new release of SE/IDB with version management and a client-server architecture.

2) The HUGEMAP database: contents
The SE/IDB system was designed to store and facilitate access to the data on the human genome produced at Généthon and at the Human Polymorphism Study Center (CEPH). Généthon's large scale approach to physical mapping, genetic mapping and cDNA sequencing required an effective database system, and no existing system was judged satisfactory. An integrated database of the human genome, HUGEMAP, was created using IDB. It includes all of Généthon and CEPH's physical mapping data (clone sizes and fingerprints, Alu-PCR mediated hybridization results, STS screenings, ...), an integrated map of the human genome, part of Généthon's genetic mapping data and a cytogenetic description of the human genome (ISCN 850). The scale of the human genome project has required enlarging this database. We are currently integrating external physical mapping data and extending the meta-schema to include genetic data and, in particular, Genbase (the CEPH database containing the data of the collaborative world-wide research on the genetic map). We are also investigating the integration of cDNA production and screening results, cytogenetic translocation data, and sequence data: we are planning to write a translator, that will generate an IDB meta-schema from a description in ASN-1, allowing us to import the Genbank data into an IDB database, using NCBI software development toolkit.

3) The HUGEMAP database: client programs
Several clients of HUGEMAP have been written to assist us in building physical maps and exploiting the physical and genetic maps:
Clone, STS or chromosome-oriented queries: presents all the available information on the specified Clone/STS/Chromosome (size, fingerprints, STS screening results, Alu-PCR mediated clone-to-clone hybridization, chromosomal assignment, FISH results, genetic map, ...). clone overlap likelihood computations: calculates the most likely overlap of two clones from their restriction fingerprints. contig assembly: uses STS content, overlap likelihoods and Alu-PCR mediated hybridizations to look for the connected parts of the map. clone ordering: a first program finds the shortest clone paths between two starting points in the genome (basically two adjacent STSs from the genetic map), by performing a breadth-first search in the graph of clones (where clones are linked based on their STS content, overlap likelihood or mutual hybridizations). A second program uses a genetic algorithm to optimize the map construction (clone and STS positioning), taking into account all available information (genetic and physical mapping data). a map viewer: provides a graphical representation of the integrated physical and genetic maps. Using the viewer, you can select objects with the mouse and apply other programs to this selection. We are implementing a front-end with a work bench, containing database objects and directories of objects, on which different filters and programs will be able to be run. This will offer a unified view of the HUGEMAP facilities. data servers: a mail-server and a WWW server can be used to query the HUGEMAP database through some of its clients.
We are in the process of facilitating importations, so that users can add their own data to the existing database. We think that the HUGEMAP database, together with its clients and a triple interface (C functions API, interface to an interpreted language, graphical interface), is a useful tool for the human genome project. It can also be a basis for collaborative research and a starting point towards a new family of molecular biology databases.