![]() | IBIMA Forschung Lehre Dienstleistungen Aktuelles ROBISYS |
| Structural and functional analysis of human zinc finger gene clusters |
Human Genome ComputingRecently in the framework of the human genome sequencing effort it has been significantly shown that analytical and design tasks in modern molecular biology could not be delivered without appropriate hardware and software ressources and special expertise in biomedical computing (generation of ETS [Adams et al.]).More generally, the development of expert systems for human genome computing requires a wide range of methods and intelligent tools, such as:
Access to public databases available via INTERNETWork on detection of structure-function relationship will be based on and should exploit the rapidly growing contents of the major protein and nucleotide databases (EMBL, SWISS-PROT, TRANSFAC, TIGR, PROSITE etc.) relevant for the reseach topics addressed in projects I and II. Fortunately the IGD-project [Ritter et al.] provides a common view on the most prominent databases. It is, therefore, the appropiate tool to extract selected facts on the topic at hand. Moreover IGD is based on a common (meta-) database structure, developed for the ACEDB (A.C. elegans Database). One advantage of ACEDB is that public data and own site specific experimentally investigated data can be combined in a common database. The database structure of ACEDB, containing a class of concept and generic attributes, seems sufficiently general to be utilized by human genome projects.
Visual presentationVisualisation of sequences, maps, binding sites and 3-D conformations is a powerful method for elicitation of knowledge within the reseach process. There is a large amount of hardware and software to support this task (INSIGHT, RASMOL, MOLSCRIPT etc.) In particular, the visualisation of DNA binding could be very helpful in our project.
Basic methods of human genome computingBasic methods of human genome computing support searches in databases, sequence comparisons, analysis of physical mapping data, assemblies of DNA sequences, detection of functional DNA target sites and the prediction of gene functions. Research on molecular biological topics needs software to search for similar protein or nucleotide sequences. There exists a tremendous amount of a public domain or commercially available software which could be used in our project [Bishop] .Usually computer programs to detect regions of specific biological functions, e.g. coding or non-coding region, promotor or enhancer regions, rely heavily on statistical dependencies [Kondrakhin et al.], [Buchner]. The most advanced computer based methods modeling protein-DNA binding aim at the determination of a pattern that reliably predict binding activity of different binding sites. For instance [Stromo] uses a matrix of patterns derived from statistics instead of a consensus sequence.
Intelligent systems and expert systemsExpert systems are based on a computer technology which is utilizied in many domains in industry and service. It plays a key role in the enhancement of production and service processes. Expert systems are characterized by the accumulation and codification of knowledge to provide high-level expertise for end-users [Waterman]. In biotechnology expert systems are poorly established. Some research systems have been designed and implemented. For instance, ARIADNE [Lathrop et al.] and the system in [Brugge, Buchanan] deduce protein conformations from primary sequences. Our group deveoped an expert system for prediction of protein membrane binding. [Müller et al.]. From the work conducted on expert systems in biotechnology so far it seems likely that this technology could play a similar key role in human computing, as in other domains.Besides this symbolic method sub-symbolic intelligent systems have been developed. Recently artificial neuronal networks have been applied in human genome computing. In GRAIL some statistical methods ([Fickett] etc.) are combined to detect coding regions via a neuronal network. Despite the impressive success of this architecture there remain some severe shortcomings: To enable data to be used by a neuronal network a remakable amount of adaptation is necessary. Moreover neuronal networks provide no incremental learning in general. All examples had to be at hand. Otherwise the performance is poor. Neuronal networks provide methods for classification tasks. They do not perform well on design tasks which is one of our aims. A more promising approach seems to be inductive machine learning in molecular biology. These methods try to automate the problem of building biological knowledge (e.g. consences sequences, coding regions) from positive and negative examples of a biological compound (e.g. transcription factors). They use biological background knowledge to guide the knowledge generation process. Most of the work in the last ten years utilizing these methods is done on protein folding and molecular design tasks [Schulze-Kremer, King], [Bolis et al.], [King], [Hayes-Roth et al.], [Friedland, Kedes]. All of them are research systems. Most indictive machine learning methods, however, require a clear statement whether an example e.g. of a zinc finger protein interacts with specific nucleic acid sequences or, vice versa, if a zinc finger protein does not specifically recognize particular sequences (see Project II: This example reflects the experimental observation that DNA binding proteins bind to nucleic acids in general. DNA binding sites that display high affinities in interaction with transcription factors classify to be potent target sites, DNA binding sites with low affinities do obviously not). But how do DNA binding sites qualify that display medium affinities? This is exactly the problem to be solved. Moreover most inductive learning methods are not incremental rather they need all training examples at the beginning of the training process which generates the knowledge base. The usual research process is characterized, however, by a detection-analysis-knowledge-forming cycle, accumulating incrementally biological facts.
Application of case-based reasoning in human genome computing within the European Union CASTING effortCase Based Reasoning technology is a unique problem solving technique that offers the ability to develop expert systems more cost effectively and with a much reduced development time scale than existing methods currently in use by European industry, service and research. This technology has proved its pedigree throughout the United States and at many major universities in recent years, but is relatively new to European companies developing expert systems. Expert systems are one of the success stories of Artificial Intelligence research, Case Based Reasoning (CBR) technology moves the frontiers of this research even further forward enabling developers to create accurate decision support systems and automate problem solving processes based on the analysis of previous cases and examples. A system on a protein engineering problem showed the applicability of CBR methods in biocomputing [Napoli, Lieber].Other expamples of CBR methods in biomedical expert systems are our previous work on systems supporting immunological and genetic problems (e.g. [Gierl et al.], [Gierl], [Gierl, Stengel-Rutkowski], [Schmidt et al.], [Swoboda]).Molecular biology is distinguished from other knowledge domains by a professional documentation of results (e.g. sequences) done during research. Numerous data collections have been accumulated. But the intrinsic biological experience of the data bases is rarely used in knowledge-based systems. Now, a suitable technique - case-based reasoning which is a methodology for reasoning and learning - has reached a state of maturity. The rapidly growing interest of the artificial intelligence community in case-based reasoning provides an increasing set of methods. Case-based reasoning means to use old experiences to understand and solve new problems. In case-based reasoning, a reasoner remembers a previous situation similar to the current one and uses it to solve the new problem [Kolodner]. Case Based Reasoning means to solve new problems by remembering a previous similar situation and by using it to solve the current problem. A case in the context of the work proposed here is a set of essential features which characterizes one or several specific solution(s) of a transcription factor binding site and therefore form the boundaries of this class of treatments. These features can be expressed as categorial, ordinal or number attributes. On the other side CBR depends on the comparison of cases in terms of similarity of their features. In this context features connected with a case are only putative entities which can be used in determining the similarity between two or more cases. "This similarity can be derived from sharing of many different features or properties - not all of which need be necessary for category membership. We then have a picture of the (biological) world being divided conceptually into clusters of similar items, each cluster having a well-defined centre, while the border between one cluster and the next may be relatively poorly defined." [Scutliffe, p. 68] The vage border between prototypes is formed by a set of single cases connected to a special prototye. Prototypes and cases form a hierarchy of well-defined centres and vage borders. Prototypes are constructed when several similar cases reach a defined frequency. The most popular approach to the similarity problem comparing the known cases with the new one is using a measure like the Tversky´s contrast model of features [Tversky] or the Rosch model of category resemblance. Moreover, the knowledge acquisition can be simplified and improved, because CBR systems incrementally and automatically collect knowledge of a specific biological environment. Therefore, CBR systems use site-specific and time-dependent biological knowledge. Most of CBR expert systems use specific knowledge representations. It seems likely that the CASUEL syntax, a case-based description language defined in the ESPRIT project INCREA will emerge as a CBR standard [CASUEL]. It provides modelling of taxonomies, inheritance, and adaptation knowledge. It is general enough to design the knowledge representation for a wide varity of classes of molecular biology knowledge. Formalized knowledge on transcription factors (zinc fingers) [Suzuki et al.] or general knowledge on biological entities like DNA [Schroeder, Blattner] could be modelled using CASUEL. Different questions about DNA-binding sites require different views on the case-based knowledge base. One possibility to cope with this problem is goal-based retrieval of prototypes and cases [Seifert]. The aim here is to explicitly formulate goals for special retrieval contexts. The importance of CBR is underlined by the CASTING program recently launched by the CEC in the framework of ESPRIT III. The aim is to support technology transfer of the CBR technology into the European industry and to rise the European awareness on CBR in general. The CEC will encourage European industry to utilize the CBR technology in producing CBR tools and apply these tools in as many domains as possible.
Cognitively appropriate researcher/expert system interfaceIn an early work [Teach, Shortliffe] the requirements of an expert system in biomedicine have been empiracally investigated. Their results show that one of the most important requirements is that such systems have to provide explanations of what they are doing in the process of automatic learning (automatic generation of the knowledge base). These are advantages of symbolic learning systems and especially CBR.But up to now there is no definition of a standard set of functions required for a cognitively appropriate support in biomedical research providing a rapid, concise and guided interaction. We have suggested the notion of cognitive open expert systems providing cognitive appropriate functions to support the interaction between man and expert system [Gierl].
Communication methodsTwo prerequisites are required to integrate the expert systems into the INTERNET. Access to public data bases is usually accomplished by World Wide Web, FTP servers and Email servers. Since public data bases are literally extended each hour it is necessary to implement a transaction-oriented communication system [Gierl et al.] that automatically updates local databases searching in public data bases for interesting new facts.
Service for a public knowledge baseProviding a knowledge base of transcription factors contributing to the world wide communication of the community of human genome researchers aims to establish a service that integrates this knowledge base in the INTERNET and maintains it. Since the technical resources are available and access to the INTERNET (for instance via World Wide Web) is ubiquitous, this is primarily an organisational problem.
References
Expert system for the analysis of human zinc finger transcription factorsOur research initiative is linked to HGF-Concept by the aim of developing a powerful expert system for the structural and functional analysis of eventually several hundred zinc finger proteins, see HGF-Concept p.11. This expert system is dedicated for handling informations obtained from the analysis of human zinc finger proteins might lead to an integrated intelligent system that might serve as an nucleus for structuring sequencing information and for describing functional networks of gene regulation. We would like to develop intelligent tools for handling the huge information already present for zinc finger genes and their products. In particular, an expert system will be established to determine DNA binding preferences for Krüppel-type zinc finger proteins. Furthermore, TF-EXPERT will include the current knowledge on zinc finger protein functions supplemented by incoming results from project I and II. It will, moreover, serve as a general research tool for scientists working on topics concerning transcriptional gene regulation. In particular, the design and engineering of synthetic zinc fingers might be modelled with the help of TF-EXPERT The local integration of experimental work with intelligent systems of human genome computing might lead to novel concepts and models essential to understand regulatory circuits exemplied by the regulation of gene expression in human organisms.
Recently in the framework of the human genome sequencing effort it has been significantly shown, that analytical and design tasks in modern molecular biology could not be delivered without appropriate hardware and software resources and special expertise in biomedical computing (generation of ETS [Adams et al.]). Despite the putative simplicity of zinc finger protein binding there remains a large number of information processing problems related to the complexity and the amount of biological data e.g. the problem of determining DNA target site specifities of zinc finger proteins harboring more than 3 or 4 zinc fingers. As [Suzuki et al] states "That the communication between DNA and protein can be described with specifity, from chemical, to the stereochemical, to the spacing, to the superspacing levels." Therefore, in supporting Project I and II our aim is (see figure Overview of TF-EXPERT):
Moreover, we will provide a service (TFkb, Transcription Factors Knowledge Base) for the genome research community for decision support in the detection of transcription factors-related novel knowledge e.g. target genes.
Figure: Overview of TF-EXPERT
Intelligent analysis of human transcription factorsIn particular, an expert system on the analysis of human transcription factors (figure: TF-EXPERT - Analysis of zinc fingers) will be established (Transcription Factors Knowledge Base (TFkb)) available for the genome research community. This service will be maintained with resources of the human genome initiatives and will have strong emphasis on the characterization of human transcription factors, in particular of zinc finger gene families. Structures and functions of human zinc finger gene clusters will be determined, such as regulatory sequences, intronic structures, cis-acting elements. In addition, knowlegde of DNA-protein interactions will be implemented in this sysem. Furthermore, by comparing zinc finger genes derived from separate but related clusters evolutionary trees might be derived by sequence comparisons. This knowledge will be automatically and incrementally integrated in TFkb in a process of abstracting knowledge on transcription factors. We will use parts of our expert system ICONS in implementing these functions in Common Lisp.
Figure: TF-EXPERT - Analysis of zinc fingers
Target site prediction as matching the knowledge baseOur contribution aims at the development of an essential bioinformatics technology in the functional analysis of transcription factors and their target genes. In particular, in the case of human zinc finger proteins our expert system on DNA-protein-interactions might be utilized for predicting target sites (figure: TF-EXPERT - Analysis of zinc fingers). Instead of solely using consensus sequences or matrix patterns we will integrate all available facts on target sites in the TFkb knowledge base forming an abstracted prototype/case tree (see below knowledge representation). Predicting in TX-EXPERT means to match a new sequence and binding features related to this sequence with the prototype/case tree. The goal is to find one or more most similar known target sites or abstract prototypes of target sites and present it to the user.Then the new sequence and its binding features are integrated in TFkb. The more sequences are presented to TF-EXPERT the more precise further predictions will be. This is one of the main advantages of our CBR methods.
Design of synthetic zinc fingersEven more exciting is designing in an interactive way synthetic transcription factors (figure TF-EXPERT - Design of zinc fingers) with predicted DNA binding specificities. In terms of in vivo functions, we might be able to determine a regulatory program for the function of human zinc finger protein. In particular, once DNA fragments have been selected by zinc finger proteins, the expert system might be fruitful in identifying individual contact residues on the nucleic acid level. The required functions of a protein will be matched with the TFkb knowledge base. The most similar zinc finger gene cluster or more abstract zinc finger gene cluster prototype will be presented to the researcher as a first zinc finger proposal. In an interaction with the researcher TF-EXPERT proposes modifications of zinc fingers gene clusters using CBR adaptation methods. Thus the initial zinc finger gene cluster is modified constrained by the background knowledge on transcription factors.
Figure: TF-EXPERT - Design of zinc fingers
Detection of zinc finger binding site preferencesWe will etablish a method to automatically detecting zinc finger binding site preferences (target genes) as one important example of general transciptional protein binding sites (figure TF-EXPERT - Analysis of zinc fingers). Therfore, we will adopt methods developed in our previous expert systems, particularly in ICONS.
Knowledge representation in case-based structureThe knowledge representation is a case-based structure of prototypes of classes of zinc finger and single consensus sequences (figure Prototype/case knowledge base) adopted from ICONS. The advantage of this approach is that consensus binding sequences in the IUPAC code could be generalized in an abstraction tree of prototypes and simple consensus sequences. The sequences come from the literature from public data bases and our own work (see above Project I and II). They form a local - but public - base of transcription factors.A second source of knowledge is a base of background knowledge. While the case-based part of the knowledge base is highly variable this part comprises steady knowledge on proteins and DNA as well as specific knowledge from the literature as for instance on chemical and stereochemical rules on zinc finger DNA binding [Suzuki et al. 1994]. Connected with consensus binding sequences are the regulatory funktions. If there are anyone known these will be extracted from the TIGR and PROSITE data bases. The PROSITE data base (A. Bairoch) contains amino acid consensus motifs including rules like "At least on Pro or Gly from -7 to -2 and from +1 to +7 or at least two or three Asp, Ser or Asu from -7 to +7". This rules will be parsed and automaticaly integrated in the TFkb. Further rules will be added from our work described in Project I and II. An important role in detecting binding sites and in designing synthetic zinc fingers play 3-D databases like PDB. Transcription factor specific facts from these databases will be intergrated in the knowledge base. The prototype/case knowledge base of zinc finger proteins and DNA sites includes knowledge on the following attributes
The knowledge base on background knowledge will cover
Figure: Prototype/case knowledge base
User oriented tasks of the TF-EXPERT systemThe expert system should support researchers working on tasks likeA. Prediction of optimal zinc finger DNA binding site preferences and especially
EvaluationTF-EXPERT will be evaluated using known zinc finger structure/function relationships. The user interface will be tested for acceptability.The properties of the TF-EXPERT system should display the following features:
References
|
zuletzt geändert: 30.10.2005 |