Community Access to Data and Resources
Community Access to Data and Resources Roadmap
Objective: Provide community access to data, models, simulations, and protocols for GTL. Allow users to query and visualize data, use models, run simulations, update and annotate community data, and combine community data and models with their local databases and models.
Making data and software from one research project accessible and useful to others is a considerable challenge, especially considering the many kinds of information produced by GTL, the variety of computational packages, and the rapidly evolving representations of our understanding of living systems. Not only does the community span a wide variety of interests and expertise, it also is a superset of GTL research. Many GTL researchers draw upon and contribute to other research activities in the life sciences. The usefulness and acceptance of GTL data and resources depends, in part, on how they integrate with other similar activities in the larger life sciences community. Community engagement and support must be provided at all stages of this infrastructure development.
Transparent and facile community access to GTL computational resources—specifically the GTL Knowledgebase—for analysis, visualization, modeling, and simulations will require access at several levels. These interfaces must be both user and application friendly and enable a comprehensive integration of GTL databases. Scientists must be able to integrate problem-solving and knowledge-discovery capabilities with custom applications and with other distributed community resources; however, they will not use these capabilities unless they can understand them and have confidence that they will be available and reliable far into the future. Therefore, comprehensive training must be readily accessible to potential users, and software tools and interfaces must be well maintained and supported.
Capabilities Needed
To achieve this sixth objective, a range of technical capabilities are required. Some associated research and development challenges are listed below:
-
Community resources for multiple types of data (machines,
interactions, process models, expression, genome annotation,
metabolism, and regulation).
- Multiple levels of data—raw data, processed results, dynamic models
- Data from other community sources
- Protocols and methods
-
Multiple interfaces to the GTL Knowledgebase to enable many
kinds of queries
- Query and update from web portals
- Interface via web services and database languages
- Adapters and translators to and from external community databases
- Integration with community workflow tools
- Integration with grid services
- Posting of data directly into computations
-
Technologies and tools for access to integrated biology view
- Ability to cross-annotate genome, proteome, and image databases with other information (e.g., genomes with expression data, images with molecular analyses)
- Support for automated and on-demand updates of models built on parameters from the evolving GTL Knowledgebase
-
Broad control over data propagation and collaboration
- Creation of a local copy of all or part of a data set and ability to reintegrate changes later
- Publishing of data to a limited set of colleagues and private sharing of notes with them
- Creation and import of dictionaries and restricted naming rules
- Propagation of data-analysis code to peers and continuous update of algorithms
-
Complete documentation, training, and support services
- Online documentation of database schema, interfaces, and access protocols with worked examples
- Documented open-source analysis and modeling and simulation applications, with files for common systems and sample input and output
- Periodic tutorials on database and application use at several levels
- Help-desk support for problems and queries
- Disaster-recovery plans for major databases
Some R&D Challenges
- Efficient management of queries that span many widely distributed databases, perhaps having varying internal organizations
- Reliable propagation of updates to replica databases and databases with information derived from central sources
- Intuitive user interfaces for browsing, querying, visualizing, and running analyses or simulations
- Design and integration of major databases, accommodating huge data volumes, large transaction rates, great schema complexity, and continually evolving content (e.g., new types of database hardware and software)
- Data standards and representation for very complex objects (e.g., object-definition languages)




