Welcome to the Google Summer of Code page for the Genome Commons. Thanks for your interest in our exciting project.

As you’ve likely read already, the Genome Commons is a new project at UC Berkeley that will develop open source and open access tools for interpreting human genomic data. Each of the GSoC ideas proposed below will advance our project in some way. These ideas are seeds that we hope the GSoC student will develop into a project that provides an opportunity for learning and makes a meaningful contribution to the open source community.

Ideas

Idea 1: Develop a clinician tool for assessing the impact of a genomic variant using public data. Tools to analyze even simple variants in a clinical setting are lacking. A great deal could be done with existing data and coding. We imagine an application, perhaps using AppEngine, that would take as input on a web page a small number of genome variants and return a summary web page with links to external data that would enable a clinician to make a professional assessment of the entered variants.

Idea 2: Build a KNIME-based pipeline for variant annotation. While superficially similar to idea 1, the focus on this idea is to use KNIME to build an extensible system for analyzing variation. The input to this system would be raw genomic “short” reads (billions of them) and analysis parameters; the output would be a set of variants. The implementation of this project would necessitate writing adapters for a wide range of existing file formats and command line tools.

Idea 3: Build a system for community annotation of proteins. Protein annotation is currently undertaken by isolated groups which focus on a few annotation types. This creates silos of data that are difficult to integrate. The goal of this idea is to build a framework for the distributed annotation of proteins by automated and manual means. For example, specialists in transmembrane prediction might register a service to make predictions; the framework would invoke the service and cache results. Or, scientists might manually annotate a protein with free-text or structured information. The provenance of all annotations must be recorded.

Idea 4: BioFUSE, a FUSE layer that enables transparent access to biology-related web services. For example, BioFUSE might expose the heavily used NCBI system to enable access to a “RefSeq” via a local path such as biofuse/refseq/v37/AC/NP_012345/seq  (or NP_012345.xml for the XML record, or .asn1 for the asn1 record, etc).

Idea 5: PgFUSE, a PostgreSQL FUSE layer. Although several people have written FUSE layers for table access, I seek a deeper implementation that would expose limited DDL functionality. For instance, I image being able to use a local file path to read and write a database function. With sufficient exposure via FUSE, it might be possible to implement source code management on individual database objects.

Project X: Blue Sky. We like good ideas. If you’ve got one that you think fits with our overall mission, please contact us early to propose the idea briefly. If we like it, we’ll ask you to elaborate.

Contact

If you’d like to discuss the ideas above or any that you may have, please contact: Reece Hart <reece@berkeley.edu>. I am happy to work with putative candidates to refine ideas and craft the best proposal possible. If you’re local, it would be nice to meet in person to get acquainted.

Applying

In your application please include the following information, in addition to the Google GSoC application:

  • What is your goal for the project? Imagine that it’s the end of the summer. Please briefly describe what a successful GSoC project with the Genome Commons looks like to you. What did you get done? What did you learn? What’s your next move?
  • Depending on the idea that is pursued, it is likely that experience with SQL, Python, Perl, Java, Unix, AppEngine, web programming, will be required. What’s your experience with these? What other relevant skills do you have?
  • How familiar are you with biological data types, such as genes, genomes, proteins, transcripts, variants, kinds of variants?
Last updated 26 January 2010