Infrastructure for Data Collection and Integration for Biomedical Knowledgebases – GlyGen as a Case Study Open Access
Downloadable ContentDownload PDF
The ongoing acceleration in the use of omics technologies is generating petabytes of data that has resulted in the development of several knowledgebases and tools. Even though a vast amount of knowledge is present in these resources, much of it is redundant heterogeneous and scattered. Despite of the availability of considerable number of resources, there is still a need for manual literature search and manual collection of data from multiple resources to find an answer to a specific scientific question. Hence, there is a greater need for collection and integration of such biomedical data. The slow progress in biomedical data integration is because of the data in the knowledgebases are in different file formats, have multiple identifiers for the same entity, lack of machine-readable schemas, ill-defined Application Programming Interfaces (APIs) and data licensing issues. The biomedical community is making substantial progress by implementing new infrastructure technologies and standards comprising of Semantic Web technologies, common formats and global linked identifiers for data collection, integration and retrieval. The field of glycomics is generating data at a fast pace from the high-throughput projects, and as a result, many tools and databases have been developed for the glycoinformatics community. Even in the glycobiology domain, the relevant data is scattered in various databases and knowledgebases giving rise to a need of having a global and comprehensive glycobiology knowledgebase. GlyGen, a glycoinformatics knowledgebase that is a free, extendable and multidisciplinary resource for glycobiology aims to address these needs. GlyGen includes data and knowledge related to glycobiology which comprises of glycans, genes, proteins, diseases, expression, and mutation. For GlyGen, the data based on pre-defined data model derived from use-cases developed from the input of more than 50 scientists are being collected and integrated from various data resources.In the initial phase of the GlyGen project, we have collected and integrated mouse and human data from resources such as UniProt, PDB, PubChem, GlyTouCan and UniCarbKB and other individual data generators based on the workflow that incorporated semantic web technologies and standards. We created 74 datasets and further categorized them as protein centric, proteoform centric and glycan centric datasets based on the content of the data. Detailed readme for each dataset was created based on the BioCompute Object specification document. A dataset collection viewer page was developed and view the dataset collection and to understand the relationship and linking of data in the dataset categories; dataset networks were created using Cytoscape. The datasets in CSV (Comma Separated Value) format were later rdfized using an RDF (Resource Description Framework) model based on the existing RDF models. The rdfized data was stored in the GlyGen triplestore in the form of triples. The GlyGen triplestore was made available to be accessed by web service APIs and by SPARQL queries. To collect, integrate and retrieve data, high availability clusters server configuration comprising three servers with preinstalled software were used. For GlyGen data, we chose Creative Commons Attribution 4.0 International (CC BY 4.0) license, and for source code, we chose, GNU General Public License v3. A private GitHub repository was created for sharing source code with the public. As it is challenging to retrieve a list if Glycoside hydrolases (GHs) from a single resource developed a workflow that retrieved the GHs from UniProtKB and validated the entries through QuickGO, Carbohydrate-Active Enzymes database (CAZy) and Pfam. The workflow retrieved a list of 83 GHs classified by GH families. When GlyGen is fully developed and functional, we will make it available to the public through the link – www.glygen.orgData integration is challenging and a difficult task that requires meticulous planning, creative approaches to tackle issues and concentrated efforts in implementing solutions. We firmly believe that the rest of the biomedical community will take inspiration from GlyGen to collect and integrate data for a biomedical domain from diverse resources.