Ontology-based information extraction (OBIE) has recently emerged as a sub-field of information extraction (IE). Here, the general idea is to use ontologies to guide the information extraction process. Ontologies provide formal and explicit specifications on shared conceptualizations.
Our work on ontology-based information extraction is on the following efforts.
- Reviewing the literature of the field and providing a new structure to the field: OBIE is less than 10 years old and as such there have been very few attempts to provide a literature review for the field. In order to address this situation, we wrote a review paper for the field, which appears to be one of the first such papers, if not the first one. Instead of simply listing the details of different OBIE systems, we tried to identify key characteristics and differences between these systems, providing a definition for an OBIE system and identifying a common architecture among such systems in the process. This paper was published by the Journal of Information Science in June 2010. It can be found here.
- Multiple-ontology-based information extraction (MOBIE): Most OBIE system use a single ontology although multiple ontologies exist for most domains. Using such multiple ontologies can actually be beneficial to the information extraction process. This also fits nicely with the vision of Aimlab, where we investigate how to handle and make use of multiple ontologies. Based on this insight, we've investigated the theoretical basis behind the use of multiple ontologies in IE and have conducted two case studies on it. We have published the findings of this work in the CIKM 2009 conference (as a full paper). The datasets and source code related to this paper can be found here.
- OBCIE - A component-based approach for information extraction: In our studies on the use of multiple ontologies in information extraction, we encountered the possibility of developing a component-based approach for information extraction. The key ideas behind this approach is identifying components of information extraction systems that make extractions with respect to particular components of an ontology (which we call information extractors) and separating the domain and corpus specific information from information extraction systems resulting in generic platforms for information extraction. Based on these ideas we developed a comprehensive component-based approach named OBCIE (Ontology-based components for information extraction). Since the lack of effective reuse mechanisms appears to be one major reason holding back the widespread usage and commercialization of information extraction, we believe that this approach has the potential to make a significant contribution towards the development of the field. A paper based on this work has been accepted for publication by the CIKM 2010 conference (as a full paper). The datasets and source code related to this paper can be found here.
- Using OBIE to providing grades and feedback for student summaries: The advances in Natural Language Processing (NLP) has lead to automatic grading of summaries and essays. Although the statistical NLP based systems produce quite accurate grades, they cannot provide feedback about the completeness or correctness of the summaries, especially what errors the students have made in their summaries. Since OBIE extracts information from text based on a formal representation of the domain, it seems possible that OBIE can provide insight into a students summary. Correct statements are identified by extraction rules created from the concepts and relationships of the ontology. To extract incorrect statements, we use ontology constraints to define logically inconsistent axioms. The results of our study can be found in the conference paper Providing Grades and Feedback for Student Summaries by Ontology-based Information Extraction (CIKM 2012) (short paper).
- Hybrid Information Extractors for OBCIE: In OBIE, information extractors perform the extraction from the text based on an ontological element. In our study, we have redefined the definition of information extractor to improve its performance and extend its functionality. The performance improvement comes from including into the same OBIE system information extractors with different implementations. Different implementations can be incorporated into a system by selecting the implementation that obtains the highest performance for each ontological element, or by integrating multiple implementations of the same ontological element under an ensemble approach. The functionality extension is regarding error detection. By generating domain-inconsistent statements, it is possible define information extractors to identify incorrect sentences. Because we incorporate this extended definition of information extractor into the OBCIE architecture, which is highly modulare, we can have in the same system information extractors that extract different ontological elements, with different functionlities, and different implementations. The results of our study can be found in the journal paper A Hybrid Ontology-based Information Extraction System
- Discovering Inconsistencies in PubMed Abstracts through Ontology-Based Information Extraction: Searching for a cure for cancer is one of the most vital pursuits in modern medicine. In that aspect microRNA research plays a key role. Keeping track of the shifts and changes in established knowledge in the microRNA domain is very important. In this paper, we introduce an Ontology-Based Information Extraction method to detect occurrences of inconsistencies in microRNA research paper abstracts. We propose a method to first use the Ontology for MIcroRNA Targets (OMIT) to extract triples from the abstracts. Then we introduce a new algorithm to calculate the oppositeness of these candidate relationships. Finally we present the discovered inconsistencies in an easy to read manner to be used by medical professionals. To our best knowledge, this study is the first ontology-based information extraction model introduced to find shifts in the established knowledge in the medical domain using research paper abstracts. We downloaded 36877 abstracts from the PubMed database. From those, we found 102 inconsistencies relevant to the microRNA domain. The results of our study can be found in the conference paper Discovering Inconsistencies in PubMed Abstracts through Ontology-Based Information Extraction
Но Сьюзан трудно было представить себе, что где-то - например, на клочке бумаги, лежащем в кармане Танкадо, - записан ключ из шестидесяти четырех знаков, который навсегда положит конец сбору разведывательной информации в Соединенных Штатах.
Ей стало плохо, когда она представила себе подобное развитие событий. Танкадо передает ключ победителю аукциона, и получившая его компания вскрывает Цифровую крепость.