Miyao Group at the University of Tokyo


Other works are also available at our GitHub repository.

Syntactic Parsing


Jigg is a framework to facilitate cooperative utilization of various natural language processing tools such as chunking, dependency parsing, POS tagging, semantic parsing, and so on. You will be able to use various tools by downloading JAR archives.

Moreover, Japanese syntactic parsing device based on Combinatory Categorial Grammar (CCG) has been implemented. This is utilized in ccg2lambda, software for textual entailment.


Corbit is an integrated text analyzer for Chinese, which performs word segmentation, part-of-speech (POS) tagging, and dependency parsing of Chinese text with state-of-the-art performance. Corbit is built based on incremental, transition-based parsing algorithms, which enable to process each of these tasks individually, or any combinations of these tasks with joint decoding, in a very efficient manner. The joint decoding usually results in higher accuracies, while slowing down the processing speed as its complexity grows.


Parsing technologies for analyzing natural language texts are developed in the Enju Project. The Enju parser for English not only analyses phrases and dependency structures but also detailed syntax and semantic structures (predicate argument structures) at high speed and accuracy. It is very helpful in applying advanced natural language processing that requires information extraction, automatic summarization, and question answering. Under the same framework, syntactic parsers for Japanese and Chinese are under development.

Kaede Treebank

Kaede Treebank is a constituent-based treebank for Japanese. It provides phrase-structure annotation data of a part of text corpus at Kyoto University. By using its data, you will be able to study a constituent parser. This resource has also been used to develop CCG parser module for Jigg.

Universal Dependencies

This is a project to develop multilingual Treebank in universal format. We are engaged in the development of Japanese data.

Semantic Parsing


ccg2lambda is a system for textual entailment with calculation based on higher-order logic using CCG syntactic parsing. It uses C&C Parser and EasyCCG Parser for English, Jigg for Japanese. It offers textual entailment based on inference engine with rule-based methods, and it has acquired successful results in various evaluation datasets.


TIFMO is a system for recognizing entailment relations in natural language texts. The system achieves accurate recognition of advanced logical inference including universal quantifiers and negations, as well as the large variety of paraphrasing observed in real world texts. TIFMO analyses meanings of sentences using Dependency-based Compositional Semantics, and is able to handle the meaning with various linguistic and world knowledge in fast logical inference.


We organize NTCIR RITE tasks which deal with the recognition of inference relations in texts in evaluation style workshop NTCIR. Recognition of inference relations in texts is a technology which automatically recognizes equivalence and difference of the meaning of two different texts. We generate evaluation data using texts extracted from Wikipedia and university entrance exams and provide them to participating teams.


Automatic Video Description Generation

This is software for automatic generation of explanation in natural language on image content of video data as known as automatic video description generation task. It is a model which applies weighting to noticed frames in video data based on sequence-to-sequence model used in machine translation, and has achieved high accuracy in more than one dataset.

Knowledge Discovery from Academic Papers


RANIS is annotation corpus which imparts the semantic relationship academic papers. It annotates terms in academic papers and imparts the semantic relationship such as “method”, “purpose”, “result” among the terms. Data on English abstracts of papers in ACM or ACL, and data on Japanese abstracts of the papers in IPSJ Journal are available. We have also released the guideline for annotation.

Question Answering

Artificial Intelligence Project

NII promotes Artificial Intelligence Project which develops integrated artificial intelligence as clever as to pass university entrance exams. Questions of university entrance exams are given and answered in natural language and are prime examples of natural language processing. When we analyze the process of understanding and answering questions (thinking, that is), however, we realize that various artificial-intelligence technologies are required not only natural language processing but also understanding and manipulation of mathematical formula, domain knowledge, logical inference, and unified comprehension of verbal and nonverbal information (such as graphs and pictures). Through the development of an integrated system that solves university entrance problems, we intend to shed light on what can or cannot be done with orchestrated frontier AI technologies, as well as what role natural language processing can play. During the course of it, we will find out problems we should hammer away in future artificial-intelligence research.


NIILC-QA is a dataset of question-answers in Wikipedia with various additional information. It aims to develop technologies so that the system itself can explain the process to find an answer to a question. For this purpose we have added information such as keywords or queries manually.

Dialogue System


We organize NTCIR STC tasks which deal with Short Text Conversation (STC) to generate short conversation in evaluation style workshop NTCIR. We have constructed dataset with human judgements for appropriateness of question-answer pairs.

Infrastructure Software


Amis is a software which can learn maximum entropy models based on Feature Forest Model. It is used in Enju Project to learn models to eliminate the ambiguity.


LiLFeS is a logic programming language with typed Feature structure. It can call feature structure processing from C++ and can be used as a library. It serves the implementation of Enju.