We are developping various language resources for natural language processing (NLP)
or artificial inteligence (AI).
Below you find brief explanations and the links to the detailed pages.
The distributed version may not the latest.
In case you want the latest or you want to contribute to the data, please do not
hesitate to contact us.
We are constructing a publicly available dependency corpus in Japanese.
The unit is, as in many languages, words.
Now we have about 35,000 annotated sentences taken from various sources such as blogs.
The annotated date are as follows.
- word segmentation
We are also distributing a parser EDA trained on
We have proposed to represent the meaning of procedural texts as flow graphs (directed
As the representative of procedural texts we have adopted recipes, of which there are
many sites or books.
We defined eight types of important terms and represent their relationships by a
directed acyclic graph (DAG).
We have annotated recipes with the following information and made them public.
- Important terms (8 classes; foods, tools, actions by chef, ...)
- Flow graph
The framework can cover general procedural texts just by replacing "food" with
Please take a look at the details if you are interested in it.
We are also distributing a named entity recognizer
PWNER trained on the data.
And we are developping a flow graph constructor.
- To power up KyTea, Japanese text processing tool
We are also distributing a word segmenter, POS tagger, and pronuciation estimater
KyTea including this data.
This is a corpus consisting of pairs of a game state and commentary sentences on it.
The game is shogi (Japanese chess).
Words in each sentences are identified and annotated game term tags.
We defined 21 game term types.
- Game state (piece distribution, piece to drop)
- Word sequence
- Game terms
A word segmenter and part-of-speech tagger,
trained from this corpus is available.
A term recognizer, PWNER, is also available.
This corpus allows us to develop a tool for connecting expressoins in a text and world knowledge.
The annotated texts contains sentences in BCCWJ as well as those in Twitter.
This corpus is useful for wikification of various texts.
- Any topics, not just named entities.
- Exhaustive annotation, not just "important" concepts.
- No NIL detection (all entities have corresponding Wikipedia articles).
We are devising a wikification tool based on it.
Collaboration with prof. Yugo Murawaki of
Graduate School of Informatics, Kyoto University
Last Change: 2016/04/13 by Shinsuke MORI