Spoken Chinese corpus

Tags:
Original URL:
A LanguageLog story, in which Mark Liberman wondered whether the
transcript was counted in words or characters. I am almost certain it
was in characters, otherwise there would have been a fourth piece to
the corpus — a word-parsing dictionary, which apparently is to be
built from this corpus.

CASS corpus

According to a recent wire story from Xinhua

Chinese linguists are going to
complete China’s largest database of spoken Chinese, on the basis of
which they will compile the country’s first modern spoken Chinese
dictionary and grammar book.

Shen
Jiaxuan, director of the Chinese Academy of Social Sciences (CASS)
Institute of Linguistics, said the database include three sub bases
such as a live Chinese conversation base whose data were collected in
Beijing, a base consisting of six dialects of Shanghai, Xi’an,
Guangzhou, Beijing, Chongqing and Xiamen, and a base of phonetic
symbols of modern spoken Chinese.

The
live conversation base now has 650 hours of live conversations recorded
in Beijing, which were transferred to 8.9 million words in transcript.

The English-language web page for the CASS Institute of Linguistics is here.


Leave a Reply

If the above Image does not contain text, use this secure code: u7qXnx2