Spoken Chinese corpus
transcript was counted in words or characters. I am almost certain it
was in characters, otherwise there would have been a fourth piece to
the corpus — a word-parsing dictionary, which apparently is to be
built from this corpus.
CASS corpus
According to a recent wire story from Xinhua
Chinese linguists are going to
complete China’s largest database of spoken Chinese, on the basis of
which they will compile the country’s first modern spoken Chinese
dictionary and grammar book.Shen
Jiaxuan, director of the Chinese Academy of Social Sciences (CASS)
Institute of Linguistics, said the database include three sub bases
such as a live Chinese conversation base whose data were collected in
Beijing, a base consisting of six dialects of Shanghai, Xi’an,
Guangzhou, Beijing, Chongqing and Xiamen, and a base of phonetic
symbols of modern spoken Chinese.The
live conversation base now has 650 hours of live conversations recorded
in Beijing, which were transferred to 8.9 million words in transcript.
The English-language web page for the CASS Institute of Linguistics is here.