Ask Language Log: Comparing the vocabularies of different languages

Tags:

Mark Liberman offers additional readings for a freshman who has a big ambition 

 

Language Log: Ask Language Log: Comparing the vocabularies of different languages

 

 If you’re interested in taking this further, here are a few inadequate suggestions:


    In the post he also touched on the effect of orthographic conventions — which I called paraorthographics in this paper — on counting words. 

    Here are similar type-token plots for 50 million words of newswire text in Arabic, Spanish, and English:

    Does this indicate that Spanish has a much richer vocabulary than English, and that Arabic is lexically even richer yet? No, it mainly tells us that Spanish has more morphological inflection than English, and Arabic still more inflection yet.
    These curves also reflect some arbitrary orthographic conventions. Thus Arabic writes many word sequences "solid" that Spanish and English would separate by spaces. In particular, prepositions and determiners are grouped with following words (thus this might be aphrase ofenglish inthearabic style). Just splitting (obvious) prepositions and articles moves the Arabic curve a noticeable amount downward:

    Arabic text has some other orthographic characteristics that raise its type-token curve by at least as much, such as variation in the treatment of hamza. And in large corpora in any language, the rate of typographical errors and variant spellings becomes a very significant contributor to the type-token curve. 

    The question raised by Michael Honeycutt is a reflection of English-speakers’ fascination/fixation on words. They are not much more or less than what’s flanked between spaces and punctuations. And I argued that they were historically invented not so much for linguistic analyses but for oculomotor efficiency. They — spaces and punctuations — happen to work very well for English. Happened to, because there were invented and propagated by Irish monks. 

    Leave a Reply

    If the above Image does not contain text, use this secure code: W8BAU9q