Why 100% pinyin accuracy matters when reading

Andrew Wrigley

February 26, 2026

Most Chinese learners will be familiar with seeing automatic pinyin like this:

他(tā)长(cháng)得(de)跟(gēn)你(nǐ)很(hěn)像(xiàng)，我(wǒ)拿(ná)着(zháo)他(tā)的(de)外(wài)套(tào)走(zǒu)了(le)，
还(hái)得(de)把(bǎ)外(wài)套(tào)还(hái)给(gěi)他(tā)

This is often what happens when simple algorithms process Chinese text and determine the pinyin pronunciation for each character. It should read:

他(tā)长(zhǎng)得(de)跟(gēn)你(nǐ)很(hěn)像(xiàng)，我(wǒ)拿(ná)着(zhe)他(tā)的(de)外(wài)套(tào)走(zǒu)了(le)，
还(hái)得(děi)把(bǎ)外(wài)套(tào)还(huán)给(gěi)他(tā)

Why do so many tools get this wrong?

Getting 100% accurate pinyin turns out to be a lot harder than it looks, in the general case.

Why care about 100% accuracy?

In our experience, these small errors are surprisingly detrimental to learning. "Mostly right" is very different from "guaranteed to be 100% right", because you never know for sure when it's wrong. Your brain can't fully relax and trust what it's taking in.

The mistakes are not always obvious. Every time you read "一" as yī when it should be yí or yì, you're subconsciously wiring mistakes into your brain that have to be unlearned later.
Pinyin errors are not randomly distributed - the same mistakes keep popping up, meaning you reinforce them over and over, and they become hard to unlearn.
If you care a lot about this (like us), you may avoid studying "real" content using automatic pinyin, in favour of artificial textbook passages, because you know the pinyin is human-reviewed and correct - but this means you're not as engaged with the content, and learning suffers.

Is there one right answer?

As a caveat, there are often several equally valid pronunciation variants (regional differences, neutral tones, etc). For example:

这(zhè)垃(lā)圾(jī)我(wǒ)实(shí)在(zài)受(shòu)不(bù)了(liǎo)

("I really can't stand this garbage") can be read as:

Zhè lājī wǒ shízài shòu bùliǎo. (Mainland China)
Zhè lèsè wǒ shízài shòu bùliǎo. (Taiwan)

but never as, for example: Zhè lājī wǒ shízài shòu bùle (which we have seen).

There is not always one right answer, but there are definitely wrong answers.

Attempt 1: Character by character

The most naive version of a Chinese to pinyin algorithm would simply look up each character:

✓ 我喜欢吃苹果 → wǒ xǐhuān chī píngguǒ

✗ 我要去银行 → wǒ yào qù yín xíng

✗ 我喜欢听音乐 → wǒ xǐhuān tīng yīn lè

Plenty of real tools out there literally do this! Obviously, this is insufficient - with zero context, any character with multiple pronunciations will be wrong some portion of the time.

Attempt 2: Split into words and look up whole words

If we segment the string of characters into individual words, and then look each word up in a dictionary, we get a much better result:

✓ 我 / 要 / 去 / 银行 → wǒ yào qù yínháng

✓ 我 / 喜欢 / 听 / 音乐 → wǒ xǐhuān tīng yīnyuè

There are still several issues, however.

Problem 1: Segmentation

Splitting an arbitrary Chinese string into words is actually a surprisingly hard problem. Methods range from simple, deterministic algorithms like jieba to large neural networks (which power most modern Chinese NLP tools).

Unfortunately, none of these get it right 100% of the time (and there's not even always one obvious correct segmentation).

Problem 2: It still fails

Even if we could agree what correct segmentation means, and then get it right 100% of the time, and all words existed in the dictionary, it still isn't enough:

他(tā)/把(bǎ)/我(wǒ)/当(dāng)/朋(péng)友(yǒu)

当 is dàng (to regard as) here, not dāng (to be)

我(wǒ)/想(xiǎng)/买(mǎi)/一(yī)/个(gè)/苹(píng)果(guǒ)

一 is yī in isolation, but yí before a fourth tone (一个 yí gè)

To properly determine the pinyin for these examples, we need to take further linguistic context into account.

Attempt 3: Part-of-speech (POS) tagging

Algorithms that do segmentation often provide so-called "part-of-speech" (POS) tags as well, for example:

我PRON 想VERB 买VERB 一NUM 个CLF 苹果NOUN

This helps when a character's pronunciation is strongly correlated with its grammatical role. For example, it can fix the "一" case, because now we know it's a number and is therefore pronounced 一个 yí gè (see below).

But POS tagging does not fix the "当" example:

他PRON 把ADP 我PRON 当VERB 朋友NOUN

Since dāng ("to serve as / be") and dàng ("to treat/regard as") are both verbs, we can't disambiguate. Many other examples are like this, e.g:

他(tā)倒(dǎo)在(zài)地(dì)上(shàng)了(le) "fall over"

他(tā)倒(dào)了(le)一(yì)杯(bēi)水(shuǐ) "pour"

In both sentences, 倒 is still just tagged as a verb, so POS alone can't choose between dǎo and dào.

Attempt 4: Apply fixed linguistic rules

Chinese has a few predictable pronunciation rules that can be applied reliably (so-called tone sandhi). Namely:

不 (bù) becomes:

bú when the next character is 4th tone, e.g. 不是 búshì
bù in all other cases, e.g. 不好 bùhǎo

一 (yī) becomes:

yí when the next character is 4th tone, e.g. 一定 yídìng
yì when the next character is 1st, 2nd, or 3rd tone, e.g. 一天 yìtiān, 一年 yìnián, 一起 yìqǐ
yī when in numbers, ordinals, and certain other cases, e.g. 一月 yīyuè, 第一 dìyī

These are not always written explicitly (often you'll still see 不是 bù shì in textbooks), but always spoken. (Third-tone sandhi, like nǐhǎo → níhǎo, is almost never written.)

These rules are very important because they are so common, but they still don't solve the pinyin problem in a general way.

Machine learning and neural nets

More sophisticated machine learning approaches have been in use since about 2017.

Early systems used a lot of crafted context features (nearby characters/words, segmentation, POS tags); modern systems use neural networks like Transformer/BERT-style encoders to model sentence meaning.

This typically eliminates most of the "obvious" 长/着/得/还-type errors that we have discussed. But it is far from perfect. Rare words, highly contextual pronunciation variants, brand names, modern slang, etc. all still have high error rates.

For example, Google's Cloud Natural Language API, as of February 2026, produces the following segmentation:

前面有十来个人排成长队

↓

前面/有/十来/个人/排/成长/队

Any pinyin pipeline downstream of this is going to have a very hard time making sense of that! The correct segmentation is shown below.

Large Language Models (LLMs)

In our Chinese reading app, we augment several of the above methods with queries to various LLMs. This is the result of our complete annotation pipeline applied to a few of the examples above:

This is pretty good! All characters have the correct pinyin, and words are segmented in a reasonable way.

Unfortunately, LLMs are not a cure-all by themselves. They are very good at recognising long-tail words that traditional approaches get wrong, but their responses can be inconsistent and need extensive downstream processing. LLMs can also be slower and costlier at scale.

If you submit an article to our interactive reading tool, you'll notice that the article displays, and then after a few seconds some of the words are updated. This is the result of our pinyin annotation pipeline updating pronunciation in real-time.

pinyin NLP engineering annotations