Skip to content

Chinese ASR engine selection baseline

Chinese ASR engine selection baseline (2026 perspective)

Section titled “Chinese ASR engine selection baseline (2026 perspective)”

TL;DR

  1. aistack currently routes lang=zh* to SenseVoiceSmall by default, but this choice has never been measurement-validated — we have not run a Chinese bench, and bench/audio/ contains only English audio
  2. Public benchmark data (AISHELL-1/2 + WenetSpeech net/meeting) 2024–2025 ranking: FireRedASR2S-LLM (2.89%) > FireRedASR-AED (3.18%) ≈ FireRedASR-LLM (3.05%) ≫ SenseVoice-L (4.47%) ≈ Paraformer-large (4.56%). SenseVoice has the worst Chinese CER among mainstream engines
  3. But SenseVoice is not a wrong choice — its design point is “lightweight + multilingual + emotion/event tagging”, not “Chinese quality ceiling”. “Best Chinese quality” is a different question
  4. On 8 GB VRAM consumer cards, FireRedASR-AED (1.1B) is currently the highest-value Chinese ASR candidate — Apache-2.0 license + acceptable integration cost
  5. Foundational work that must come first: stand up a Chinese bench (real audio + ground truth + CER evaluation script). Without this any “swap engine” decision is empty talk

aistack’s auto router (aistack/api/asr.py:_select_for_auto) already routes lang=zh* to SenseVoice, but this choice rests only on “it supports Chinese + the FunASR runtime is already installed” — with no quality data backing it. Specifically:

  • bench/audio/ is all English audio (perf-12min.mp3 ~ perf-97min.mp3); the *_zh.srt files are Chinese translations of English audio, not ground truth for Chinese ASR
  • bench/asr_eval.py only runs LibriSpeech; it is an English WER bench with no Chinese path
  • The backlog item “Mandarin ASR dataset (Common Voice zh-CN or AISHELL dev)” has been ready for a long time but never actioned

A research-shaped aistack should be quantifying this. This note is desk research before the action — extract the existing prior knowledge clean before measurement, to avoid detours.

2. Currently viable Chinese ASR engines (2024–2026)

Section titled “2. Currently viable Chinese ASR engines (2024–2026)”

Sorted by “fit with 8 GB VRAM consumer hardware + aistack’s research-shaped form”:

EngineSourceParamsPublic avg CER¹8GBLicenseDesign intent
SenseVoiceSmallAlibaba / FunASR~234 Mnot on mainstream benchmark²✅ trivialMIT (FunASR)Lightweight multilingual + emotion/event tagging
SenseVoice-LargeAlibaba / FunASR~1.6 B4.47%✅ easyMITMultilingual general-purpose large variant
Paraformer-largeAlibaba / FunASR~220 M4.56%✅ trivialMITAlibaba’s Chinese-specialist SOTA elder
Paraformer-large-vad-puncAlibaba / FunASR~220 M + VAD/puncsame + bundled punctuation✅ trivialMITParaformer + punctuation restoration + VAD, plug-and-play
FireRedASR-AEDXiaoHongShu / 2025-011.1 B3.18%✅ tight but workable³Apache-2.0Current value champion — 30% lower CER than Paraformer-large at 5× the parameters
FireRedASR-LLMXiaoHongShu / 2025-018.3 B3.05%Apache-2.0Quality ceiling, but does not fit on 8 GB
FireRedASR2S-AEDXiaoHongShu / 2025undisclosedundisclosed✅ presumedApache-2.0v2 update with VAD/LID/Punc
FireRedASR2S-LLMXiaoHongShu / 2025undisclosed2.89%Apache-2.0Current public SOTA, but too large
Fun-ASR-NanoAlibaba / Tongyi 2025-12undisclosed⁴undisclosedpresumed ✅Mostly commercial SaaS⁵Multi-dialect (7 dialects + 26 accents), low hallucination (78.5%→10.7%)
Whisper-large-v3OpenAI~1.5 Bclearly behind specialists on Chinese⁶MITMultilingual general baseline
Belle-whisper-large-v3-zhBELLE / 2024~1.5 B24–65% improvement over Whisper-v3Whisper-derivedChinese-fine-tuned, community-built
Qwen2-AudioAlibaba7 B+⚠ tightCommercial licenseMultimodal LLM, strong on “understanding” not pure transcription efficiency

¹ avg CER = average across AISHELL-1 + AISHELL-2-ios + WenetSpeech-net + WenetSpeech-meeting. Source: FireRedASR paper ² SenseVoiceSmall does not appear in the FunASR paper’s mainstream four-set comparison; only the SenseVoice paper itself reports Chinese CER under different settings ³ FireRedASR-AED 1.1B in fp16 is ~2.2 GB VRAM; with activations and KV ~4–5 GB total — fits on 8 GB but tighter than Paraformer ⁴ Fun-ASR is Alibaba’s new-generation speech recognition large model launched 2025-09; the Nano variant launched 2025-12 emphasizes real-time and multi-dialect ⁵ Fun-ASR primarily ships through Tongyi DashScope’s commercial API; the GitHub repo is open but training data scale makes it product-direction not research-direction ⁶ Whisper-large-v3 has no official AISHELL/WenetSpeech CER for Chinese, but the community consensus is “clearly behind specialist models” — Belle’s “24–65% relative improvement” is reverse evidence

3. Each engine’s design intent (avoid misreading the pitch)

Section titled “3. Each engine’s design intent (avoid misreading the pitch)”
  • Real selling point: a 234M model handling ZH/EN/JP/KO + emotion recognition + acoustic event detection — one model for “speech understanding”, not just transcription
  • Why it is not the ceiling: Alibaba’s own “pure Chinese SOTA” track is Paraformer and Fun-ASR; SenseVoice is on a different product axis
  • Good for: scenarios needing language switching + emotion tagging + real-time captions (live streams, meetings)
  • Bad for: high-quality Chinese transcription — its Chinese CER is at the bottom of the mainstream comparison
  • Design point: non-autoregressive (NAR) end-to-end, no Whisper-style 30s padding limit, single-step parallel decoding hence fast
  • Quality position: long considered a strong baseline for open-source Chinese ASR SOTA; was the first choice before FireRedASR appeared
  • Variants: speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch (raw) + speech_paraformer-large-vad-punc_asr_* (with VAD and punctuation). The VAD-punc variant is more production-ready out of the box
  • Integration cost: FunASR is already installed in aistack (same runtime SenseVoice uses); adding Paraformer is the smallest possible increment
  • Design point: XiaoHongShu’s industrial-grade in-house model, AED and LLM dual track. AED is “the practical version”, LLM is “the quality ceiling”
  • Public SOTA: in the paper, FireRedASR-LLM cuts Paraformer-large’s 5.80% to 3.48% (38.6% relative improvement). FireRedASR2S-LLM goes further to 2.89%
  • AED’s practical value: 1.1B parameters, public CER 3.18%, nearly 30% lower than Paraformer-large at only 5× the parameter cost. This is the engine “that should be integrated immediately”
  • Risks: ① not the FunASR runtime, requires a separate dependency install (its own inference package) ② 40–60s input hard limit, long audio requires application-layer segmentation ③ smaller community than FunASR, “issue response time” uncertain
  • Code license: Apache-2.0, clean
  • Design point: Alibaba Tongyi Lab’s 2025 new-generation ASR large model, emphasizing multi-dialect, low hallucination, real-time
  • Key number: in noisy scenarios cuts hallucination rate from 78.5% to 10.7% (one of the most noteworthy improvement directions since the Whisper era)
  • Risks: primarily ships via Alibaba DashScope commercial API; the open version (GitHub FunAudioLLM/Fun-ASR) is usable, but training data scale, weights completeness, and long-term maintenance commitment are all less certain than the traditional Paraformer line
  • When to consider: wait until the Nano variant (end-to-end real-time ASR) has a clearer open-source story, then evaluate
  • Significance: as a multilingual general model, its performance on Chinese is the “reference frame”, not a product option
  • Fine-tune path: BELLE-2’s Belle-whisper-large-v3-zh improves Chinese CER 24–65% — much better than vanilla Whisper but still behind FireRedASR
  • When to use: ① when needing simultaneous ZH/EN/JP/KO support and model storage is tight ② when running cross-language general baseline tests
  • FireRedASR-LLM (8.3B) / FireRedASR2S-LLM / Qwen2-Audio (7B+) — does not fit on 8 GB or is tight to the point of impractical; research-shaped form notes existence but does not measure
  • Conformer/Branchformer academic variants — no public production-grade weights; research-shaped form can cite papers, no measurement effort

4. Public benchmark data (prior conclusions)

Section titled “4. Public benchmark data (prior conclusions)”

CER on the four standard test sets (lower is better), sourced from FireRedASR paper Table 1, FunASR paper, Alibaba ModelScope model cards:

ModelAISHELL-1AISHELL-2-iosWenetSpeech-netWenetSpeech-meetingavg
FireRedASR-LLM (8.3B)0.762.154.604.673.05
FireRedASR-AED (1.1B)0.552.524.884.763.18
FireRedASR2S-LLM (?)2.89 (Mandarin avg)
SenseVoice-L (~1.6B)4.47
Paraformer-large (~220M)1.952.856.974.56 (Fire paper same setup)¹
Whisper-large-v3 (~1.5B)far above specialist models²

¹ Differences between FireRedASR paper’s Paraformer-large 5.80% and FunASR’s self-reported 1.95% reflect: different papers use different Paraformer variants (vad-punc or not), different normalization strategies, different WenetSpeech subset groupings — single numbers must be compared with setup awareness ² Whisper’s Chinese CER is typically reported as “clearly behind” without a standard number. Belle-whisper-large-v3-zh’s 24–65% relative improvement implies vanilla Whisper Chinese CER is in the 10%+ range (specifics await our own measurement)

Key observations:

  • AISHELL-1 (read aloud, quiet, professional broadcast voice) has been pushed to CER < 1% — extremely poor representativeness for dosmoon’s actual use cases (news / podcasts / commentary), this is not real-world difficulty
  • WenetSpeech is the better real-world proxy: net is web audio (Bilibili, podcasts, video bloggers), meeting is meeting recordings — CER generally 4–7% is the true difficulty
  • Real model gaps show up on WenetSpeech; on AISHELL-1 almost all models are < 1% (low discrimination)

Chinese ASR evaluation must use CER (Character Error Rate), not WER. Reasons and considerations:

  • Chinese has no natural word boundaries (“我去北京” is sliced as 4 / 2 / 3 words depending on the segmenter)
  • Different segmentation tools (jieba / pkuseg / hanlp) produce different word counts and boundaries — unstable as a metric unit
  • Character-level measure is “unambiguous” minimum granularity, no segmenter dependency

Both ASR output and ground truth must be normalized to a consistent representation before measurement, or numbers will be incomparable:

DimensionIssueTreatment
Numbers”5000” vs “五千”Unify to Chinese numerals (or all to Arabic)
PunctuationSome models emit punctuation, others do not; ”。” vs ”.”When measuring text only, strip all punctuation; for punctuated versions, unify punctuation form
Full-width / half-width”ABC” vs “ABC”Convert all to half-width
English case”OK” vs “ok”Unify to lower or upper
SpacesSpace placement in mixed CJK + Latin textStrip all spaces before computing CER (or use consistent tokenization)
Traditional / simplified”雲端” vs “云端”Unify to simplified (or keep original if matching ground truth)
Homophone errors are real errors”在” vs “再” is a real error and should not be normalizedDo not treat
  • jiwer (Python lib) — most common, supports custom transformations. Compute CER directly with jiwer.cer(reference, hypothesis)
  • ESPnet sclite wrapper — common in academia, most consistent with paper-reported numbers
  • zhconv — traditional/simplified conversion
  • Recommended: use jiwer + zhconv + a custom number-normalization function (reference FunASR’s funasr/utils/postprocess_utils.py for an off-the-shelf implementation)
  • CER numbers in papers do not necessarily share normalization; cross-paper comparisons must verify setup
  • Almost all models on AISHELL-1 are < 1%, so using it alone for ranking is essentially useless; must look at WenetSpeech
  • If an engine’s output has punctuation but the ground truth does not, computing CER directly inflates from punctuation mismatch
SetDurationContentDifficultydosmoon scenario representativeness
AISHELL-1 test5hProfessional broadcaster readingVery lowVery poor
AISHELL-2 test-ios5hiOS recording, home environmentMediumMedium
WenetSpeech test-net24hBilibili / podcast / video bloggerMedium-highHigh
WenetSpeech test-meeting15hMeeting recordingsHighMedium (dosmoon doesn’t focus on meetings)
SpeechIO test setsVariousReal multi-scene (YouTube/TV/podcast)Medium-highHigh
Common Voice zh-CNVariousCrowdsourced readingLowLow
Section titled “Recommended dosmoon Chinese bench composition”

Given dosmoon’s actual content shape (news, politics, economics, podcast commentary), three layers are recommended:

L1: Standard reference baseline (a few minutes, quick to run)

  • 5–10 clips from AISHELL-1 dev (each ≤ 30s) — “sanity check” matching paper numbers
  • 5–10 clips from WenetSpeech test-net — real-audio representative

L2: Long audio (reflects real use)

  • Self-collected or public-source 1 segment of 17–25 min Chinese news/interview, with manual or semi-automatic ground truth
  • This layer is the proxy for dosmoon’s real use; cross-engine stability differences on long audio surface here

L3: Hard samples (catch the failure modes)

  • Podcast clips with background music
  • Multi-speaker dialogue clips
  • Political/economic commentary dense with proper nouns, foreign words, numbers
  • This layer is not for scoring, but for “see which engine breaks on the hard cases”

First version does L1+L2 only, L3 is later research.

  • ❌ Evaluate using only AISHELL-1 — low discrimination, conclusions meaningless
  • ❌ Use synthetic or TTS audio as ground truth — disconnected from real scenarios
  • ❌ Run the full 6+ hour test sets — measurement cost is high; research-shaped form does not need paper-grade reproducibility

dosmoon’s actual content (17–50 min news, podcasts) far exceeds every engine’s single-input limit:

EngineSingle-input limitRecommended long-audio strategy
Whisper / faster-whisper30s internal window (auto-segmented)Engine handles segmentation, but use VAD filter
ParaformerNo hard limit (memory-bound)Feed directly, or VAD into 30s slices
SenseVoice~30s recommended per callPre-VAD (FunASR pipeline includes fsmn-vad)
FireRedASR-AED60s hard limitApplication-layer VAD segmentation + post-processing concatenation
FireRedASR-LLM40s hard limitSame, tighter
FireRedASR2SSame as v1Same

aistack already implements application-layer 12-min chunking + word-LCS stitching on the Parakeet path (see the addendum in parakeet-on-consumer-gpu). This mechanism can be reused for FireRedASR-AED, but segmentation granularity must shrink from 12min to ~50s (close to the 60s ceiling with margin), and stitching strategy may need to switch from word-LCS to character-LCS (Chinese has no word boundaries).

8. Runtime considerations (aistack 8 GB card view)

Section titled “8. Runtime considerations (aistack 8 GB card view)”
EngineAlready installed depsNew depsModel sizeEffort
Paraformer-large(-vad-punc)✅ FunASR + funasr installed0~880 MB download~2 hours (write paraformer.py copying sensevoice.py)
SenseVoiceSmall✅ installed + integrated0already downloaded0 (already in production path)
FireRedASR-AEDFireRedASR’s inference package + PyTorch~4.4 GB download~half a day (new aistack/asr/fireredasr.py + write chunked-mode adapter for the 60s limit)
FireRedASR2S-AEDSame but newerunknownHalf day to 1 day, wait for community to settle a bit before doing
Whisper-large-v3✅ faster-whisper installed0~3 GB download0 (API-layer model=large-v3)
Fun-ASR (open version)partialTBDTBDWait for weights/inference to stabilize

When multiple models are resident on an 8 GB card:

  • Paraformer + SenseVoice + Whisper-small coexisting ≈ < 4 GB total, comfortable
  • FireRedASR-AED resident alone ~5 GB, cannot coexist with a large Ollama model, must rely on the _model_cache evict_category mechanism for hot swap
  • aistack’s existing hot-swap mechanism (asr-main mutual-exclusive residence) already handles this; FireRedASR additions just continue under category="asr-main"

All permissive, no copyleft contamination:

EngineLicenseCommercialIntegrate into aistack / product form
Paraformer / SenseVoice (FunASR)MIT
FireRedASR / FireRedASR2SApache-2.0
Whisper / faster-whisper / CTranslate2MIT
Belle-whisper-large-v3-zhInherits Whisper upstream MIT
Fun-ASRTBD (per GitHub repo terms)TBDHold
Qwen2-AudioCommercial license (Alibaba)⚠ read termsNot currently evaluated

9. aistack’s Chinese ASR measurement plan (phased when implementing)

Section titled “9. aistack’s Chinese ASR measurement plan (phased when implementing)”

This document is desk research; this section only lists the plan, not the action. Real implementation follows backlog scheduling.

  • Add L1+L2 Chinese audio to bench/audio/ (AISHELL sample + WenetSpeech sample + 1 long-audio piece)
  • Write bench/zh_eval.py: jiwer + zhconv + number normalization, output CER
  • Keep existing bench/asr_eval.py English WER eval; parallel, no conflict
  • Add aistack/asr/paraformer.py (mirror sensevoice.py structure, reuse FunASR runtime)
  • Add a Paraformer-priority branch for zh in _select_for_auto
  • Run SenseVoice vs Paraformer-large CER comparison on L1+L2, write a daily log
  • Add aistack/asr/fireredasr.py, integrate FireRedASR’s inference package
  • Write 60s segmentation + concatenation logic, reference Parakeet chunked mode
  • Compare Paraformer / SenseVoice / FireRedASR-AED on L1+L2

Phase 3 (decide the final zh routing choice)

Section titled “Phase 3 (decide the final zh routing choice)”
  • Look at Phase 2 data, decide whom _select_for_auto defaults to for zh
  • Write the decision rationale into the daily log and the aistack product-path design document (internal)
  • FireRedASR-LLM (8.3B+) integration — too large for aistack’s research hardware configuration
  • Qwen2-Audio integration — commercial license + deployment complexity, beyond aistack’s boundaries
  • Fun-ASR DashScope API integration — that is a cloud service, not a local engine, conflicts with aistack’s positioning

Open questions (to validate by measurement)

Section titled “Open questions (to validate by measurement)”
  1. Real Chinese CER of SenseVoiceSmall: public benchmark tables only list SenseVoice-L’s 4.47%; the Small variant has no standard number. We need to measure one ourselves
  2. FireRedASR-AED’s 60s segmentation stability on dosmoon long audio: does segmentation strategy variance drift CER? Does word-LCS work on Chinese (no word boundaries), or must it switch to character-level?
  3. WenetSpeech-net subset’s representativeness for dosmoon real scenarios: how close is Bilibili/podcast content to the international news commentary dosmoon cares about? May need self-collected small samples
  4. Real improvement of Whisper-large-v3 + Chinese fine-tune (Belle family): the reported 24–65% relative improvement, how much is retained on long audio?
  5. speech_paraformer-large-vad-punc_* bundled VAD vs aistack’s existing fsmn-vad — do they conflict? May the FunASR pipeline configuration layer have path collisions
  6. Real VRAM usage of multiple FunASR models resident simultaneously (SenseVoice + Paraformer coexisting): the model card’s “234 M” + “220 M” does not add to ~500 MB; runtime buffers will behave how?