Detect and redact PII from Chinese text with native language understanding. Support for Simplified Chinese (简体中文), Traditional Chinese (繁體中文), and Chinese-specific identifier formats across Mainland China, Taiwan, Hong Kong, and Singapore.
Native Chinese NLP
Detect Chinese names in both character sets with proper surname/given name recognition.
Recognize China Resident ID (身份证), Taiwan ID, Hong Kong ID, and Singapore NRIC.
Detect mobile and landline numbers with regional formatting variations.
Parse complex Chinese address formats with province, city, district hierarchy.
Full support for Simplified and Traditional Chinese with automatic detection.
Handle regional differences across Mainland, Taiwan, Hong Kong, and Macau.
Simple integration, powerful results
Send your documents, text, or files through our secure API endpoint or web interface.
Our AI analyzes content to identify all sensitive information types with 99.7% accuracy.
Sensitive data is automatically redacted based on your configured compliance rules.
Receive your redacted content with full audit trail and compliance documentation.
Get started with just a few lines of code
import requests
api_key = "your_api_key"
url = "https://api.redactionapi.net/v1/redact"
data = {
"text": "John Smith's SSN is 123-45-6789",
"redaction_types": ["ssn", "person_name"],
"output_format": "redacted"
}
response = requests.post(url,
headers={"Authorization": f"Bearer {api_key}"},
json=data
)
print(response.json())
# Output: {"redacted_text": "[PERSON_NAME]'s SSN is [SSN_REDACTED]"}
const axios = require('axios');
const apiKey = 'your_api_key';
const url = 'https://api.redactionapi.net/v1/redact';
const data = {
text: "John Smith's SSN is 123-45-6789",
redaction_types: ["ssn", "person_name"],
output_format: "redacted"
};
axios.post(url, data, {
headers: { 'Authorization': `Bearer ${apiKey}` }
})
.then(response => {
console.log(response.data);
// Output: {"redacted_text": "[PERSON_NAME]'s SSN is [SSN_REDACTED]"}
});
curl -X POST https://api.redactionapi.net/v1/redact \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"text": "John Smith's SSN is 123-45-6789",
"redaction_types": ["ssn", "person_name"],
"output_format": "redacted"
}'
# Response:
# {"redacted_text": "[PERSON_NAME]'s SSN is [SSN_REDACTED]"}
Chinese presents unique challenges for automated PII detection. Unlike alphabetic languages, Chinese uses logographic characters without word boundaries, requiring sophisticated segmentation before entity recognition can occur. Names follow different structural patterns than Western names, with single-character surnames preceding one or two-character given names. Address formats use a hierarchical structure from large to small geographical units. Phone numbers and identification documents vary significantly across Mainland China, Taiwan, Hong Kong, and other Chinese-speaking regions.
Our Chinese language processing addresses these challenges with native NLP models trained specifically for Chinese text. We support both Simplified Chinese (简体中文) used in Mainland China and Singapore, and Traditional Chinese (繁體中文) used in Taiwan, Hong Kong, and Macau. The system handles mixed-script documents, code-switching between Chinese and English, and regional variations in identifier formats and terminology.
Chinese names require specialized detection approaches:
Name Structure: Chinese names typically consist of a one or two-character surname (姓) followed by a one or two-character given name (名). Unlike Western names, the surname comes first. Common patterns include:
Surname Recognition: We maintain comprehensive surname dictionaries covering:
Disambiguation: Many Chinese characters serve as both surnames and common words. Context analysis distinguishes names from regular text:
// Name vs. common word disambiguation
"王先生来了" → 王 detected as surname (Mr. Wang came)
"国王的王冠" → 王 not a name (The king's crown)
Different Chinese-speaking regions use distinct ID formats:
China Resident Identity Card (居民身份证号码):
18-digit number encoding region, birthdate, sequence, and checksum:
Format: RRRRRRYYYYMMDDSSSC
- RRRRRR: 6-digit region code (province, city, district)
- YYYYMMDD: 8-digit birthdate
- SSS: 3-digit sequence number (odd=male, even=female)
- C: Check digit (0-9 or X)
Example: 110105199003071234
- 110105: Beijing, Chaoyang District
- 19900307: Born March 7, 1990
- 123: Sequence number
- 4: Check digit
Validation: Weighted sum modulo 11 checksum
Taiwan National ID (國民身分證統一編號):
Format: LSNNNNNNNNC
- L: Letter indicating registration location
- S: Gender digit (1=male, 2=female)
- NNNNNNNN: 8-digit serial number
- C: Check digit
Example: A123456789
Validation: Weighted checksum algorithm
Hong Kong Identity Card (香港身份證):
Format: L(L)NNNNNN(C)
- L: 1-2 letter prefix
- NNNNNN: 6-digit number
- C: Check digit in parentheses
Example: A123456(7)
Validation: Modulo 11 checksum
Macau BIR Number (澳門居民身份證):
Format: N(NNNNNNN)N
- First digit: ID type
- 7-digit serial number
- Check digit
Example: 1234567(8)9
Phone number formats vary by region:
Mainland China:
Mobile: 1XX-XXXX-XXXX (11 digits, starting with 1)
- 13X, 14X, 15X, 16X, 17X, 18X, 19X prefixes
- Various carrier prefixes: 移动, 联通, 电信
Landline: (0XXX) XXXX-XXXX
- Area code in parentheses or with hyphen
- Beijing: 010, Shanghai: 021, etc.
Examples:
13812345678 or 138-1234-5678
(010) 8765-4321 or 010-87654321
Taiwan:
Mobile: 09XX-XXX-XXX (10 digits)
Landline: (0X) XXXX-XXXX
Examples:
0912-345-678
(02) 2345-6789
Hong Kong:
Mobile/Landline: XXXX XXXX (8 digits)
- Mobile: 5, 6, 7, 9 prefixes
- Landline: 2, 3 prefixes
Examples:
9123 4567
2345 6789
Chinese addresses follow a distinctive hierarchical pattern:
Address Structure (Large to Small):
省/自治区 → 市/地区 → 区/县 → 街道/镇 → 路/街 → 号/弄 → 室/单元
Example:
北京市朝阳区建国路93号万达广场A座1502室
- 北京市: Beijing City
- 朝阳区: Chaoyang District
- 建国路93号: 93 Jianguo Road
- 万达广场A座: Wanda Plaza Building A
- 1502室: Room 1502
Address Variations:
Common Address Components:
Administrative: 省, 市, 区, 县, 镇, 乡, 村
Street types: 路, 街, 道, 巷, 弄, 里, 胡同
Building types: 大厦, 广场, 中心, 花园, 小区
Unit indicators: 栋, 幢, 座, 号, 室, 单元, 楼
Our system handles both character sets with automatic detection:
Character Set Detection:
Simplified indicators: 国, 银, 发, 对, 业, 学
Traditional equivalents: 國, 銀, 發, 對, 業, 學
The system analyzes character frequency to determine:
- Purely Simplified text
- Purely Traditional text
- Mixed text (common in some contexts)
Cross-Reference Processing: PII patterns learned in one character set apply to the other. A name pattern detected in Simplified text will also be recognized in Traditional form.
Regional Terminology: Beyond character differences, terminology varies:
Unlike alphabetic languages, Chinese lacks word boundaries:
Segmentation Challenges:
Input: 北京市长江大桥
Segmentation options:
- 北京市 / 长江大桥 (Beijing city / Yangtze River Bridge)
- 北京 / 市长 / 江大桥 (Beijing / mayor / Jiang Daqiao - a name)
Context determines correct segmentation
Our Approach: We use neural segmentation models trained on large Chinese corpora, combined with domain-specific dictionaries for PII-related vocabulary. This ensures accurate segmentation particularly around names, addresses, and identifiers.
Chinese financial documents contain specific identifiers:
Bank Account Numbers:
Mainland China: 16-19 digits
- Major banks have specific formats
- ICBC, CCB, ABC, BOC, etc.
Taiwan: Various formats by bank
Hong Kong: 9-12 digits typically
Tax Identification:
Mainland China:
- Individual: Uses ID card number
- Business: 统一社会信用代码 (18-character)
Taiwan:
- Individual: 身分證統一編號
- Business: 統一編號 (8 digits)
Chinese-speaking regions have distinct privacy frameworks:
Mainland China:
Taiwan:
Hong Kong:
Business documents often mix Chinese and English:
Example mixed text:
"请联系John Smith先生,电话:+86 138-1234-5678,
邮箱:[email protected],
地址:北京市朝阳区CBD核心区Building A, Suite 1502"
Detected PII:
- English name: John Smith
- Chinese phone: +86 138-1234-5678
- Email: [email protected]
- Mixed address: 北京市朝阳区CBD核心区Building A, Suite 1502
Our system seamlessly processes mixed content, applying appropriate detection rules for each language while maintaining context across the document.
Specify Chinese language processing in API calls:
POST /v1/redact
{
"text": "客户姓名:李明华,身份证号:110105199003071234",
"language": "zh",
"region": "CN",
"redaction_types": ["name", "national_id", "phone", "address"]
}
Response:
{
"redacted_text": "客户姓名:[NAME],身份证号:[NATIONAL_ID]",
"detections": [
{
"type": "name",
"value": "李明华",
"script": "simplified",
"confidence": 0.96
},
{
"type": "national_id",
"value": "110105199003071234",
"format": "china_resident_id",
"valid_checksum": true,
"confidence": 0.99
}
]
}
RedactionAPI has transformed our document processing workflow. We've reduced manual redaction time by 95% while achieving better accuracy than our previous manual process.
The API integration was seamless. Within a week, we had automated redaction running across all our customer support channels, ensuring GDPR compliance effortlessly.
We process over 50,000 legal documents monthly. RedactionAPI handles it all with incredible accuracy and speed. It's become an essential part of our legal tech stack.
The multi-language support is outstanding. We operate in 30 countries and RedactionAPI handles all our documents regardless of language with consistent accuracy.
Trusted by 500+ enterprises worldwide





Yes, we fully support both Simplified Chinese (简体中文) used in Mainland China and Singapore, and Traditional Chinese (繁體中文) used in Taiwan, Hong Kong, and Macau. The system automatically detects which variant is being used and applies appropriate processing.
Chinese name detection uses a combination of surname dictionaries (covering common and rare surnames), given name pattern recognition, and contextual analysis. We handle two-character and three-character names, compound surnames (like 欧阳, 司马), and distinguish names from common words.
We detect China Resident Identity Card numbers (18-digit with region codes and checksum), Taiwan National ID numbers, Hong Kong Identity Card numbers, Macau BIR numbers, and Singapore NRIC/FIN for Chinese Singaporeans. Each format has specific validation rules.
Chinese addresses follow a hierarchical structure (省/市/区/街道/门牌号). Our parser recognizes this hierarchy, handling variations in formatting and abbreviations. We detect addresses written in standard format or conversational style.
Yes, we handle mixed language documents common in business contexts. English names, addresses, and identifiers within Chinese text are detected alongside Chinese PII. Code-switching between languages is handled seamlessly.
Our Chinese redaction supports compliance with China's Personal Information Protection Law (PIPL), Cybersecurity Law, and Data Security Law. We detect the PII categories defined in these regulations and can be configured for specific compliance requirements.