Body-only wrapper unwrapping made more conservative

hxy91819

·Apr 9, 2026·#63808docs-i18n: avoid ambiguous body-only wrapper unwrap

The docs-i18n translation pipeline now handles ambiguous body-only payloads more gracefully by deferring edge cases to validation instead of silently truncating translated content.

The docs-i18n translation pipeline handles a subtle edge case: when extracting content from HTML body tags, documents that contain literal <body> or </body> tokens as content can confuse the unwrapping logic. Previously, the system would strip what looked like a body wrapper even when the source itself documented body tokens, potentially truncating legitimate translated content.

The function now checks whether the source document contains body tag tokens before deciding whether to unwrap. When body tokens are detected, the function returns an ambiguous result, letting validation and retry logic handle the payload instead of attempting aggressive stripping.

This change makes the translation pipeline more conservative in edge cases. Rather than guessing incorrectly and losing content, the system preserves ambiguous payloads for the retry mechanism to resolve. The behavior is particularly important for technical documentation that discusses HTML structure itself.

The fix sits in the docs-i18n scripts and applies to the chunked raw translation workflow, which processes documentation in segments and reconstructs complete translated output.

View Original GitHub Description

Summary

keep body-only wrapper unwrapping conservative when the source documents <body>/</body> tokens
let validation/retry handle ambiguous body-only payloads instead of stripping and truncating content
add regression coverage for ambiguous and non-ambiguous body-only wrapper cleanup

Testing

cd scripts/docs-i18n && go test ./...