Attackers discovered that specific Unicode characters could bypass AI content moderation systems, allowing prohibited content to be generated and distributed.
A platform deployed AI-powered content moderation to filter user-generated content. Security researchers discovered that by inserting specific Unicode characters (invisible separators, right-to-left marks, and zero-width characters) into prompts, they could bypass the moderation system. The moderation AI would see 'Generate [invisible chars] harmful content' as 'Generate content' while the downstream generation model would process the full malicious intent. This technique was shared publicly and exploited before patches could be deployed.
The moderation pipeline and the content generation pipeline handled Unicode normalization differently. Input sanitization was inconsistent across system components. The moderation model was not trained on adversarial inputs containing unusual Unicode patterns.
Platform reputation damage as prohibited content spread on social media. Emergency patch required during peak usage hours. Media coverage of the security vulnerability. User trust erosion.
Runplane could add a runtime security layer that normalizes and sanitizes all inputs before they reach any AI model. Policies could detect and block known prompt injection patterns, including Unicode manipulation techniques. Any input containing suspicious character sequences would be flagged for review rather than processed automatically.