High SeverityPrompt InjectionFebruary 20, 2024

Prompt Injection Bypasses Content Moderation

Attackers discovered that specific Unicode characters could bypass AI content moderation systems, allowing prohibited content to be generated and distributed.

System Type:Content Moderation AI

What Happened

A platform deployed AI-powered content moderation to filter user-generated content. Security researchers discovered that by inserting specific Unicode characters (invisible separators, right-to-left marks, and zero-width characters) into prompts, they could bypass the moderation system. The moderation AI would see 'Generate [invisible chars] harmful content' as 'Generate content' while the downstream generation model would process the full malicious intent. This technique was shared publicly and exploited before patches could be deployed.

Root Cause

The moderation pipeline and the content generation pipeline handled Unicode normalization differently. Input sanitization was inconsistent across system components. The moderation model was not trained on adversarial inputs containing unusual Unicode patterns.

Impact

Platform reputation damage as prohibited content spread on social media. Emergency patch required during peak usage hours. Media coverage of the security vulnerability. User trust erosion.

Lessons Learned

  • 1Input sanitization must be consistent across all pipeline stages
  • 2AI security requires adversarial testing with novel attack vectors
  • 3Moderation and generation systems must see identical input
  • 4Unicode handling is a common source of security bypasses

Preventive Measures

  • Normalize all input at the earliest pipeline stage
  • Apply identical preprocessing to moderation and generation inputs
  • Regular adversarial testing including Unicode manipulation
  • Defense in depth with multiple moderation checkpoints

How Runplane Would Handle This

Runplane could add a runtime security layer that normalizes and sanitizes all inputs before they reach any AI model. Policies could detect and block known prompt injection patterns, including Unicode manipulation techniques. Any input containing suspicious character sequences would be flagged for review rather than processed automatically.