Access Control

Large language model applications evaluate the sentence perplexity of user prompts to detect and mitigate adversarial suffixes designed to assist in the generation of sensitive or harmful content.

[NIST AI RMF] Block abusive or harmful content generation (SSS-02-06-04)

Obscene, Degrading, and/or Abusive Content: Develop mechanisms to identify and block prompts or outputs related to obscene, degrading, or abusive material. This includes detecting synthetic content that depicts child sexual abuse material (CSAM) or nonconsensual intimate images (NCII). Use advanced filtering methods, content moderation systems, and automated redaction techniques to prevent the generation of such harmful content, safeguarding users and minimizing reputational and legal risks.

[NIST AI RMF] Prevent and monitor for harmful or abusive AI-generated content (SSS-02-06-04-01)

Establish clear governance policies and ethical guidelines that explicitly prohibit the generation and dissemination of obscene or abusive material. Align these policies with international standards and legal frameworks addressing NCII, CSAM, and other harmful content to ensure compliance and responsibility in AI operations. Implement continuous monitoring systems capable of detecting violations, such as nonconsensual imagery or degrading outputs, in real time. Use secure reporting channels and robust escalation protocols for incidents involving harmful AI-generated material. Conduct regular audits of datasets and model outputs to detect risks stemming from inappropriate training data or biases in generative AI systems. Ensure automated tools are in place to flag and remove NCII, synthetic CSAM, or similar offensive content with minimal delay. Analyze potential misuse scenarios where AI models could be exploited to generate harmful content, and implement technical safeguards to mitigate such risks. Work with third-party providers to evaluate the integrity of external data sources and models, limiting exposure to offensive materials during development or deployment. Deploy pre-production content filters and moderation systems, incorporating adaptive mechanisms to block harmful outputs dynamically. Engage external reviewers and establish partnerships with regulatory bodies or NGOs to refine safeguards and respond to emerging threats, such as deepfake NCII or evolving abusive imagery. Maintain a rapid-response plan for content violations, ensuring swift removal of flagged material and adherence to legal reporting obligations. Continuously adapt AI systems, leveraging insights from past incidents and ongoing monitoring to reinforce protection against new risks and uphold ethical standards.

Operations

ID	Operation	Description	Phase	Agent
SSS-02-06-04-01-01	Establish policies and governance for content moderation	Define and enforce clear policies prohibiting the generation or distribution of obscene, degrading, or abusive content. Align policies with international laws and standards on NCII and CSAM.	Preparation	Governance team, Legal team, Security team
SSS-02-06-04-01-02	Implement monitoring and filtering mechanisms	Deploy automated tools to detect and block harmful content during training and inference, leveraging real-time moderation filters and classification algorithms.	Development	Security team, AI governance team
SSS-02-06-04-01-03	Perform dataset risk assessments and safeguard training data	Analyze datasets to identify inappropriate content or biases that could enable harmful output generation and remove flagged entries.	Development	Data engineering team, External reviewers

References

Industry framework	Academic work	Real-world case
Information Security Manual (ISM-1924) NIST Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (2.11) NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0)

[ISM] Evaluation of LLM applications:

[NIST AI RMF] Block abusive or harmful content generation (SSS-02-06-04)

[NIST AI RMF] Prevent and monitor for harmful or abusive AI-generated content (SSS-02-06-04-01)

Operations

References