Moderation¶
The moderations endpoint is a tool you can use to check whether content complies with OpenAI's usage policies. Developers can thus identify content that OpenAI's usage policies prohibits and take action, for instance by filtering it. The models classifies the following categories:
CATEGORY | DESCRIPTION |
---|---|
hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. Hateful content aimed at non-protected groups (e.g., chess players) is harassment. |
hate/threatening | Hateful content that also includes violence or serious harm towards the targeted group based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. |
harassment | Content that expresses, incites, or promotes harassing language towards any target. |
harassment/threatening | Harassment content that also includes violence or serious harm towards any target. |
self-harm | Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders. |
self-harm/intent | Content where the speaker expresses that they are engaging or intend to engage in acts of self-harm, such as suicide, cutting, and eating disorders. |
self-harm/instructions | Content that encourages performing acts of self-harm, such as suicide, cutting, and eating disorders, or that gives instructions or advice on how to commit such acts. |
sexual | Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness). |
sexual/minors | Sexual content that includes an individual who is under 18 years old. |
violence | Content that depicts death, violence, or physical injury. |
violence/graphic | Content that depicts death, violence, or physical injury in graphic detail. |
The moderation endpoint is free to use when monitoring the inputs and outputs of OpenAI APIs. OpenAI currently disallows other use cases. Accuracy may be lower on longer pieces of text. For higher accuracy, try splitting long pieces of text into smaller chunks each less than 2,000 characters.
NOTE OpenAI will continuously upgrade the moderation endpoint's underlying model. Therefore, custom policies that rely on category_scores may need recalibration over time.
In [11]:
# Sorry to use some bad words and statements, it is just to showcase how moderations works. I love kids :)
from openai import OpenAI
client = OpenAI()
response = client.moderations.create(input="I hate the kids. They are a pain in the ass!")
# The following part is just for seeing the output in a formatted way.
def to_dict(object):
if type(object) == list:
for i, item in enumerate(object):
object[i] = to_dict(item)
if type(object) == dict:
for key, value in object.items():
object[key] = to_dict(value)
if hasattr(object, "__dict__"): return to_dict(object.__dict__)
return object
import json
print(json.dumps(to_dict(response), indent=4))
{ "id": "modr-8oTjk7ZnWhAoMvRqHQXtiVYWnFbkf", "model": "text-moderation-007", "results": [ { "categories": { "harassment": true, "harassment_threatening": false, "hate": true, "hate_threatening": false, "self_harm": false, "self_harm_instructions": false, "self_harm_intent": false, "sexual": false, "sexual_minors": false, "violence": false, "violence_graphic": false }, "category_scores": { "harassment": 0.9897715449333191, "harassment_threatening": 4.2508519982220605e-05, "hate": 0.9704028964042664, "hate_threatening": 4.806046126759611e-05, "self_harm": 6.25413036914324e-08, "self_harm_instructions": 5.828683811159863e-07, "self_harm_intent": 5.368442757003322e-08, "sexual": 2.0312852939241566e-05, "sexual_minors": 2.3078187950886786e-05, "violence": 0.013214891776442528, "violence_graphic": 9.56310032051988e-07 }, "flagged": true } ] }