Content Moderation Dataset FAQs

Guidelines for custom Labels

The AI labels we are providing are made possible through solid research of giants upon whose shoulders we stand:

The undesired content moderation is defined and annotated by OpenAI and documented in their paper A Holistic Approach to Undesired Content Detection in the Real World.
The human-ChatGPT comparison Q&A corpus started only 10 days after the launch of ChatGPT with the aim to facilitate research and create open-source models for detecting ChatGPT-generated content.
Cardiff NLP provided a comprehensive contribution through SuperTweetEval, which itself is backed by multiple SemEval datasets.
SMS Spam Collection Dataset.

Our initial aim is to conveniently package for you the labels that target the largest number of standard use-cases. With that said, if after having gone through the list of labels you are left wanting, know that we are working towards being able to support you in building your own labels, and training models that can provide you with predictions that cater to your specific needs.

One recurrent question is: how many instances do I need to label before being able to train a model with decent accuracy?

The answer to this question will depend on the complexity of the task, the quality and diversity of the instances, along with the quality of the ground-truth labels. Furthermore, building the dataset does not have to be independent from the training. Meaning that you can iteratively train new versions of your models as you increase your number of labeled instances. Sometimes this can enable a positive loop where current versions of the models make predictions for you to correct instead of having to label from scratch. The downside here is that early bias might induce further bias.
Here are a few figures that might help you make a rough estimation of the quantity required:
- Approaching emotion detection as a multi-label task, CardiffNLP achieved the average macro-F1 score of 55.27 by fine-tuning and evaluating a RoBERTa-base model on the following dataset splits:
  - Training Instances: 6,838
  - Evaluation Instances: 886
  - Testing Instances: 3,259
- An earlier attempt to simplify the emotion detection to a multi-class task resulted in higher average macro-F1 scores for the same RoBERTa-base model trained on roughly the same number of instances:
  - 73.1±1.7 (74.9) on the evaluation set.
  - 76.1±0.5 (76.6) on the test set.

Another recurrent question is: how to ensure ground-truth quality throughout the inherently subjective task of interpreting social posts?

CardiffNLP's approach was to include in their labeling instructions the following rules:
- The same person might only label the same set of labels once. Redundant submissions are discarded.
- For the entries of someone to be accepted, they must have correctly labeled 3 out of the 3 randomly sampled hidden tests that are part of the instances they are presented with to label.
- While empty submissions are not allowed and each instance is required to be annotated with at least one label for multi-label tasks, it is also encouraged from anyone participating in the labeling to share feedback about any instance they are uncertain about.
The Six Attributes of Unhealthy Conversations work by Google Research, University of Oxford, University of South Carolina, and others, established more intricate guidelines to ensure the quality of their labels:
- With regards to the subjective nature of their labels, they address it as such:
  
  The inherent subtlety, subjectivity, and frequent ambiguity of the attributes covered in this dataset make crowdsourcing quality attribute labels an unavoidably difficult process. Typically the goal in an annotation task would simply be to maximise agreement between the multiple annotators of each comment. However, when the annotation task is inherently subjective and meaningful difference of opinion is itself valuable data, the goal becomes instead to maximise common understanding of the task across annotators. This entails tailoring the phrasing of the questions put to annotators, so as to create as common an understanding as possible of what each question is really asking. This way, disagreement between annotators reflected in the dataset will represent different reasonable readings of the same comment which are themselves important to capture.
- With regards to prompting for correctness from the annotators, they define an approach that relies on establishing "trustworthiness scores":
  
  Our primary quality control mechanism was to collate a set of ‘test comments’, for which we had manually established the correct answers. Annotators encountered one test comment per batch of seven comments they reviewed, without knowing which of the seven was the test comment, and their running accuracy on these test comments was defined as their ‘trustworthiness score’. The task required that annotators maintain a trustworthiness score of more than 78%. If an annotator dropped below this level, they were removed from the annotator pool for this task, and all of their prior annotations were discarded. The removed ‘bad’ annotator judgements were replaced by newly collected trusted judgements as necessary.
Be prepared to work with limitations and constraints for which no workaround is seemingly plausible:
- Social posts contain transliterations, abbreviations, interjections, language for which no tokenizer is equipped with the right vocabulary, small talk that provides little to no context from which to infer remotely accurate predictions, terms and expressions that rise in popularity as quickly as they are forgotten, and overall human written expression at its most deliberate and raw and unrestricted form.
- Aim to reflect the size range of your target inputs:
  - 19.6% of instances in the emotion dataset used by CardiffNLP fall between 118 and 134 number of characters. It is the largest proportion, followed by 13.8% ranging between 134 and 150 characters. In contrast, the largest proportion of posts on Farcaster is less than 20 characters long.
  - The Human ChatGPT Comparison Corpus (HC3) from Hello-SimpleAI strays the furthest in terms of the number of characters. It clearly did not prompt neither the human answers nor the ChatGPT ones for brevity. In fact, the sustained verbose eloquence is a strong feature indicative of LLM-generated content, which introduces added complexity in detecting it within the often short social posts.