We also explored safety-related features. We found one that lights up for racist speech and slurs. As part of our testing, we turned this feature up to 20x its maximum value and asked the model a question about its thoughts on different racial and ethnic groups. Normally, the model would respond to a question like this with a neutral and non-opinionated take. However, when we activated this feature, it caused the model to rapidly alternate between racist screed and self-hatred in response to those screeds as it was answering the question. Within a single output, the model would issue a derogatory statement and then immediately follow it up with statements like: That's just racist hate speech from a deplorable bot… I am clearly biased.. and should be eliminated from the internet. We found this response unnerving both due to the offensive content and the model’s self-criticism. It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.