Setting a Record for Kaggle Hate Speech Detection

Date: 2023.07.11 ~ 2023.07.26

Writer: 9tailwolf

Source Code : HERE

Reference

Paper
Paper

Introduction

My first plan for making Hate Speech Detection Model setting a record of Korean Hate Speech Detection in the Kaggle Competition.

The bigger the data, the better. I collected various hate expression data. Below is a list of Korean Hate Speech Dataset.

As a result, just Kocohub dataset is suitable for Kaggle Competition. It is consist of more then 8,000 labeled training dataset and more then 2,000,000 unlabeled dataset.

Base Model

First, I tested below two base model with Find Tuning.

In this case, I use BertForSequenceClassification Library from transformers to classificate sentences. I used 2 method. First, I made binary seperation model with (none / offensive, hate) and (offensive / hate). Second, I made muntiful seperation model. As a result, multiful seperation model and KcELECTRA is suitable for tasks.

KoELECTRA / Binary Model : 0.541 Socre
KoELECTRA / Multiful Model : 0.593 Socre
KcELECTRA / Multiful Model : 0.601 Socre

So I choose KcELECTRA for base model. But I think there is no striking difference with KoELECTRA and KcELECTRA.

Improved Model

To make better model, I consist various structure of models with various hyper parameters. You can see more detail find tuning code at HERE.

KcELECTRA - NN Model

KcELECTRA-NN model is a model that consist of KcELECTRA, hidden linear, and linear layer for classify. In this model, we can tuning with linear layer, dropout, and labels. Following is a code of KcELECTRA-NN model.

class KcELECTRA_NN(nn.Module):
    def __init__(self, linear, dropout, labels):
        super(KcELECTRA_NN,self).__init__()
        self.linear, self.dropout, self.labels = linear, dropout, labels
        self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base")
        self.Linear1 = nn.Linear(768, linear)
        self.Relu = nn.ReLU()
        if dropout != 0:
            self.Dropout = nn.Dropout(dropout)
        self.Linear2 = nn.Linear(linear, labels)
        self.Softmax = nn.Softmax()
            
    
    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0][:,0]
        l1 = self.Linear1(output)
        act = self.Relu(l1)
        if self.dropout != 0:
            act = self.Dropout(act)
        l2 = self.Linear2(act)
        return l2
        return self.Softmax(l2,dim = 0)

As a result, I was setting a record with 0.637 f1 score.

KcELECTRA - CNN Model

KcELECTRA-CNN model is a model that consist of KcELECTRA, 1D CNN layer, and linear layer for classify. In this model, we can tuning with output_channel, kernal, stride, dropout, and labels. Following is a code of KcELECTRA-CNN model.

class KcELECTRA_CNN(nn.Module):
    def __init__(self, output_channel, kernal, stride, dropout, labels):
        super(KcELECTRA_CNN,self).__init__()
        self.dropout = True if dropout==0 else False
        self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base")
        self.Conv = nn.Conv1d(in_channels = 1, out_channels = output_channel, kernel_size = kernal, stride = stride)
        CNNlayer = (769 - kernal) // stride
        self.Relu = nn.ReLU()
        self.Pooling = nn.MaxPool1d(kernel_size=CNNlayer)
        if dropout != 0:
            self.Dropout = nn.Dropout(dropout)
        self.Linear = nn.Linear(output_channel, labels,1)
        self.Softmax = nn.Softmax(dim=1)
            
    
    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0][:,0].unsqueeze(1)
        conv = self.Conv(output)
        act = self.Relu(conv)
        pool = self.Pooling(act).squeeze(2)
        if self.dropout:
            pool = self.Dropout(pool)
        l = self.Linear(pool)
        return self.Softmax(l)

But i think KcELECTRA-CNN isn’t suitable for classification task. I can’t make better result than KcELECTRA-NN model.

KcELECTRA - SmallNN Model

KcELECTRA-SmallNN model is a model without hidden layer in KcELECTRA-NN model. I determined that the hidden layer contributes to overfitting.

class KcELECTRA_SmallNN(nn.Module):
    def __init__(self, dropout, labels):
        super(KcELECTRA_SmallNN,self).__init__()
        self.dropout, self.labels = dropout, labels
        self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base")
        if dropout != 0:
            self.Dropout = nn.Dropout(dropout)
        self.Linear = nn.Linear(768, labels)
        self.Softmax = nn.Softmax(dim = 1)
            
    
    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0][:,0]
        if self.dropout != 0:
            output = self.Dropout(output)
        l = self.Linear(output)
        return self.Softmax(l)

As a result, I was setting a record with 0.619 f1 score.

PP - KcELECTRA - NN Model

In dataset, I think there is a problem with Tokenizing, I pre-processed the sentence to leave only Korean.

As a result, At KcELECTRA-NN model, I can make more better result with 0.646 f1 Score and I reached 17/97 ranking.

KcELECTRA - HiddenCNN Model

This model is based on This Paper. This model use a CNN channels with hidden layer of BERT.

class KcELECTRA_CNN2(nn.Module):
    def __init__(self,len_size, kernal, stride, dropout, labels):
        super(KcELECTRA_CNN2,self).__init__()
        self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base", output_hidden_states=True)
        self.Dropout_default = nn.Dropout(0.1)
        self.Conv = nn.Conv2d(in_channels = 13, out_channels = 13, kernel_size = (kernal,768), padding = kernal//2)
        self.Relu = nn.ReLU()
        self.Pooling = nn.MaxPool2d(kernel_size=3, stride = stride, padding = kernal//2)
        self.Flat = nn.Flatten()
        self.Linear_dropout = nn.Dropout(dropout)
        self.Linear = nn.Linear((len_size//stride + 1) * 13, labels)
        self.Softmax = nn.Softmax(dim=1)
            
    
    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[1]
        output = torch.transpose(torch.cat(tuple([t.unsqueeze(0) for t in output]), 0),0,1) # batch * 13(layer) * word len * 768

        conv = self.Conv(self.Dropout_default(output))
        act = self.Relu(self.Dropout_default(conv))
        pool = self.Pooling(self.Dropout_default(act))
        flat = self.Flat(self.Dropout_default(pool))
        l = self.Linear(self.Linear_dropout(flat))
        return self.Softmax(l)

Above model isn’t suitable for tasks. I was setting a record with 0.612 f1 score

KcELECTRA - HiddenCNN Model 2

This model is based on This Paper. This model use a CNN channels with hidden layer of BERT.

class KcELECTRA_CNN3(nn.Module):
    def __init__(self,len_size, kernal, filter_size, stride,labels):
        super(KcELECTRA_CNN3,self).__init__()
        self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base", output_hidden_states=True)
        self.Dropout_default = nn.Dropout(0.1)
        self.Convs = nn.ModuleList([nn.Conv2d(in_channels = 3, out_channels = filter_size, kernel_size = (i,768), padding = ((i-1)//2,0)) for i in kernal])
        self.Relu = nn.ReLU()
        self.Pooling = nn.MaxPool1d(kernel_size=len_size//stride, stride = stride, padding = (len_size//stride-1)//2)
        self.Flat = nn.Flatten()
        self.Linear = nn.Linear(len_size // stride * len(kernal) * filter_size , labels)
        self.Softmax = nn.LogSoftmax(dim=1)
            
    
    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[1][-3:]
        output = torch.transpose(torch.cat(tuple([t.unsqueeze(0) for t in output]), 0),0,1) # batch * 3 * encoder * 768
        x = [self.Relu(self.Dropout_default(Conv(self.Dropout_default(output)).squeeze(3))) for Conv in self.Convs]
        pool = torch.cat([self.Pooling(self.Dropout_default(i)) for i in x],1)
        flat = self.Flat(self.Dropout_default(pool))
        l = self.Linear(self.Dropout_default(flat))
        return self.Softmax(l)

I was setting a BEST record with 0.671 f1 score. It is a third place rank!

Result

I reached 3/97 ranking with KcELECTRA_CNN3 Model with 0.671 f1 Score.