Setting a Record for Kaggle Hate Speech Detection
Date: 2023.07.11 ~ 2023.07.26
Writer: 9tailwolf
Source Code : HERE
Reference
Introduction
My first plan for making Hate Speech Detection Model
setting a record of Korean Hate Speech Detection in the Kaggle Competition.
The bigger the data, the better. I collected various hate expression data. Below is a list of Korean Hate Speech Dataset.
As a result, just Kocohub dataset
is suitable for Kaggle Competition. It is consist of more then 8,000 labeled training dataset and more then 2,000,000 unlabeled dataset.
Base Model
First, I tested below two base model with Find Tuning.
In this case, I use BertForSequenceClassification
Library from transformers to classificate sentences. I used 2 method. First, I made binary seperation model with (none / offensive, hate) and (offensive / hate). Second, I made muntiful seperation model. As a result, multiful seperation model
and KcELECTRA
is suitable for tasks.
- KoELECTRA / Binary Model : 0.541 Socre
- KoELECTRA / Multiful Model : 0.593 Socre
- KcELECTRA / Multiful Model :
0.601
Socre
So I choose KcELECTRA
for base model. But I think there is no striking difference with KoELECTRA and KcELECTRA.
Improved Model
To make better model, I consist various structure of models with various hyper parameters. You can see more detail find tuning code at HERE.
KcELECTRA - NN Model
KcELECTRA-NN model
is a model that consist of KcELECTRA, hidden linear, and linear layer for classify. In this model, we can tuning with linear layer, dropout, and labels. Following is a code of KcELECTRA-NN model
.
class KcELECTRA_NN(nn.Module):
def __init__(self, linear, dropout, labels):
super(KcELECTRA_NN,self).__init__()
self.linear, self.dropout, self.labels = linear, dropout, labels
self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base")
self.Linear1 = nn.Linear(768, linear)
self.Relu = nn.ReLU()
if dropout != 0:
self.Dropout = nn.Dropout(dropout)
self.Linear2 = nn.Linear(linear, labels)
self.Softmax = nn.Softmax()
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0][:,0]
l1 = self.Linear1(output)
act = self.Relu(l1)
if self.dropout != 0:
act = self.Dropout(act)
l2 = self.Linear2(act)
return l2
return self.Softmax(l2,dim = 0)
As a result, I was setting a record with 0.637
f1 score.
KcELECTRA - CNN Model
KcELECTRA-CNN model
is a model that consist of KcELECTRA, 1D CNN layer, and linear layer for classify. In this model, we can tuning with output_channel, kernal, stride, dropout, and labels. Following is a code of KcELECTRA-CNN model
.
class KcELECTRA_CNN(nn.Module):
def __init__(self, output_channel, kernal, stride, dropout, labels):
super(KcELECTRA_CNN,self).__init__()
self.dropout = True if dropout==0 else False
self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base")
self.Conv = nn.Conv1d(in_channels = 1, out_channels = output_channel, kernel_size = kernal, stride = stride)
CNNlayer = (769 - kernal) // stride
self.Relu = nn.ReLU()
self.Pooling = nn.MaxPool1d(kernel_size=CNNlayer)
if dropout != 0:
self.Dropout = nn.Dropout(dropout)
self.Linear = nn.Linear(output_channel, labels,1)
self.Softmax = nn.Softmax(dim=1)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0][:,0].unsqueeze(1)
conv = self.Conv(output)
act = self.Relu(conv)
pool = self.Pooling(act).squeeze(2)
if self.dropout:
pool = self.Dropout(pool)
l = self.Linear(pool)
return self.Softmax(l)
But i think KcELECTRA-CNN isn’t suitable for classification task. I can’t make better result than KcELECTRA-NN model.
KcELECTRA - SmallNN Model
KcELECTRA-SmallNN model
is a model without hidden layer in KcELECTRA-NN model. I determined that the hidden layer contributes to overfitting.
class KcELECTRA_SmallNN(nn.Module):
def __init__(self, dropout, labels):
super(KcELECTRA_SmallNN,self).__init__()
self.dropout, self.labels = dropout, labels
self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base")
if dropout != 0:
self.Dropout = nn.Dropout(dropout)
self.Linear = nn.Linear(768, labels)
self.Softmax = nn.Softmax(dim = 1)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0][:,0]
if self.dropout != 0:
output = self.Dropout(output)
l = self.Linear(output)
return self.Softmax(l)
As a result, I was setting a record with 0.619
f1 score.
PP - KcELECTRA - NN Model
In dataset, I think there is a problem with Tokenizing, I pre-processed the sentence to leave only Korean.
As a result, At KcELECTRA-NN model, I can make more better result with 0.646
f1 Score and I reached 17/97
ranking.
KcELECTRA - HiddenCNN Model
This model is based on This Paper. This model use a CNN channels with hidden layer of BERT.
class KcELECTRA_CNN2(nn.Module):
def __init__(self,len_size, kernal, stride, dropout, labels):
super(KcELECTRA_CNN2,self).__init__()
self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base", output_hidden_states=True)
self.Dropout_default = nn.Dropout(0.1)
self.Conv = nn.Conv2d(in_channels = 13, out_channels = 13, kernel_size = (kernal,768), padding = kernal//2)
self.Relu = nn.ReLU()
self.Pooling = nn.MaxPool2d(kernel_size=3, stride = stride, padding = kernal//2)
self.Flat = nn.Flatten()
self.Linear_dropout = nn.Dropout(dropout)
self.Linear = nn.Linear((len_size//stride + 1) * 13, labels)
self.Softmax = nn.Softmax(dim=1)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[1]
output = torch.transpose(torch.cat(tuple([t.unsqueeze(0) for t in output]), 0),0,1) # batch * 13(layer) * word len * 768
conv = self.Conv(self.Dropout_default(output))
act = self.Relu(self.Dropout_default(conv))
pool = self.Pooling(self.Dropout_default(act))
flat = self.Flat(self.Dropout_default(pool))
l = self.Linear(self.Linear_dropout(flat))
return self.Softmax(l)
Above model isn’t suitable for tasks. I was setting a record with 0.612
f1 score
KcELECTRA - HiddenCNN Model 2
This model is based on This Paper. This model use a CNN channels with hidden layer of BERT.
class KcELECTRA_CNN3(nn.Module):
def __init__(self,len_size, kernal, filter_size, stride,labels):
super(KcELECTRA_CNN3,self).__init__()
self.KcELECTRA = AutoModel.from_pretrained("beomi/KcELECTRA-base", output_hidden_states=True)
self.Dropout_default = nn.Dropout(0.1)
self.Convs = nn.ModuleList([nn.Conv2d(in_channels = 3, out_channels = filter_size, kernel_size = (i,768), padding = ((i-1)//2,0)) for i in kernal])
self.Relu = nn.ReLU()
self.Pooling = nn.MaxPool1d(kernel_size=len_size//stride, stride = stride, padding = (len_size//stride-1)//2)
self.Flat = nn.Flatten()
self.Linear = nn.Linear(len_size // stride * len(kernal) * filter_size , labels)
self.Softmax = nn.LogSoftmax(dim=1)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
output = self.KcELECTRA(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[1][-3:]
output = torch.transpose(torch.cat(tuple([t.unsqueeze(0) for t in output]), 0),0,1) # batch * 3 * encoder * 768
x = [self.Relu(self.Dropout_default(Conv(self.Dropout_default(output)).squeeze(3))) for Conv in self.Convs]
pool = torch.cat([self.Pooling(self.Dropout_default(i)) for i in x],1)
flat = self.Flat(self.Dropout_default(pool))
l = self.Linear(self.Dropout_default(flat))
return self.Softmax(l)
I was setting a BEST
record with 0.671
f1 score. It is a third place rank!
Result
I reached 3/97
ranking with KcELECTRA_CNN3 Model with 0.671
f1 Score.