As a result, just Kocohub dataset is suitable for Kaggle Competition. It is consist of more then 8,000 labeled training dataset and more then 2,000,000 unlabeled dataset.
Base Model
First, I tested below two base model with Find Tuning.
In this case, I use BertForSequenceClassification Library from transformers to classificate sentences. I used 2 method. First, I made binary seperation model with (none / offensive, hate) and (offensive / hate). Second, I made muntiful seperation model. As a result, multiful seperation model and KcELECTRA is suitable for tasks.
KoELECTRA / Binary Model : 0.541 Socre
KoELECTRA / Multiful Model : 0.593 Socre
KcELECTRA / Multiful Model : 0.601 Socre
So I choose KcELECTRA for base model. But I think there is no striking difference with KoELECTRA and KcELECTRA.
Improved Model
To make better model, I consist various structure of models with various hyper parameters. You can see more detail find tuning code at HERE.
KcELECTRA - NN Model
KcELECTRA-NN model is a model that consist of KcELECTRA, hidden linear, and linear layer for classify. In this model, we can tuning with linear layer, dropout, and labels. Following is a code of KcELECTRA-NN model.
As a result, I was setting a record with 0.637 f1 score.
KcELECTRA - CNN Model
KcELECTRA-CNN model is a model that consist of KcELECTRA, 1D CNN layer, and linear layer for classify. In this model, we can tuning with output_channel, kernal, stride, dropout, and labels. Following is a code of KcELECTRA-CNN model.
But i think KcELECTRA-CNN isn’t suitable for classification task. I can’t make better result than KcELECTRA-NN model.
KcELECTRA - SmallNN Model
KcELECTRA-SmallNN model is a model without hidden layer in KcELECTRA-NN model. I determined that the hidden layer contributes to overfitting.
As a result, I was setting a record with 0.619 f1 score.
PP - KcELECTRA - NN Model
In dataset, I think there is a problem with Tokenizing, I pre-processed the sentence to leave only Korean.
As a result, At KcELECTRA-NN model, I can make more better result with 0.646 f1 Score and I reached 17/97 ranking.
KcELECTRA - HiddenCNN Model
This model is based on This Paper. This model use a CNN channels with hidden layer of BERT.
Above model isn’t suitable for tasks. I was setting a record with 0.612 f1 score
KcELECTRA - HiddenCNN Model 2
This model is based on This Paper. This model use a CNN channels with hidden layer of BERT.
I was setting a BEST record with 0.671 f1 score. It is a third place rank!
Result
I reached 3/97 ranking with KcELECTRA_CNN3 Model with 0.671 f1 Score.