Data Collection by Web Crawling

Date: 2023.07.27 ~ 2023.07.30

Writer: 9tailwolf

Introduction

The goal of my research is to determine what kind of hate occurs at what age and gender. So I need sentences data in the community that can represent each age group.

How to Choose Community

Trend Searching was used to identify communities that well represent the characteristics of each age group. Blackkiwi is a most useful site for searching Korean searching trends.

The target age/gender group of communities that I want to collect is 2030s male, 3040s male, 4050s male, 2030s female, 3040s female, 4050s female.

As a result I could select 6 communities.

81.8% of 2030s, 88.1% of male : FMkorea

78.1% of 3040s, 93.5% of male : Ruliweb

69.0% of 4050s, 65.9% of male : Clien

73.8% of 2030s, 95.0% of female : Theqoo

77.2% of 3040s, 79.5% of female : Yeobgi-Hogeun-Jinsil

88.1% of 4050s, 95.2% of female : 82cook

Web Crawling for Collect Data

BeautifulSoap4 is a useful package tool library for parsing HTML and XML documents. By this, datas can be made a form of tree and it can be searched by python iteration. Selenium is also a useful package tool library for browser automation. This allows you to quickly and automatically browse the websight.

Below is a source code for using BeautifulSoap4 and Selenium.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome() # Open Chrome
driver.implicitly_wait(10) # This code is for wating until the web is running.

If you want to use Chrome for crawling, Chromedriver is essantial.

After that, it depends on the need and the website.

For example, to get data in Yeobgi-Hogeun-Jinsil, I run below code.

data = []
i = 3
while len(data)<5000: # Get 5,000 sentences data
    html = driver.page_source # Getting html in html variable
    soup = BeautifulSoup(html, 'html.parser') # Making Tree with BeautifulSoup
    driver.implicitly_wait(10)
    
    '''
    The procedure of finging data
    '''
    a = soup.find('div',class_='cafe_group cafe_all').find_all('li',class_='thumbnail_on cmt_on') 
    for j in a[1:]:
        data.append(j.find('span',class_='txt_detail').get_text())
        
        
    '''
    Turn the page
    '''
    if i!=7:
        driver.find_element('xpath', '//*[@id="pagingNav"]/span['+str(i)+']').click()
    else:
        driver.find_element('xpath', '//*[@id="mArticle"]/div[2]/a[2]/span').click()
        i = 2
    i += 1
    
    driver.implicitly_wait(10)

In this way, I can collect 5,000 sentence datas with 6 selected communities.