Web Scraping with Python Selenium

Photo by Dai Yoshinaga on Unsplash

There are lots of data in the internet and it will keep growing as we speak. Sometimes, we want to analyze it but wondering what are the options and how we can do that. In addition, we may also want to automate this so we don’t have to run it manually in our local. This post will explain an example on how to get publicly listed company list from Indonesian Stock Exchange (IDX). Selenium was chosen because we need to interact with the page in order to get what we want and this tool supports that purpose. Let’s get into more details.

Introduction

Nowadays, people can search many things they want in the internet and sometimes the data they need is available in a webpage like the one we use in this example. If the web owner just put the data as static page, there will be too many people can access it, especially if there are lots of interest on it like company financial data. Therefore, web owners implement some technology to avoid bots to get their data easily. This will make any web visitors have to do something before they can get what they want like fill in a form with captcha, click on few buttons or scroll to the bottom page and put a tick mark on the terms and condition page. While we can use requests to get data from static page, it is not sufficient when we deal with dynamic page. Therefore, selenium was built to resolve this. This package is not perfect since for popular websites, they can still recognize and block your activity. But we will still use this as educational purpose.

Setup

There are many configurations available for selenium. In this example, I will use Google Chrome with chromedriver to perform the task. What we do is to code some tasks and chromedriver will translate it when interacting with Google Chrome. Please note that we need to use the same version of both applications in order to make the code works. Here are the first part of the code for setup.

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import ElementClickInterceptedException
from time import sleep

opts = Options()
opts.add_argument('--headless')
opts.add_argument('user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
opts.add_argument('--window-size=1920x1080')

opts.add_argument('--no-sandbox')
opts.add_argument('--disable-gpu')
opts.add_argument('--disable-dev-shm-usage')
opts.add_argument('disable-infobars')

# path to your chromedriver
browser = Chrome('/usr/local/bin/chromedriver', options=opts)
browser.implicitly_wait(1)

The first 4 lines are the sub-packages we need in this example. If you want to scrape another page, the required packages may be different. We add sleep here to put a pause in our activity to make the web thinks it’s from normal human activity.

The options part here is also important. Since we want to eventually run this script in the cloud, it is important to put --headless option so there will be no Google Chrome visual opening. But when we do it in local, we may still want to see Chrome UI and see how the code performs. If we don’t put this option, Chrome will open and give an info that this software is now run automatically in background.

Options for user-agent are also vary. We need to put this so the websites will identify that an actual web browser is used to get the data. If we don’t put this, the webpage can block us, thinks that this is not actual human requests.

Lastly, window size needs to be set. This is to prevent us from not finding the buttons or fields we need. By default, the browser will open in standard size. The other arguments in that snippet are optional.

Next, we will need to initiate our browser by calling chromedriver so we need to put the location path. It may differ if we use Windows. Adding implicit wait enforce the activities to pause for 1 second before doing the next set of actions. We should have our browser object ready so we can move to the actual actions

Scrape the Data

First, we will need to open the web page we want to scrape

browser.get('https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/')

The command above will open this page

Publicly Listed Company. Source: link [as per 9 May 2021]

Now the default option is to show 10 records per page. Let’s see what are the options by using this command below. We can get the element name by right click on the box and select Inspect Element.

select = Select(browser.find_element_by_name('companyTable_length'))
Display per Page Option

Since we want to capture more efficiently, we will need to pick 100 records per page. We can do so by using this command.

number_per_page = int(select.options[3].text) # the fourth option gives 100 records per page
select.select_by_value('100')
sleep(1)

If we click 100, the page will display 100 records per page. Adding sleep command will pause the script to ensure the page is loaded.

After the page is displayed, we need to know how many page we need to scrape. To do that, we need to scroll down to the bottom part of the table.

Total Page to Scrape

From the screenshot above, we know that the full list will be in 8 pages. Therefore, we need to click the Next button for 7 times since the first page is already there. While doing so, we can capture the data inside the table directly by inspecting the relevant elements. Lastly, we need to close the browser after getting all data we need to preserve memory. This loop is represented in the code below. The try-except part is optional though since sometimes the page gives a popup for rating to minimize the crawler effort.

company_df = pd.DataFrame() # initiate DataFrame as holder

page_element = browser.find_elements_by_class_name('paginate_button ')
page_element_length = len(page_element) + 1 # to capture last page

for i in range(1, page_element_length):
    try:
        # print('Retrieve data in page {0} ...'.format(i))
        company_table = browser.find_element_by_id('companyTable').text.split('\n')
        company_raw = [company_table[i].split() for i in range(1, len(company_table))]
        company_code = [company_raw[i][1] for i in range(len(company_raw))]
        company_name = [' '.join(company_raw[i][2:-3]) for i in range(len(company_raw))]
        date_public = [' '.join(company_raw[i][-3:]) for i in range(len(company_raw))]
        company_df_add = pd.DataFrame({'Kode':company_code, 'Nama':company_name, 'Tanggal Pencatatan':date_public})
        next_page = browser.find_elements_by_xpath('/html/body/main/div[2]/div/div[2]/div/div[4]/a[2]')[0].click()
        sleep(2)

        # append company information
        company_df = company_df.append(company_df_add, ignore_index=True)
#         sleep(1)

    except ElementClickInterceptedException:
        element = browser.find_element_by_class_name("paginate_button next")
        browser.execute_script("""
        var element = arguments[0];
        element.parentNode.removeChild(element);
        """, element)
        print(element)

browser.close()

Once you run the code above, you can check the variable company_df which will give us the full list of all publicly listed company in Indonesia. You can either save this to csv format or follow up with another actions, which I will explain in my next post to get the financial metrics for these companies.

Closing

There you have it, the list of public company in Indonesia. If you want to automate this, I suggest you to check PythonAnywhere where they provide the infrastructure on top of Amazon Web Services (AWS) in a decent price, saving us from the headache of maintaining the infra part. They have free version although you can’t scrape other than their whitelist websites. The paid version starts at just USD 5 per month and in my experience, it’s totally worth the price.

Hope you find this post useful and see you in my next post!

One thought on “Web Scraping with Python Selenium

Leave a comment