Company Data Web Scraping with Python Selenium

Hi readers, this post is a continuation of using Python Selenium to get public company list in Indonesia. I suggest to read that piece first before you continue. You can read the previous post here.

After we get the company list, the next logical question will be, how are they performing as a business. Sure, each of them will publish their financial report every quarter in their websites. In addition, there are also many free apps that we can use to create a watchlist. However, selecting good companies is tricky, not only we need to select those with strong historical performance, but we also need to ensure they have potential growth for the next few years. I am saying next few years and not forever because everything can happen. In addition, world is never the same after Covid-19 strikes in early 2020. Hope this post can give you a glimpse on how to select fundamentally strong companies.

Why Scraping

For obvious reason, leveraging technology to do routine task is the way we live nowadays. Thankfully, the barrier to learn programming language for casual users is way lower now. I first learned during my first year in college using Pascal. Then in my second year, I got exposed to Java where printing ‘Hello, World!’ is already painful and only in third year I knew about C#. I am no expert in software engineering so the term casual users here refer to those who want to make use of technology to automate some stuff they want to do. I am not professional developer where you need to think on many aspects since your program will be accessed by many people (millions) at the same time. Think of how many people ask something to Google every second or how many people watched video on YouTube at the same time.

Scraping is the mechanism to get the data from the websites automatically. With many pages in the same site, it’s not scalable to rely on human to open every page and retrieve the data we need. Therefore, we want to automate the process so we can focus on analyzing the data after we get it.

Why Reuters

There are many web pages providing company financial data. They use API (Application Programming Interface), read financial reports and also work together with various media channel to get their hands on the data. After comparing multiple websites, I think Reuters provides not only a snapshot of financial condition but also a metric showing the growth in the last few years. This is really helpful since we don’t need to do another calculation but what we need is to retrieve the data and play around with the criteria we need. For example, if we want to check on Apple, we will need to go to this link to get the overview. The juice, however, is in the Key Metrics tab where we can get many financial metrics from a single company. We can group it into few sections as follows:

  1. Price and Volume
  2. Per Share Data
  3. Valuation
  4. Financial Strength
  5. Margins
  6. Management Effectiveness
  7. Growth
  8. Income Statement

How to Get

Now that we know where we can get the data, let’s put in in the code to automate this. I will use Indonesian public company list as baseline, but you can change this as long as you can find the combination of company stock code with how Reuters use it in their URL.

What we need to do is to define the base URL after importing necessary packages

import pandas as pd
import requests

from time import sleep
from bs4 import BeautifulSoup

base_url = 'https://www.reuters.com/companies/{0}.JK/key-metrics'

Then, to get the value, we need to list down every properties associated with the sections. I have copied it here for you so you lazy b*stards can start using it immediately (just kidding)

column_list = [
    'Company Code',
    'Price closing or last bid',
    '52 Week High',
    '52 Week Low',
    '10 Day Average Trading Volume',
    'Market Capitalization',
    '3 Month Average Trading Volume',
    'Beta',
    '1 Day Price Change',
    '13 Week Price Return (Daily)',
    '26 Week Price Return (Daily)',
    '5 Day Price Return (Daily)',
    '52 Week Price Return (Daily)',
    'Year To Date Price Return (Daily)',
    'Month to Date Price Return (Daily)',
    'Price Relative to S&P500 (4 Week)',
    'Price Relative to S&P500 (13 Week)',
    'Price Relative to S&P500 (26 Week)',
    'Price Relative to S&P500 (52 Week)',
    'Price Relative to S&P500 (YTD)',
    'EPS excl. Extra Items (Annual)',
    'EPS excl. Extra Items (TTM)',
    'EPS Normalized (Annual)',
    'Revenue per Share (Annual)',
    'Revenue per Share (TTM)',
    'Book Value (Per Share Annual)',
    'Book Value (Per Share Quarterly)',
    'Tangible Book Value (Per Share Annual)',
    'Tangible Book Value (Per Share Quarterly)',
    'Cash Per Share (Per Share Annual)',
    'Cash Per Share (Per Share Quarterly)',
    'Cash Flow (Per Share Annual)',
    'Cash Flow (Per Share TTM)',
    'Dividend (Per Share Annual)',
    'Dividends (Per Share TTM)',
    'EBITD (Per Share TTM)',
    'EPS Basic excl. Extra Items (Annual)',
    'EPS Basic excl. Extra Items (TTM)',
    'EPS incl. Extra Items (Annual)',
    'EPS incl. Extra Items (TTM)',
    'Free Cash Flow (Per Share TTM)',
    'Dividend (Per Share 5Y)',
    'P/E excl. Extra Items (Annual)',
    'P/E excl. Extra Items (TTM)',
    'P/E Normalized (Annual)',
    'Price to sales (Annual)',
    'Price to sales (TTM)',
    'Price to Tangible Book (Annual)',
    'Price to Tangible Book (Quarterly)',
    'Price to Free Cash Flow (Per Share Annual)',
    'Price to Cash Flow (Per Share TTM)',
    'Price to Free Cash Flow (Per Share TTM)',
    'Price to Book (Annual)',
    'Price to Book (Quarterly)',
    'P/E Basic excl. Extra Items (TTM)',
    'P/E excl. Extra Items High (TTM)',
    'P/E excl. Extra Items Low (TTM)',
    'P/E incl. Extra Items (TTM)',
    'Net Debt (Interim)',
    'Net Debt (Annual)',
    'Dividend Yield (5Y)',
    'Dividend Yield',
    'Current Dividend Yield (TTM)',
    'Free Cash Flow (Annual)',
    'Current Ratio (Annual)',
    'Net Interest coverage (Annual)',
    'Long Term Debt/Equity (Annual)',
    'Payout Ratio (Annual)',
    'Quick Ratio (Annual)',
    'Total Debt/Total Equity (Annual)',
    'Current EV/Free Cash Flow (Annual)',
    'Current EV/Free Cash Flow (TTM)',
    'Current Ratio (Quarterly)',
    'Long Term Debt/Equity (Quarterly)',
    'Quick Ratio (Quarterly)',
    'Total Debt/Total Equity (Quarterly)',
    'Free Cash Flow (TTM)',
    'Net Interest Coverage (TTM)',
    'Payout Ratio (TTM)',
    'Gross Margin (Annual)',
    'Gross Margin (TTM)',
    'Net Profit Margin % (Annual)',
    'Net Profit Margin (TTM)',
    'Operating Margin (Annual)',
    'Operating Margin (TTM)',
    'Pretax Margin (TTM)',
    'Pretax Margin (Annual)',
    'Operating Margin (5Y)',
    'Pretax Margin (5Y)',
    'Free Operating Cash Flow/Revenue (5Y)',
    'Free Operating Cash Flow/Revenue (TTM)',
    'Gross Margin (5Y)',
    'Net Profit Margin (5Y)',
    'Return on Assets (Annual)',
    'Return on Equity (TTM)',
    'Return on Average Equity (Annual)',
    'Return on Average equity (TTM)',
    'Return on Investment (Annual)',
    'Return on Investment (TTM)',
    'Return on Average Assets (5Y)',
    'Return on Average Equity (5Y)',
    'Return on Investment (5Y)',
    'Asset Turnover (Annual)',
    'Asset Turnover (TTM)',
    'Inventory Turnover (Annual)',
    'Inventory Turnover (TTM)',
    'Net Income/Employee (Annual)',
    'Net Income/Employee (TTM)',
    'Receivables Turnover (Annual)',
    'Receivables Turnover (TTM)',
    'Revenue/Employee (Annual)',
    'Revenue/Employee (TTM)',
    'Revenue Growth (Quarterly YoY)',
    'Revenue Growth Rate (5Y)',
    'EPS Growth (Quarterly YoY)',
    'EPS Growth (TTM YoY)',
    'EPS Growth Rate (5Y)',
    'Dividend Growth Rate (3Y)',
    'Revenue Growth (TTM YoY)',
    'Revenue Growth (Per Share 5Y)',
    'Revenue Growth Rate (3Y)',
    'EPS Growth Rate (3Y)',
    'Book Value Growth Rate (Per Share 5Y)',
    'Tangible Book Value Total Equity CAGR (5Y)',
    'Capital Spending growth rate 5 year',
    'EBITDA CAGR (5Y)',
    'EBITDA Interim CAGR (5Y)',
    'Free Operating Cash Flow CAGR (5Y)',
    'Total Debt CAGR (5Y)',
    'Net Profit Margin Growth Rate (5Y)',
    'Revenue (Annual)',
    'Revenue (TTM)',
    'EBITD (Annual)',
    'EBITD (TTM)',
    'Earnings Before Taxes (Annual)',
    'Earnings Before Taxes (TTM)',
    'Net Income to Common (Annual)',
    'Net Income to Common (TTM)',
    'Earnings Before Taxes Normalized (Annual)',
    'Net Income Available to Common Normalized (Annual)',
    'Diluted Normalized EPS excl. Extra Items (TTM)'
]

Once we have the list, we can initialize an empty table as the holder for the data we want to get. I am using df variable for this.

df = pd.DataFrame(columns=column_list)

Then we can start automating the process by using the loop function

i = 0
for k in company_df['Company Code']:
    try:
        req = requests.get(base_url.format(k))
        soup = BeautifulSoup(req.content, 'html.parser')
        status = soup.find('h2', class_='TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg ErrorPage-status-2Tfzh')
        # print(status)
        if status is not None and status.text == '404':
            pass
        else:
            data = soup.find_all('span', class_='TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__regular___2X0ym digits MarketsTable-value-FP5ul')
            value = [d.get_text() for d in data]
            df.loc[i] = [k] + value
            i += 1
            sleep(1)
    except Exception:
        # print(e)
        pass

The snippet above will call every company in the list to Reuters page. The req variable stores the result of the call. The soup variable will convert the page using HTML as the baseline. Then the status variable will try to find if the website can provide the data. Sometimes, new companies (or problematic companies) don’t have their latest data yet, hence we need to skip them.

If the data is available, then congratulations! We can store them using the data variable. Once we get it, we need to capture the text and add it into our table. To make the code looks like it is not an automated script, we use sleep function to create a pause before calling the next company.

Closing

Now if you open the df variable, you will find all the data there. You may want to start analyzing it, but looks like the data there is not in the correct format. If you call df.info() function, you will notice that all of them are stored as strings and not as float. This means we need to convert it first before we can do something about it. I will explain how in the next post so stay tune.

Hope you enjoy this post and see you in my next post!

Leave a comment