Web Scraping with Automa
Originated from the interest of the performance of Investment funds for individual in Vietnam, I started to research how to crawl historical NAV data of multiple investment funds domiciled in Vietnam. It leads to the path of learning on Web Browser Automation. This note covers my experience playing browser automation with Automa.
What is Automa?
Automa is a browser extension (available for Chromium-based and Firefox) that automate browser operations, from auto-fill forms, doing a repetitive task, taking a screenshot, to scraping data of the website.
This extension has a similar UX as n8n or zapier. It visualizes the workflow with block and arrows.
The workflow can be imported/exported locally via .json
file. Or it can also be hosted publicly in the Marketplace
Situation of data to be crawled
I need to download all the published .xls
file of Net Asset Value Daily report
from the website of fund DCBC.
This is the manual workflow that I want to automate.
- Click on the row "2022 - DCBC - NAV REPORT", then a list of 10 "NAV daily report" will be displayed
- Click on the 1st child element of this unordered list (e.g. "DCBC Net Asset Value as of Day 12/12/2022")
- Switch active tab to the recently opened tab, and click on the link to download
.xls
file - Delay 500ms, and Close this tab
- Then repeat the task to all other 9 child elements in the unordered list.
What did I achieve using Automata?
dcbc v5:
- The manual workflow depicted previously can be automated Automa like in this photo.
- Click here to download the
.json
file of Automa workflowdcbc v5
.
dcbc v6:
- TIL
- a Loop Elements block doesn't work without a Loop Breakpoint block
- use Attribute Value to retrieve the
href
value which is the URL of.xls
file, save it to Table - use Get Text to retrieve the text content (here is filename), save it to Table
- use Export Data to export the Table as
.csv
file to download later
- Click here to download the
.json
file of Automa workflowdcbc v6
.
dcbc v7:
- in the previous version, my workflow can download only 10 files within the 1st page. Now I want to retrieve all published daily reports. At this moment, the website of fund DCBC has 24 pages of daily NAV reports, dated from 2022-01-03 to 2022-12-12. This means the Loop Elements need to be iterated ~240 times (10 files x 24 pages), which can be achieved using the feature "Load more elements" in Loop Elements
- Retrieving all 240 files' url of year 2022 has a runtime 16m23s
- Click here to download the
.json
file of Automa workflowdcbc v7
.
dcbc v8
- Instead of looping the elements and then opening the daily nav page in new tab to retrieve the download url as in v7. My friend suggests trying to retrieve all the URLs first and then loop through these URLs.
- I expected that the runtime of this version should be much shorter than v7. Surprisingly, the runtime of v8 is worse, at 23m29s
- Click here to download the
.json
file of Automa workflowdcbc v8
.
dcbc v11
- TIL
- to write concise info into the "Description" field of each block, since Automa use the content of this field as the name of each executed step in its workflow logs
- to use variable from previous node
- Example:
- a New Tab block with
{{loopData.fileInfo.page_url}}
inNew tab URL
field - It will open a new tab, with URL from the "page_url" column within the table created in previous
Loop Data
block withLoop ID
"fileInfo"
- a New Tab block with
- Example:
- to use the Slice Variable block to clarify the option to parse which data into next step, instead of being buried inside the parameter of
Loop Data
block like my previous workflow - the
Fallback
parameter in Element Exists block solved the problem of wrong output inv9
. Reason: in the source of dcbc website, some daily nav report don't have a file to be downloaded - In Automa,
table
andvariables
have the same functionality. They both store information or value from the page 1- Whenever inserting a value to the table, the value will get pushed to the end row of the selected column
- But in the variables, the variable value will get overwritten with the new value.
- In
v11
, this workflow has 3 loops- loop
yearReport
: goes through all years from 2013 to 2022- loop
pageUrl
: a nested loop which retrieves the url of all daily report web pages, and dump into a Table
- loop
- loop
fileInfo
: goes through all the records in the table created in previous step
- loop
- Click here to download the
.json
file of Automa workflowdcbc v11
.