In the Web Page Scraping with jsoup article I described how to extract data from a web page using the open-source jsoup Java library. As an HTML parser, jsoup only sees the raw page source and is completely unaware of any content that is added to the DOM via JavaScript after the initial page load. For that, you need to employ some kind of embedded browser engine, such as Oracle WebView.
Another approach is to use a headless browser, that is, a web browser without a graphical user interface. There are several good ones to choose from, including Chrome, Firefox, PhantomJS, Zombie JS, HtmlUnit, and Splash. I today’s article, we’ll be automating the Chrome headless browser from a Python script to fetch a web page and read the dynamically generated contents of an element.
Project Setup
Python is an ideal language for web page scraping because it’s more light-weight that full-fledged languages like Java. There is also a Selenium WebDriver for python. (Actually, there is one for Java as well!) The driver is basically the engine that runs the browser, much like a database driver.
I’ll be developing the script within MS Visual Studio Code. It just makes everything much easier. Here‘s a great tutorial on getting started with python in Visual Studio Code.
Once you’ve got the python extension set up, you’re ready to create the project.
- Select File > Add Folder to Workspace… from the main menu.
- Browse the root folder of your project. Mine is simply called “demo”.
- Now we’ll need to download the ChromeDriver – WebDriver for Chrome. Place the chromedriver executable in your project root.
- Next, we’ll install Selenium. In the Integrated Terminal (select View < Integrated Terminal from the main menu if it is not already open), type “pip install selenium” at the prompt and hit the Enter key.
C:\Users\blackjacques>pip install selenium Collecting selenium Downloading https://files.pythonhosted.org/packages/41/c6/78a9a0d0150dbf43095c6f422fdf6f948e18453c5ebbf92384175b372ca2/selenium-3.13.0-py2.py3-none-any.whl (946kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 952kB 201kB/s Installing collected packages: selenium Successfully installed selenium-3.13.0
The Demo Page
We’ll be testing our script on a very simple web page. It has one paragraph element whose text which will be updated via JavaScript. You can either host the page on a server or, to keep things really simple, just save it locally to your project folder!
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>JavaScript scraping test</title> </head> <body> <p id='dynamic-text'>No javascript support</p> <script> document.getElementById('dynamic-text').innerHTML = 'Javascript rendered content.'; </script> </body> </html>
The Script
Our script will test the above page by loading it into the headless chrome browser, fetching the “#dynamic-text” element, and printing its innerHTML to the console. If it says, “JavaScript rendered content,” then we’ve got the JS-rendered text. Otherwise, it might be time to revisit this whole solution!
- Create a new file named “page_scraping_demo.py” in your project root. Visual Studio Code will immediately recognize it as a python script.
- Add the following code to the file and save your changes.
from selenium import webdriver url = 'file:///I:/My%20Documents/articles/htmlgoodies/Web%20Scraping/demo/web_scraping_demo.html' driver = webdriver.Chrome() driver.get(url) dynamicText = driver.find_element_by_css_selector("#dynamic-text") message = dynamicText.get_attribute('innerHTML') print("Dynamic text element says: '" + message + "'." )
Running the Script
It’s time to put our code to work.
Right-click anywhere in the script editor and select “Run Python File in Terminal” from the popup menu. A browser window will open and load the page:
The script’s output will be displayed in the terminal:
Selenium Registry Issue in Windows
There is a known selenium issue in Windows that causes the following error to appear in the terminal:
[14596:10928:0531/174034.867:ERROR:install_util.cc(589)] Unable to create registry key HKLM\SOFTWARE\Policies\Google\Chrome for reading result=2
To fix it, you’ll have to go into regedit.exe and add required keys as per the instructions provided by Jari Mäkeläinen (aka jarmake):
- Open the registry with regedit (just click on Windows start menu and start typing regedit, it should come up)
- From the registry explorer, expand HKEY_LOCAL_MACHINE, and from there expand SOFTWARE
- Expand Policies. (I was missing everything from this point.)
- So what I had to do was to select the Policies by left clicking on it, and then right click and from the context menu select New > Key and name it Google.
- Once that is created, select that and right click and again select New > Key and name that Chrome.
- Select that folder, and right click. Choose New > String. Name it “MachineLevelUserCloudPolicyEnrollmentToken” and leave the value empty. (I set mine to “2”)
If you already have Google and Chrome under Policies, I think only adding the key as in step 6 should work. I’ve attached a picture to show how it should look like when it’s done.
And just to clarify, this is under SOFTWARE/Policies, not directly under SOFTWARE.
You can download the files that we worked on today from GitHub.
Going Forward…
Now that we’ve gotten our feet wet with python, selenium, and the chrome headless browser, we’ll tackle a more complex example next time that illustrates how to gather data from a dynamically generated page.