Another approach is to use a headless browser, that is, a web browser without a graphical user interface. There are several good ones to choose from, including Chrome, Firefox, PhantomJS, Zombie JS, HtmlUnit, and Splash. I today's article, we'll be automating the Chrome headless browser from a Python script to fetch a web page and read the dynamically generated contents of an element.
Python is an ideal language for web page scraping because it's more light-weight that full-fledged languages like Java. There is also a Selenium WebDriver for python. (Actually, there is one for Java as well!) The driver is basically the engine that runs the browser, much like a database driver.
Once you've got the python extension set up, you're ready to create the project.
- Select File > Add Folder to Workspace… from the main menu.
- Browse the root folder of your project. Mine is simply called "demo".
- Now we'll need to download the ChromeDriver - WebDriver for Chrome. Place the chromedriver executable in your project root.
- Next, we'll install Selenium. In the Integrated Terminal (select View < Integrated Terminal from the main menu if it is not already open), type "pip install selenium" at the prompt and hit the Enter key.
C:\Users\blackjacques>pip install selenium Collecting selenium Downloading https://files.pythonhosted.org/packages/41/c6/78a9a0d0150dbf43095c6f422fdf6f948e18453c5ebbf92384175b372ca2/selenium-3.13.0-py2.py3-none-any.whl (946kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 952kB 201kB/s Installing collected packages: selenium Successfully installed selenium-3.13.0
The Demo Page
- Create a new file named "page_scraping_demo.py" in your project root. Visual Studio Code will immediately recognize it as a python script.
- Add the following code to the file and save your changes.
from selenium import webdriver url = 'file:///I:/My%20Documents/articles/htmlgoodies/Web%20Scraping/demo/web_scraping_demo.html' driver = webdriver.Chrome() driver.get(url) dynamicText = driver.find_element_by_css_selector("#dynamic-text") message = dynamicText.get_attribute('innerHTML') print("Dynamic text element says: '" + message + "'." )
Running the Script
It's time to put our code to work.
Right-click anywhere in the script editor and select "Run Python File in Terminal" from the popup menu. A browser window will open and load the page:
The script's output will be displayed in the terminal:
Selenium Registry Issue in Windows
There is a known selenium issue in Windows that causes the following error to appear in the terminal:
[14596:10928:0531/174034.867:ERROR:install_util.cc(589)] Unable to create registry key HKLM\SOFTWARE\Policies\Google\Chrome for reading result=2
To fix it, you'll have to go into regedit.exe and add required keys as per the instructions provided by Jari Mäkeläinen (aka jarmake):
- Open the registry with regedit (just click on Windows start menu and start typing regedit, it should come up)
- From the registry explorer, expand HKEY_LOCAL_MACHINE, and from there expand SOFTWARE
- Expand Policies. (I was missing everything from this point.)
- So what I had to do was to select the Policies by left clicking on it, and then right click and from the context menu select New > Key and name it Google.
- Once that is created, select that and right click and again select New > Key and name that Chrome.
- Select that folder, and right click. Choose New > String. Name it "MachineLevelUserCloudPolicyEnrollmentToken" and leave the value empty. (I set mine to "2")
If you already have Google and Chrome under Policies, I think only adding the key as in step 6 should work. I've attached a picture to show how it should look like when it's done.
And just to clarify, this is under SOFTWARE/Policies, not directly under SOFTWARE.
You can download the files that we worked on today from GitHub.
Now that we've gotten our feet wet with python, selenium, and the chrome headless browser, we'll tackle a more complex example next time that illustrates how to gather data from a dynamically generated page.
Rob's alter-ego, "Blackjacques", is an accomplished guitar player, who has released several CDs and cover songs. His band, Ivory Knight, was rated as one of Canada's top hard rock and metal groups by Brave Words magazine (issue #92).