CS 329P:Practical Machine Learning(2021 Fall)

1.3 Web Scraping

Posted on October 31, 2021

1.3 Web Scraping

The goal is to extract data from website
Many ML dataset are obtained by web scraping
Web crawling VS scrapping
- Crawling:indexing whole pages on Internet
- Scraping:scraping particular data from web pages of a website

1.3.1 Web scraping tools

“curl” often dosen’t work
Use headless browser
You need a lot of new IPs,easy to get through public clouds

1.3.2 Case study-house price predition

Query houses sold in near Stanford(Can repleace the city and state in URL for other places)
Get the house IDs from the index pages

from bs4 import BeautifulSoup

page = BeautifulSoup(open(html_path, 'r'))
links = [a['herf'] for a in page.find_all('a', 'list-card-link')]
id = [l.split('/')[-2].split('_')[0] for l in links]

Html here.

<a href="https://www.zillow.com/homedetails/3602-Evergreen-Glade-Dr-Kingwood-TX-77339/28362351_zpid/" class="list-card-link list-card-link-top-margin list-card-img" tabindex="-1" aria-hidden="false"><img class="" src="https://photos.zillowstatic.com/fp/4a358e4b9b5b1ebafa12ce120b1944f0-p_e.jpg" alt="3602 Evergreen Glade Dr, Kingwood, TX 77339" aria-hidden="false"></a>

identify the HTML elements through inspect

sold_items = [a.text for a in page.find('div', 'ds-home-details-chip').find('p').find_all('span')]
for item in sold_items:
    if 'Sold:' in item:
        result['Sold Price'] = item.spilt(' ')[1]
    if 'Sold on' in item:
        result['Sold On'] = item.spilt(' ')[-1]

Repeat the previuos process to extract othe field data.
1.3.3 Crawl images
Get all image URLs

1.3.4 Summary

Web scraping is a powerful way to collect data at scale when the website dosen’t offer a data API
Low cost if using public clouds
Use browser’s inspection tools to locate the information in HTML
Be cautious to use it properly

Tags: Machine Learning Course