1.3 Web Scraping

  • The goal is to extract data from website
  • Many ML dataset are obtained by web scraping
  • Web crawling VS scrapping

    • Crawling:indexing whole pages on Internet
    • Scraping:scraping particular data from web pages of a website

1.3.1 Web scraping tools

  • “curl” often dosen’t work
  • Use headless browser
  • You need a lot of new IPs,easy to get through public clouds

1.3.2 Case study-house price predition

  • Query houses sold in near Stanford(Can repleace the city and state in URL for other places)
  • Get the house IDs from the index pages
from bs4 import BeautifulSoup

page = BeautifulSoup(open(html_path, 'r'))
links = [a['herf'] for a in page.find_all('a', 'list-card-link')]
id = [l.split('/')[-2].split('_')[0] for l in links]

Html here.

<a href="https://www.zillow.com/homedetails/3602-Evergreen-Glade-Dr-Kingwood-TX-77339/28362351_zpid/" class="list-card-link list-card-link-top-margin list-card-img" tabindex="-1" aria-hidden="false"><img class="" src="https://photos.zillowstatic.com/fp/4a358e4b9b5b1ebafa12ce120b1944f0-p_e.jpg" alt="3602 Evergreen Glade Dr, Kingwood, TX 77339" aria-hidden="false"></a>
  • identify the HTML elements through inspect
sold_items = [a.text for a in page.find('div', 'ds-home-details-chip').find('p').find_all('span')]
for item in sold_items:
    if 'Sold:' in item:
        result['Sold Price'] = item.spilt(' ')[1]
    if 'Sold on' in item:
        result['Sold On'] = item.spilt(' ')[-1]
  • Repeat the previuos process to extract othe field data.

    1.3.3 Crawl images

  • Get all image URLs

1.3.4 Summary

  • Web scraping is a powerful way to collect data at scale when the website dosen’t offer a data API
  • Low cost if using public clouds
  • Use browser’s inspection tools to locate the information in HTML
  • Be cautious to use it properly