1.3 Web Scraping
- The goal is to extract data from website
- Many ML dataset are obtained by web scraping
-
Web crawling VS scrapping
- Crawling:indexing whole pages on Internet
- Scraping:scraping particular data from web pages of a website
1.3.1 Web scraping tools
- “curl” often dosen’t work
- Use headless browser
- You need a lot of new IPs,easy to get through public clouds
1.3.2 Case study-house price predition
- Query houses sold in near Stanford(Can repleace the city and state in URL for other places)
- Get the house IDs from the index pages
from bs4 import BeautifulSoup
page = BeautifulSoup(open(html_path, 'r'))
links = [a['herf'] for a in page.find_all('a', 'list-card-link')]
id = [l.split('/')[-2].split('_')[0] for l in links]
Html here.
<a href="https://www.zillow.com/homedetails/3602-Evergreen-Glade-Dr-Kingwood-TX-77339/28362351_zpid/" class="list-card-link list-card-link-top-margin list-card-img" tabindex="-1" aria-hidden="false"><img class="" src="https://photos.zillowstatic.com/fp/4a358e4b9b5b1ebafa12ce120b1944f0-p_e.jpg" alt="3602 Evergreen Glade Dr, Kingwood, TX 77339" aria-hidden="false"></a>
- identify the HTML elements through inspect
sold_items = [a.text for a in page.find('div', 'ds-home-details-chip').find('p').find_all('span')]
for item in sold_items:
if 'Sold:' in item:
result['Sold Price'] = item.spilt(' ')[1]
if 'Sold on' in item:
result['Sold On'] = item.spilt(' ')[-1]
- Repeat the previuos process to extract othe field data.
1.3.3 Crawl images
- Get all image URLs
1.3.4 Summary
- Web scraping is a powerful way to collect data at scale when the website dosen’t offer a data API
- Low cost if using public clouds
- Use browser’s inspection tools to locate the information in HTML
- Be cautious to use it properly