TDM 20200: Project 02 — 2024

Motivation: Web scraping is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. In general, scraping data from websites has always been a popular topic in The Data Mine. We will use the website of "books.toscrape.com" to practice scraping skills

Context: In the previous project we gently introduced XML and XPath to parse a XML document. In this project, we will introduce web scraping. We will learn some basic web scraping skills using BeautifulSoup.

Scope: Python, web scraping, BeautifulSoup

Learning Objectives
  • Understand webpage structures

  • Use BeautifulSoup to scrape data from web pages

Readings and Resources

  • Make sure to read about, and use the template found here, and the important information about projects submissions here.

  • This link will provide you more information about python requests library

  • This link will provide you more information about BeautifulSoap

Dr Ward created 9 videos to help with this project. You might want to take a look at these videos here:

Questions

Question 1 (2 points)

  1. Please use BeautifulSoup to get and display the website’s HTML source code https://books.toscrape.com

  2. Review the website’s HTML source code. What is the title for that webpage?

You may refer to the following to import libraries, modify the code to fit into yours

import requests
from bs4 import BeautifulSoup
...# define url
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')

Question 2 (2 points)

  1. Please use the BeautifulSoup library to get and display all categories' names from the homepage of the website.

  • Review the page source code, to find "categories" located at the sidebar, under a div tag with class nav-list. The BeautifulSoup select method is useful to get such category, like this:

soup.select('.nav-list li a')

Question 3 (2 points)

  1. Now, instead of only getting the names of the categories, get all of the category links from the homepage as well.

    • Review the homepage source code, to explore where are the category links located. You may use find to get the div tags.

    soup.find('div',class_ = 'side_categories')
    • Under a div tag, the links can be found in the "a" tag. You may use find_all to get all category links.

      find_all('a')
    • You may refer to the following code, to exclude "books" from the category list, since they are not part of the categories.

    • Assume link holds the link information for a category.

    link.get('href').startswith("catalogue/category/books/")
    • The format of the output for this question should look like:

    Category: Travel,link: catalogue/category/books/travel_2/index.html
    Category: Mystery,link: catalogue/category/books/mystery_3/index.html
    ....
  2. Update the code from question 3a to get (only) the links for books with the category "Romance".

  • Output will be like

romance_url is https://books.toscrape.com/catalogue/category/books/romance_8/index.html

Question 4 (2 points)

  1. Use the "Romance" link https://books.toscrape.com/catalogue/category/books/romance_8/index.html from Question 3b to get the webpage source code for the Romance category web page.

  2. Display all book titles in the first page of the romance category.

Question 5 (2 points)

If you look at this page:

you can see, in the lower-right-hand corner, that the link to the second page is:

Now temporarily forget that you know this fact! We want you to try to find this page-2 link in the Romance book page.

  1. Starting with http://books.toscrape.com/catalogue/category/books/romance_8/index.html, please find the page 2 link the from Romance category web page using BeautifulSoup.

    The following is some sample code, for your reference.

    # need to remove last part from basic url
    url= "http://books.toscrape.com/catalogue/category/books/romance_8/index.html"
    url=url.rsplit('/',1)[0]
    # Assume you get next hyperlink ""category/page-2.html" as the next page, you need to only keep the last part
    next_link = next.split("/")[-1]
    #Combine
    next_url=url+"/"+next_link
  2. List the titles of all of the books from the second page of the "Romance" category.

Project 02 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project02.ipynb

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.