Web Scrapping with Python:

Background

Web Scrapping simply means extracting data from online websites, documents, and files that can use for personal purposes or ML datasets, Marketing, etc.
Many automated tools are available for this but if you wanted to learn how python can do this get ready.

So we will use Python’s easiest web-scraping library, Beautiful Soup.

What Do You Need To Scrape the Web?

  1. Requests library: The requests library is the standard Python Lib. With it, we can make HTTP requests to get data from a targeted website which we will scrape.

  2. Beautiful Soup: Beautiful Soup helps us parse HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from targeted websites. You don’t need the previous Beautiful Soup.

  3. Before we start, make sure you have Python 3.x installed on your computer. If so, let’s start with the tutorial by setting up Beautiful Soup in Python!

Requests get data from a web server and Beautiful Soup is a bowl that stores all data in a segregated format for easier actions of data extraction.


Setting up all the library's

  • Install the Requests library: Run the following command in your command prompt or terminal:

      pip install requests
    
  • Install Beautiful Soup: Run the following command in your command prompt or terminal:

      pip install bs4
    
  • Install a parser: We need a parser to extract data from HTML documents. In this guide, we’ll use the lxml parser. To install this parser, run the following command:

      pip install lxml
    

    Section 1: Scraping a Single Page

    I’ll guide you through each line of the code we need to write to build our first scraper. You can find the full code at the end of this article. Let’s get started!

    Importing libraries

    The following are the libraries required to scrape with Beautiful Soup:

      from bs4 import BeautifulSoup
      import requests
    

    Get the HTML of the website

  • We’re going to scrape a website that contains thousands of pages of movie lists with their transcripts. We’ll start by scraping one page and then we'll see how to scrape multiple pages.

    First, we define the link. In this case, I’m choosing the transcript of the movie Inception, but you can choose any movie you want.

  • Requesting the content from the website using Requests. requests.get is for receiving the response code we save in the result variable. We do not required the response code but we require the content i.e. the text inside it. Thus we used the .text method to obtain the content of the website. Finally, we use the lxml parser to obtain the soup, which is the object that contains all the data in the nested structure that we’re going to reuse later.

      website = 'https://subslikescript.com/movie/Titanic-120338'
      result = requests.get(website)
      content = result.text
      soup = BeautifulSoup(content,'lxml')
      # BeautifulSoup is for to process pull data from html 
      # soup is object storing the parserd html content.
      print(soup.prettify())
    

    Once we have the soup object, the HTML in a readable format can be easily obtained by using .prettify(). Although we can use the HTML printed in a text editor to locate elements, it’s much better to go straight to the HTML code of the specific element we want. We’ll do this in the next step.


Analyzing the website and HTML code

An important step before we move on to writing code is analyzing the website we want to scrape and the HTML code obtained in order to find the best approach to scrape the website. Below, you can find a screenshot of one transcript. The elements to scrape are the movie title and the transcript.

Movie transcript

To obtain the HTML code of a specific element, follow these steps:

  1. Go to the website of the transcript you want.

  2. Hover on either the movie title or transcript and then right-click. A list will be displayed. Choose “Inspect” to open up the source code of the page.

Choosing “Inspect”

Below, you can find a smaller version of the HTML code obtained after clicking inspect. We’re going to use this HTML code as a reference for locating elements in the next step.

Locating an element with Beautiful Soup

Locating an element in Beautiful Soup is simple. You just have to apply the .find() method to the soup created before.

As an example, let’s locate the box that contains the movie title, description, and transcript. It’s inside an article tag and has a class named “main-article”. We locate that box with the following code:

title = soup.find('article',class_="main-article")

Now let’s locate the movie title and transcript. The movie's name is inside an h1 tag and doesn’t have a class name. After we locate it, we use the .text method to obtain the text inside the node:

name = title.h1.text.split()[:-3]
#name = title.h1.text.split() 
#gives us all the content inside h1 text so use split and removed "- full transcript"

The transcript is inside a div tag and has a class named ”full-script”. To obtain the text in this case, we’re going to modify the default parameters inside the .get_text() method. First, we set strip=True to remove leading and trailing spaces. Then we add a blank space to the separator separator=’ ‘ so words have a blank space after every new line (\n).

transcript = box.find('div', class_='full-script')
transcript = transcript.get_text(strip=True, separator=' ')

So far, we have successfully scraped the data. Print the title and transcript variables to make sure everything is working fine until now.

Exporting data in a txt file

You can export the data scraped to reuse it later if you want. You can store data in CSV, JSON, and more formats. For this example, I’ll store the data extracted in a .txt file.

To do so, we need to use the with keyword as shown in the code below:

with open(f'{title}.txt', 'w') as file:
    file.write(transcript)

Keep in mind that I’m using f-string to set the file name as the movie title. After running the code, a .txt file should be located in your working directory.

Now that we’ve successfully scraped data from one page, we’re ready to scrape transcripts from multiple pages!


Section 2: Scraping Multiple Pages and movies names

Below, you can find the screenshot of the first page of the website with movie transcripts. The website has 1,234 pages with around 30 movie transcripts per page.

List of movie transcripts

In this second part, I’m going to show you how to scrape multiple links by getting the href attribute of each link. First, we have to modify the website to scrape. The link to the website shown above is subslikescript.com/movies.

Our new website variable will be the following:

root = 'https://subslikescript.com'
website = f'{root}/movies'

I also defined a root variable that will help us scrape multiple pages later.

Getting the href attribute

Let’s get first the href attribute of the 30 movies listed on one page. To do so, inspect any movie title inside the “List of Movie Transcripts” box presented in the screenshot above.

After this, you should get the HTML code. An a tag should be highlighted in blue. Each a tag belongs to a movie title.

Getting href attributes

As you can see, the links inside the href don’t contain the root subslikescript.com.This is why I defined a root variable before to concatenate it later.

Let’s locate all the a elements on the page.

Locating multiple elements with Beautiful Soup

To locate multiple elements in Beautiful Soup, we have to use the .find_all() method. We need to add the parameter href=True to be able to extract the link that corresponds to each movie transcript.

box.find_all('a', href=True)

To extract the links from the href, we have to add ['href'] to the expression above. However, the .find_all() method will return a list, so we have to loop through it and get the hrefs one by one inside the loop.

for link in box.find_all('a', href=True):
    link['href']

To store the links, we can use list comprehension, as shown below:

links = [link['href'] for link in box.find_all('a', href=True)]
print(links)

If you print the links list, you will see the links we want to scrape. We’ll scrape each page in the next step.

To scrape the transcript of each link, we’ll follow the same steps we did before for one transcript. This time, we’ll include those steps inside the following for loop.

for link in links:
    result = requests.get(f'{root}/{link}')
    content = result.text
    soup = BeautifulSoup(content, 'lxml')

As you might remember, the links we stored previously don’t contain the root subslikescript.com, so we have to concatenate it with the expression f’{root}/{link}’.

The rest of the code is the same as we wrote for the first part of this guide. The full code for this project is the following:

In case you want to navigate through the pages listed on the web, you have two options:

  • Option 1: Inspect any of the pages displayed on the website (e.g. 1,2,3, …1234). You should obtain an a tag that contains an href attribute with the links for each page. Once you have the links, concatenate them with the root and follow the steps shown in Section 2.

  • Option 2: Go to page 2 and copy the link obtained. It should look like this: subslikescript.com/movies?page=2. As you can see, there’s a pattern the website follows for each page: f’{website}?page={i}’. You can reuse the website variable and loop between the numbers 1 and 10 if you want to navigate the first ten pages.