Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22872

Real Python: A Practical Introduction to Web Scraping in Python

$
0
0

Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.

The Internet hosts perhaps the greatest source of information on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.

In this tutorial, you’ll learn how to:

  • Parse website data using string methods and regular expressions
  • Parse website data using an HTML parser
  • Interact with forms and other website components

Note: This tutorial is adapted from the chapter “Interacting With the Web” in Python Basics: A Practical Introduction to Python 3.

The book uses Python’s built-in IDLE editor to create and edit Python files and interact with the Python shell, so you’ll see occasional references to IDLE throughout this tutorial. However, you should have no problems running the example code from the editor and environment of your choice.

Source Code:Click here to download the free source code that you’ll use to collect and parse data from the Web.

Scrape and Parse Text From Websites

Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones that you’ll create in this tutorial. Websites do this for two possible reasons:

  1. The site has a good reason to protect its data. For instance, Google Maps doesn’t let you request too many results too quickly.
  2. Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely.

Before using your Python skills for web scraping, you should always check your target website’s acceptable use policy to see if accessing the website with automated tools is a violation of its terms of use. Legally, web scraping against the wishes of a website is very much a gray area.

Important: Please be aware that the following techniques may be illegal when used on websites that prohibit web scraping.

For this tutorial, you’ll use a page that’s hosted on Real Python’s server. The page that you’ll access has been set up for use with this tutorial.

Now that you’ve read the disclaimer, you can get to the fun stuff. In the next section, you’ll start grabbing all the HTML code from a single web page.

Build Your First Web Scraper

One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the urllib.request module contains a function called urlopen() that you can use to open a URL within a program.

In IDLE’s interactive window, type the following to import urlopen():

>>>
>>> fromurllib.requestimporturlopen

The web page that you’ll open is at the following URL:

>>>
>>> url="http://olympus.realpython.org/profiles/aphrodite"

To open the web page, pass url to urlopen():

>>>
>>> page=urlopen(url)

urlopen() returns an HTTPResponse object:

>>>
>>> page<http.client.HTTPResponse object at 0x105fef820>

To extract the HTML from the page, first use the HTTPResponse object’s .read() method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8:

>>>
>>> html_bytes=page.read()>>> html=html_bytes.decode("utf-8")

Read the full article at https://realpython.com/python-web-scraping-practical-introduction/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]


Viewing all articles
Browse latest Browse all 22872

Trending Articles