How to Download HTML and Assets from a URL with Python

How to Download HTML and Assets from a URL with Python

Step-by-Step Guide to Download HTML and Website Assets with Python

When working with web scraping or offline website analysis, you might need to download not only the HTML content of a page but also its associated assets like CSS files, JavaScript, images, and fonts. Python provides a powerful suite of libraries to help you achieve this efficiently. In this guide, we'll use requests for making HTTP requests and BeautifulSoup for parsing HTML. We'll also handle asset downloads and saving them to the appropriate directories.

Here’s a step-by-step guide to help you download HTML and associated assets from a URL using Python.

Prerequisites

Ensure you have the required libraries installed. You can install them using pip if you haven't already:

pip install requests beautifulsoup4

Step-by-Step Script

  1. Import Required Libraries

    We'll need requests to fetch the webpage and its assets, BeautifulSoup to parse the HTML, and some standard libraries to handle files and directories.

     import requests
     from bs4 import BeautifulSoup
     import os
     from urllib.parse import urljoin
    
  2. Define the URL and Fetch HTML Content

    Define the URL from which you want to download the HTML and assets. Use requests to fetch the HTML content.

     # Define the URL
     url = 'http://example.com'
    
     # Fetch the HTML content
     response = requests.get(url)
     html_content = response.text
    
  3. Save the HTML Content

    Write the fetched HTML content to a file named index.html.

     # Save HTML content
     with open('index.html', 'w', encoding='utf-8') as file:
         file.write(html_content)
    
  4. Parse HTML Content

    Use BeautifulSoup to parse the HTML content for extracting asset URLs.

     # Parse HTML content
     soup = BeautifulSoup(html_content, 'html.parser')
    
  5. Create Directories for Assets

    Create directories to store CSS, JavaScript, images, and other assets.

     # Create directories to save CSS, JS, images, and other assets
     os.makedirs('css', exist_ok=True)
     os.makedirs('js', exist_ok=True)
     os.makedirs('images', exist_ok=True)
     os.makedirs('assets', exist_ok=True)  # For other assets
    
  6. Define a Function to Download Files

    This function will download files from a URL and save them to the specified directory.

     # Function to download files
     def download_file(file_url, directory):
         try:
             response = requests.get(file_url)
             response.raise_for_status()  # Check for request errors
             file_name = os.path.basename(file_url)
             file_path = os.path.join(directory, file_name)
             with open(file_path, 'wb') as file:
                 file.write(response.content)
             print(f"Downloaded: {file_url}")
         except requests.RequestException as e:
             print(f"Error downloading {file_url}: {e}")
    
  7. Find and Download CSS Files

    Locate CSS files and download them.

     # Find and download CSS files
     for link in soup.find_all('link', href=True):
         if 'stylesheet' in link.get('rel', []):
             css_url = urljoin(url, link['href'])
             download_file(css_url, 'css')
    
  8. Find and Download JavaScript Files

    Locate JavaScript files and download them.

     # Find and download JavaScript bundles
     for script in soup.find_all('script', src=True):
         js_url = urljoin(url, script['src'])
         download_file(js_url, 'js')
    
  9. Find and Download Images

    Locate image files and download them.

     # Find and download images
     for img in soup.find_all('img', src=True):
         img_url = urljoin(url, img['src'])
         download_file(img_url, 'images')
    
  10. Find and Download Other Assets

    Handle other assets like fonts, videos, and icons.

    # Find and download other assets (e.g., fonts, videos)
    for link in soup.find_all('link', href=True):
        if 'icon' in link.get('rel', []) or 'manifest' in link.get('rel', []):
            asset_url = urljoin(url, link['href'])
            download_file(asset_url, 'assets')
    
    # Example for handling video files
    for video in soup.find_all('video', src=True):
        video_url = urljoin(url, video['src'])
        download_file(video_url, 'assets')
    
    # Example for handling font files
    for link in soup.find_all('link', href=True):
        if link['href'].endswith(('.woff', '.woff2', '.ttf', '.otf')):
            font_url = urljoin(url, link['href'])
            download_file(font_url, 'assets')
    
    💡
    Feel free to tweak and extend this script to handle more complex scenarios or additional types of assets.
Conclusion
This script provides a robust way to download HTML and associated assets from a URL. By utilizing requests and BeautifulSoup, you can efficiently fetch and save web content for offline analysis or replication. Adjust the directories and asset handling as needed based on the specific requirements of your project. Happy scraping!