How to Download HTML and Assets from a URL with Python
Step-by-Step Guide to Download HTML and Website Assets with Python
Table of contents
When working with web scraping or offline website analysis, you might need to download not only the HTML content of a page but also its associated assets like CSS files, JavaScript, images, and fonts. Python provides a powerful suite of libraries to help you achieve this efficiently. In this guide, we'll use requests
for making HTTP requests and BeautifulSoup
for parsing HTML. We'll also handle asset downloads and saving them to the appropriate directories.
Here’s a step-by-step guide to help you download HTML and associated assets from a URL using Python.
Prerequisites
Ensure you have the required libraries installed. You can install them using pip if you haven't already:
pip install requests beautifulsoup4
Step-by-Step Script
Import Required Libraries
We'll need
requests
to fetch the webpage and its assets,BeautifulSoup
to parse the HTML, and some standard libraries to handle files and directories.import requests from bs4 import BeautifulSoup import os from urllib.parse import urljoin
Define the URL and Fetch HTML Content
Define the URL from which you want to download the HTML and assets. Use
requests
to fetch the HTML content.# Define the URL url = 'http://example.com' # Fetch the HTML content response = requests.get(url) html_content = response.text
Save the HTML Content
Write the fetched HTML content to a file named
index.html
.# Save HTML content with open('index.html', 'w', encoding='utf-8') as file: file.write(html_content)
Parse HTML Content
Use
BeautifulSoup
to parse the HTML content for extracting asset URLs.# Parse HTML content soup = BeautifulSoup(html_content, 'html.parser')
Create Directories for Assets
Create directories to store CSS, JavaScript, images, and other assets.
# Create directories to save CSS, JS, images, and other assets os.makedirs('css', exist_ok=True) os.makedirs('js', exist_ok=True) os.makedirs('images', exist_ok=True) os.makedirs('assets', exist_ok=True) # For other assets
Define a Function to Download Files
This function will download files from a URL and save them to the specified directory.
# Function to download files def download_file(file_url, directory): try: response = requests.get(file_url) response.raise_for_status() # Check for request errors file_name = os.path.basename(file_url) file_path = os.path.join(directory, file_name) with open(file_path, 'wb') as file: file.write(response.content) print(f"Downloaded: {file_url}") except requests.RequestException as e: print(f"Error downloading {file_url}: {e}")
Find and Download CSS Files
Locate CSS files and download them.
# Find and download CSS files for link in soup.find_all('link', href=True): if 'stylesheet' in link.get('rel', []): css_url = urljoin(url, link['href']) download_file(css_url, 'css')
Find and Download JavaScript Files
Locate JavaScript files and download them.
# Find and download JavaScript bundles for script in soup.find_all('script', src=True): js_url = urljoin(url, script['src']) download_file(js_url, 'js')
Find and Download Images
Locate image files and download them.
# Find and download images for img in soup.find_all('img', src=True): img_url = urljoin(url, img['src']) download_file(img_url, 'images')
Find and Download Other Assets
Handle other assets like fonts, videos, and icons.
# Find and download other assets (e.g., fonts, videos) for link in soup.find_all('link', href=True): if 'icon' in link.get('rel', []) or 'manifest' in link.get('rel', []): asset_url = urljoin(url, link['href']) download_file(asset_url, 'assets') # Example for handling video files for video in soup.find_all('video', src=True): video_url = urljoin(url, video['src']) download_file(video_url, 'assets') # Example for handling font files for link in soup.find_all('link', href=True): if link['href'].endswith(('.woff', '.woff2', '.ttf', '.otf')): font_url = urljoin(url, link['href']) download_file(font_url, 'assets')
💡Feel free to tweak and extend this script to handle more complex scenarios or additional types of assets.
Conclusion
requests
and BeautifulSoup
, you can efficiently fetch and save web content for offline analysis or replication. Adjust the directories and asset handling as needed based on the specific requirements of your project. Happy scraping!