In this tutorial, we will show you how to use a Python script to obtain metadata from images and PDFs.
This script can be useful in scenarios where you want to investigate files. Here are some possibilities:
- Obtain the geolocation of a photo. This can be useful in an investigation about the location where the photographer was at the time of the photo.
- Discover the authors of a document and its original creation date. This can be useful in forensic investigations.
- Discover the software used for editing or creation, the operating system, camera settings, and other useful information.
Installing the necessary libraries for Python
Initially, we will start by installing the necessary libraries for our Python script to function correctly. We are using a Windows machine as an example, but the process is similar on a Linux machine.
To do this, we will open a terminal on a machine that already has Python installed. If you have any questions on how to install Python, you can see this post: Install Python on Windows.
Now let’s start installing the libraries with the commands below.
pip install exifread
pip install pymediainfo
pip install PyPDF2
Running the code for image and PDF metadata
Now, let’s copy the code below and paste it into a file with the “.py” extension. In this case, we are naming our file “metadata.py“.
import os
import sys
import time
import exifread
from pymediainfo import MediaInfo
from PyPDF2 import PdfReader
def print_media_metadata(file_path):
try:
media_info = MediaInfo.parse(file_path)
for track in media_info.tracks:
for key, value in track.to_data().items():
print(f"{key}: {value}")
except Exception as e:
print(f"Error: {e}")
def print_exif_metadata(file_path):
def get_if_exist(data, key):
return data[key] if key in data else None
def convert_to_degrees(value):
d = float(value.values[0].num) / float(value.values[0].den)
m = float(value.values[1].num) / float(value.values[1].den)
s = float(value.values[2].num) / float(value.values[2].den)
return d + (m / 60.0) + (s / 3600.0)
try:
with open(file_path, 'rb') as f:
tags = exifread.process_file(f)
for tag in tags.keys():
print(f"EXIF TAG {tag}: {tags[tag]}")
lat_ref = get_if_exist(tags, 'GPS GPSLatitudeRef')
lat = get_if_exist(tags, 'GPS GPSLatitude')
lon_ref = get_if_exist(tags, 'GPS GPSLongitudeRef')
lon = get_if_exist(tags, 'GPS GPSLongitude')
if lat and lon and lat_ref and lon_ref:
lat = convert_to_degrees(lat)
if lat_ref.values[0] != 'N':
lat = -lat
lon = convert_to_degrees(lon)
if lon_ref.values[0] != 'E':
lon = -lon
print(f"=====Geolocation: Latitude: {lat}, Longitude: {lon}")
except Exception as e:
print(f"Error: {e}")
def print_pdf_metadata(file_path):
try:
reader = PdfReader(file_path)
info = reader.metadata
for key, value in info.items():
print(f"{key}: {value}")
if info.title:
print(f"=====Tittle: {info.title}")
if info.subject:
print(f"=====Subject}: {info.subject}")
if info.keywords:
print(f"=====Keywords: {info.keywords}")
if info.producer:
print(f"=====Produ.: {info.producer}")
if info.creation_date:
print(f"=====creation date: {info.creation_date}")
if info.modification_date:
print(f"=====modification date: {info.modification_date}")
if info.author:
print(f"=====Author: {info.author}")
if info.creator:
print(f"=====Ceiator: {info.creator}")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("python script.py <file>")
sys.exit(1)
file_path = sys.argv[1]
print_media_metadata(file_path)
if file_path.lower().endswith(('.jpg', '.jpeg', '.png', 'webp', 'avif')):
print_exif_metadata(file_path)
elif file_path.lower().endswith('.pdf'):
print_pdf_metadata(file_path)
If you are on Windows, you can type the command below in the terminal.
Notepad metadata.py
Then a screen will appear asking for confirmation. Confirm and then paste the code you copied above.
Now let’s save the file with the code.
Using a simple image
Now, let’s run our “metadata.py” script and pass a “jpg” image as a parameter. For this, we will use the command below.
The image file we are passing is “ssh_server_Windows_7.jpg”, you can choose a file you copied to your Python script folder.
python metadata.py ssh_server_Windows_7.jpg
Next, we will see that we have some metadata information. In this case, the information is very limited because we do not have EXIF information in the image we used.
Using an image with EXIF information
Now let’s perform the same procedure described above with an image with EXIF. In this case, we are using an image that has information about the device used to take the photo and information about GPS, date, and camera settings.
We have items 1 to 4 demonstrating the information described below.
- Item 1: Information about the device used to take the photo.
- Item 2: Geolocation information such as latitude and longitude. This allows investigating the location where the photo was taken.
- Item 3: Information about the original date of the image and its digitization.
- Item 4: Information about the camera settings of the device such as flash firing, focal length, among others.
Obtaining metadata from a PDF file
Now let’s pass a PDF file as a parameter to our code. After running our script, we will see something similar to the figure below.
In this case, we can see information about the PDF document’s creation date and the software used to create the document. Additionally, we also have information about the operating system used, in this case, Windows.
Explaining the code for image and PDF metadata
Next, we will explain the code used in the script for image and PDF metadata.
Python libraries for image and PDF metadata
First, let’s look at the Python libraries we will import. Below, we will describe the function of the 6 libraries used in this script.
- “import os“: Imports the module to interact with the operating system, allowing file and directory manipulation.
- “import sys“: Imports the module to access variables and functions that interact closely with the Python interpreter, such as command-line arguments.
- “import time”: Imports the module for time manipulation, especially useful for converting timestamps.
- “import exifread“: Imports the library for reading EXIF metadata from image files.
- “from pymediainfo import MediaInfo”: Imports the MediaInfo class from the pymediainfo module to read media file metadata.
- “from PyPDF2 import PdfReader”: Imports the PdfReader class from the PyPDF2 module to read PDF file metadata.
pymediainfo for obtaining media file metadata
Next, we will use pymediainfo to read and print metadata from media files. Below, we will detail parts of the code.
- “MediaInfo.parse(file_path)“: Parses the file and obtains metadata.
- “track.to_data().items()”: Iterates over key-value pairs of the metadata.
Python: EXIF metadata from image files
Now let’s describe the “print_exif_metadata(file_path)” function that will obtain the EXIF metadata from the image files we analyze.
Now let’s describe the important parts of the code that analyzes the EXIF metadata. In this case, we are including the attempt to obtain geolocation information.
- “get_if_exist(data, key)“: Helper function to check if a key exists.
- convert_to_degrees(value)“: Helper function to convert GPS values to degrees.
- “exifread.process_file(f)“: Processes the file and obtains EXIF metadata.
Python PDF metadata
Now we will use PdfReader to obtain metadata information from PDF files. In this case, we are collecting information such as title, subject, keywords, producer, creation date, modification date, author, and creator.
def print_pdf_metadata(file_path):
try:
reader = PdfReader(file_path)
info = reader.metadata
for key, value in info.items():
print(f"{key}: {value}")
if info.title:
print(f"=====Tittle: {info.title}")
if info.subject:
print(f"=====Subject}: {info.subject}")
if info.keywords:
print(f"=====Keywords: {info.keywords}")
if info.producer:
print(f"=====Produ.: {info.producer}")
if info.creation_date:
print(f"=====creation date: {info.creation_date}")
if info.modification_date:
print(f"=====modification date: {info.modification_date}")
if info.author:
print(f"=====Author: {info.author}")
if info.creator:
print(f"=====Ceiator: {info.creator}")
except Exception as e:
print(f"Error: {e}")
Now let’s detail the main functions of the code snippet above.
- PdfReader(f): Creates a PDF reader object.
- reader.metadata: Obtains document metadata.
The main part of our Python script
Now let’s show the main functions that are in the “__main__” part of our Python script.
Let’s detail the functions that will be used to obtain the metadata using the file extension passed as a parameter.
- “if name == “main””: Checks if the script is being run directly.
- “if len(sys.argv) < 2”: Checks if a file argument was passed.
- “file_path = sys.argv[1]”: Gets the file path from the command-line arguments.
See more:
How to connect mysql using python
Python: show my TCP and UDP ports
Juliana Mascarenhas
Data Scientist and Master in Computer Modeling by LNCC.
Computer Engineer
more links: