Python get metadata from images and pdfs

In this tutorial, we will show you how to use a Python script to obtain metadata from images and PDFs.

This script can be useful in scenarios where you want to investigate files. Here are some possibilities:

Obtain the geolocation of a photo. This can be useful in an investigation about the location where the photographer was at the time of the photo.
Discover the authors of a document and its original creation date. This can be useful in forensic investigations.
Discover the software used for editing or creation, the operating system, camera settings, and other useful information.

Table Of Contents

Installing the necessary libraries for Python
Running the code for image and PDF metadata
Explaining the code for image and PDF metadata

Installing the necessary libraries for Python

Initially, we will start by installing the necessary libraries for our Python script to function correctly. We are using a Windows machine as an example, but the process is similar on a Linux machine.

To do this, we will open a terminal on a machine that already has Python installed. If you have any questions on how to install Python, you can see this post: Install Python on Windows.

Now let’s start installing the libraries with the commands below.

pip install exifread

pip install pymediainfo

pip install PyPDF2

Running the code for image and PDF metadata

Now, let’s copy the code below and paste it into a file with the “.py” extension. In this case, we are naming our file “metadata.py“.

import os
import sys
import time
import exifread
from pymediainfo import MediaInfo
from PyPDF2 import PdfReader

def print_media_metadata(file_path):
    try:
        media_info = MediaInfo.parse(file_path)
        for track in media_info.tracks:
            for key, value in track.to_data().items():
                print(f"{key}: {value}")
    except Exception as e:
        print(f"Error: {e}")

def print_exif_metadata(file_path):
    def get_if_exist(data, key):
        return data[key] if key in data else None

    def convert_to_degrees(value):
        d = float(value.values[0].num) / float(value.values[0].den)
        m = float(value.values[1].num) / float(value.values[1].den)
        s = float(value.values[2].num) / float(value.values[2].den)
        return d + (m / 60.0) + (s / 3600.0)

    try:
        with open(file_path, 'rb') as f:
            tags = exifread.process_file(f)
            for tag in tags.keys():
                print(f"EXIF TAG {tag}: {tags[tag]}")            

            lat_ref = get_if_exist(tags, 'GPS GPSLatitudeRef')
            lat = get_if_exist(tags, 'GPS GPSLatitude')
            lon_ref = get_if_exist(tags, 'GPS GPSLongitudeRef')
            lon = get_if_exist(tags, 'GPS GPSLongitude')
            if lat and lon and lat_ref and lon_ref:
                lat = convert_to_degrees(lat)
                if lat_ref.values[0] != 'N':
                    lat = -lat
                lon = convert_to_degrees(lon)
                if lon_ref.values[0] != 'E':
                    lon = -lon
                print(f"=====Geolocation: Latitude: {lat}, Longitude: {lon}")
    except Exception as e:
        print(f"Error: {e}")

def print_pdf_metadata(file_path):
    try:
        reader = PdfReader(file_path)
        info = reader.metadata
        for key, value in info.items():
            print(f"{key}: {value}")

        if info.title:
            print(f"=====Tittle: {info.title}")
        if info.subject:
            print(f"=====Subject}: {info.subject}")
        if info.keywords:
            print(f"=====Keywords: {info.keywords}")
        if info.producer:
            print(f"=====Produ.: {info.producer}")
        if info.creation_date:
            print(f"=====creation date: {info.creation_date}")
        if info.modification_date:
            print(f"=====modification date: {info.modification_date}")
        if info.author:
            print(f"=====Author: {info.author}")
        if info.creator:
            print(f"=====Ceiator: {info.creator}")
    except Exception as e:
        print(f"Error: {e}")


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("python script.py <file>")
        sys.exit(1)

    file_path = sys.argv[1]
    print_media_metadata(file_path)
    if file_path.lower().endswith(('.jpg', '.jpeg', '.png', 'webp', 'avif')):
        print_exif_metadata(file_path)
    elif file_path.lower().endswith('.pdf'):
        print_pdf_metadata(file_path)

If you are on Windows, you can type the command below in the terminal.

Notepad metadata.py

Then a screen will appear asking for confirmation. Confirm and then paste the code you copied above.

Now let’s save the file with the code.

Using a simple image

Now, let’s run our “metadata.py” script and pass a “jpg” image as a parameter. For this, we will use the command below.

The image file we are passing is “ssh_server_Windows_7.jpg”, you can choose a file you copied to your Python script folder.

python metadata.py ssh_server_Windows_7.jpg

Next, we will see that we have some metadata information. In this case, the information is very limited because we do not have EXIF information in the image we used.

Using an image with EXIF information

Now let’s perform the same procedure described above with an image with EXIF. In this case, we are using an image that has information about the device used to take the photo and information about GPS, date, and camera settings.

We have items 1 to 4 demonstrating the information described below.

Item 1: Information about the device used to take the photo.
Item 2: Geolocation information such as latitude and longitude. This allows investigating the location where the photo was taken.
Item 3: Information about the original date of the image and its digitization.
Item 4: Information about the camera settings of the device such as flash firing, focal length, among others.

Obtaining metadata from a PDF file

Now let’s pass a PDF file as a parameter to our code. After running our script, we will see something similar to the figure below.

In this case, we can see information about the PDF document’s creation date and the software used to create the document. Additionally, we also have information about the operating system used, in this case, Windows.

Explaining the code for image and PDF metadata

Next, we will explain the code used in the script for image and PDF metadata.

Python libraries for image and PDF metadata

First, let’s look at the Python libraries we will import. Below, we will describe the function of the 6 libraries used in this script.

“import os“: Imports the module to interact with the operating system, allowing file and directory manipulation.
“import sys“: Imports the module to access variables and functions that interact closely with the Python interpreter, such as command-line arguments.
“import time”: Imports the module for time manipulation, especially useful for converting timestamps.
“import exifread“: Imports the library for reading EXIF metadata from image files.
“from pymediainfo import MediaInfo”: Imports the MediaInfo class from the pymediainfo module to read media file metadata.
“from PyPDF2 import PdfReader”: Imports the PdfReader class from the PyPDF2 module to read PDF file metadata.

pymediainfo for obtaining media file metadata

Next, we will use pymediainfo to read and print metadata from media files. Below, we will detail parts of the code.

“MediaInfo.parse(file_path)“: Parses the file and obtains metadata.
“track.to_data().items()”: Iterates over key-value pairs of the metadata.

Python: EXIF metadata from image files

Now let’s describe the “print_exif_metadata(file_path)” function that will obtain the EXIF metadata from the image files we analyze.

Now let’s describe the important parts of the code that analyzes the EXIF metadata. In this case, we are including the attempt to obtain geolocation information.

“get_if_exist(data, key)“: Helper function to check if a key exists.
convert_to_degrees(value)“: Helper function to convert GPS values to degrees.
“exifread.process_file(f)“: Processes the file and obtains EXIF metadata.

Python PDF metadata

Now we will use PdfReader to obtain metadata information from PDF files. In this case, we are collecting information such as title, subject, keywords, producer, creation date, modification date, author, and creator.

def print_pdf_metadata(file_path):
    try:
        reader = PdfReader(file_path)
        info = reader.metadata
        for key, value in info.items():
            print(f"{key}: {value}")
        if info.title:
            print(f"=====Tittle: {info.title}")
        if info.subject:
            print(f"=====Subject}: {info.subject}")
        if info.keywords:
            print(f"=====Keywords: {info.keywords}")
        if info.producer:
            print(f"=====Produ.: {info.producer}")
        if info.creation_date:
            print(f"=====creation date: {info.creation_date}")
        if info.modification_date:
            print(f"=====modification date: {info.modification_date}")
        if info.author:
            print(f"=====Author: {info.author}")
        if info.creator:
            print(f"=====Ceiator: {info.creator}")
    except Exception as e:
        print(f"Error: {e}")

Now let’s detail the main functions of the code snippet above.

PdfReader(f): Creates a PDF reader object.
reader.metadata: Obtains document metadata.

The main part of our Python script

Now let’s show the main functions that are in the “__main__” part of our Python script.

Let’s detail the functions that will be used to obtain the metadata using the file extension passed as a parameter.

“if name == “main””: Checks if the script is being run directly.
“if len(sys.argv) < 2”: Checks if a file argument was passed.
“file_path = sys.argv[1]”: Gets the file path from the command-line arguments.

See more:

How to handle PDF in Python?

CRUD MySQL with Python

How to connect mysql using python

Python: show my TCP and UDP ports

Create API in Python MySQL

Juliana Mascarenhas

Data Scientist and Master in Computer Modeling by LNCC.
Computer Engineer

more links:

https://www.python.org

https://www.online-python.com

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". O cookie é definido pelo consentimento do cookie GDPR para registrar o consentimento do usuário para os cookies na categoria "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". Este cookie é definido pelo plug-in GDPR Cookie Consent. Os cookies são usados para armazenar o consentimento do usuário para os cookies na categoria "Necessary",
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. O cookie é definido pelo plug-in GDPR Cookie Consent e é usado para armazenar se o usuário consentiu ou não com o uso de cookies. Ele não armazena nenhum dado pessoal.

Cookie	Duration	Description
_tccl_visit	30 minutes	This cookie is set by the web hosting provider GoDaddy. This is a persistent cookie used for monitoring the website usage performance.
_tccl_visitor	1 year	This cookie is set by the web hosting provider GoDaddy. This is a persistent cookie used for monitoring the website usage performance.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_199766752_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Cookie	Duration	Description
FCCDCF	12 hours	No description available.
GoogleAdServingTest	session	No description