Skip to content
Logo SR
  • Home
  • Português
  • English
  • About
Logo SR
  • Home
  • Português
  • English
  • About

Python get metadata from images and pdfs

Python_eng, security, Sem categoria

In this tutorial, we will show you how to use a Python script to obtain metadata from images and PDFs.

Português

This script can be useful in scenarios where you want to investigate files. Here are some possibilities:

  1. Obtain the geolocation of a photo. This can be useful in an investigation about the location where the photographer was at the time of the photo.
  2. Discover the authors of a document and its original creation date. This can be useful in forensic investigations.
  3. Discover the software used for editing or creation, the operating system, camera settings, and other useful information.
Table Of Contents
  1. Installing the necessary libraries for Python
  2. Running the code for image and PDF metadata
    • Using a simple image
    • Using an image with EXIF information
    • Obtaining metadata from a PDF file
  3. Explaining the code for image and PDF metadata
    • Python libraries for image and PDF metadata
    • pymediainfo for obtaining media file metadata
    • Python: EXIF metadata from image files
    • Python PDF metadata
    • The main part of our Python script

Installing the necessary libraries for Python

Initially, we will start by installing the necessary libraries for our Python script to function correctly. We are using a Windows machine as an example, but the process is similar on a Linux machine.

To do this, we will open a terminal on a machine that already has Python installed. If you have any questions on how to install Python, you can see this post: Install Python on Windows.

Now let’s start installing the libraries with the commands below.

pip install exifread
pip install pymediainfo
pip install PyPDF2

Running the code for image and PDF metadata

Now, let’s copy the code below and paste it into a file with the “.py” extension. In this case, we are naming our file “metadata.py“.

import os
import sys
import time
import exifread
from pymediainfo import MediaInfo
from PyPDF2 import PdfReader

def print_media_metadata(file_path):
    try:
        media_info = MediaInfo.parse(file_path)
        for track in media_info.tracks:
            for key, value in track.to_data().items():
                print(f"{key}: {value}")
    except Exception as e:
        print(f"Error: {e}")

def print_exif_metadata(file_path):
    def get_if_exist(data, key):
        return data[key] if key in data else None

    def convert_to_degrees(value):
        d = float(value.values[0].num) / float(value.values[0].den)
        m = float(value.values[1].num) / float(value.values[1].den)
        s = float(value.values[2].num) / float(value.values[2].den)
        return d + (m / 60.0) + (s / 3600.0)

    try:
        with open(file_path, 'rb') as f:
            tags = exifread.process_file(f)
            for tag in tags.keys():
                print(f"EXIF TAG {tag}: {tags[tag]}")            

            lat_ref = get_if_exist(tags, 'GPS GPSLatitudeRef')
            lat = get_if_exist(tags, 'GPS GPSLatitude')
            lon_ref = get_if_exist(tags, 'GPS GPSLongitudeRef')
            lon = get_if_exist(tags, 'GPS GPSLongitude')
            if lat and lon and lat_ref and lon_ref:
                lat = convert_to_degrees(lat)
                if lat_ref.values[0] != 'N':
                    lat = -lat
                lon = convert_to_degrees(lon)
                if lon_ref.values[0] != 'E':
                    lon = -lon
                print(f"=====Geolocation: Latitude: {lat}, Longitude: {lon}")
    except Exception as e:
        print(f"Error: {e}")

def print_pdf_metadata(file_path):
    try:
        reader = PdfReader(file_path)
        info = reader.metadata
        for key, value in info.items():
            print(f"{key}: {value}")

        if info.title:
            print(f"=====Tittle: {info.title}")
        if info.subject:
            print(f"=====Subject}: {info.subject}")
        if info.keywords:
            print(f"=====Keywords: {info.keywords}")
        if info.producer:
            print(f"=====Produ.: {info.producer}")
        if info.creation_date:
            print(f"=====creation date: {info.creation_date}")
        if info.modification_date:
            print(f"=====modification date: {info.modification_date}")
        if info.author:
            print(f"=====Author: {info.author}")
        if info.creator:
            print(f"=====Ceiator: {info.creator}")
    except Exception as e:
        print(f"Error: {e}")


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("python script.py <file>")
        sys.exit(1)

    file_path = sys.argv[1]
    print_media_metadata(file_path)
    if file_path.lower().endswith(('.jpg', '.jpeg', '.png', 'webp', 'avif')):
        print_exif_metadata(file_path)
    elif file_path.lower().endswith('.pdf'):
        print_pdf_metadata(file_path)

If you are on Windows, you can type the command below in the terminal.

Notepad metadata.py

Then a screen will appear asking for confirmation. Confirm and then paste the code you copied above.

Now let’s save the file with the code.

Using a simple image

Now, let’s run our “metadata.py” script and pass a “jpg” image as a parameter. For this, we will use the command below.

The image file we are passing is “ssh_server_Windows_7.jpg”, you can choose a file you copied to your Python script folder.

python metadata.py ssh_server_Windows_7.jpg

Next, we will see that we have some metadata information. In this case, the information is very limited because we do not have EXIF information in the image we used.

Using an image with EXIF information

Now let’s perform the same procedure described above with an image with EXIF. In this case, we are using an image that has information about the device used to take the photo and information about GPS, date, and camera settings.

We have items 1 to 4 demonstrating the information described below.

  • Item 1: Information about the device used to take the photo.
  • Item 2: Geolocation information such as latitude and longitude. This allows investigating the location where the photo was taken.
  • Item 3: Information about the original date of the image and its digitization.
  • Item 4: Information about the camera settings of the device such as flash firing, focal length, among others.

Obtaining metadata from a PDF file

Now let’s pass a PDF file as a parameter to our code. After running our script, we will see something similar to the figure below.

In this case, we can see information about the PDF document’s creation date and the software used to create the document. Additionally, we also have information about the operating system used, in this case, Windows.

Explaining the code for image and PDF metadata

Next, we will explain the code used in the script for image and PDF metadata.

Python libraries for image and PDF metadata

First, let’s look at the Python libraries we will import. Below, we will describe the function of the 6 libraries used in this script.

  1. “import os“: Imports the module to interact with the operating system, allowing file and directory manipulation.
  2. “import sys“: Imports the module to access variables and functions that interact closely with the Python interpreter, such as command-line arguments.
  3. “import time”: Imports the module for time manipulation, especially useful for converting timestamps.
  4. “import exifread“: Imports the library for reading EXIF metadata from image files.
  5. “from pymediainfo import MediaInfo”: Imports the MediaInfo class from the pymediainfo module to read media file metadata.
  6. “from PyPDF2 import PdfReader”: Imports the PdfReader class from the PyPDF2 module to read PDF file metadata.

pymediainfo for obtaining media file metadata

Next, we will use pymediainfo to read and print metadata from media files. Below, we will detail parts of the code.

  • “MediaInfo.parse(file_path)“: Parses the file and obtains metadata.
  • “track.to_data().items()”: Iterates over key-value pairs of the metadata.

Python: EXIF metadata from image files

Now let’s describe the “print_exif_metadata(file_path)” function that will obtain the EXIF metadata from the image files we analyze.

Now let’s describe the important parts of the code that analyzes the EXIF metadata. In this case, we are including the attempt to obtain geolocation information.

  • “get_if_exist(data, key)“: Helper function to check if a key exists.
  • convert_to_degrees(value)“: Helper function to convert GPS values to degrees.
  • “exifread.process_file(f)“: Processes the file and obtains EXIF metadata.

Python PDF metadata

Now we will use PdfReader to obtain metadata information from PDF files. In this case, we are collecting information such as title, subject, keywords, producer, creation date, modification date, author, and creator.

def print_pdf_metadata(file_path):
    try:
        reader = PdfReader(file_path)
        info = reader.metadata
        for key, value in info.items():
            print(f"{key}: {value}")
        if info.title:
            print(f"=====Tittle: {info.title}")
        if info.subject:
            print(f"=====Subject}: {info.subject}")
        if info.keywords:
            print(f"=====Keywords: {info.keywords}")
        if info.producer:
            print(f"=====Produ.: {info.producer}")
        if info.creation_date:
            print(f"=====creation date: {info.creation_date}")
        if info.modification_date:
            print(f"=====modification date: {info.modification_date}")
        if info.author:
            print(f"=====Author: {info.author}")
        if info.creator:
            print(f"=====Ceiator: {info.creator}")
    except Exception as e:
        print(f"Error: {e}")

Now let’s detail the main functions of the code snippet above.

  • PdfReader(f): Creates a PDF reader object.
  • reader.metadata: Obtains document metadata.

The main part of our Python script

Now let’s show the main functions that are in the “__main__” part of our Python script.

Let’s detail the functions that will be used to obtain the metadata using the file extension passed as a parameter.

  • “if name == “main””: Checks if the script is being run directly.
  • “if len(sys.argv) < 2”: Checks if a file argument was passed.
  • “file_path = sys.argv[1]”: Gets the file path from the command-line arguments.

See more:

How to handle PDF in Python?

CRUD MySQL with Python

How to connect mysql using python

Python: show my TCP and UDP ports

Create API in Python MySQL

Juliana Mascarenhas

Data Scientist and Master in Computer Modeling by LNCC.
Computer Engineer

Linkedin

more links:

https://www.python.org

https://www.online-python.com

← Previous Post
Next Post →

Related Posts

Arpwatch: Installation and Configuration

Network, security

NMAP: TCP and UDP port mapping

Network, security

NMAP: Advanced Scan

Network, security

NMAP: Identify the version of a service

Network, security

Snort PfSense : Detect DoS Attack

security

Install Open VPN on Linux

Network, security, VPN_en
  • Português
  • English
  • Blockchain (3)
  • Data (9)
    • Data Science_en (3)
    • Database (6)
  • Network (66)
    • Cloud_en (2)
    • OpenWRT_en (3)
    • PacketTracer_en (4)
    • Protocol (13)
    • Proxy_en (6)
    • Servers (9)
  • PfSense_en (9)
    • pfBlockerNG_en (3)
  • programming (18)
    • Java_en (3)
    • Python_eng (13)
  • Raspberry PI en (7)
  • security (19)
    • Suricata_en (3)
  • Sem categoria (1)
  • virtualization (19)
    • Docker_en (6)
    • VirtualBox_en (13)
  • VPN_en (8)
  • Zabbix_en (5)
  • Português

Latest Articles

  • Packet Tracer network with one router
  • How to Use Snap Behind a Proxy on Linux (Step-by-Step Guide) 
  • How to Create a Network with a Switch in Packet Tracer – Step-by-Step Guide for Beginners
  • Why use Kali Linux inside VirtualBox?
  • How to install pfBlocker on pfSense: step by step guide
  • Packet Tracer for Dummies: Setting Up Your First Network with 2 PCs (Quick Start Guide)
  • Learn how to use the curl command: tutorial with practical examples
  • How to Install Kali Linux on VirtualBox: Step-by-Step Guide for Beginners
  • Python Package Managers: Pip and Conda – A Complete Beginner’s Guide
  • What is CGNAT ?
  • Tutorial: How to use WHOIS and RDAP
  • How to Set Up a Postfix and Dovecot Email Server on Linux: A Step-by-Step Guide
  • Tutorial how to Install and configure VNC on Ubuntu
  • Build Your Own DNS Server: A Step-by-Step Guide using BIND9
  • Tutorial for SSH Public Key Authentication
  • Socket UDP Python Chat
  • Socket TCP Python make a chat
  • apt get behind proxy
  • Best IDE for Python?
  • Python get metadata from images and pdfs
  • Português
  • English
  • Cookie Policy / Política de Cookies
  • Privacy Policy
  • About
We use cookies on our website remembering your preferences and visits. By clicking “Accept All”, you consent to the use of ALL the cookies. Visit " Settings" to provide a controlled consent./ Usamos cookies no site lembrando suas preferências e visitas. Clicando em “Aceitar todos”, você concorda com o uso de TODOS os cookies. visite "Configurações cookies" para um consentimento controlado.
Settings/ConfiguraçõesAccept All / Aceitar tudo
Manage consent / Gerenciar consentimento

Privacy Overview / Visão geral da privacidade

This website, uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience. **/** Este site usa cookies para melhorar a sua experiência enquanto navega pelo site. Destes, os cookies que são categorizados como necessários são armazenados no seu navegador, pois são essenciais para o funcionamento das funcionalidades básicas do site. Também usamos cookies de terceiros que nos ajudam a analisar e entender como você usa este site. Esses cookies serão armazenados em seu navegador apenas com o seu consentimento. Você também tem a opção de cancelar esses cookies. Porém, a desativação de alguns desses cookies pode afetar sua experiência de navegação.
Necessary_en
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-advertisement1 yearSet by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". O cookie é definido pelo consentimento do cookie GDPR para registrar o consentimento do usuário para os cookies na categoria "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". Este cookie é definido pelo plug-in GDPR Cookie Consent. Os cookies são usados para armazenar o consentimento do usuário para os cookies na categoria "Necessary",
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. O cookie é definido pelo plug-in GDPR Cookie Consent e é usado para armazenar se o usuário consentiu ou não com o uso de cookies. Ele não armazena nenhum dado pessoal.
Functional_en
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
CookieDurationDescription
pll_language1 yearThe pll _language cookie is used by Polylang to remember the language selected by the user when returning to the website, and also to get the language information when not available in another way.
Performance_en
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
CookieDurationDescription
_tccl_visit30 minutesThis cookie is set by the web hosting provider GoDaddy. This is a persistent cookie used for monitoring the website usage performance.
_tccl_visitor1 yearThis cookie is set by the web hosting provider GoDaddy. This is a persistent cookie used for monitoring the website usage performance.
Analytics_en
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
CookieDurationDescription
__gads1 year 24 daysThe __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_199766752_11 minuteSet by Google to distinguish users.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
Advertisement_en
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
CookieDurationDescription
test_cookie15 minutesThe test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
Others_en
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
CookieDurationDescription
FCCDCF12 hoursNo description available.
GoogleAdServingTestsessionNo description
SAVE & ACCEPT
Powered by CookieYes Logo