Automate Your Data Reports with Python and PostgreSQL

A Step-by-Step Guide to Automating Data Exports and Email Reports with Python

How to Automate Data Exports and Email Reports with Python – a Step-by-Step Guide

In today’s data-driven world, automation is the key to streamlining tasks and saving time. Don’t you just wish there was a magic button that could handle all those tedious tasks for you? Well, you’re in luck! With the power of Python and PostgreSQL, you can automate the process of extracting data from a database and sending it as an email attachment. This beginner-friendly tutorial will walk you through the process step by step, so even if you’re new to programming, you’ll be able to follow along.

Why Automation?

Imagine you’re working in an organization where your managers expect a weekly report filled with valuable insights. But creating this report is far from a straightforward task. To get the information you need, you have to manually run ten different database queries, gather the results, and then meticulously compile them into an Excel spreadsheet. It’s a time-consuming and error-prone process that can leave you exhausted.

In this scenario, wouldn’t it be a game-changer if Python could take the reins and handle this entire process for you? Picture this: Every week, without any manual intervention, Python seamlessly extracts the required data, compiles it into a neat Excel sheet, and even sends it off to your managers like clockwork. Sounds like a dream come true, right? Well, in this tutorial, we’ll make that dream a reality.

Getting Started

Before we dive in, let’s make sure you have everything you need to follow along. Here are the prerequisites:

  1. Python installed on your computer. You can download Python from Python.org.
  2. Basic knowledge of the Python programming language.
  3. Familiarity with sending emails in Python.
  4. PostgreSQL installed on your computer. You can download PostgreSQL from here.

Setting Up Your Virtual Environment

To ensure a clean and isolated environment for our project, we’ll create a virtual environment using the venv tool, which is included with Python. Here’s how you can set it up:

  1. Open your terminal or command prompt.
  2. Create a project directory and navigate to it. Run the following commands: mkdir report-automation cd report-automation
  3. Create a virtual environment named env using the following command: python -m venv env (Note: On Windows, use source env/Scripts/activate to activate the environment.)
  4. Activate the virtual environment with the following command: source env/bin/activate (Note: On Windows, use source env/Scripts/activate to activate the environment.)
  5. You should see (env) in your terminal prompt, indicating that the virtual environment has been activated.

Great! Now that you’re in the virtual environment, let’s install the required libraries.

Installing the Required Libraries

To work with PostgreSQL and manipulate data in Python, we’ll need a few libraries. Run the following command to install them:

pip install psycopg2 pandas xlsxwriter

Here’s what each library does:

  • psycopg2: Python adapter for PostgreSQL, enabling Python applications to interact with PostgreSQL databases.
  • pandas: A versatile data manipulation and analysis library for Python, ideal for working with structured data.
  • xlsxwriter: Python module for creating and formatting Excel (XLSX) files, useful for generating reports and spreadsheets.

Once the installation is complete, you’re all set to move on to the next step.

Setting Up Your Sample Database

In this tutorial, we’ll be working with a demo database named “airlines”. This database includes three tables: bookings, flights, and airports_data. To set up the database, follow these steps:

  1. Download the SQL script file “airlines_db.sql” from here.
  2. Open your terminal or command prompt.
  3. Use the following command to install the database, replacing postgres with your PostgreSQL username if it’s different: psql -f airlines_db.sql -U postgres This command will execute the SQL script and create the “airlines” database with the necessary tables.

The main schema in the database is bookings, which stores crucial information about bookings made for flights. There are two other tables, flights and airports_data, which capture details about flights and airports, respectively. The relationships between these tables are established through foreign keys.

Setting Up Logging and Environment Variables

To provide informative messages and handle errors throughout the code, we’ll configure logging using Python’s built-in logging module. We’ll also set up environment variables to securely store sensitive information and configuration parameters. Environment variables are used to store data like passwords and email credentials separately from the code, reducing the risk of accidental exposure or unauthorized access.

In this tutorial, we’ll store the email credentials, SMTP server information, and database connection details in environment variables. Here’s what you need to do:

  1. Create a .env file in the project directory.
  2. Add the following lines to the .env file, replacing the values with your own: [email protected] PASSWORD=your-email-password EMAIL_PORT=587 SMTP_SERVER=smtp.example.com DB_HOSTNAME=localhost DB_NAME=airlines DB_PORT=5432 DB_USERNAME=postgres DB_PASSWORD=postgres
  3. Save the .env file.
  4. Run source .env in your terminal or command prompt to load the environment variables.

By using environment variables, you can ensure that sensitive data like passwords and email credentials are kept separate from the code. This improves security while maintaining flexibility in configuration.

You’re all set! Now let’s move on to extracting the data from the database.

Extracting the Data From the Database

Before we can automate the reporting process, we need to connect to our database and fetch the data we need. To accomplish this, we’ll use the psycopg2 library, which allows Python applications to interact with PostgreSQL databases.

Let’s start by setting up the database configurations. Add the following code to your Python file:

import os
import psycopg2
import logging

logging.basicConfig(format="%(asctime)s | %(levelname)s : %(message)s", level=logging.INFO)

DB_CONFIG = {
    "host": os.environ.get("DB_HOSTNAME"),
    "database": os.environ.get("DB_NAME"),
    "user": os.environ.get("DB_USERNAME"),
    "password": os.environ.get("DB_PASSWORD"),
}

Here, we define a dictionary called DB_CONFIG, which stores the configuration parameters for connecting to the PostgreSQL database. These parameters include the hostname, database name, username, and password. We retrieve these values from the environment variables we set up earlier.

Next, let’s create a method to connect to the database and fetch the data. Add the following code:

...

class DataExporter:
    ...
    def __connect_to_database(self) -> None:
        """Establish a connection to the PostgreSQL database."""
        try:
            self.conn = psycopg2.connect(**self.db_config)
            self.cursor = self.conn.cursor()
            logging.info("Connected to the database")
        except Exception as e:
            logging.error("Failed to connect to the database with error: %s", e)
            raise

    def __fetch_from_database(self, start_timestamp, end_timestamp) -> list:
        """Fetch booking data from the database for a given time range."""
        self.__connect_to_database()
        query = f"""
        SELECT COUNT(*) AS num_bookings, SUM(total_amount) AS total_amount
        FROM bookings
        WHERE book_date >= {int(start_timestamp.timestamp()) * 1000}
        AND book_date <= {int(end_timestamp.timestamp()) * 1000}
        """
        logging.info(
            "Extracting bookings data from database for start timestamp=%s and end_timestamp=%s",
            start_timestamp,
            end_timestamp,
        )
        result = None
        try:
            self.cursor.execute(query)
            result = list(self.cursor.fetchone())
            result.append(
                f'{start_timestamp.strftime("%d %b, %Y")} - {end_timestamp.strftime("%d %b, %Y")}'
            )
            logging.info(
                "Successfully extracted bookings data from database for start timestamp=%s and end_timestamp=%s",
                start_timestamp,
                end_timestamp,
            )
        except Exception as e:
            logging.error(
                "Error occurred while extracting bookings data from database: %s", e
            )
        return result

...

This code defines two private methods within the DataExporter class: __connect_to_database and __fetch_from_database.

The __connect_to_database method establishes a connection to the PostgreSQL database using the psycopg2 library. It reads the database configuration values from the DB_CONFIG dictionary we defined earlier. If the connection is successful, it sets up a cursor for executing SQL queries.

The __fetch_from_database method fetches the booking data from the database for a given time range. It constructs a SQL query that selects the count of bookings and the total booking amount based on the provided start and end timestamps. The query limits the results to bookings within the specified time range.

After executing the query, the method retrieves the result using cursor.fetchone() and appends the timestamp range to the result list. The final result is then returned.

With these methods in place, you can easily extract booking data for different time ranges and use it in your reports.

Structuring the Booking Data with the BookingInfo Class

To organize and represent booking data retrieved from the database, we’ll define a BookingInfo class in a separate file called booking_info.py. This class will encapsulate booking-related information, making it easier to work with and present the data.

In booking_info.py, add the following code:

from decimal import Decimal

class BookingInfo:
    def __init__(self, data_list):
        """Initialize BookingInfo with data from the database."""
        self.__total_bookings, self.__total_amount, self.__timestamp = data_list
        self.__total_amount = Decimal(self.__total_amount) if self.__total_amount else Decimal(0)

    def __str__(self) -> str:
        """Return a string representation of BookingInfo."""
        return f"Total Bookings: {self.__total_bookings}, Total Amount: ${self.__total_amount}"

    def get_total_bookings(self) -> int:
        """Get the total number of bookings."""
        return self.__total_bookings

    def get_total_amount(self) -> Decimal:
        """Get the total booking amount as a Decimal."""
        return self.__total_amount

    def get_timestamp(self) -> str:
        """Get the timestamp associated with the booking data."""
        return self.__timestamp

The BookingInfo class receives a data_list as input, which is assumed to be a list containing the following elements:

  • total_bookings: An integer representing the total number of bookings.
  • total_amount: A floating-point value representing the total booking amount.
  • timestamp: A timestamp associated with the booking data.

The __init__ method initializes private instance variables (__total_bookings, __total_amount, and __timestamp) with the values from the data_list. It also converts the __total_amount to a Decimal type for precise handling of monetary values.

The __str__ method is implemented to provide a string representation of the BookingInfo object. It returns a string in the format “Total Bookings: X, Total Amount: $Y”, where X is the total number of bookings and Y is the total booking amount formatted as dollars.

To access the encapsulated data, the class provides three getter methods:

  • get_total_bookings(): Returns the total number of bookings as an integer.
  • get_total_amount(): Returns the total booking amount as a Decimal type.
  • get_timestamp(): Returns the timestamp associated with the booking data as a string.

By encapsulating the booking data within the BookingInfo class, the code becomes more organized, readable, and reusable. This structured approach simplifies the handling of booking information throughout the application.

Converting the Data into an Excel Sheet

Now that we can retrieve data from the database for a specific time range, let’s convert the extracted data into an Excel sheet. To do this, we’ll define another private method within the DataExporter class called __convert_to_excelsheet.

Add the following code to your Python file:

import pandas as pd
from booking_info import BookingInfo

...

class DataExporter:
    ...
    def __convert_to_excelsheet(self, data, sheet_name):
        """Convert the fetched data into an Excel sheet."""
        try:
            booking_info = BookingInfo(data)
            data = {
                "": ["Total Bookings", "Total Amount ($)"],
                booking_info.get_timestamp(): [
                    booking_info.get_total_bookings(),
                    booking_info.get_total_amount(),
                ],
            }
            logging.info("Converting the data into a pandas DataFrame")
            df = pd.DataFrame(data)
            logging.info("Inserting the data into the Excel sheet")
            with pd.ExcelWriter(sheet_name, engine="xlsxwriter") as writer:
                df.to_excel(writer, sheet_name="Sheet1", index=False)
            logging.info("Successfully inserted data into the Excel sheet")
        except ValueError as e:
            logging.error("Error converting data into Excel: %s", e)

...

This code defines the __convert_to_excelsheet method, which takes two parameters: data (a list containing the booking data) and sheet_name (the desired name of the Excel sheet).

Within this method, we create an instance of the BookingInfo class, passing in the data list as input. This allows us to structure and format the data for insertion into the Excel sheet.

Next, we structure the data dictionary to prepare it for conversion into a pandas DataFrame. The dictionary has two key-value pairs: an empty string as the key and a list with header values (“Total Bookings” and “Total Amount ($)”), and the timestamp as the key with a list of values containing the total number of bookings and the total booking amount.

We then convert the data dictionary into a pandas DataFrame (df), which is a commonly used data structure for handling tabular data in Python.

To create the Excel sheet, we use the pd.ExcelWriter context manager with the "xlsxwriter" engine. This ensures that the Excel file is prepared for data insertion. The sheet_name parameter specifies the name of the sheet within the Excel file.

Finally, the data within the DataFrame is written to the Excel sheet using the to_excel method of the DataFrame. We set the index parameter to False to exclude the default row numbers in the Excel sheet.

Now that we have the ability to convert data into an Excel sheet, let’s combine all the functionalities and generate reports automatically.

Combining the Functionalities

To generate reports automatically, we need to create a public method that users can call to extract the data from the database, convert it into an Excel sheet, and send it as an email attachment.

Add the following code to your Python file:

...

class DataExporter:
    ...
    def generate_excelsheet(
        self, start_timestamp: datetime, end_timestamp: datetime, sheet_name="Bookings Data.xlsx"
    ) -> bool:
        """Generate an Excel sheet with booking data for a specified time range."""
        data = self.__fetch_from_database(start_timestamp, end_timestamp)
        if data is not None:
            self.__convert_to_excelsheet(data, sheet_name)
            return True
        else:
            logging.error("No data to generate an Excel sheet")
            return False

...

This method, generate_excelsheet, accepts several parameters:

  • start_timestamp: The start of the time range for data extraction.
  • end_timestamp: The end of the time range for data extraction.
  • sheet_name (optional): The name of the Excel sheet to be created. Defaults to “Bookings Data.xlsx”.

The method calls the private __fetch_from_database method to retrieve the booking data for the specified time range. If data is available, it calls the private __convert_to_excelsheet method to convert the extracted data into an Excel sheet with the specified name.

If the generation of the Excel sheet is successful, the method returns True. Otherwise, it logs an error message and returns False.

Now that we can generate Excel sheets with the desired data, let’s move on to emailing the reports.

Sending an Email with the Bookings Data Report

To send an email with the Excel sheet as an attachment, we’ll use Python’s built-in smtplib library. This library allows us to communicate with an SMTP server and send emails programmatically.

In a separate file called mailer.py, add the following code:

import logging
import os
import smtplib
import ssl
from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

logging.basicConfig(format="%(asctime)s | %(levelname)s : %(message)s", level=logging.INFO)

SMTP_SERVER = os.environ.get("SMTP_SERVER")
PORT = os.environ.get("EMAIL_PORT")
EMAIL = os.environ.get("EMAIL")
PASSWORD = os.environ.get("PASSWORD")


def send_email(to_email: str, subject: str, attachment_name: str):
    """Send an email with an attachment to the specified recipient."""
    message = MIMEMultipart()
    message["From"] = EMAIL
    message["To"] = to_email
    message["Subject"] = subject
    body = "Hi there,\n\nPlease find attached your report.\n\nThanks"
    message.attach(MIMEText(body, "plain"))

    with open(attachment_name, "rb") as file:
        part = MIMEBase(
            "application",
            "vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        )
        part.set_payload(file.read())

    encoders.encode_base64(part)
    part.add_header(
        "Content-Disposition",
        f"attachment; filename= {attachment_name}",
    )
    logging.info(f"Attaching {attachment_name} to the email")
    message.attach(part)

    text = message.as_string()
    context = ssl.create_default_context()

    with smtplib.SMTP(SMTP_SERVER, PORT) as server:
        logging.info(f"Sending email to {to_email}")
        server.starttls(context=context)
        server.login(EMAIL, PASSWORD)
        server.sendmail(EMAIL, to_email, text)
        logging.info(f"Successfully sent the email to {to_email}")

This code defines the send_email function, which takes three parameters: to_email (the recipient’s email address), subject (the subject of the email), and attachment_name (the filename of the attachment).

Within the function, we construct an email message using the MIMEMultipart class. We set the sender’s email address, recipient’s email address, and subject. We also include a simple plain text body to provide a brief message for the recipient.

The script allows attaching the bookings data report as an attachment. It reads the attachment file, encodes it, and adds it to the email message using the MIMEBase class. This ensures that the recipient can easily access and download the data report from the email.

Finally, the script connects to the SMTP server, logs in using the sender’s email address and password, sends the email, and logs a success message upon successful transmission.

With the ability to send emails with attachments in place, we’re ready to test the entire flow of the application.

Testing the Flow

Now comes the exciting part—testing the flow of our application. In this section, we’ll automate the generation and delivery of reports. Here’s an example of how you can do it:

from exporter import DataExporter
from datetime import datetime

# Example of how you can automate monthly reports
start_timestamp = datetime(2023, 5, 1, 0, 0, 0)  # May 1, 2023 00:00:00
end_timestamp = datetime(2023, 5, 31, 23, 59, 59)  # May 31, 2023 23:59:59

exporter = DataExporter()
if exporter.generate_excelsheet(
    start_timestamp, end_timestamp, sheet_name="Monthly Report.xlsx"
):
    send_email("[email protected]", "Your Monthly Report", "Monthly Report.xlsx")

In the code above, we set up the start and end timestamps to cover the entire month of May 2023. Next, we create an instance of the DataExporter class, which handles the data export and Excel sheet generation. We then call the generate_excelsheet method with the specified timestamps and sheet name. If the Excel sheet generation is successful, we use the send_email function from mailer.py to send the email containing the generated report.

You can also automate the generation of weekly reports using a similar approach. Simply change the start and end timestamps to cover the desired week.

Scheduling the Application

To truly automate the process, we need to schedule the execution of the script at the desired intervals. For example, we might want to generate and send reports every Monday for the previous week’s data and on the 1st day of every month for the previous month’s information. To achieve this, we can utilize the schedule library.

Here’s an example of how you can schedule the execution of the script:

import schedule
from exporter import DataExporter
from datetime import datetime, timedelta
from mailer import send_email


def main():
    today = datetime.now()
    if today.weekday() == 0:  # Check if it's Monday (0 means Monday)
        # It's Monday, fetch data for the previous week (Monday to Sunday)
        start_timestamp = (today - timedelta(days=7)).replace(
            hour=0, minute=0, second=0, microsecond=0
        )
        end_timestamp = (today - timedelta(days=1)).replace(
            hour=23, minute=59, second=59, microsecond=0
        )
        sheet_name = "Weekly Report.xlsx"
    elif today.day == 1:
        # It's the 1st day of the month, fetch data for the last month
        start_timestamp = (today.replace(day=1) - timedelta(days=1)).replace(
            day=1, hour=0, minute=0, second=0, microsecond=0
        )
        end_timestamp = (today.replace(day=1) - timedelta(days=1)).replace(
            hour=23, minute=59, second=59, microsecond=0
        )
        sheet_name = "Monthly Report.xlsx"

    exporter = DataExporter()
    exporter.generate_excelsheet(start_timestamp, end_timestamp, sheet_name)
    send_email("[email protected]", "Your Report", sheet_name)


schedule.every().day.at("00:00").do(main)

while True:
    schedule.run_pending()

In the code above, we use the schedule library to run the main function daily at midnight. The main function calculates the timestamps for data extraction and Excel sheet generation. After generating the Excel sheet, the script sends it via email to the specified recipient.

If the script runs on a Monday, it generates a weekly report. The start and end timestamps are calculated to cover the previous week from Monday to Sunday. The Excel sheet is named “Weekly Report.xlsx”.

On the 1st day of the month, the script generates a monthly report. The start and end timestamps encompass the entire previous month. The Excel sheet is named “Monthly Report.xlsx”.

By scheduling the script using the schedule library, you can automate the generation and delivery of reports according to your desired intervals.

Wrapping Up

In this tutorial, you learned how to automate the generation and delivery of reports using Python, PostgreSQL, and emails. By leveraging the power of Python, you can streamline and simplify repetitive tasks, saving valuable time and reducing the risk of errors.

Automation doesn’t have to be complicated or intimidating. With the right tools and a well-structured approach, you can transform tedious tasks into efficient workflows.

Remember, the possibilities are endless. You can build on this foundation and explore additional features, such as storing email recipients in a database or using Cron Jobs to schedule the execution at specific times.

Feel free to experiment, adapt, and expand on these concepts to fit your unique needs. With your newfound automation superpowers, you’ll find yourself accomplishing more in less time.

Thank you for joining me on this journey. I hope you found this tutorial helpful, and I wish you all the best as you automate your own tasks. Happy coding!


Leave a Reply

Your email address will not be published. Required fields are marked *