Automate Your Data Reports with Python and PostgreSQL
A Step-by-Step Guide to Automating Data Exports and Email Reports with Python
How to Automate Data Exports and Email Reports with Python – a Step-by-Step Guide
In today’s data-driven world, automation is the key to streamlining tasks and saving time. Don’t you just wish there was a magic button that could handle all those tedious tasks for you? Well, you’re in luck! With the power of Python and PostgreSQL, you can automate the process of extracting data from a database and sending it as an email attachment. This beginner-friendly tutorial will walk you through the process step by step, so even if you’re new to programming, you’ll be able to follow along.
Why Automation?
Imagine you’re working in an organization where your managers expect a weekly report filled with valuable insights. But creating this report is far from a straightforward task. To get the information you need, you have to manually run ten different database queries, gather the results, and then meticulously compile them into an Excel spreadsheet. It’s a time-consuming and error-prone process that can leave you exhausted.
In this scenario, wouldn’t it be a game-changer if Python could take the reins and handle this entire process for you? Picture this: Every week, without any manual intervention, Python seamlessly extracts the required data, compiles it into a neat Excel sheet, and even sends it off to your managers like clockwork. Sounds like a dream come true, right? Well, in this tutorial, we’ll make that dream a reality.
Getting Started
Before we dive in, let’s make sure you have everything you need to follow along. Here are the prerequisites:
- Python installed on your computer. You can download Python from Python.org.
- Basic knowledge of the Python programming language.
- Familiarity with sending emails in Python.
- PostgreSQL installed on your computer. You can download PostgreSQL from here.
Setting Up Your Virtual Environment
To ensure a clean and isolated environment for our project, we’ll create a virtual environment using the venv
tool, which is included with Python. Here’s how you can set it up:
- Open your terminal or command prompt.
- Create a project directory and navigate to it. Run the following commands:
mkdir report-automation cd report-automation
- Create a virtual environment named
env
using the following command:python -m venv env
(Note: On Windows, usesource env/Scripts/activate
to activate the environment.) - Activate the virtual environment with the following command:
source env/bin/activate
(Note: On Windows, usesource env/Scripts/activate
to activate the environment.) - You should see
(env)
in your terminal prompt, indicating that the virtual environment has been activated.
Great! Now that you’re in the virtual environment, let’s install the required libraries.
Installing the Required Libraries
To work with PostgreSQL and manipulate data in Python, we’ll need a few libraries. Run the following command to install them:
pip install psycopg2 pandas xlsxwriter
Here’s what each library does:
psycopg2
: Python adapter for PostgreSQL, enabling Python applications to interact with PostgreSQL databases.pandas
: A versatile data manipulation and analysis library for Python, ideal for working with structured data.xlsxwriter
: Python module for creating and formatting Excel (XLSX) files, useful for generating reports and spreadsheets.
Once the installation is complete, you’re all set to move on to the next step.
Setting Up Your Sample Database
In this tutorial, we’ll be working with a demo database named “airlines”. This database includes three tables: bookings
, flights
, and airports_data
. To set up the database, follow these steps:
- Download the SQL script file “airlines_db.sql” from here.
- Open your terminal or command prompt.
- Use the following command to install the database, replacing
postgres
with your PostgreSQL username if it’s different:psql -f airlines_db.sql -U postgres
This command will execute the SQL script and create the “airlines” database with the necessary tables.
The main schema in the database is bookings
, which stores crucial information about bookings made for flights. There are two other tables, flights
and airports_data
, which capture details about flights and airports, respectively. The relationships between these tables are established through foreign keys.
Setting Up Logging and Environment Variables
To provide informative messages and handle errors throughout the code, we’ll configure logging using Python’s built-in logging
module. We’ll also set up environment variables to securely store sensitive information and configuration parameters. Environment variables are used to store data like passwords and email credentials separately from the code, reducing the risk of accidental exposure or unauthorized access.
In this tutorial, we’ll store the email credentials, SMTP server information, and database connection details in environment variables. Here’s what you need to do:
- Create a
.env
file in the project directory. - Add the following lines to the
.env
file, replacing the values with your own:[email protected] PASSWORD=your-email-password EMAIL_PORT=587 SMTP_SERVER=smtp.example.com DB_HOSTNAME=localhost DB_NAME=airlines DB_PORT=5432 DB_USERNAME=postgres DB_PASSWORD=postgres
- Save the
.env
file. - Run
source .env
in your terminal or command prompt to load the environment variables.
By using environment variables, you can ensure that sensitive data like passwords and email credentials are kept separate from the code. This improves security while maintaining flexibility in configuration.
You’re all set! Now let’s move on to extracting the data from the database.
Extracting the Data From the Database
Before we can automate the reporting process, we need to connect to our database and fetch the data we need. To accomplish this, we’ll use the psycopg2
library, which allows Python applications to interact with PostgreSQL databases.
Let’s start by setting up the database configurations. Add the following code to your Python file:
import os
import psycopg2
import logging
format="%(asctime)s | %(levelname)s : %(message)s", level=logging.INFO)
logging.basicConfig(
= {
DB_CONFIG "host": os.environ.get("DB_HOSTNAME"),
"database": os.environ.get("DB_NAME"),
"user": os.environ.get("DB_USERNAME"),
"password": os.environ.get("DB_PASSWORD"),
}
Here, we define a dictionary called DB_CONFIG
, which stores the configuration parameters for connecting to the PostgreSQL database. These parameters include the hostname, database name, username, and password. We retrieve these values from the environment variables we set up earlier.
Next, let’s create a method to connect to the database and fetch the data. Add the following code:
...
class DataExporter:
...def __connect_to_database(self) -> None:
"""Establish a connection to the PostgreSQL database."""
try:
self.conn = psycopg2.connect(**self.db_config)
self.cursor = self.conn.cursor()
"Connected to the database")
logging.info(except Exception as e:
"Failed to connect to the database with error: %s", e)
logging.error(raise
def __fetch_from_database(self, start_timestamp, end_timestamp) -> list:
"""Fetch booking data from the database for a given time range."""
self.__connect_to_database()
= f"""
query SELECT COUNT(*) AS num_bookings, SUM(total_amount) AS total_amount
FROM bookings
WHERE book_date >= {int(start_timestamp.timestamp()) * 1000}
AND book_date <= {int(end_timestamp.timestamp()) * 1000}
"""
logging.info("Extracting bookings data from database for start timestamp=%s and end_timestamp=%s",
start_timestamp,
end_timestamp,
)= None
result try:
self.cursor.execute(query)
= list(self.cursor.fetchone())
result
result.append(f'{start_timestamp.strftime("%d %b, %Y")} - {end_timestamp.strftime("%d %b, %Y")}'
)
logging.info("Successfully extracted bookings data from database for start timestamp=%s and end_timestamp=%s",
start_timestamp,
end_timestamp,
)except Exception as e:
logging.error("Error occurred while extracting bookings data from database: %s", e
)return result
...
This code defines two private methods within the DataExporter
class: __connect_to_database
and __fetch_from_database
.
The __connect_to_database
method establishes a connection to the PostgreSQL database using the psycopg2
library. It reads the database configuration values from the DB_CONFIG
dictionary we defined earlier. If the connection is successful, it sets up a cursor for executing SQL queries.
The __fetch_from_database
method fetches the booking data from the database for a given time range. It constructs a SQL query that selects the count of bookings and the total booking amount based on the provided start and end timestamps. The query limits the results to bookings within the specified time range.
After executing the query, the method retrieves the result using cursor.fetchone()
and appends the timestamp range to the result list. The final result is then returned.
With these methods in place, you can easily extract booking data for different time ranges and use it in your reports.
Structuring the Booking Data with the BookingInfo
Class
To organize and represent booking data retrieved from the database, we’ll define a BookingInfo
class in a separate file called booking_info.py
. This class will encapsulate booking-related information, making it easier to work with and present the data.
In booking_info.py
, add the following code:
from decimal import Decimal
class BookingInfo:
def __init__(self, data_list):
"""Initialize BookingInfo with data from the database."""
self.__total_bookings, self.__total_amount, self.__timestamp = data_list
self.__total_amount = Decimal(self.__total_amount) if self.__total_amount else Decimal(0)
def __str__(self) -> str:
"""Return a string representation of BookingInfo."""
return f"Total Bookings: {self.__total_bookings}, Total Amount: ${self.__total_amount}"
def get_total_bookings(self) -> int:
"""Get the total number of bookings."""
return self.__total_bookings
def get_total_amount(self) -> Decimal:
"""Get the total booking amount as a Decimal."""
return self.__total_amount
def get_timestamp(self) -> str:
"""Get the timestamp associated with the booking data."""
return self.__timestamp
The BookingInfo
class receives a data_list
as input, which is assumed to be a list containing the following elements:
total_bookings
: An integer representing the total number of bookings.total_amount
: A floating-point value representing the total booking amount.timestamp
: A timestamp associated with the booking data.
The __init__
method initializes private instance variables (__total_bookings
, __total_amount
, and __timestamp
) with the values from the data_list
. It also converts the __total_amount
to a Decimal
type for precise handling of monetary values.
The __str__
method is implemented to provide a string representation of the BookingInfo
object. It returns a string in the format “Total Bookings: X, Total Amount: $Y”, where X
is the total number of bookings and Y
is the total booking amount formatted as dollars.
To access the encapsulated data, the class provides three getter methods:
get_total_bookings()
: Returns the total number of bookings as an integer.get_total_amount()
: Returns the total booking amount as aDecimal
type.get_timestamp()
: Returns the timestamp associated with the booking data as a string.
By encapsulating the booking data within the BookingInfo
class, the code becomes more organized, readable, and reusable. This structured approach simplifies the handling of booking information throughout the application.
Converting the Data into an Excel Sheet
Now that we can retrieve data from the database for a specific time range, let’s convert the extracted data into an Excel sheet. To do this, we’ll define another private method within the DataExporter
class called __convert_to_excelsheet
.
Add the following code to your Python file:
import pandas as pd
from booking_info import BookingInfo
...
class DataExporter:
...def __convert_to_excelsheet(self, data, sheet_name):
"""Convert the fetched data into an Excel sheet."""
try:
= BookingInfo(data)
booking_info = {
data "": ["Total Bookings", "Total Amount ($)"],
booking_info.get_timestamp(): [
booking_info.get_total_bookings(),
booking_info.get_total_amount(),
],
}"Converting the data into a pandas DataFrame")
logging.info(= pd.DataFrame(data)
df "Inserting the data into the Excel sheet")
logging.info(with pd.ExcelWriter(sheet_name, engine="xlsxwriter") as writer:
="Sheet1", index=False)
df.to_excel(writer, sheet_name"Successfully inserted data into the Excel sheet")
logging.info(except ValueError as e:
"Error converting data into Excel: %s", e)
logging.error(
...
This code defines the __convert_to_excelsheet
method, which takes two parameters: data
(a list containing the booking data) and sheet_name
(the desired name of the Excel sheet).
Within this method, we create an instance of the BookingInfo
class, passing in the data
list as input. This allows us to structure and format the data for insertion into the Excel sheet.
Next, we structure the data
dictionary to prepare it for conversion into a pandas DataFrame. The dictionary has two key-value pairs: an empty string as the key and a list with header values (“Total Bookings” and “Total Amount ($)”), and the timestamp as the key with a list of values containing the total number of bookings and the total booking amount.
We then convert the data
dictionary into a pandas DataFrame (df
), which is a commonly used data structure for handling tabular data in Python.
To create the Excel sheet, we use the pd.ExcelWriter
context manager with the "xlsxwriter"
engine. This ensures that the Excel file is prepared for data insertion. The sheet_name
parameter specifies the name of the sheet within the Excel file.
Finally, the data within the DataFrame is written to the Excel sheet using the to_excel
method of the DataFrame. We set the index
parameter to False
to exclude the default row numbers in the Excel sheet.
Now that we have the ability to convert data into an Excel sheet, let’s combine all the functionalities and generate reports automatically.
Combining the Functionalities
To generate reports automatically, we need to create a public method that users can call to extract the data from the database, convert it into an Excel sheet, and send it as an email attachment.
Add the following code to your Python file:
...
class DataExporter:
...def generate_excelsheet(
self, start_timestamp: datetime, end_timestamp: datetime, sheet_name="Bookings Data.xlsx"
-> bool:
) """Generate an Excel sheet with booking data for a specified time range."""
= self.__fetch_from_database(start_timestamp, end_timestamp)
data if data is not None:
self.__convert_to_excelsheet(data, sheet_name)
return True
else:
"No data to generate an Excel sheet")
logging.error(return False
...
This method, generate_excelsheet
, accepts several parameters:
start_timestamp
: The start of the time range for data extraction.end_timestamp
: The end of the time range for data extraction.sheet_name
(optional): The name of the Excel sheet to be created. Defaults to “Bookings Data.xlsx”.
The method calls the private __fetch_from_database
method to retrieve the booking data for the specified time range. If data is available, it calls the private __convert_to_excelsheet
method to convert the extracted data into an Excel sheet with the specified name.
If the generation of the Excel sheet is successful, the method returns True
. Otherwise, it logs an error message and returns False
.
Now that we can generate Excel sheets with the desired data, let’s move on to emailing the reports.
Sending an Email with the Bookings Data Report
To send an email with the Excel sheet as an attachment, we’ll use Python’s built-in smtplib
library. This library allows us to communicate with an SMTP server and send emails programmatically.
In a separate file called mailer.py
, add the following code:
import logging
import os
import smtplib
import ssl
from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
format="%(asctime)s | %(levelname)s : %(message)s", level=logging.INFO)
logging.basicConfig(
= os.environ.get("SMTP_SERVER")
SMTP_SERVER = os.environ.get("EMAIL_PORT")
PORT = os.environ.get("EMAIL")
EMAIL = os.environ.get("PASSWORD")
PASSWORD
def send_email(to_email: str, subject: str, attachment_name: str):
"""Send an email with an attachment to the specified recipient."""
= MIMEMultipart()
message "From"] = EMAIL
message["To"] = to_email
message["Subject"] = subject
message[= "Hi there,\n\nPlease find attached your report.\n\nThanks"
body "plain"))
message.attach(MIMEText(body,
with open(attachment_name, "rb") as file:
= MIMEBase(
part "application",
"vnd.openxmlformats-officedocument.spreadsheetml.sheet",
)file.read())
part.set_payload(
encoders.encode_base64(part)
part.add_header("Content-Disposition",
f"attachment; filename= {attachment_name}",
)f"Attaching {attachment_name} to the email")
logging.info(
message.attach(part)
= message.as_string()
text = ssl.create_default_context()
context
with smtplib.SMTP(SMTP_SERVER, PORT) as server:
f"Sending email to {to_email}")
logging.info(=context)
server.starttls(context
server.login(EMAIL, PASSWORD)
server.sendmail(EMAIL, to_email, text)f"Successfully sent the email to {to_email}") logging.info(
This code defines the send_email
function, which takes three parameters: to_email
(the recipient’s email address), subject
(the subject of the email), and attachment_name
(the filename of the attachment).
Within the function, we construct an email message using the MIMEMultipart
class. We set the sender’s email address, recipient’s email address, and subject. We also include a simple plain text body to provide a brief message for the recipient.
The script allows attaching the bookings data report as an attachment. It reads the attachment file, encodes it, and adds it to the email message using the MIMEBase
class. This ensures that the recipient can easily access and download the data report from the email.
Finally, the script connects to the SMTP server, logs in using the sender’s email address and password, sends the email, and logs a success message upon successful transmission.
With the ability to send emails with attachments in place, we’re ready to test the entire flow of the application.
Testing the Flow
Now comes the exciting part—testing the flow of our application. In this section, we’ll automate the generation and delivery of reports. Here’s an example of how you can do it:
from exporter import DataExporter
from datetime import datetime
# Example of how you can automate monthly reports
= datetime(2023, 5, 1, 0, 0, 0) # May 1, 2023 00:00:00
start_timestamp = datetime(2023, 5, 31, 23, 59, 59) # May 31, 2023 23:59:59
end_timestamp
= DataExporter()
exporter if exporter.generate_excelsheet(
="Monthly Report.xlsx"
start_timestamp, end_timestamp, sheet_name
):"[email protected]", "Your Monthly Report", "Monthly Report.xlsx") send_email(
In the code above, we set up the start and end timestamps to cover the entire month of May 2023. Next, we create an instance of the DataExporter
class, which handles the data export and Excel sheet generation. We then call the generate_excelsheet
method with the specified timestamps and sheet name. If the Excel sheet generation is successful, we use the send_email
function from mailer.py
to send the email containing the generated report.
You can also automate the generation of weekly reports using a similar approach. Simply change the start and end timestamps to cover the desired week.
Scheduling the Application
To truly automate the process, we need to schedule the execution of the script at the desired intervals. For example, we might want to generate and send reports every Monday for the previous week’s data and on the 1st day of every month for the previous month’s information. To achieve this, we can utilize the schedule
library.
Here’s an example of how you can schedule the execution of the script:
import schedule
from exporter import DataExporter
from datetime import datetime, timedelta
from mailer import send_email
def main():
= datetime.now()
today if today.weekday() == 0: # Check if it's Monday (0 means Monday)
# It's Monday, fetch data for the previous week (Monday to Sunday)
= (today - timedelta(days=7)).replace(
start_timestamp =0, minute=0, second=0, microsecond=0
hour
)= (today - timedelta(days=1)).replace(
end_timestamp =23, minute=59, second=59, microsecond=0
hour
)= "Weekly Report.xlsx"
sheet_name elif today.day == 1:
# It's the 1st day of the month, fetch data for the last month
= (today.replace(day=1) - timedelta(days=1)).replace(
start_timestamp =1, hour=0, minute=0, second=0, microsecond=0
day
)= (today.replace(day=1) - timedelta(days=1)).replace(
end_timestamp =23, minute=59, second=59, microsecond=0
hour
)= "Monthly Report.xlsx"
sheet_name
= DataExporter()
exporter
exporter.generate_excelsheet(start_timestamp, end_timestamp, sheet_name)"[email protected]", "Your Report", sheet_name)
send_email(
"00:00").do(main)
schedule.every().day.at(
while True:
schedule.run_pending()
In the code above, we use the schedule
library to run the main
function daily at midnight. The main
function calculates the timestamps for data extraction and Excel sheet generation. After generating the Excel sheet, the script sends it via email to the specified recipient.
If the script runs on a Monday, it generates a weekly report. The start and end timestamps are calculated to cover the previous week from Monday to Sunday. The Excel sheet is named “Weekly Report.xlsx”.
On the 1st day of the month, the script generates a monthly report. The start and end timestamps encompass the entire previous month. The Excel sheet is named “Monthly Report.xlsx”.
By scheduling the script using the schedule
library, you can automate the generation and delivery of reports according to your desired intervals.
Wrapping Up
In this tutorial, you learned how to automate the generation and delivery of reports using Python, PostgreSQL, and emails. By leveraging the power of Python, you can streamline and simplify repetitive tasks, saving valuable time and reducing the risk of errors.
Automation doesn’t have to be complicated or intimidating. With the right tools and a well-structured approach, you can transform tedious tasks into efficient workflows.
Remember, the possibilities are endless. You can build on this foundation and explore additional features, such as storing email recipients in a database or using Cron Jobs to schedule the execution at specific times.
Feel free to experiment, adapt, and expand on these concepts to fit your unique needs. With your newfound automation superpowers, you’ll find yourself accomplishing more in less time.
Thank you for joining me on this journey. I hope you found this tutorial helpful, and I wish you all the best as you automate your own tasks. Happy coding!
Leave a Reply