How to Calculate the Time Difference Between Two Dates in Pandas

 

Extracting the time difference between two dates in Pandas using Python can provide valuable insights to enhance your analysis, facilitate data comprehension, and enhance the precision of a machine learning model’s predictions.

 

Python is exceptionally adept at time series data analysis, effectively handling dates in various formats. Utilizing Pandas, you can add or subtract days from dates, creating new features or enabling more insightful data analyses.

 

This tutorial will demonstrate how to compute the time difference between two dates in Pandas, delivering the disparity in minutes, seconds, hours, or days. This process is swift and straightforward, making it an excellent entry point for working with data.

 

Data Load

Data Loading To begin, initiate your Jupyter notebook / Google Colab file and import the Pandas library, then use the Pandas function read_csv() function to import the dataset. Let’s go ahead and import a logistic data set containing a column for a delivery start and end date.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/ngorovitch/free-datasets/main/Logistics/Truck%20Delivery%20Data.csv',  encoding = "ISO-8859-1")
df = df[['trip_start_date', 'trip_end_date']].dropna()

 

For the purpose of this exercise, let’s go ahead and remove the other columns. We will then use it to showcase the various Pandas’ functionalities when it comes to calculating with dates.


Inspect Data Format 

Now, let’s look at the data format by running df.info() to retrieve comprehensive information about the columns and their respective data types within the dataframe. Frequently, datasets of this nature have dates that come in the form of object data types or strings. Before being able to perform any date calculations, we will need to convert those columns to datetime format.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6684 entries, 45 to 6728
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   trip_start_date  6684 non-null   object
 1   trip_end_date    6684 non-null   object
dtypes: object(2)
memory usage: 156.7+ KB

 

 

Transform Dates into Datetime Format 

Pandas offer several valuable functions for reshaping date values. In our case, we will use the to_datetime() function to transform the dates into datetime objects. By incorporating the errors=’coerce’ argument, any unconvertible dates will be marked as NaT (or “not a time”).  Let’s go ahead and apply the to_datetime method to our date columns to perform the conversion.

df['trip_start_date'] = pd.to_datetime(df['trip_start_date'], errors='coerce')
df['trip_end_date'] = pd.to_datetime(df['trip_end_date'], errors='coerce')

Running a quick df.info(), we can see that the column type has changed from object to datetime64. The columns are now ready to be used for calculations.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6684 entries, 45 to 6728
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   trip_start_date  6684 non-null   datetime64[ns]
 1   trip_end_date    6684 non-null   datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 156.7 KB

 

 

Compute Time Duration Between Two Dates as Timedelta 

Python, in conjunction with Pandas, offers various methods to compute time intervals between two dates. The initial approach involves subtracting one date from the other, which yields a timedelta result, for instance, 0 days 05:00:00. This timedelta indicates the precise count of days, hours, minutes, and seconds separating the two dates. While this method is convenient for basic temporal representations, you might require a more detailed breakdown of days, hours, minutes, and seconds for a more intricate analysis.

df['delivery_time'] = df['trip_end_date'] - df['trip_start_date']
df.head()
	trip_start_date	trip_end_date	delivery_time
45	2020-08-26 16:16:00	2020-08-28 12:15:00	1 days 19:59:00
46	2020-08-26 11:58:00	2020-08-28 11:25:00	1 days 23:27:00
47	2020-08-26 15:35:00	2020-08-28 11:21:00	1 days 19:46:00
48	2020-08-28 07:58:00	2020-08-28 11:14:00	0 days 03:16:00
49	2020-08-27 17:21:00	2020-08-28 10:42:00	0 days 17:21:00

Calculate the time difference between two dates in seconds

One way to obtain to calculate the difference between two dates with greater precision is to use the dt.total_seconds() function. By wrapping our calculation in parentheses and then appending .total_seconds() to the end of our calculation, we can obtain the number of seconds between the two dates and can assign it to a new column.

df['delivery_time_seconds'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds()
df.delivery_time_seconds.head()
45    158340.0
46    170820.0
47    157560.0
48     11760.0
49     62460.0

 

Calculate the time difference between two dates in minutes

Since there are 60 seconds in every minute, we can simply divide the number of seconds between the two dates by 60 to obtain the number of minutes between the two dates. Again, we can assign the result to a new column.

df['delivery_time_minutes'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds() / 60
df['delivery_time_minutes'].head()
45    158340.0
46    170820.0
47    157560.0
48     11760.0
49     62460.0

Calculate the time difference between two dates in hours

To calculate the time difference between the two dates in seconds, we can divide the total_seconds() value by 60 to obtain the minutes, then divide by 60 again to obtain the time difference in hours. We can then assign the time difference in hours to a new column in the dataframe.

df['delivery_time_hours'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds() / 60 / 60
df['delivery_time_hours'].head()
45    43.983333
46    47.450000
47    43.766667
48     3.266667
49    17.350000

Calculate the time difference between two dates in days

Finally, we’ll calculate the time difference between the two dates in days by modifying the step above to divide by 24 hours. That gives us the time difference in days. This is not the only way to achieve this in Pandas, but it’s one of the quickest and easiest to use, and it runs quickly, even on large time series datasets.

df['delivery_time_days'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds() / 60 / 60 / 24
df['delivery_time_days_2'] = (df['trip_end_date'] - df['trip_start_date']).dt.days
df.delivery_time_days.head()
45    1.832639
46    1.977083
47    1.823611
48    0.136111
49    0.722917

Calculate the time difference between two dates in weeks

Now let’s calculate the difference between two dates in weeks. We can easily do that by taking our trip end date minus our trip start date. Then we can use the dt method to convert it into days and diving it by 7, which will give us the weekly values. 

 

Another variation of this code will involve using the delivery time from which will apply a lambda function that will divide each row of the delivery time column by a weekly time delta. 

 

# Calculate delivery time in weeks and months
df['delivery_time_weeks'] = (df['trip_end_date'] - df['trip_start_date']).dt.days / 7

# Calculate delivery time in weeks and months
df['delivery_time_weeks_2'] = df['delivery_time'].apply(lambda x: x / pd.Timedelta(weeks=1))

Calculate the time difference between two dates in Month

To calculate the time difference in months, we simply account for the time difference in years by doing a dt.year. Then we will convert it in months by multiplying by 12, and then we can add the extra months. Another version involves using a lambda to perform that task. 

df['delivery_time_months'] = (df['trip_end_date'].dt.year - df['trip_start_date'].dt.year)\
 * 12 + (df['trip_end_date'].dt.month - df['trip_start_date'].dt.month)
df['delivery_time_months_2'] = df['trip_end_date'].sub(df['trip_start_date'])\
.apply(lambda x: x / pd.Timedelta(days=30.44))  # Average month length

 

 

Calculate the time difference between two dates with timezones.

Handling time zones can be a bit tricky, but you can use the Pytz library to manage time zones effectively. The tz_localize() method is used to set the time zone of the datetime columns, and the tz_convert() method is used to convert the time zone if needed.

Remember to replace ‘America/New_York’ with your desired target time zone.

 

As you can see in the result, the timedelta adds the timezones on the numbers at the end. So for a start date, we have +0, and for the usa version we have a – 4

import pytz

# Assuming the data is in a specific time zone (e.g., 'UTC')
data_time_zone = 'UTC'
df['trip_start_date'] = df['trip_start_date'].dt.tz_localize(data_time_zone)
df['trip_end_date'] = df['trip_end_date'].dt.tz_localize(data_time_zone)

# Calculate delivery time in weeks and months
df['delivery_time_weeks'] = (df['trip_end_date'] - df['trip_start_date']).dt.days / 7

# Convert time zones if needed (e.g., from 'UTC' to another time zone)
target_time_zone = 'America/New_York'
df['trip_start_date_us'] = df['trip_start_date'].dt.tz_convert(target_time_zone)
df['trip_end_date_us'] = df['trip_end_date'].dt.tz_convert(target_time_zone)

df[["trip_start_date","trip_start_date_us"]].head()
trip_start_date	trip_start_date_us
45	2020-08-26 16:16:00+00:00	2020-08-26 12:16:00-04:00
46	2020-08-26 11:58:00+00:00	2020-08-26 07:58:00-04:00
47	2020-08-26 15:35:00+00:00	2020-08-26 11:35:00-04:00
48	2020-08-28 07:58:00+00:00	2020-08-28 03:58:00-04:00
49	2020-08-27 17:21:00+00:00	2020-08-27 13:21:00-04:00

Calculate the time difference between two dates considering time intervals.

Working with specific time intervals, such as business hours or working days, requires additional logic to calculate the delivery time accurately.

 

Lets set up the calculate_working_days function, which calculates the working time (in hours) between the trip_start_date and trip_end_date, considering working hours and excluding weekends. Adjust the working_hours_start and working_hours_end variables to match your specific working hours.

# Convert dates to datetime format
df['trip_start_date'] = pd.to_datetime(df['trip_start_date'], errors='coerce')
df['trip_end_date'] = pd.to_datetime(df['trip_end_date'], errors='coerce')

# Set up working hours (adjust as needed)
working_hours_start = 8
working_hours_end = 17

# Define function to calculate working time with excluded weekends
def calculate_working_time(start_date, end_date):
    total_hours = 0
    current_datetime = start_date
    
    while current_datetime < end_date:
        current_date = current_datetime.date()
        
        # Calculate working time for the current day
        start_of_day = current_datetime.replace(hour=working_hours_start, minute=0, second=0, microsecond=0)
        end_of_day = current_datetime.replace(hour=working_hours_end, minute=0, second=0, microsecond=0)
        
        if current_date == start_date.date():
            start_of_day = max(current_datetime, start_of_day)
        
        if current_date == end_date.date():
            end_of_day = min(current_datetime, end_of_day)
        
        working_time = (end_of_day - start_of_day).total_seconds() / 3600
        total_hours += max(0, working_time)
        
        # Move to the next day
        current_datetime += pd.DateOffset(days=1)
    
    return total_hours

# Calculate working time for each row
df['working_time'] = df.apply(lambda row: calculate_working_time(row['trip_start_date'], row['trip_end_date']), axis=1)

# Print the modified DataFrame
print(df.head())
       trip_start_date       trip_end_date  working_time
45 2020-08-26 16:16:00 2020-08-28 12:15:00      9.733333
46 2020-08-26 11:58:00 2020-08-28 11:25:00     14.033333
47 2020-08-26 15:35:00 2020-08-28 11:21:00     10.416667
48 2020-08-28 07:58:00 2020-08-28 11:14:00      0.000000
49 2020-08-27 17:21:00 2020-08-28 10:42:00      0.000000

 

Calculate the time difference between two dates considering irregular time intervals.

We need additional logic to identify the non-working days and adjust the calculations accordingly to account for irregular intervals, such as excluding weekends and holidays. Let’s modify our calculate working time function by simply adding the non_working_days variables and your holidays variable. Those will house the date to account for. You can fill those variables manually or une a calendar python library to do so. .

 

# Convert dates to datetime format
df['trip_start_date'] = pd.to_datetime(df['trip_start_date'], errors='coerce')
df['trip_end_date'] = pd.to_datetime(df['trip_end_date'], errors='coerce')

# Set up working hours (adjust as needed)
working_hours_start = 8
working_hours_end = 17

# Define non-working days (holidays and weekends)
non_working_days = [5, 6]  # Saturday (5) and Sunday (6)
holidays = [pd.Timestamp('2023-01-01'), pd.Timestamp('2023-12-25')]  # Example holidays

# Define function to calculate working time with excluded non-working days
def calculate_working_time(start_date, end_date):
    total_hours = 0
    current_datetime = start_date
    
    while current_datetime < end_date:
        current_date = current_datetime.date()
        current_day_of_week = current_datetime.weekday()
        
        if current_day_of_week not in non_working_days and current_date not in holidays:
            # Calculate working time for the current day
            start_of_day = current_datetime.replace(hour=working_hours_start, minute=0, second=0, microsecond=0)
            end_of_day = current_datetime.replace(hour=working_hours_end, minute=0, second=0, microsecond=0)

            if current_date == start_date.date():
                start_of_day = max(current_datetime, start_of_day)

            if current_date == end_date.date():
                end_of_day = min(current_datetime, end_of_day)

            working_time = (end_of_day - start_of_day).total_seconds() / 3600
            total_hours += max(0, working_time)
        
        # Move to the next day
        current_datetime += pd.DateOffset(days=1)
    
    return total_hours

# Calculate working time for each row
df['working_time'] = df.apply(lambda row: calculate_working_time(row['trip_start_date'], row['trip_end_date']), axis=1)

# Print the modified DataFrame
print(df.head())

 

       trip_start_date       trip_end_date  working_time
45 2020-08-26 16:16:00 2020-08-28 12:15:00      9.733333
46 2020-08-26 11:58:00 2020-08-28 11:25:00     14.033333
47 2020-08-26 15:35:00 2020-08-28 11:21:00     10.416667
48 2020-08-28 07:58:00 2020-08-28 11:14:00      0.000000
49 2020-08-27 17:21:00 2020-08-28 10:42:00      0.000000

 

Visualize the difference between two dates with Seaborn

Visualizing time differences can provide valuable insights into the distribution of your data.sns.histplot() function creates a histogram of the working time distribution, while the sns.boxplot() function generates a box plot to show the spread and outliers in the data.

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='working_time', bins=20, kde=True)
plt.title('Distribution of Working Time')
plt.xlabel('Working Time (hours)')
plt.ylabel('Frequency')
plt.show()

# Visualize time differences using a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='working_time')
plt.title('Box Plot of Working Time')
plt.ylabel('Working Time (hours)')
plt.show()

calculate the time difference between two dates in pandas (visualize)

calculate the time difference between two dates in pandas (box plot)

If you made this far in the article, thank you very much.

 

I hope this information was of use to you. 

 

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com  or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam. 

 

Newsletter

 

 

 

 

Leave a Comment