Extracting the time difference between two dates in Pandas using Python can provide valuable insights to enhance your analysis, facilitate data comprehension, and enhance the precision of a machine learning model’s predictions.
Python is exceptionally adept at time series data analysis, effectively handling dates in various formats. Utilizing Pandas, you can add or subtract days from dates, creating new features or enabling more insightful data analyses.
This tutorial will demonstrate how to compute the time difference between two dates in Pandas, delivering the disparity in minutes, seconds, hours, or days. This process is swift and straightforward, making it an excellent entry point for working with data.
Data Load
Data Loading To begin, initiate your Jupyter notebook / Google Colab file and import the Pandas library, then use the Pandas function read_csv() function to import the dataset. Let’s go ahead and import a logistic data set containing a column for a delivery start and end date.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/ngorovitch/free-datasets/main/Logistics/Truck%20Delivery%20Data.csv', encoding = "ISO-8859-1")
df = df[['trip_start_date', 'trip_end_date']].dropna()
For the purpose of this exercise, let’s go ahead and remove the other columns. We will then use it to showcase the various Pandas’ functionalities when it comes to calculating with dates.
Inspect Data Format
Now, let’s look at the data format by running df.info() to retrieve comprehensive information about the columns and their respective data types within the dataframe. Frequently, datasets of this nature have dates that come in the form of object data types or strings. Before being able to perform any date calculations, we will need to convert those columns to datetime format.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6684 entries, 45 to 6728
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 trip_start_date 6684 non-null object
1 trip_end_date 6684 non-null object
dtypes: object(2)
memory usage: 156.7+ KB
Transform Dates into Datetime Format
Pandas offer several valuable functions for reshaping date values. In our case, we will use the to_datetime() function to transform the dates into datetime objects. By incorporating the errors=’coerce’ argument, any unconvertible dates will be marked as NaT (or “not a time”). Let’s go ahead and apply the to_datetime method to our date columns to perform the conversion.
df['trip_start_date'] = pd.to_datetime(df['trip_start_date'], errors='coerce')
df['trip_end_date'] = pd.to_datetime(df['trip_end_date'], errors='coerce')
Running a quick df.info(), we can see that the column type has changed from object to datetime64. The columns are now ready to be used for calculations.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6684 entries, 45 to 6728
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 trip_start_date 6684 non-null datetime64[ns]
1 trip_end_date 6684 non-null datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 156.7 KB
Compute Time Duration Between Two Dates as Timedelta
Python, in conjunction with Pandas, offers various methods to compute time intervals between two dates. The initial approach involves subtracting one date from the other, which yields a timedelta result, for instance, 0 days 05:00:00. This timedelta indicates the precise count of days, hours, minutes, and seconds separating the two dates. While this method is convenient for basic temporal representations, you might require a more detailed breakdown of days, hours, minutes, and seconds for a more intricate analysis.
df['delivery_time'] = df['trip_end_date'] - df['trip_start_date']
df.head()
trip_start_date trip_end_date delivery_time
45 2020-08-26 16:16:00 2020-08-28 12:15:00 1 days 19:59:00
46 2020-08-26 11:58:00 2020-08-28 11:25:00 1 days 23:27:00
47 2020-08-26 15:35:00 2020-08-28 11:21:00 1 days 19:46:00
48 2020-08-28 07:58:00 2020-08-28 11:14:00 0 days 03:16:00
49 2020-08-27 17:21:00 2020-08-28 10:42:00 0 days 17:21:00
Calculate the time difference between two dates in seconds
One way to obtain to calculate the difference between two dates with greater precision is to use the dt.total_seconds() function. By wrapping our calculation in parentheses and then appending .total_seconds() to the end of our calculation, we can obtain the number of seconds between the two dates and can assign it to a new column.
df['delivery_time_seconds'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds()
df.delivery_time_seconds.head()
45 158340.0
46 170820.0
47 157560.0
48 11760.0
49 62460.0
Calculate the time difference between two dates in minutes
Since there are 60 seconds in every minute, we can simply divide the number of seconds between the two dates by 60 to obtain the number of minutes between the two dates. Again, we can assign the result to a new column.
df['delivery_time_minutes'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds() / 60
df['delivery_time_minutes'].head()
45 158340.0
46 170820.0
47 157560.0
48 11760.0
49 62460.0
Calculate the time difference between two dates in hours
To calculate the time difference between the two dates in seconds, we can divide the total_seconds() value by 60 to obtain the minutes, then divide by 60 again to obtain the time difference in hours. We can then assign the time difference in hours to a new column in the dataframe.
df['delivery_time_hours'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds() / 60 / 60
df['delivery_time_hours'].head()
45 43.983333
46 47.450000
47 43.766667
48 3.266667
49 17.350000
Calculate the time difference between two dates in days
Finally, we’ll calculate the time difference between the two dates in days by modifying the step above to divide by 24 hours. That gives us the time difference in days. This is not the only way to achieve this in Pandas, but it’s one of the quickest and easiest to use, and it runs quickly, even on large time series datasets.
df['delivery_time_days'] = (df['trip_end_date'] - df['trip_start_date']).dt.total_seconds() / 60 / 60 / 24
df['delivery_time_days_2'] = (df['trip_end_date'] - df['trip_start_date']).dt.days
df.delivery_time_days.head()
45 1.832639
46 1.977083
47 1.823611
48 0.136111
49 0.722917
Calculate the time difference between two dates in weeks
Now let’s calculate the difference between two dates in weeks. We can easily do that by taking our trip end date minus our trip start date. Then we can use the dt method to convert it into days and diving it by 7, which will give us the weekly values.
Another variation of this code will involve using the delivery time from which will apply a lambda function that will divide each row of the delivery time column by a weekly time delta.
# Calculate delivery time in weeks and months
df['delivery_time_weeks'] = (df['trip_end_date'] - df['trip_start_date']).dt.days / 7
# Calculate delivery time in weeks and months
df['delivery_time_weeks_2'] = df['delivery_time'].apply(lambda x: x / pd.Timedelta(weeks=1))
Calculate the time difference between two dates in Month
To calculate the time difference in months, we simply account for the time difference in years by doing a dt.year. Then we will convert it in months by multiplying by 12, and then we can add the extra months. Another version involves using a lambda to perform that task.
df['delivery_time_months'] = (df['trip_end_date'].dt.year - df['trip_start_date'].dt.year)\
* 12 + (df['trip_end_date'].dt.month - df['trip_start_date'].dt.month)
df['delivery_time_months_2'] = df['trip_end_date'].sub(df['trip_start_date'])\
.apply(lambda x: x / pd.Timedelta(days=30.44)) # Average month length
Calculate the time difference between two dates with timezones.
Handling time zones can be a bit tricky, but you can use the Pytz library to manage time zones effectively. The tz_localize() method is used to set the time zone of the datetime columns, and the tz_convert() method is used to convert the time zone if needed.
Remember to replace ‘America/New_York’ with your desired target time zone.
As you can see in the result, the timedelta adds the timezones on the numbers at the end. So for a start date, we have +0, and for the usa version we have a – 4
import pytz
# Assuming the data is in a specific time zone (e.g., 'UTC')
data_time_zone = 'UTC'
df['trip_start_date'] = df['trip_start_date'].dt.tz_localize(data_time_zone)
df['trip_end_date'] = df['trip_end_date'].dt.tz_localize(data_time_zone)
# Calculate delivery time in weeks and months
df['delivery_time_weeks'] = (df['trip_end_date'] - df['trip_start_date']).dt.days / 7
# Convert time zones if needed (e.g., from 'UTC' to another time zone)
target_time_zone = 'America/New_York'
df['trip_start_date_us'] = df['trip_start_date'].dt.tz_convert(target_time_zone)
df['trip_end_date_us'] = df['trip_end_date'].dt.tz_convert(target_time_zone)
df[["trip_start_date","trip_start_date_us"]].head()
trip_start_date trip_start_date_us
45 2020-08-26 16:16:00+00:00 2020-08-26 12:16:00-04:00
46 2020-08-26 11:58:00+00:00 2020-08-26 07:58:00-04:00
47 2020-08-26 15:35:00+00:00 2020-08-26 11:35:00-04:00
48 2020-08-28 07:58:00+00:00 2020-08-28 03:58:00-04:00
49 2020-08-27 17:21:00+00:00 2020-08-27 13:21:00-04:00
Calculate the time difference between two dates considering time intervals.
Working with specific time intervals, such as business hours or working days, requires additional logic to calculate the delivery time accurately.
Lets set up the calculate_working_days function, which calculates the working time (in hours) between the trip_start_date and trip_end_date, considering working hours and excluding weekends. Adjust the working_hours_start and working_hours_end variables to match your specific working hours.
# Convert dates to datetime format
df['trip_start_date'] = pd.to_datetime(df['trip_start_date'], errors='coerce')
df['trip_end_date'] = pd.to_datetime(df['trip_end_date'], errors='coerce')
# Set up working hours (adjust as needed)
working_hours_start = 8
working_hours_end = 17
# Define function to calculate working time with excluded weekends
def calculate_working_time(start_date, end_date):
total_hours = 0
current_datetime = start_date
while current_datetime < end_date:
current_date = current_datetime.date()
# Calculate working time for the current day
start_of_day = current_datetime.replace(hour=working_hours_start, minute=0, second=0, microsecond=0)
end_of_day = current_datetime.replace(hour=working_hours_end, minute=0, second=0, microsecond=0)
if current_date == start_date.date():
start_of_day = max(current_datetime, start_of_day)
if current_date == end_date.date():
end_of_day = min(current_datetime, end_of_day)
working_time = (end_of_day - start_of_day).total_seconds() / 3600
total_hours += max(0, working_time)
# Move to the next day
current_datetime += pd.DateOffset(days=1)
return total_hours
# Calculate working time for each row
df['working_time'] = df.apply(lambda row: calculate_working_time(row['trip_start_date'], row['trip_end_date']), axis=1)
# Print the modified DataFrame
print(df.head())
trip_start_date trip_end_date working_time
45 2020-08-26 16:16:00 2020-08-28 12:15:00 9.733333
46 2020-08-26 11:58:00 2020-08-28 11:25:00 14.033333
47 2020-08-26 15:35:00 2020-08-28 11:21:00 10.416667
48 2020-08-28 07:58:00 2020-08-28 11:14:00 0.000000
49 2020-08-27 17:21:00 2020-08-28 10:42:00 0.000000
Calculate the time difference between two dates considering irregular time intervals.
We need additional logic to identify the non-working days and adjust the calculations accordingly to account for irregular intervals, such as excluding weekends and holidays. Let’s modify our calculate working time function by simply adding the non_working_days variables and your holidays variable. Those will house the date to account for. You can fill those variables manually or une a calendar python library to do so. .
# Convert dates to datetime format
df['trip_start_date'] = pd.to_datetime(df['trip_start_date'], errors='coerce')
df['trip_end_date'] = pd.to_datetime(df['trip_end_date'], errors='coerce')
# Set up working hours (adjust as needed)
working_hours_start = 8
working_hours_end = 17
# Define non-working days (holidays and weekends)
non_working_days = [5, 6] # Saturday (5) and Sunday (6)
holidays = [pd.Timestamp('2023-01-01'), pd.Timestamp('2023-12-25')] # Example holidays
# Define function to calculate working time with excluded non-working days
def calculate_working_time(start_date, end_date):
total_hours = 0
current_datetime = start_date
while current_datetime < end_date:
current_date = current_datetime.date()
current_day_of_week = current_datetime.weekday()
if current_day_of_week not in non_working_days and current_date not in holidays:
# Calculate working time for the current day
start_of_day = current_datetime.replace(hour=working_hours_start, minute=0, second=0, microsecond=0)
end_of_day = current_datetime.replace(hour=working_hours_end, minute=0, second=0, microsecond=0)
if current_date == start_date.date():
start_of_day = max(current_datetime, start_of_day)
if current_date == end_date.date():
end_of_day = min(current_datetime, end_of_day)
working_time = (end_of_day - start_of_day).total_seconds() / 3600
total_hours += max(0, working_time)
# Move to the next day
current_datetime += pd.DateOffset(days=1)
return total_hours
# Calculate working time for each row
df['working_time'] = df.apply(lambda row: calculate_working_time(row['trip_start_date'], row['trip_end_date']), axis=1)
# Print the modified DataFrame
print(df.head())
trip_start_date trip_end_date working_time
45 2020-08-26 16:16:00 2020-08-28 12:15:00 9.733333
46 2020-08-26 11:58:00 2020-08-28 11:25:00 14.033333
47 2020-08-26 15:35:00 2020-08-28 11:21:00 10.416667
48 2020-08-28 07:58:00 2020-08-28 11:14:00 0.000000
49 2020-08-27 17:21:00 2020-08-28 10:42:00 0.000000
Visualize the difference between two dates with Seaborn
Visualizing time differences can provide valuable insights into the distribution of your data.sns.histplot() function creates a histogram of the working time distribution, while the sns.boxplot() function generates a box plot to show the spread and outliers in the data.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='working_time', bins=20, kde=True)
plt.title('Distribution of Working Time')
plt.xlabel('Working Time (hours)')
plt.ylabel('Frequency')
plt.show()
# Visualize time differences using a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='working_time')
plt.title('Box Plot of Working Time')
plt.ylabel('Working Time (hours)')
plt.show()
If you made this far in the article, thank you very much.
I hope this information was of use to you.
Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam.
If you liked this article, maybe you will like these too.
How to remove missing values from your data with python?
How To Detect and Handle Outliers in Data Mining [10 Methods]
Hyperparameter Tuning with Random Search
Hyperparameter Tuning with Grid Search
How to create a practice dataset?
Machine Learning project for Beginners