This project analyzes the distribution of trips of the bike-sharing system Ford Go Bike, in the San Francisco Bay Area (USA), taken from June through December of 2017 (total of 443,121 observations).
The objective is to perform an Exploratory Data Analysis in order to identify interesting insights, looking for patterns, relationships and correlations between all features. The analysis is presented in the following structure:
The Ford Go Bike is the Bike-share system operating in the Bay Area, San Francisco (USA). With around 6.400 bikes in more than 360 stations across San Francisco, East Bay and San Jose, there are basically two ways the system can be used:
The dataset is open to the public, containing station and user data for each trip. At the time of this project, is was possible to download historic data from June, 2017 up to April, 2019.
Brief description of the distribution of trips taken from June through December of 2017:
g = df.subscriber.value_counts(normalize=True)
sns.barplot(x=g.keys(), y=g.values*100)
plt.ylim(0,100)
plt.title(f'Proportion of Trips by User Type')
plt.xticks([0,1], ['Sporadic', 'Subscriber'])
plt.ylabel('Proportion [%]');
Almost 90% of the trips were taken by actual subscribers of the Ford Go Bike program.
g = df.gender.value_counts(normalize=True)
sns.barplot(x=g.keys(), y=g.values*100)
plt.ylim(0,100)
plt.title(f'Proportion of Trips by Gender')
plt.xticks([0,1], ['Female', 'Male'])
plt.ylabel('Proportion [%]');
The vast majority of trips were taken by men (77%).
Since we do not have access to the actual distribution of distinct subscribers, it is not possible to tell if males are more likely to use the bike-rental program or if this higher frequency of trips is simply due to having more male subscribers.
fig, axes = plt.subplots(1,2, gridspec_kw={'width_ratios': [1, 1.8]}, figsize=(12,5))
g = df.weekend.value_counts(normalize=True)
sns.barplot(x=g.keys(), y=g.values*100, ax=axes[0])
axes[0].set_ylim(0,100)
axes[0].set_title(f'Proportion of Trips by Part of Week')
axes[0].set_xticklabels(['Weekday', 'Weekend'])
axes[0].set_ylabel('Proportion [%]');
g = df.weekday.value_counts(normalize=True)
sns.barplot(x=g.keys(), y=g.values*100, palette=blue_scale, ax=axes[1])
plt.ylim(0,100)
plt.title(f'Proportion of Trips by Day of Week')
plt.xticks(range(7), ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']);
plt.ylabel('Proportion [%]');
Interestingly, the weekends actually have significant less trips (15%), whereas the weekday trips seems to be uniformly distributed around 17%.
This pattern suggests the majority of trips are actually completed for commuting purposes.
Let's further investigate the distribution of trips along the hours of the day:
fig, axes = plt.subplots(1,2, sharey=True, figsize=(12,5))
weekday = df.query('weekday not in (5,6)')
weekday_hours = weekday.hour.value_counts(normalize=True)
sns.barplot(x= weekday_hours.keys(), y= weekday_hours.values*100, color=sns.color_palette()[0], ax=axes[0])
axes[0].set_title('Proportion of Trips per Hour: Weekday')
axes[0].set_ylabel('Proportion [%]')
axes[0].set_xlabel('Hours')
weekend = df.query('weekday in (5,6)')
weekend_hours = weekend.hour.value_counts(normalize=True)
sns.barplot(x= weekend_hours.keys(), y= weekend_hours.values*100, color=sns.color_palette()[1], ax=axes[1])
plt.title(f'Proportion of Trips per Hour: Weekend')
plt.ylabel('Proportion [%]')
plt.xlabel('Hours');
plt.tight_layout()
Separating the trips between Weekday and Weekend, it is possible to see a clear commuting pattern for trips occurring during weekdays.
There are two peaks of usage around 8h and 17h, depicting the typical pendular pattern of home-work and work-home trips.
plt.figure(figsize=(8,5))
sns.heatmap(df3.pivot_table(index=pd.cut(df3.duration_min, bins=range(0,61)),
columns=df3.hour, aggfunc='count',
fill_value=0).gender, vmin=0,cmap='viridis');
plt.title('Heatmap of Trip Distribution by Duration and Hour of Day');
As seen before, most trips occurs in two peaks (around 8h and 17h). But this heatmap shows they usually take between 5 and 10 minutes, suggesting really short trips.
# User Age vs Hour of day
plt.figure(figsize=(8,5))
sns.heatmap(df3.pivot_table(index=df3.user_age, columns=df3.hour, aggfunc='count', fill_value=0).gender,
vmin=0,cmap='viridis');
plt.title('Heatmap of Trip Distribution by User Age and Hour of Day');
There is no clear correlation between user age and the time the trip occurs, suggesting users of all ages are using the bike-share to commute.
Now let's stratify the analysis by the type of user:
# Trip Duration vs Subscriber vs Gender
plt.figure(figsize=(8,5))
sns.violinplot(data=df3, y='duration_min', x='subscriber', hue='gender', split=True, orient='v', inner='quartile',
hue_order=[1,0]);
plt.title('Distribution of Trip Duration by Subscriber and Gender')
plt.ylabel('Duration [min]')
plt.xlabel('')
L = plt.legend()
L.get_texts()[0].set_text('Male')
L.get_texts()[1].set_text('Female')
plt.xticks([0,1], ['Sporadic','Subscriber']);
Even though subscribers have access to unlimited 45-minutes trips, looks like their trips are typically shorter than Non-subscribers.
This could be explained due to the fact that subscribers usually take the bike for mobility purpose (trip as a mean of transportation), whereas occasional users could represent tourists using the bike for leisure and getting to know the city.
There is not a significant difference in trip duration based on gender. Specially for Subscribers.
# Trip duration vs Hour Day - count
plt.figure(figsize=(15,5))
plt.subplot(121)
trip_duration_bins = pd.cut(subscriber1.duration_min, bins=range(0,61))
sns.heatmap(subscriber1.pivot_table(index=trip_duration_bins, columns=subscriber1.hour, aggfunc='count', fill_value=0).gender, vmin=0,
cmap='viridis');
plt.title('Heatmap of Distribution of Trip Duration by Hour of Day | Subscriber');
plt.subplot(122)
trip_duration_bins = pd.cut(subscriber0.duration_min, bins=range(0,61))
sns.heatmap(subscriber0.pivot_table(index=trip_duration_bins, columns=subscriber0.hour, aggfunc='count', fill_value=0).gender, vmin=0,
cmap='viridis');
plt.title('Heatmap of Distribution of Trip Duration by Hour of Day | Sporadic');
plt.tight_layout()
There is a clear difference in the bike usage pattern between subscribers and sporadic users along the hours of the day:
Subscribers: there are two clear peaks of usage, around 8h and 17h (supporting commuting pattern). Both peaks concentrate trips taking between 5 and 10 minutes
Sporadic: although more spread out, there is one clear peak of usage, around 17h, but with significant usage starting at 8h and ending by 19h. Most trips take from 7 up to 15 minutes
# Trip duration vs User age - count vs Subscriber
plt.figure(figsize=(15,5))
plt.subplot(121)
trip_duration_bins = pd.cut(subscriber1.duration_min, bins=range(0,61))
sns.heatmap(subscriber1.pivot_table(index=trip_duration_bins, columns=subscriber1.user_age, aggfunc='count',
fill_value=0).gender, vmin=0,cmap='viridis');
plt.title('Heatmap of Distribution of Trip Duration by user Age | Subscriber');
plt.subplot(122)
trip_duration_bins = pd.cut(subscriber0.duration_min, bins=range(0,61))
sns.heatmap(subscriber0.pivot_table(index=trip_duration_bins, columns=subscriber0.user_age, aggfunc='count',
fill_value=0).gender, vmin=0,cmap='viridis');
plt.title('Heatmap of Distribution of Trip Duration by user Age | Sporadic');
plt.tight_layout()
There is a clear difference in the bike usage pattern between subscribers and sporadic users:
Subscribers: the vast majority of users are between 25 and 35 years old, riding from 5 up to 10 minutes
Sporadic: the vast majority of users are between 28 and 30 years old (much more concentrated), riding from 7 up to 20 minutes (much more spread out)
Commuting Pattern: most bike-share systems in the world are vastly used by tourist as a cheap albeit good alternative to travel around and get to know the city they are visiting. But in this case, it is possible to identify a clear commuting pattern, with trips concentrated around 8am and 5pm.
Typical behavior for each type of user: one would think that, since subscribers have access to unlimited 45-minutes trips, there would be many trips of longer duration. But in this case, it is possible to see that subscribers actually tend to use the system for commuting purposes, with the vast majority of trips taking place during weekdays, around 8am and 5pm.
Not many young adults (less than 20 years) using the system. Further investigate why and target this audience to increase number of users.
Vast majority of trips seem to be taken for commuting purpose. Try to target advertising and offer special pricing for tourists in order to increase number of sporadic users.