Data processing that was performed on each raw dataset is listed below.
The following data processing was performed on raw bike share ridership data
When loading raw bike share ridership data, parse datetime
s using the datetime
format %m/%d/%Y %H:%M
, since the ridership data timestamps did not contain seconds. Additionally, the analysis in this project does not require information about the second at which a bike share trip was started or ended. For these reasons, seconds were not parsed from the raw datetime
s.
Clean (Standardize) column names
_
__
by _
Rename column names to ensure the same names are used in all processed ridership data files
to_station_name
→ end_station_id
to_station_id
→ end_station_id
from_station_name
→ start_station_id
from_station_id
→ start_station_id
trip_start_time
→ started_at
trip_stop_time
→ ended_at
trip_duration_seconds
→ trip_duration
start_time
→ started_at
end_time
→ ended_at
Remove trips that are
(filter) Drop trips with missing values in the following columns
start_station_id
end_station_id
Due to the large variation in bike share ridership by period of the year (eg. month) and station, charts show the average ridership per station per period. In order to calculate this average, the number of stations during a given period is needed. The start_station_id
and end_station_id
are station identifier columns that can be used to count the number of stations across the network within a specific period (eg. monthly). This count can then be used to calculate the required average ridership per period (eg. average monthly ridership), which is used to charts. For this reason, both of these columns cannot have missing values. So, any bike share trips with a missing value in either of these columns were dropped.
(clean) clean start and end station names
Extract the following datetime
attributes from the start and end datetime
of the trip
export cleaned data to disk in a .parquet
file with the file naming format processed__trips_YY_<PP>__YYmmdd_HHMMSS.parquet.gzip
, where
<PP>
is MM
for monthly files (2020, 2021, 2022, 2023) and QQ
for quarterly files (2018, 2019)YYmmdd_HHMMSS
is the timestamp (retrieved using Python) at which the file with processed data was exported to disk(filter) Keep census tract boundaries for the city of Toronto using the following filters
PRUID = 35
in order to capture the province of Ontario, per Statistics CanadaCTUID
starts with 535
CTNAME
starts with 01
, 02
, 03
, 04
or 05
where the CTUID
and CTNAME
filters were obtained using [Census Mapper](http://censusmapper.cahttps://censusmapper.ca/#11/43.7245/-79.4861). It was necessary to zoom in or out on the map displayed on the Census Mapper site until a selected boundary displays the name of a census tract. eg. 0001.00 (CT) and not other administrative or statistical boundaries such as census division (CD), dissemination area (DA), etc.
None.
/station_information
GBFS specification which contained a struct
datatype
rental_methods
is_virtual_station
rental_uris
groups
geopandas
spatial join (.sjoin
) with the contains
predicateCTUID
column to census_tract_id
AREA_NAME
column to Neighbourhood