Data processing that was performed on each raw dataset is listed below.
The following data processing was performed on raw bike share ridership data
When loading raw bike share ridership data, parse datetimes using the datetime format %m/%d/%Y %H:%M, since the ridership data timestamps did not contain seconds. Additionally, the analysis in this project does not require information about the second at which a bike share trip was started or ended. For these reasons, seconds were not parsed from the raw datetimes.
Clean (Standardize) column names
___ by _Rename column names to ensure the same names are used in all processed ridership data files
to_station_name → end_station_idto_station_id → end_station_idfrom_station_name → start_station_idfrom_station_id → start_station_idtrip_start_time → started_attrip_stop_time → ended_attrip_duration_seconds → trip_durationstart_time → started_atend_time → ended_atRemove trips that are
(filter) Drop trips with missing values in the following columns
start_station_idend_station_idDue to the large variation in bike share ridership by period of the year (eg. month) and station, charts show the average ridership per station per period. In order to calculate this average, the number of stations during a given period is needed. The start_station_id and end_station_id are station identifier columns that can be used to count the number of stations across the network within a specific period (eg. monthly). This count can then be used to calculate the required average ridership per period (eg. average monthly ridership), which is used to charts. For this reason, both of these columns cannot have missing values. So, any bike share trips with a missing value in either of these columns were dropped.
(clean) clean start and end station names
Extract the following datetime attributes from the start and end datetime of the trip
export cleaned data to disk in a .parquet file with the file naming format processed__trips_YY_<PP>__YYmmdd_HHMMSS.parquet.gzip, where
<PP> is MM for monthly files (2020, 2021, 2022, 2023) and QQ for quarterly files (2018, 2019)YYmmdd_HHMMSS is the timestamp (retrieved using Python) at which the file with processed data was exported to disk(filter) Keep census tract boundaries for the city of Toronto using the following filters
PRUID = 35 in order to capture the province of Ontario, per Statistics CanadaCTUID starts with 535CTNAME starts with 01, 02, 03, 04 or 05where the CTUID and CTNAME filters were obtained using [Census Mapper](http://censusmapper.cahttps://censusmapper.ca/#11/43.7245/-79.4861). It was necessary to zoom in or out on the map displayed on the Census Mapper site until a selected boundary displays the name of a census tract. eg. 0001.00 (CT) and not other administrative or statistical boundaries such as census division (CD), dissemination area (DA), etc.
None.
/station_information GBFS specification which contained a struct datatype
rental_methodsis_virtual_stationrental_urisgroupsgeopandas spatial join (.sjoin) with the contains predicateCTUID column to census_tract_idAREA_NAME column to Neighbourhood