🤖 Data Cleaning and Processing

Data processing that was performed on each raw dataset is listed below.

🚵 Raw Bike Share Ridership Data

The following data processing was performed on raw bike share ridership data

When loading raw bike share ridership data, parse datetimes using the datetime format %m/%d/%Y %H:%M, since the ridership data timestamps did not contain seconds. Additionally, the analysis in this project does not require information about the second at which a bike share trip was started or ended. For these reasons, seconds were not parsed from the raw datetimes.
Clean (Standardize) column names
1. replace single and double white space by _
2. replace __ by _
Rename column names to ensure the same names are used in all processed ridership data files
1. in 2018, change
  1. to_station_name → end_station_id
  2. to_station_id → end_station_id
  3. from_station_name → start_station_id
  4. from_station_id → start_station_id
  5. trip_start_time → started_at
  6. trip_stop_time → ended_at
  7. trip_duration_seconds → trip_duration
2. in 2019, 2020, 2021, 2022 and 2023, change
  1. start_time → started_at
  2. end_time → ended_at
Remove trips that are
1. (clean) shorter than 60 seconds (these are likely false starts by users trying out the service or having technical difficulties; this cleaning is performed in other bike share systems too - see for example Washington D.C., Chicago, San Francisco, Philadelphia, New York)
2. (filter) longer than allowed by the Bike Share Toronto pricing plan
  1. 2018, 2019, 2020
    1. allow a maximum trip duration of 30 minutes
  2. 2021
    1. January to June (inclusive)
      1. allow a maximum trip duration of 30 minutes
    2. July to December
      1. allow a maximum trip duration of 45 minutes to reflect operator’s change in bike share pricing plan
  3. 2022
    1. allow a maximum trip duration of 45 minutes
  4. 2023
    1. January to March
      1. allow a maximum trip duration of 45 minutes
    2. April to December
      1. allow a maximum trip duration of 90 minutes to reflect operator’s change in bike share pricing plan
(filter) Drop trips with missing values in the following columns
1. start_station_id
2. end_station_id
Due to the large variation in bike share ridership by period of the year (eg. month) and station, charts show the average ridership per station per period. In order to calculate this average, the number of stations during a given period is needed. The start_station_id and end_station_id are station identifier columns that can be used to count the number of stations across the network within a specific period (eg. monthly). This count can then be used to calculate the required average ridership per period (eg. average monthly ridership), which is used to charts. For this reason, both of these columns cannot have missing values. So, any bike share trips with a missing value in either of these columns were dropped.
(clean) clean start and end station names
1. remove special characters
2. replace variations in spelling of the same word by one version of the spelling
3. remove white spaces
Extract the following datetime attributes from the start and end datetime of the trip
1. year
2. month
3. day of month (1-31)
4. hour
5. minute
export cleaned data to disk in a .parquet file with the file naming format processed__trips_YY_<PP>__YYmmdd_HHMMSS.parquet.gzip, where
1. <PP> is MM for monthly files (2020, 2021, 2022, 2023) and QQ for quarterly files (2018, 2019)
2. YYmmdd_HHMMSS is the timestamp (retrieved using Python) at which the file with processed data was exported to disk

🛤️ City of Toronto Census Tract Boundaries

(filter) Keep census tract boundaries for the city of Toronto using the following filters
1. PRUID = 35 in order to capture the province of Ontario, per Statistics Canada
2. CTUID starts with 535
3. CTNAME starts with 01, 02, 03, 04 or 05
where the CTUID and CTNAME filters were obtained using [Census Mapper](http://censusmapper.cahttps://censusmapper.ca/#11/43.7245/-79.4861). It was necessary to zoom in or out on the map displayed on the Census Mapper site until a selected boundary displays the name of a census tract. eg. 0001.00 (CT) and not other administrative or statistical boundaries such as census division (CD), dissemination area (DA), etc.

🪃 City of Toronto Neighborhood Boundaries

None.

ℹ️ Bike Share Station Info

Applied JSON encoding to the following columns from the /station_information GBFS specification which contained a struct datatype
1. rental_methods
2. is_virtual_station
3. rental_uris
4. groups
Extract the neighborhood and census tract in which each bike share station is found using a geopandas spatial join (.sjoin) with the contains predicate
rename the
1. CTUID column to census_tract_id
2. AREA_NAME column to Neighbourhood

🚵 Raw Bike Share Ridership Data

🛤️ City of Toronto Census Tract Boundaries

🪃 City of Toronto Neighborhood Boundaries

ℹ️ Bike Share Station Info

🌡️ Daily Weather Data