Tutorial: Exploratory analysis using Xorbits over NYC taxi dataset

TUTORIALS

Explore one of the most classic dataset using Xorbits, gain a better understanding of the capabilities of Xorbits, its ease-of-use, and how it can be integrated with other Python libraries to streamline your data analysis workflow.

Jon Wang

14 February 2023

Xorbits is a powerful tool for exploring and analyzing large datasets. One of the classic datasets for demonstrating the capabilities of Xorbits is the NYC taxi dataset, which contains records of taxi rides in the city from 2009 to 2022. In this blog, we will show you how to use Xorbits to do some initial exploration of the NYC taxi dataset and get a sense of the ease-of-use of Xorbits.

Please find the links below to run this example on your favorite platform:

Platform
Google Colab
Kaggle

And for more examples, please visit Xorbits Examples.

Software versions

Xorbits>=0.1.2
plotly==5.11.0

# Install dependencies
%pip install xorbits>=0.1.2 plotly==5.11.0 pyarrow

Datasets

Download the taxi zone geojson:

%%bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
wget https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc\?method\=export\&format\=GeoJSON -O taxi_zones.geojson

Download the taxi trip records of 2021:

%%bash
for i in {1..12}
do
    wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-$(printf "%02d" $i).parquet
done

%%bash
mkdir yellow_tripdata_2021
mv yellow_tripdata_2021-*.parquet yellow_tripdata_2021

Initialization

The first step is to initialize Xorbits:

import xorbits

# Initialize Xorbits in the local environment.
xorbits.init()

Data loading

The second step is to load the data into an Xorbits DataFrame. This can be done using the read_parquet() function, which allows us to specify the location of the parquet file and any additional options we want to use while reading the data.

In the case of the NYC taxi dataset, here is an example of how we could do this using Xorbits:

import datetime
import xorbits.pandas as pd

trips = pd.read_parquet(
    'yellow_tripdata_2021',
    columns=[
        'tpep_pickup_datetime',
        'tpep_dropoff_datetime',
        'trip_distance',
        'PULocationID',
        'DOLocationID',
    ])
# Remove outliers.
trips = trips[(trips['tpep_pickup_datetime'] >= datetime.datetime(2021, 1, 1)) & (trips['tpep_pickup_datetime'] <= datetime.datetime(2021, 12, 31))]
trips

	tpep_pickup_datetime	tpep_dropoff_datetime	trip_distance	PULocationID	DOLocationID
0	2021-01-01 00:30:10	2021-01-01 00:36:12	2.10	142	43
1	2021-01-01 00:51:20	2021-01-01 00:52:19	0.20	238	151
2	2021-01-01 00:43:30	2021-01-01 01:11:06	14.70	132	165
3	2021-01-01 00:15:48	2021-01-01 00:31:01	10.60	138	132
4	2021-01-01 00:31:49	2021-01-01 00:48:21	4.94	68	33
...	...	...	...	...	...
3212643	2021-12-30 23:04:37	2021-12-30 23:22:59	7.01	42	70
3212644	2021-12-30 23:31:58	2021-12-30 23:37:58	1.30	164	90
3212645	2021-12-30 23:53:54	2021-12-31 00:03:17	2.30	263	151
3212646	2021-12-30 23:28:00	2021-12-30 23:43:00	4.46	246	87
3212672	2021-12-31 00:00:00	2021-12-31 00:08:00	1.02	163	229

30833535 rows × 5 columns

taxi_zones = pd.read_csv('taxi+_zone_lookup.csv', usecols=['LocationID', 'Zone'])
taxi_zones.set_index(['LocationID'], inplace=True)
taxi_zones

	Zone
LocationID
1	Newark Airport
2	Jamaica Bay
3	Allerton/Pelham Gardens
4	Alphabet City
5	Arden Heights
...	...
261	World Trade Center
262	Yorkville East
263	Yorkville West
264	NV
265	NaN

265 rows × 1 columns

import json
with open('taxi_zones.geojson') as fd:
    geojson = json.load(fd)

Once we have the data loaded into a DataFrame, we might want to get a sense of the overall structure of the data by looking at the number of rows and columns, the data types of each column, and the first few rows of the data. We can do this using the shape, dtypes, and head() attributes, respectively:

trips.shape

(30833535, 5)

trips.dtypes

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
trip_distance                   float64
PULocationID                      int64
DOLocationID                      int64
dtype: object

trips.head()

	tpep_pickup_datetime	tpep_dropoff_datetime	trip_distance	PULocationID	DOLocationID
0	2021-01-01 00:30:10	2021-01-01 00:36:12	2.10	142	43
1	2021-01-01 00:51:20	2021-01-01 00:52:19	0.20	238	151
2	2021-01-01 00:43:30	2021-01-01 01:11:06	14.70	132	165
3	2021-01-01 00:15:48	2021-01-01 00:31:01	10.60	138	132
4	2021-01-01 00:31:49	2021-01-01 00:48:21	4.94	68	33

Temporal Analysis

One way to analyze the NYC taxi dataset is to look at how the number of rides varies over time. A particularly intriguing analysis involves finding out the number of rides per hour of each week day. We can create two additional columns in the DataFrame representing the pickup hour and day of the week. We can then group the data by day of the week and hour using the groupby method and calculate the number of rides for each group.

trips['PU_dayofweek'] = trips['tpep_pickup_datetime'].dt.dayofweek
trips['PU_hour'] = trips['tpep_pickup_datetime'].dt.hour
gb_time = trips.groupby(by=['PU_dayofweek', 'PU_hour'], as_index=False).agg(count=('PU_dayofweek', 'count'))
gb_time

	PU_dayofweek	PU_hour	count
0	0	0	55630
1	0	1	28733
2	0	2	16630
3	0	3	10682
4	0	4	13039
...	...	...	...
163	6	19	192822
164	6	20	165670
165	6	21	147248
166	6	22	123082
167	6	23	90842

168 rows × 3 columns

We can then use a library like plotly to visualize the time series data. The graph below displays the number of rides per hour. From the graph, it is apparent that people are more inclined to travel in the afternoon. Additionally, on weekends, people generally tend to stay out late.

import plotly.express as px

b = px.bar(
    gb_time.to_pandas(),
    x='PU_hour',
    y='count',
    color='PU_dayofweek',
    color_continuous_scale='sunset_r',
)
b.show()

The graph below displays the number of rides for each day of the week.

b = px.bar(
    gb_time.to_pandas(),
    x='PU_dayofweek',
    y='count',
    color='PU_hour',
    color_continuous_scale='sunset_r',
)
b.show()

Spatial Analysis

Another way to analyze the NYC taxi dataset is to look at patterns in the spatial distribution of rides. We can use the groupby method to group the data by pick-up and drop-off location ID and compute the count of rides for each group:

gb_pu_location = trips.groupby(['PULocationID'], as_index=False).agg(count=('PULocationID', 'count'))
gb_pu_location = gb_pu_location.to_pandas()
gb_pu_location

	PULocationID	count
0	1	3006
1	2	13
2	3	1705
3	4	38057
4	5	115
...	...	...
258	261	126511
259	262	446737
260	263	734649
261	264	229925
262	265	139525

263 rows × 2 columns

Then visualize the spatial distribution of pick-up locations:

import plotly.graph_objects as go

fig = go.Figure(
    go.Choroplethmapbox(
        geojson=geojson,
        featureidkey='properties.location_id',
        locations=gb_pu_location['PULocationID'],
        z=gb_pu_location['count'],
        colorscale="Viridis",
        marker_opacity=0.7,
        marker_line_width=0.1
    )
)
fig.update_layout(
    mapbox_style="carto-positron",
    mapbox_zoom=9,
    mapbox_center = {"lat": 40.7158, "lon": -73.9805},
    height=600,
)
fig.show()

We can also group the data by drop-off location ID:

gb_do_location = trips.groupby(['DOLocationID'], as_index=False).agg(count=('DOLocationID', 'count'))
gb_do_location = gb_do_location.to_pandas()
gb_do_location

	DOLocationID	count
0	1	55937
1	2	54
2	3	4158
3	4	134896
4	5	342
...	...	...
256	261	116452
257	262	495512
258	263	698930
259	264	196514
260	265	92557

261 rows × 2 columns

Then visualize the spatial distribution of drop-off locations:

fig = go.Figure(
    go.Choroplethmapbox(
        geojson=geojson,
        featureidkey='properties.location_id',
        locations=gb_do_location['DOLocationID'],
        z=gb_do_location['count'],
        colorscale="Viridis",
        marker_opacity=0.7,
        marker_line_width=0.1
    )
)
fig.update_layout(
    mapbox_style="carto-positron",
    mapbox_zoom=9,
    mapbox_center = {"lat": 40.7158, "lon": -73.9805},
    height=600,
)
fig.show()

Another area that we can explore is the traffic between taxi zones:

gb_pu_do_location = trips.groupby(['PULocationID', 'DOLocationID'], as_index=False).agg(count=('PULocationID', 'count'))

# Add zone names.
gb_pu_do_location = gb_pu_do_location.merge(taxi_zones, left_on='PULocationID', right_index=True)
gb_pu_do_location.rename(columns={'Zone': 'PUZone'}, inplace=True)
gb_pu_do_location = gb_pu_do_location.merge(taxi_zones, left_on='DOLocationID', right_index=True)
gb_pu_do_location.rename(columns={'Zone': 'DOZone'}, inplace=True)

gb_pu_do_location.sort_values(['count'], inplace=True, ascending=False)
gb_pu_do_location

	PULocationID	DOLocationID	count	PUZone	DOZone
44108	237	236	225931	Upper East Side South	Upper East Side North
43855	236	237	194655	Upper East Side North	Upper East Side South
44109	237	237	152438	Upper East Side South	Upper East Side South
43854	236	236	149797	Upper East Side North	Upper East Side North
49493	264	264	112711	NV	NV
...	...	...	...	...	...
1014	10	29	1	Baisley Park	Brighton Beach
21361	121	174	1	Hillcrest/Pomonok	Norwood
21909	124	174	1	Howard Beach	Norwood
868	9	29	1	Auburndale	Brighton Beach
49337	264	105	1	NV	Governor's Island/Ellis Island/Liberty Island

49751 rows × 5 columns

Conclusion

In conclusion, Xorbits is an incredibly powerful tool for exploring and analyzing large datasets, as demonstrated by its use with the NYC taxi dataset. By following the steps outlined in this blog, you can gain a better understanding of the capabilities of Xorbits, its ease-of-use, and how it can be integrated with other Python libraries to streamline your data analysis workflow.