Explore one of the most classic dataset using Xorbits, gain a better understanding of the capabilities of Xorbits, its ease-of-use, and how it can be integrated with other Python libraries to streamline your data analysis workflow.

Xorbits is a powerful tool for exploring and analyzing large datasets. One of the classic datasets for demonstrating the capabilities of Xorbits is the NYC taxi dataset, which contains records of taxi rides in the city from 2009 to 2022. In this blog, we will show you how to use Xorbits to do some initial exploration of the NYC taxi dataset and get a sense of the ease-of-use of Xorbits.
Please find the links below to run this example on your favorite platform:
Platform |
---|
Google Colab |
Kaggle |
And for more examples, please visit Xorbits Examples.
Software versions
- Xorbits>=0.1.2
- plotly==5.11.0
# Install dependencies
%pip install xorbits>=0.1.2 plotly==5.11.0 pyarrow
Datasets
Download the taxi zone geojson:
%%bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
wget https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc\?method\=export\&format\=GeoJSON -O taxi_zones.geojson
Download the taxi trip records of 2021:
%%bash
for i in {1..12}
do
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-$(printf "%02d" $i).parquet
done
%%bash
mkdir yellow_tripdata_2021
mv yellow_tripdata_2021-*.parquet yellow_tripdata_2021
Initialization
The first step is to initialize Xorbits:
import xorbits
# Initialize Xorbits in the local environment.
xorbits.init()
Data loading
The second step is to load the data into an Xorbits DataFrame. This can be done using the
read_parquet()
function, which allows us to specify the location of the parquet file and
any additional options we want to use while reading the data.
In the case of the NYC taxi dataset, here is an example of how we could do this using Xorbits:
import datetime
import xorbits.pandas as pd
trips = pd.read_parquet(
'yellow_tripdata_2021',
columns=[
'tpep_pickup_datetime',
'tpep_dropoff_datetime',
'trip_distance',
'PULocationID',
'DOLocationID',
])
# Remove outliers.
trips = trips[(trips['tpep_pickup_datetime'] >= datetime.datetime(2021, 1, 1)) & (trips['tpep_pickup_datetime'] <= datetime.datetime(2021, 12, 31))]
trips
tpep_pickup_datetime | tpep_dropoff_datetime | trip_distance | PULocationID | DOLocationID | |
---|---|---|---|---|---|
0 | 2021-01-01 00:30:10 | 2021-01-01 00:36:12 | 2.10 | 142 | 43 |
1 | 2021-01-01 00:51:20 | 2021-01-01 00:52:19 | 0.20 | 238 | 151 |
2 | 2021-01-01 00:43:30 | 2021-01-01 01:11:06 | 14.70 | 132 | 165 |
3 | 2021-01-01 00:15:48 | 2021-01-01 00:31:01 | 10.60 | 138 | 132 |
4 | 2021-01-01 00:31:49 | 2021-01-01 00:48:21 | 4.94 | 68 | 33 |
... | ... | ... | ... | ... | ... |
3212643 | 2021-12-30 23:04:37 | 2021-12-30 23:22:59 | 7.01 | 42 | 70 |
3212644 | 2021-12-30 23:31:58 | 2021-12-30 23:37:58 | 1.30 | 164 | 90 |
3212645 | 2021-12-30 23:53:54 | 2021-12-31 00:03:17 | 2.30 | 263 | 151 |
3212646 | 2021-12-30 23:28:00 | 2021-12-30 23:43:00 | 4.46 | 246 | 87 |
3212672 | 2021-12-31 00:00:00 | 2021-12-31 00:08:00 | 1.02 | 163 | 229 |
30833535 rows × 5 columns
taxi_zones = pd.read_csv('taxi+_zone_lookup.csv', usecols=['LocationID', 'Zone'])
taxi_zones.set_index(['LocationID'], inplace=True)
taxi_zones
Zone | |
---|---|
LocationID | |
1 | Newark Airport |
2 | Jamaica Bay |
3 | Allerton/Pelham Gardens |
4 | Alphabet City |
5 | Arden Heights |
... | ... |
261 | World Trade Center |
262 | Yorkville East |
263 | Yorkville West |
264 | NV |
265 | NaN |
265 rows × 1 columns
import json
with open('taxi_zones.geojson') as fd:
geojson = json.load(fd)
Once we have the data loaded into a DataFrame, we might want to get a sense of the overall structure of the data by looking at the number of rows and columns, the data types of each column, and the first few rows of the data. We can do this using the shape, dtypes, and head() attributes, respectively:
trips.shape
(30833535, 5)
trips.dtypes
tpep_pickup_datetime datetime64[ns]
tpep_dropoff_datetime datetime64[ns]
trip_distance float64
PULocationID int64
DOLocationID int64
dtype: object
trips.head()
tpep_pickup_datetime | tpep_dropoff_datetime | trip_distance | PULocationID | DOLocationID | |
---|---|---|---|---|---|
0 | 2021-01-01 00:30:10 | 2021-01-01 00:36:12 | 2.10 | 142 | 43 |
1 | 2021-01-01 00:51:20 | 2021-01-01 00:52:19 | 0.20 | 238 | 151 |
2 | 2021-01-01 00:43:30 | 2021-01-01 01:11:06 | 14.70 | 132 | 165 |
3 | 2021-01-01 00:15:48 | 2021-01-01 00:31:01 | 10.60 | 138 | 132 |
4 | 2021-01-01 00:31:49 | 2021-01-01 00:48:21 | 4.94 | 68 | 33 |
Temporal Analysis
One way to analyze the NYC taxi dataset is to look at how the number of rides varies over
time. A particularly intriguing analysis involves finding out the number of rides per hour
of each week day. We can create two additional columns in the DataFrame representing the
pickup hour and day of the week. We can then group the data by day of the week and hour
using the groupby
method and calculate the number of rides for each group.
trips['PU_dayofweek'] = trips['tpep_pickup_datetime'].dt.dayofweek
trips['PU_hour'] = trips['tpep_pickup_datetime'].dt.hour
gb_time = trips.groupby(by=['PU_dayofweek', 'PU_hour'], as_index=False).agg(count=('PU_dayofweek', 'count'))
gb_time
PU_dayofweek | PU_hour | count | |
---|---|---|---|
0 | 0 | 0 | 55630 |
1 | 0 | 1 | 28733 |
2 | 0 | 2 | 16630 |
3 | 0 | 3 | 10682 |
4 | 0 | 4 | 13039 |
... | ... | ... | ... |
163 | 6 | 19 | 192822 |
164 | 6 | 20 | 165670 |
165 | 6 | 21 | 147248 |
166 | 6 | 22 | 123082 |
167 | 6 | 23 | 90842 |
168 rows × 3 columns
We can then use a library like plotly to visualize the time series data. The graph below displays the number of rides per hour. From the graph, it is apparent that people are more inclined to travel in the afternoon. Additionally, on weekends, people generally tend to stay out late.
import plotly.express as px
b = px.bar(
gb_time.to_pandas(),
x='PU_hour',
y='count',
color='PU_dayofweek',
color_continuous_scale='sunset_r',
)
b.show()
The graph below displays the number of rides for each day of the week.
b = px.bar(
gb_time.to_pandas(),
x='PU_dayofweek',
y='count',
color='PU_hour',
color_continuous_scale='sunset_r',
)
b.show()
Spatial Analysis
Another way to analyze the NYC taxi dataset is to look at patterns in the spatial distribution of
rides. We can use the groupby
method to group the data by pick-up and drop-off location ID and
compute the count of rides for each group:
gb_pu_location = trips.groupby(['PULocationID'], as_index=False).agg(count=('PULocationID', 'count'))
gb_pu_location = gb_pu_location.to_pandas()
gb_pu_location
PULocationID | count | |
---|---|---|
0 | 1 | 3006 |
1 | 2 | 13 |
2 | 3 | 1705 |
3 | 4 | 38057 |
4 | 5 | 115 |
... | ... | ... |
258 | 261 | 126511 |
259 | 262 | 446737 |
260 | 263 | 734649 |
261 | 264 | 229925 |
262 | 265 | 139525 |
263 rows × 2 columns
Then visualize the spatial distribution of pick-up locations:
import plotly.graph_objects as go
fig = go.Figure(
go.Choroplethmapbox(
geojson=geojson,
featureidkey='properties.location_id',
locations=gb_pu_location['PULocationID'],
z=gb_pu_location['count'],
colorscale="Viridis",
marker_opacity=0.7,
marker_line_width=0.1
)
)
fig.update_layout(
mapbox_style="carto-positron",
mapbox_zoom=9,
mapbox_center = {"lat": 40.7158, "lon": -73.9805},
height=600,
)
fig.show()
We can also group the data by drop-off location ID:
gb_do_location = trips.groupby(['DOLocationID'], as_index=False).agg(count=('DOLocationID', 'count'))
gb_do_location = gb_do_location.to_pandas()
gb_do_location
DOLocationID | count | |
---|---|---|
0 | 1 | 55937 |
1 | 2 | 54 |
2 | 3 | 4158 |
3 | 4 | 134896 |
4 | 5 | 342 |
... | ... | ... |
256 | 261 | 116452 |
257 | 262 | 495512 |
258 | 263 | 698930 |
259 | 264 | 196514 |
260 | 265 | 92557 |
261 rows × 2 columns
Then visualize the spatial distribution of drop-off locations:
fig = go.Figure(
go.Choroplethmapbox(
geojson=geojson,
featureidkey='properties.location_id',
locations=gb_do_location['DOLocationID'],
z=gb_do_location['count'],
colorscale="Viridis",
marker_opacity=0.7,
marker_line_width=0.1
)
)
fig.update_layout(
mapbox_style="carto-positron",
mapbox_zoom=9,
mapbox_center = {"lat": 40.7158, "lon": -73.9805},
height=600,
)
fig.show()
Another area that we can explore is the traffic between taxi zones:
gb_pu_do_location = trips.groupby(['PULocationID', 'DOLocationID'], as_index=False).agg(count=('PULocationID', 'count'))
# Add zone names.
gb_pu_do_location = gb_pu_do_location.merge(taxi_zones, left_on='PULocationID', right_index=True)
gb_pu_do_location.rename(columns={'Zone': 'PUZone'}, inplace=True)
gb_pu_do_location = gb_pu_do_location.merge(taxi_zones, left_on='DOLocationID', right_index=True)
gb_pu_do_location.rename(columns={'Zone': 'DOZone'}, inplace=True)
gb_pu_do_location.sort_values(['count'], inplace=True, ascending=False)
gb_pu_do_location
PULocationID | DOLocationID | count | PUZone | DOZone | |
---|---|---|---|---|---|
44108 | 237 | 236 | 225931 | Upper East Side South | Upper East Side North |
43855 | 236 | 237 | 194655 | Upper East Side North | Upper East Side South |
44109 | 237 | 237 | 152438 | Upper East Side South | Upper East Side South |
43854 | 236 | 236 | 149797 | Upper East Side North | Upper East Side North |
49493 | 264 | 264 | 112711 | NV | NV |
... | ... | ... | ... | ... | ... |
1014 | 10 | 29 | 1 | Baisley Park | Brighton Beach |
21361 | 121 | 174 | 1 | Hillcrest/Pomonok | Norwood |
21909 | 124 | 174 | 1 | Howard Beach | Norwood |
868 | 9 | 29 | 1 | Auburndale | Brighton Beach |
49337 | 264 | 105 | 1 | NV | Governor's Island/Ellis Island/Liberty Island |
49751 rows × 5 columns
Conclusion
In conclusion, Xorbits is an incredibly powerful tool for exploring and analyzing large datasets, as demonstrated by its use with the NYC taxi dataset. By following the steps outlined in this blog, you can gain a better understanding of the capabilities of Xorbits, its ease-of-use, and how it can be integrated with other Python libraries to streamline your data analysis workflow.