Lego Data Analysis with GPT4

Jin Young Kim
11 min readJun 2, 2023

--

Surveying the capability of GPT4 as of May 2023

In previous post I wrote how the generative AI will change the nature of data science and how it’s done. Now to put my hypotheses to test, I’ve used latest LLM technologies (GPT4) to analyze some of my favorite data sets.

As for the setup, while I was initially excited by Code Interpreter Plugin, I wasn’t getting it at my OpenAI account, so used Notable plug-in for ChatGPT instead. The Notable plug-in works as follows:

  1. GPT4 interprets user command and send request to the Notable server
  2. This includes loading file and installing library as well as running python commands
  3. The Notable server sends the results back to ChatGPT
  4. Outputs include table, plot and analytical summaries
  5. Users can check the code and output (table and plot) in the Notable notebook as needed

In the end both ChatGPT sessions and Notebooks are left for future use cases, which turns out to be a better solution for actual work, since I could review and work with Jupyter notebook side-by-side instead of relying on ChatGPT prompts (for the case of Code Interpreter Plugin as seen from Youtube videos).

Lego Sets & Parts Dataset

Now onto the fun part. The data I used includes the following information for Lego sets, themes and parts. For those not familiar of Lego world, for instance, Starwars theme includes X-wing Star Fighter set, which includes a bunch of lego parts (bricks, plates and so on). Full schema below:

Now, while Lego is a toy and the size of dataset is not too big, this is by no means a toy example (pun intended) given the size (combined dataset contains 1M rows) and the complexity of schema (just getting from sets to parts takes 3-way join). Thanks to Rebrickable for making dataset accessible.

Now onto the analysis (after I uploaded the gzipped files to a folder in the Notable server). In what follows, note that I’ve combined GPT prompt (or blockquote for multi-line prompts) and my impressions / interpretations. I've used screenshots for GPT-generated code (in white background; captured from the Notable notebook) and GPT output (in dark grey background), or side-by-side of both input and output.

Step 1: Data Preparation

We'll analyze Lego dataset (of sets, parts, themes). Let's first load data and summarize the schema from the gzipped csv files under LegoData folder of the project. (for each file output table of field name / data type / sample values)

How GPT can generate code for reading and summarizing dataset is pretty impressive in and of itself, but let’s move on since we have plenty of surprises coming.

Okay, let's build a master table of themes (name) / sets (name, year) / parts (number / name / color). Use inventories / inventory_parts table to join sets and parts table. Use column names to reflect their human names.

While GPT struggled a bit to figure out join key, note how it beautifully recovers and completes 7-way join of parts, sets and themes. It would have taken at least half an hour upto this stage between web search, typing, and verification. Looks like we can finally hope to lower the % of time spent in data preparation to well below 80%.

The resulting master table contains 1M record and is ready for EDA. (note that actual merging code is not shown here for brevity). I decided to make a checkpoint here and move onto the next phase (and the next notebook) using the prompt below, which GPT+Notable followed faithfully.

Now let's save the master table in a gzipped file at the same place as other datasets, and then load it in new notebook for exploratory analysis.

Step 2: Exploratory Analysis

Now onto the EDA. I’ve first kicked off the following prompt to get basic validating questions answered.

Let’s first validate the table
Any obvious duplicates?
Any out-of-bounds values?
Any other concerns?

Also print a table of column catalog containing:
Number of unique values
% of null values
sample values

For non-numeric columns (including year) with large # of unique values, can you suggest suitable categories for visualization?

To which I got the following summary. Note that, again, GPT recovered from initial attempt to attach sample values to the summary table by figuring out how to apply string concatenation. It also suggested ways to group column values for visualization, to which I mostly agree.

Now we can apply the lessons above by simple prompts below:

Let’s take following data prep steps
Remove duplicate rows (assuming they’re identical across all columns)
Remove rows with year prior to 1980 (print % of rows removed)
Add suggested categories for theme and color (with appropriate naming)

And Voilà, our master table is modified to reflect suggestions above.

At this point, I decided to build another table at set level including aggregate columns of various kinds:

Create a separate set master table as perform EDA as above
Theme Name
Theme Categories
Set Name
Year
Total # Parts
Total # of Unique Parts
Dominant Color Category
% of Technic Parts

Now let’s visualize data to ensure the aggregation is done correctly:

Visualize the per-column stats
(use appropriate visualization; e.g. word cloud for strings)
Theme Categories
Year
Total # Parts
Total # of Unique Parts
Dominant Color Category
% of Technic Parts

Interestingly, GPT knew the package wordcloud is not available and installed it -- another example of recovery. Below we found word cloud and the distribution of various sets.

I found that % of Technic Parts column has no value, and asked GPT to figure out why and fix it. Amazingly, GPT backtracked the join process to find out that the part category doesn't have the name Technique per se, but it could use the string match to find parts that has the term Technique in its part category name. Below is the distribution before/after the fix. Now I'm starting to get scared on what GPT can't accomplish.

Now that we have proper set master table, I asked GPT to provide the following analysis:
(the prompt below went through several iterations to refine plots. Maybe multi-modal GPT4 will figure this out itself)

Visualize the inter-column relationship
Use horizontal box plot to avoid overlapping labels

Dist of column values by theme category (using box plot)
- Year
- Total # Parts
- Total # of Unique Parts
- Avg. Part Count (Total # Parts / Total # of Unique Parts)
- Dominant Color Category
- % of Technic Parts
Add a few observations to each plot, and summarize in the end

I won’t include full details of code and GPT-generated analysis, but the resulting plots gave a good glimpse of different lego sets

  • Duplo has both fewest parts and unique parts (of course!)
  • Technique has highest average per-part quantity, since it’s composed of smaller number of basic elements (axles, gears, pins, etc.)
  • Other than Technic, Star Wars has largest fraction of Technic parts (possibly due to its complexity)

Finally, the distribution of dominant color (I don’t even know what definition GPT used to calculate this) provides the plot that matches the intuition, with Technic and Star Wars dominated by black and gray, whereas Classic and Duplo are more diverse in color scheme. (I know that it’s not perfect categorization but I’m running out of time here:)

Step 3: Correlation Analysis

Next day, while I was quite convinced about the analytic capabilities of GPT4 at this point, I tried to push the limits further and ask it to generate hypothesis. Since I already lost my previous session overnight, I decided to start over and fired up the following prompt.

We’ll analyze Lego dataset (of sets, parts, themes).

Use project: NOTABLE_PROJECT_URL
Create notebook named: Lego Analysis Part 2
Load the master table and set master table from LegoData folder.
They’re in gzipped csv / How many rows in each table?

Then print column-level schema for each table (in tabular format)
… (omitted) …

Can you guess what info each table contains
Any oddities at column level?

Generate a few interesting hypotheses to explore
Along with suitable analytics plan including data processing and visualization

And below is what I got for the last part of the prompt. Yes, it did get the oddity at column level and then suggested basic yet pretty reasonable hypotheses.

And I decided to pursue the correlation analysis with the prompt below followed by a pretty reasonable response. Not sure if the last line made any difference, but I saw other people doing it and I guess it doesn’t hurt to motivate AI to do better?

Let’s pursue the first hypothesis.
Can you build a step-by-step plan first?
(including data preparation / validation / visualization / interpretation)
Assume you’re a world-class data scientist working with a client.

Okay, now onto the actual work. Here I decided to up the ante by asking for formatting and code comments within the notebook.

The plan looks good. Can you carry out the plan within the current notebook

Include markdown cells for analysis step & description
Include helpful comments for each code cell
Check if the assumptions are met for linear regression

And yes, it started to populate the notebook with decent headers and summaries. And yes, the EDA, statistical analysis, and visualization all hit the right note. Finally I can just share the notebook written by GPT4 with anyone interested!

While the correlation is there, but it’s still pretty weak. So decided to try a few tricks below for deep dive. First it identified categories that lasted more than 10 years. And then breakdown by theme category showed Technic as having highest correlation, along with revealing interesting discontinuities for a few other themes. Since the previous guideline worked so well, I decided to add style guideline to all the analytics prompt.

Interesting. Let’s find ways to strengthen the correlation

Check if removing older period (say pre-1960) helps

Check if removing outliers

Check if stronger relationship is found within theme category
- Try theme categories that lasted at least 10 years

I decided to further deep dive into Technic theme (my favorite) to see which sets are contributing to part count inflation. And yes, recent series of supercars were main contributors. (Lambo and Ferarri, are you listening?)

Now let’s delve into Technic theme category which shows highest correlation.

Create a technic master table to include only relevant rows (using Theme Category column)

Add subcategory based on main entity (e.g. plane, car, etc) within set name

Display an interactive scatterplot showing the correlation

Display set name upon hover

Use subcategory above for coloring each dot

Use plot size 10x10 inch for readability

Repeat correlation analysis for some of the 5 main subcategories

Okay. I can go on forever but I think I accomplished the goal of understanding what GPT4 (+Notable plugin) can do. What I don’t know what it can’t do, since it answered all the basic data processing and analytic questions with ease, often recovering from errors and figuring out things on its own. Maybe I’ll look into this with a different dataset!

Conclusions: Future is here, and it’s just beginning

When ChatGPT was first announced, I thought maybe we data scientists may have a few years to adjust at most. Now I feel that the time to act is now. With very rudimentary add-on (the Notable plug-in) GPT4 provides pretty compelling automation to everyday analytics workflow. Yes, it’s pretty non-polished setting with both ChatGPT and the Notable losing session context frequently, which necessities starting from scratch.

With that said, I’d still suggest any data scientist serious with one’s profession should act sooner than later to experience the new analytical experience generative AI affords, and think how it’ll change the nature of their job. For me, it’s liberating to see how we can perform basic tasks with natural language command, and I can see the oncoming surge of data science prompts being share as code snippets from the previous era. At the same time, it’s hard to estimate the impact on broad industry, although most of my predictions would still hold I suppose.

No matter where you are, now’s good time to ask:
what does it mean that no-code era of data analytics is finally — well, almost — arrived?

Appendix: Practical Tips

If you’d like to try GPT4-powered data immediately, here are a few tips.

  • GPT4 (within ChatGPT interface) has the hard limit of 25 messages per 3 hours. So you’d like to bundle commands to the extent possible. Bundling also helps with multitasking, since GPT4 responses are quite slow.
  • Both ChatGPT and the Notable seems to session context frequently (and it’s unclear how far back task context is used to generate future responses). This necessities saving prompts, intermediate data and notebooks so that no information is lost.
  • To be clear, these are saved by ChatGPT and Notable, but I’d keep a local copy since ChatGPT removes subsequent message if any of previous messages are revises, and the Notable notebooks can be modified by both the user and the Notable plug-in.
  • Continuing on the previous point, as is the case of any DS project, I’d breakdown the whole project into a few sessions to keep ChatGPT sessions and Notebooks manageable. For example, I’ve separated 1) data preparation 2) exploratory analysis into separate sessions.
  • Make sure to save and refine prompts over time. Here’s the collection of prompts used for this post:
  • Lego Data Analysis with GPT4 (Prompts)

p.s. Are you interested in extending the state-of-the-art in data science applied to the domain of search and recommendation? We’re hiring in Data&Analytics (in Seattle/WA or Seoul/Korea or remote). Please shoot me email at jin dot y dot kim at navercorp dot com. Team Blog (in Korean) is also available.

--

--

Jin Young Kim
Jin Young Kim

Written by Jin Young Kim

Head of Data Science and Director of Naver Search US. (We’re hiring data scientists and engineers in Korea & US!) #헬로데이터과학 #데이터지능팟캐스트

No responses yet