fix(data-management): update canonical hugging face dataset paths and configs by GovIndLok · Pull Request #180 · rohitg00/ai-engineering-from-scratch

GovIndLok · 2026-05-25T16:32:35Z

What this PR does

Fixes #179 incorrect HuggingFace dataset path and removes unused import in data_utils.py
Additionally, expands into a full audit of prompt-data-helper.md to update all dataset identifiers to their canonical Hugging Face source names and structures (including required configurations for datasets like GLUE and WMT16) to ensure learners can copy-paste working code.

Kind of change

Checklist

Code runs without errors with the listed dependencies
No comments in code files (docs explain, code is self-explanatory)
Built from scratch first, then shown with a framework (for new lessons)
Lesson folder matches LESSON_TEMPLATE.md structure
ROADMAP.md row for the lesson is a markdown link ([Name](phases/...)), not bare text
One lesson per commit (atomic per-lesson rule)
Tested locally / code output matches what docs/en.md claims

Phase / lesson

Phase 0 · 09-data-management

Notes for reviewer

Two small fixes:

Dataset path was "rotten_tomatoes" but should be "cornell-movie-review-data/rotten_tomatoes" (correct HF namespace)
Removed unused import os at the top

Both changes verified to work correctly. Attached screenshot showing dataset loading successfully.

…stream_dataset functions

coderabbitai · 2026-05-25T16:32:47Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 31aa635f-ec37-4ac2-a430-5fb1b98875d1

📥 Commits

Reviewing files that changed from the base of the PR and between 01e177c and 961aa8b.

📒 Files selected for processing (1)

phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md

📝 Walkthrough

Walkthrough

Fixes the data utilities demo by removing an unused import and correcting the Rotten Tomatoes dataset identifier; also updates the common task-to-dataset mapping table in prompt-data-helper.md.

Changes

Data management fixes

Layer / File(s)	Summary
Import cleanup `phases/00-setup-and-tooling/09-data-management/code/data_utils.py`	Removes the unused `os` import from the module.
Dataset identifier correction `phases/00-setup-and-tooling/09-data-management/code/data_utils.py`	Changes the hardcoded dataset identifier in the `__main__` demo from `rotten_tomatoes` to `cornell-movie-review-data/rotten_tomatoes` for both `load_and_inspect()` and `stream_dataset()` calls.
Prompt-data mapping updates `phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md`	Replaces several Hugging Face dataset ID entries in the "Common task-to-dataset mapping" table (text classification, sentiment, QA, translation, language modeling, image classification rows).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: updating Hugging Face dataset paths and configurations in the data management module.
Description check	✅ Passed	The description clearly explains the fixes (incorrect dataset path and unused import) and expanded audit scope with working code verification.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#179` by updating the dataset path from 'rotten_tomatoes' to 'cornell-movie-review-data/rotten_tomatoes' and removing unused imports, with expanded audit of canonical HF dataset identifiers across the documentation.
Out of Scope Changes check	✅ Passed	All changes are in scope: fixing the reported dataset path issue, removing unused imports, and auditing dataset identifiers in documentation for consistency with canonical Hugging Face sources.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

phases/00-setup-and-tooling/09-data-management/code/data_utils.py (1)
161-161: ⚡ Quick win

Keep dataset ID guidance consistent with the updated demo.

Nice fix here. As a follow-up, please also update phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md (Line 32-40), which still maps Rotten Tomatoes to rotten_tomatoes; otherwise learners can still copy an identifier that fails in current datasets.

Also applies to: 164-164
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/00-setup-and-tooling/09-data-management/code/data_utils.py` at line
161, The prompt-data-helper mapping still uses the old failing dataset id
"rotten_tomatoes"; update the mapping to the working dataset identifier used by
load_and_inspect, i.e., "cornell-movie-review-data/rotten_tomatoes", and change
any other occurrences (the second load_and_inspect use noted) so learners copy
the correct ID; specifically, align the Rotten Tomatoes entry in the
prompt-data-helper text (the block around lines 32-40) with the load_and_inspect
calls to avoid the mismatched identifier.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@phases/00-setup-and-tooling/09-data-management/code/data_utils.py`:
- Line 161: The prompt-data-helper mapping still uses the old failing dataset id
"rotten_tomatoes"; update the mapping to the working dataset identifier used by
load_and_inspect, i.e., "cornell-movie-review-data/rotten_tomatoes", and change
any other occurrences (the second load_and_inspect use noted) so learners copy
the correct ID; specifically, align the Rotten Tomatoes entry in the
prompt-data-helper text (the block around lines 32-40) with the load_and_inspect
calls to avoid the mismatched identifier.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fd389feb-cff6-4607-b3ab-06eb6f744911

📥 Commits

Reviewing files that changed from the base of the PR and between 664862f and 26ffdb5.

📒 Files selected for processing (1)

phases/00-setup-and-tooling/09-data-management/code/data_utils.py

GovIndLok · 2026-05-25T21:55:48Z

@coderabbitai Good catch. Expanding this PR to audit and update the entire mapping table in prompt-data-helper.md. Fixing rotten_tomatoes to cornell-movie-review-data/rotten_tomatoes and updating other identifiers to their canonical HF source names (e.g., cnn_dailymail, stanfordnlp/imdb, rajpurkar/squad) so learners copy-paste working paths. Will push shortly.

…ta-helper

coderabbitai · 2026-05-25T21:57:01Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find and display the prompt-data-helper.md file
fd -i "prompt-data-helper.md" --exec cat {}

Length of output: 2324

🌐 Web query:

Hugging Face Hub canonical dataset ID for "cnn_dailymail" - is it "abisee/cnn_dailymail" or another namespace?

Length of output: 8000

🌐 Web query:

Hugging Face Hub canonical dataset ID for "wmt16" dataset - correct namespace owner

Length of output: 7573

🌐 Web query:

Hugging Face Hub canonical dataset ID for "wikitext" dataset - correct namespace owner

Length of output: 13742

🌐 Web query:

Hugging Face Hub canonical dataset ID for "conll2003" dataset - correct namespace owner

Length of output: 14565

🌐 Web query:

Hugging Face Hub canonical dataset IDs for "mnist" and "cifar10" datasets - correct namespace owners

Length of output: 8194

@GovIndLok Great plan! I verified all the entries in the mapping table against the HF Hub. Here's a full audit to help you make the update complete in one pass:

Task	Current ID	Canonical ID	Source
Text classification	`rotten_tomatoes`	`cornell-movie-review-data/rotten_tomatoes`	HF Hub
Sentiment analysis	`imdb`	`stanfordnlp/imdb`	HF Hub
NLI	`glue/mnli`	`nyu-mll/glue` (config `"mnli"`)	HF Hub
Question answering	`squad`	`rajpurkar/squad`	HF Hub
Summarization	`cnn_dailymail`	`abisee/cnn_dailymail`	HF Hub
Translation	`wmt16`	`wmt/wmt16`	HF Hub
Language modeling	`wikitext`	`Salesforce/wikitext`	HF Hub
Token classification	`conll2003`	`eriktks/conll2003`	HF Hub
Image classification	`mnist`	`ylecun/mnist`	HF Hub
Image classification	`cifar10`	`uoft-cs/cifar10`	HF Hub
Object detection	`detection-datasets/coco`	✅ already correct	—

Two additional things to catch beyond the table:

Step 2 inline examples — the prompt text currently shows `imdb` and `squad` as example IDs and should be updated to stanfordnlp/imdb and rajpurkar/squad.
GLUE/MNLI syntax — glue/mnli is not valid load_dataset() syntax. The correct call is:
```
load_dataset("nyu-mll/glue", "mnli")
```
The table's HF ID column should reflect nyu-mll/glue (config: mnli) to avoid confusion.

🐇

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md`:
- Line 41: Update the table row that currently shows "Translation | WMT |
`wmt/wmt16`" to specify a required config/subset or add an explicit instruction
to pass one; for example change the cell to include the config like
"`wmt/wmt16`, `cs-en`" or append a note "pass config (e.g., `cs-en`) to
datasets.load_dataset", so callers of datasets.load_dataset("wmt/wmt16") know to
provide a config such as `cs-en`, `de-en`, etc.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 08c74dd6-f9bd-45df-affa-9236ce384e82

📥 Commits

Reviewing files that changed from the base of the PR and between 26ffdb5 and 01e177c.

📒 Files selected for processing (1)

phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md

…tions

GovIndLok added 2 commits May 25, 2026 20:46

fix: update dataset path for Rotten Tomatoes in load_and_inspect and …

b0b1423

…stream_dataset functions

fix: improve formatting of dataset split print statements

26ffdb5

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

fix: update Hugging Face IDs for dataset recommendations in prompt-da…

01e177c

…ta-helper

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md Outdated

Merge branch 'main' into fix/rotten_tomato

d942a60

GovIndLok marked this pull request as draft May 26, 2026 14:11

fix: update Hugging Face IDs and configurations in dataset recommenda…

961aa8b

…tions

GovIndLok changed the title ~~fix(data-management): update rotten_tomatoes dataset path and clean up imports~~ fix(data-management): update canonical hugging face dataset paths and configs May 26, 2026

GovIndLok marked this pull request as ready for review May 26, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(data-management): update canonical hugging face dataset paths and configs#180

fix(data-management): update canonical hugging face dataset paths and configs#180
GovIndLok wants to merge 5 commits into
rohitg00:mainfrom
GovIndLok:fix/rotten_tomato

GovIndLok commented May 25, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

GovIndLok commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

GovIndLok commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Kind of change

Checklist

Phase / lesson

Notes for reviewer

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

GovIndLok commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GovIndLok commented May 25, 2026 •

edited

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading