Skip to content

fix(data-management): update canonical hugging face dataset paths and configs#180

Open
GovIndLok wants to merge 5 commits into
rohitg00:mainfrom
GovIndLok:fix/rotten_tomato
Open

fix(data-management): update canonical hugging face dataset paths and configs#180
GovIndLok wants to merge 5 commits into
rohitg00:mainfrom
GovIndLok:fix/rotten_tomato

Conversation

@GovIndLok
Copy link
Copy Markdown

@GovIndLok GovIndLok commented May 25, 2026

What this PR does

Fixes #179 incorrect HuggingFace dataset path and removes unused import in data_utils.py
Additionally, expands into a full audit of prompt-data-helper.md to update all dataset identifiers to their canonical Hugging Face source names and structures (including required configurations for datasets like GLUE and WMT16) to ensure learners can copy-paste working code.

Kind of change

  • New lesson
  • Fix to an existing lesson
  • Translation
  • New output (prompt, skill, agent, MCP server)
  • Docs / website / tooling

Checklist

  • Code runs without errors with the listed dependencies
  • No comments in code files (docs explain, code is self-explanatory)
  • Built from scratch first, then shown with a framework (for new lessons)
  • Lesson folder matches LESSON_TEMPLATE.md structure
  • ROADMAP.md row for the lesson is a markdown link ([Name](phases/...)), not bare text
  • One lesson per commit (atomic per-lesson rule)
  • Tested locally / code output matches what docs/en.md claims

Phase / lesson

Phase 0 · 09-data-management

Notes for reviewer

Two small fixes:

  1. Dataset path was "rotten_tomatoes" but should be "cornell-movie-review-data/rotten_tomatoes" (correct HF namespace)
  2. Removed unused import os at the top

Both changes verified to work correctly. Attached screenshot showing dataset loading successfully.
image

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 31aa635f-ec37-4ac2-a430-5fb1b98875d1

📥 Commits

Reviewing files that changed from the base of the PR and between 01e177c and 961aa8b.

📒 Files selected for processing (1)
  • phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md

📝 Walkthrough

Walkthrough

Fixes the data utilities demo by removing an unused import and correcting the Rotten Tomatoes dataset identifier; also updates the common task-to-dataset mapping table in prompt-data-helper.md.

Changes

Data management fixes

Layer / File(s) Summary
Import cleanup
phases/00-setup-and-tooling/09-data-management/code/data_utils.py
Removes the unused os import from the module.
Dataset identifier correction
phases/00-setup-and-tooling/09-data-management/code/data_utils.py
Changes the hardcoded dataset identifier in the __main__ demo from rotten_tomatoes to cornell-movie-review-data/rotten_tomatoes for both load_and_inspect() and stream_dataset() calls.
Prompt-data mapping updates
phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md
Replaces several Hugging Face dataset ID entries in the "Common task-to-dataset mapping" table (text classification, sentiment, QA, translation, language modeling, image classification rows).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: updating Hugging Face dataset paths and configurations in the data management module.
Description check ✅ Passed The description clearly explains the fixes (incorrect dataset path and unused import) and expanded audit scope with working code verification.
Linked Issues check ✅ Passed The PR successfully addresses issue #179 by updating the dataset path from 'rotten_tomatoes' to 'cornell-movie-review-data/rotten_tomatoes' and removing unused imports, with expanded audit of canonical HF dataset identifiers across the documentation.
Out of Scope Changes check ✅ Passed All changes are in scope: fixing the reported dataset path issue, removing unused imports, and auditing dataset identifiers in documentation for consistency with canonical Hugging Face sources.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
phases/00-setup-and-tooling/09-data-management/code/data_utils.py (1)

161-161: ⚡ Quick win

Keep dataset ID guidance consistent with the updated demo.

Nice fix here. As a follow-up, please also update phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md (Line 32-40), which still maps Rotten Tomatoes to rotten_tomatoes; otherwise learners can still copy an identifier that fails in current datasets.

Also applies to: 164-164

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/00-setup-and-tooling/09-data-management/code/data_utils.py` at line
161, The prompt-data-helper mapping still uses the old failing dataset id
"rotten_tomatoes"; update the mapping to the working dataset identifier used by
load_and_inspect, i.e., "cornell-movie-review-data/rotten_tomatoes", and change
any other occurrences (the second load_and_inspect use noted) so learners copy
the correct ID; specifically, align the Rotten Tomatoes entry in the
prompt-data-helper text (the block around lines 32-40) with the load_and_inspect
calls to avoid the mismatched identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@phases/00-setup-and-tooling/09-data-management/code/data_utils.py`:
- Line 161: The prompt-data-helper mapping still uses the old failing dataset id
"rotten_tomatoes"; update the mapping to the working dataset identifier used by
load_and_inspect, i.e., "cornell-movie-review-data/rotten_tomatoes", and change
any other occurrences (the second load_and_inspect use noted) so learners copy
the correct ID; specifically, align the Rotten Tomatoes entry in the
prompt-data-helper text (the block around lines 32-40) with the load_and_inspect
calls to avoid the mismatched identifier.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fd389feb-cff6-4607-b3ab-06eb6f744911

📥 Commits

Reviewing files that changed from the base of the PR and between 664862f and 26ffdb5.

📒 Files selected for processing (1)
  • phases/00-setup-and-tooling/09-data-management/code/data_utils.py

@GovIndLok
Copy link
Copy Markdown
Author

@coderabbitai Good catch. Expanding this PR to audit and update the entire mapping table in prompt-data-helper.md. Fixing rotten_tomatoes to cornell-movie-review-data/rotten_tomatoes and updating other identifiers to their canonical HF source names (e.g., cnn_dailymail, stanfordnlp/imdb, rajpurkar/squad) so learners copy-paste working paths. Will push shortly.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find and display the prompt-data-helper.md file
fd -i "prompt-data-helper.md" --exec cat {}

Length of output: 2324


🌐 Web query:

Hugging Face Hub canonical dataset ID for "cnn_dailymail" - is it "abisee/cnn_dailymail" or another namespace?

Length of output: 8000


🌐 Web query:

Hugging Face Hub canonical dataset ID for "wmt16" dataset - correct namespace owner

Length of output: 7573


🌐 Web query:

Hugging Face Hub canonical dataset ID for "wikitext" dataset - correct namespace owner

Length of output: 13742


🌐 Web query:

Hugging Face Hub canonical dataset ID for "conll2003" dataset - correct namespace owner

Length of output: 14565


🌐 Web query:

Hugging Face Hub canonical dataset IDs for "mnist" and "cifar10" datasets - correct namespace owners

Length of output: 8194



@GovIndLok Great plan! I verified all the entries in the mapping table against the HF Hub. Here's a full audit to help you make the update complete in one pass:

Task Current ID Canonical ID Source
Text classification rotten_tomatoes cornell-movie-review-data/rotten_tomatoes HF Hub
Sentiment analysis imdb stanfordnlp/imdb HF Hub
NLI glue/mnli nyu-mll/glue (config "mnli") HF Hub
Question answering squad rajpurkar/squad HF Hub
Summarization cnn_dailymail abisee/cnn_dailymail HF Hub
Translation wmt16 wmt/wmt16 HF Hub
Language modeling wikitext Salesforce/wikitext HF Hub
Token classification conll2003 eriktks/conll2003 HF Hub
Image classification mnist ylecun/mnist HF Hub
Image classification cifar10 uoft-cs/cifar10 HF Hub
Object detection detection-datasets/coco ✅ already correct

Two additional things to catch beyond the table:

  1. Step 2 inline examples — the prompt text currently shows `imdb` and `squad` as example IDs and should be updated to stanfordnlp/imdb and rajpurkar/squad.

  2. GLUE/MNLI syntaxglue/mnli is not valid load_dataset() syntax. The correct call is:

    load_dataset("nyu-mll/glue", "mnli")

    The table's HF ID column should reflect nyu-mll/glue (config: mnli) to avoid confusion.

🐇

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md`:
- Line 41: Update the table row that currently shows "Translation | WMT |
`wmt/wmt16`" to specify a required config/subset or add an explicit instruction
to pass one; for example change the cell to include the config like
"`wmt/wmt16`, `cs-en`" or append a note "pass config (e.g., `cs-en`) to
datasets.load_dataset", so callers of datasets.load_dataset("wmt/wmt16") know to
provide a config such as `cs-en`, `de-en`, etc.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 08c74dd6-f9bd-45df-affa-9236ce384e82

📥 Commits

Reviewing files that changed from the base of the PR and between 26ffdb5 and 01e177c.

📒 Files selected for processing (1)
  • phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md

Comment thread phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md Outdated
@GovIndLok GovIndLok marked this pull request as draft May 26, 2026 14:11
@GovIndLok GovIndLok changed the title fix(data-management): update rotten_tomatoes dataset path and clean up imports fix(data-management): update canonical hugging face dataset paths and configs May 26, 2026
@GovIndLok GovIndLok marked this pull request as ready for review May 26, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] dataset path: rotten_tomatoes → cornell-movie-review-data/rotten_tomatoes in data_utils.py

1 participant