fix(data-management): update canonical hugging face dataset paths and configs#180
fix(data-management): update canonical hugging face dataset paths and configs#180GovIndLok wants to merge 5 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughFixes the data utilities demo by removing an unused import and correcting the Rotten Tomatoes dataset identifier; also updates the common task-to-dataset mapping table in prompt-data-helper.md. ChangesData management fixes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
phases/00-setup-and-tooling/09-data-management/code/data_utils.py (1)
161-161: ⚡ Quick winKeep dataset ID guidance consistent with the updated demo.
Nice fix here. As a follow-up, please also update
phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md(Line 32-40), which still maps Rotten Tomatoes torotten_tomatoes; otherwise learners can still copy an identifier that fails in currentdatasets.Also applies to: 164-164
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/00-setup-and-tooling/09-data-management/code/data_utils.py` at line 161, The prompt-data-helper mapping still uses the old failing dataset id "rotten_tomatoes"; update the mapping to the working dataset identifier used by load_and_inspect, i.e., "cornell-movie-review-data/rotten_tomatoes", and change any other occurrences (the second load_and_inspect use noted) so learners copy the correct ID; specifically, align the Rotten Tomatoes entry in the prompt-data-helper text (the block around lines 32-40) with the load_and_inspect calls to avoid the mismatched identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@phases/00-setup-and-tooling/09-data-management/code/data_utils.py`:
- Line 161: The prompt-data-helper mapping still uses the old failing dataset id
"rotten_tomatoes"; update the mapping to the working dataset identifier used by
load_and_inspect, i.e., "cornell-movie-review-data/rotten_tomatoes", and change
any other occurrences (the second load_and_inspect use noted) so learners copy
the correct ID; specifically, align the Rotten Tomatoes entry in the
prompt-data-helper text (the block around lines 32-40) with the load_and_inspect
calls to avoid the mismatched identifier.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fd389feb-cff6-4607-b3ab-06eb6f744911
📒 Files selected for processing (1)
phases/00-setup-and-tooling/09-data-management/code/data_utils.py
|
@coderabbitai Good catch. Expanding this PR to audit and update the entire mapping table in prompt-data-helper.md. Fixing rotten_tomatoes to cornell-movie-review-data/rotten_tomatoes and updating other identifiers to their canonical HF source names (e.g., cnn_dailymail, stanfordnlp/imdb, rajpurkar/squad) so learners copy-paste working paths. Will push shortly. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Find and display the prompt-data-helper.md file
fd -i "prompt-data-helper.md" --exec cat {}Length of output: 2324 🌐 Web query:
Length of output: 8000 🌐 Web query:
Length of output: 7573 🌐 Web query:
Length of output: 13742 🌐 Web query:
Length of output: 14565 🌐 Web query:
Length of output: 8194
Two additional things to catch beyond the table:
🐇 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md`:
- Line 41: Update the table row that currently shows "Translation | WMT |
`wmt/wmt16`" to specify a required config/subset or add an explicit instruction
to pass one; for example change the cell to include the config like
"`wmt/wmt16`, `cs-en`" or append a note "pass config (e.g., `cs-en`) to
datasets.load_dataset", so callers of datasets.load_dataset("wmt/wmt16") know to
provide a config such as `cs-en`, `de-en`, etc.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 08c74dd6-f9bd-45df-affa-9236ce384e82
📒 Files selected for processing (1)
phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md
What this PR does
Fixes #179 incorrect HuggingFace dataset path and removes unused import in data_utils.py
Additionally, expands into a full audit of
prompt-data-helper.mdto update all dataset identifiers to their canonical Hugging Face source names and structures (including required configurations for datasets like GLUE and WMT16) to ensure learners can copy-paste working code.Kind of change
Checklist
LESSON_TEMPLATE.mdstructure[Name](phases/...)), not bare textdocs/en.mdclaimsPhase / lesson
Phase 0 · 09-data-management
Notes for reviewer
Two small fixes:
import osat the topBoth changes verified to work correctly. Attached screenshot showing dataset loading successfully.
