Evaluation on Tech News Feed

Evaluation on Tech News Feed https://news.dhphong.com/tags/evaluation/ Recent content in Evaluation on Tech News Feed Hugo -- 0.131.0 vi Mon, 27 Apr 2026 00:06:02 +0700 [Hacker News] SWE-bench Verified no longer measures frontier coding capabilities https://news.dhphong.com/posts/2026-04-26-swe-bench-verified-no-longer-measures-frontier-coding/ Mon, 27 Apr 2026 00:06:02 +0700 https://news.dhphong.com/posts/2026-04-26-swe-bench-verified-no-longer-measures-frontier-coding/ Nguồn: Hacker News Tóm tắt OpenAI tuyên bố ngừng sử dụng SWE-bench Verified — benchmark đo năng lực lập trình AI phổ biến nhất — do lo ngại về data contamination. Bằng chứng: khi prompt model với GitHub issue text từ benchmark, các model tái tạo chính xác file diff trong dataset, cho thấy memorization thay vì reasoning thực sự. Cụ thể, 15% ví dụ trong SWE-bench Verified bị o3 “memorize” và 4% bởi o4-mini.