<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Evaluation on Tech News Feed</title>
    <link>https://news.dhphong.com/tags/evaluation/</link>
    <description>Recent content in Evaluation on Tech News Feed</description>
    <generator>Hugo -- 0.131.0</generator>
    <language>vi</language>
    <lastBuildDate>Mon, 27 Apr 2026 00:06:02 +0700</lastBuildDate>
    <atom:link href="https://news.dhphong.com/tags/evaluation/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>[Hacker News] SWE-bench Verified no longer measures frontier coding capabilities</title>
      <link>https://news.dhphong.com/posts/2026-04-26-swe-bench-verified-no-longer-measures-frontier-coding/</link>
      <pubDate>Mon, 27 Apr 2026 00:06:02 +0700</pubDate>
      <guid>https://news.dhphong.com/posts/2026-04-26-swe-bench-verified-no-longer-measures-frontier-coding/</guid>
      <description>Nguồn: Hacker News
Tóm tắt OpenAI tuyên bố ngừng sử dụng SWE-bench Verified — benchmark đo năng lực lập trình AI phổ biến nhất — do lo ngại về data contamination. Bằng chứng: khi prompt model với GitHub issue text từ benchmark, các model tái tạo chính xác file diff trong dataset, cho thấy memorization thay vì reasoning thực sự.
Cụ thể, 15% ví dụ trong SWE-bench Verified bị o3 &amp;ldquo;memorize&amp;rdquo; và 4% bởi o4-mini.</description>
    </item>
  </channel>
</rss>
