Home › Knowledge Base › HACK

HACK

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

Announce Type: new Abstract: Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real life high-stakes domains and general settings, and we find severe failures on specific subcategories across eight reward models. To mitigate these failures, we propose HARVE, a training-free reward-head editing method for scalar reward...

arXiv CS 7d ago

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Announce Type: new Abstract: Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate.

arXiv CS 6d ago

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

arXiv:2606.06223v1 Announce Type: new Abstract: Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features.

arXiv CS 5d ago

Meta confirms 1000s of Instagram accounts were hacked by abusing its AI chatbot

Meta confirms thousands of Instagram accounts were hacked by abusing its AI chatbot Meta is notifying thousands of people whose Instagram accounts were hijacked during the months-long abuse of the company's AI chatbot, which hackers repeatedly tricked into taking control of a person's account. In a new data breach notification letter, seen by this week in security, Meta has revealed for the first time how many people had their accounts hijacked as part of the long-running hacking campaign,...

Hacker News 4d ago

Reform MP refuses to say whether Farage should produce evidence for Russian hack claim

A senior Reform UK figure has declined to pressure Nigel Farage into providing evidence to security services regarding his claim of being hacked by Russian agents. This refusal comes as Farage faces increasing pressure to substantiate his assertion that a state-sponsored Russian hack was responsible for the Guardian's reporting on a £5 million gift he received. Both Labour and the Conservatives have highlighted the national security risks associated with Russian state activity.

The Guardian World 16d ago

Life-changing medicine or beauty hack? How Ozempic came to be seen as both, and why that's risky

Life-changing medicine or beauty hack? How Ozempic came to be seen as both, and why that's risky The same drug that is helping patients manage diabetes and reduce their risk of serious complications from chronic conditions is also being discussed as a beauty hack by people hoping to lose a few kilograms. Experts say more education and awareness are needed.

Channel News Asia 5d ago

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

arXiv:2606.09711v1 Announce Type: new Abstract: Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps.

arXiv CS 1d ago

Anthropic invites EU to access Mythos hacking tech

Anthropic has extended an invitation to the European Commission granting the EU’s cyber agency access to its powerful AI hacking tool Mythos, according to a Commission official familiar with the process. The AI firm made the formal invitation after a meeting with the Commission in San Francisco last Thursday, the official said, adding the EU now has to put in place a mechanism to access the model with proper security safeguards. Bloomberg reported on Monday that the EU’s Athens-based...

Politico EU 9d ago

MP staffer’s account sent almost 2,000 phishing emails after suspected hack

LONDON — Nearly 2,000 people were targeted with a phishing email after the suspected hack of a staffer of senior Labour MP Florence Eshalomi. The email contained a malicious file — identified by the Parliamentary Digital Service as a phishing attack — that tried to secure the credentials of other accounts, according to an email seen by POLITICO, which was sent by Eshalomi to those targeted in the days following last week’s breach. Westminster journalists and public affairs...

Politico EU 1d ago

How Turkey Hacked the Hair Transplant Industry

The astounding growth of the hair-transplant industry in Turkey is not just a medical tourism success story; it’s also a tale of “hacked” medical equipment and algorithmic craftsmanship. From a biological and evolutionary perspective, human hair is often viewed as an unremarkable mass of keratin that still plays some important functions—protecting our scalps from the sun’s harmful ultraviolet rays and regulating our body temperatures—but, for the most part, is no longer essential to our...

Wired 10d ago