Netzer Epstein

I work on Microsoft's Excel Copilot team and research AI safety: how language models behave under evaluation, and how to make evaluation itself trustworthy and verifiable. I build the experiments, benchmarks, and open datasets that make claims about model reliability testable.

Now
Microsoft · Excel Copilot
Research
SPAR & Heron AI Security fellow
Based in
Tel Aviv, Israel
Netzer Epstein, AI Safety Researcher · Research Engineer

About

I'm an AI safety researcher and research engineer focused on LLM evaluation, model behavior, and verifiable machine learning. My work is empirical and evals-first: I design protocols, run them across frontier models, analyze the results with proper statistics, and ship reproducible datasets.

At Microsoft I work on the Excel Copilot team, where my day-to-day is effectively research engineering: designing and optimizing prompt-engineering strategies and building the evaluation benchmarks that measure model accuracy, efficiency, and relevance across native and web clients.

Through the SPAR and Heron AI Security fellowships I run empirical safety research: causally attributing evaluation awareness in LLMs to specific surface cues (with a 22.7k-transcript public dataset), and benchmarking how zero-knowledge ML systems diverge from the protocols they claim to implement. I care about rigor, measurable results, and shipping work others can build on.

Focus areas

Evaluation awarenessLLM evals & benchmarkingDeception & sandbaggingMulti-agent interactionCooperation & coordinationVerifiable / zero-knowledge ML

Research

Two parallel fellowship projects on different sides of trustworthy AI: evaluation behavior and cryptographic verifiability.

SPAR Spring 2026 · 2026

What tips models off? Causal attribution of LLM evaluation awareness

Research Fellow · Advised by Qiyao Wei

Under review

LLMs can tell when they're being evaluated. I built a causal protocol that identifies which surface cues tip them off, transplanting evaluation 'tells' between transcripts under LLM-as-judge quality control, then released a 22.7k-transcript dataset to study it.

flip rate from injected eval scaffolding metadata
significant effect across all 5 tell templates
cross-model effect correlation (gpt-5-mini vs gemini-3-flash)
causal transcripts published on Hugging Face
Evaluation awarenessLLM judgesCausal analysisBenchmarks
Submitted to EMNLP 2026eval-awareness-tells datasetCode

Heron AI Security Fellowship (first cohort) · 2026

Auditing zero-knowledge ML: benchmarking theory-to-implementation gaps

Research Fellow · Advised by Daniel Kang

Under review

ZKML papers report large speedups for verifiable inference, but some gains come from omitting cryptographic operations the proofs depend on. I built a 56-artifact audit benchmark and a multi-agent inspector that quantify and automatically catch these protocol-to-implementation gaps.

expert-authored vulnerability artifacts, 6 crypto categories
hidden zkLLM prover-time inflation (≈911× proof size)
inspector recall at 71.1% precision (F1 +5pp vs baseline)
ZKMLVerifiable inferenceMulti-agentCryptography
Contributes to a NeurIPS 2026 submissionzkml-inspectorzkml-audit-benchmarkDataset on Hugging Face

Projects

Open-source research tooling and experiments, mostly LLM evaluation, deception/sandbagging studies, and zero-knowledge ML auditing.

Experience

Five years at Microsoft on Excel, most recently on Excel Copilot, alongside research fellowships in AI safety.

Industry

  1. 2024–Present · Tel Aviv, Israel

    Research Engineer, Excel Copilot

    Microsoft, Israel R&D Center

    Building LLM-powered features for Excel Copilot: formula suggestions, prompt-engineering strategies, and the evaluation benchmarks that measure model accuracy, efficiency, and user relevance across native and web clients.

    TypeScriptC++OpenAI APIsAzure DevOpsLLM evals
  2. 2023–2024 · Tel Aviv, Israel

    Software Engineer, Excel Online

    Microsoft, Israel R&D Center

    Led smart-suggestions work in Excel Online and shipped full-stack features across Excel Desktop and Excel Online.

    TypeScriptNode.jsReactReact NativeC++KQL
  3. 2021–2023 · Tel Aviv, Israel

    Software Engineer (Student position), Excel Online

    Microsoft, Israel R&D Center

    Owned slices of Excel Online infrastructure, telemetry, and complex build/bundling pipelines.

    C#Node.jsWebpack

Fellowships & training

  1. 2026

    SPAR Spring Fellow, Evaluation Awareness

    SPAR (Supervised Program for Alignment Research)

    Researching evaluation awareness in LLMs: how models may alter their behavior when they detect they are being benchmarked. Advised by Qiyao Wei.

    LLM evalsMechanistic probesCausal interventions
  2. 2025–2026

    Heron AI Security Research Fellow (first cohort)

    Heron AI Security Initiative

    Built zkml-inspector and zkml-audit-benchmark: a multi-agent auditor and 56-artifact dataset for catching soundness gaps in zero-knowledge ML implementations. Advised by Daniel Kang.

    ZKMLMulti-agent systemsPython
  3. 2025

    BlueDot Impact, Technical AI Safety

    BlueDot Impact

    Completed a technical curriculum covering mechanistic interpretability, RLHF, and threat modeling for transformative AI.

  4. June 2026

    ARBOx4 Fellow, Oxford AI Safety Initiative

    OAISI (University of Oxford)

    Alignment research bootcamp covering core technical AI safety methods and hands-on research practice.

Toolbox

The stack I reach for across research engineering, evaluation, and safety work. When a project needs a tool I don't yet know, I learn it, something I've done repeatedly.

Languages
PythonTypeScriptC++CUDAC#KQLJavaScript
AI / ML & evals
LLM evaluation & benchmark designInspect AIPrompt engineeringOpenAI / Anthropic / Gemini APIsAzure OpenAIMulti-agent systemsStatistical analysis (scipy)Hugging Face datasets
AI safety
Evaluation awarenessDeception & sandbaggingEvals & red-teamingAI controlVerifiable / zero-knowledge MLAdversarial robustnessInterpretability
Engineering
React / React NativeNode.jsFull-stackAzure DevOpsWebpackTelemetry

Writing

I write occasionally about LLM internals, evaluation, and AI safety, working through ideas in public. Posts here sync automatically from my Substack.

Education

2019–2022

BSc in Computer Science

The Hebrew University of Jerusalem (HUJI)

Internet & Society Excellence Program (MATAR)

MATAR is a selective excellence track that pairs a full computer science degree with the interdisciplinary study of how computing and the internet reshape society. Alongside core CS, I studied the legal, ethical, and societal dimensions of technology, and that grounding in the stakes of deploying powerful systems is where my interest in AI safety first took root.

GPA 91 / 100

Off the clock

A few of the things that occupy me away from a screen, and the occasional conversation starter.

Games & stories

Tabletop RPGs, narrative games, and designing my own. I'm interested in games and gamification as a tool for studying how humans and AI behave under different incentives.

TTRPGsGame designStorytelling games

Reading & watching

Sci-fi and fantasy, plus animated series (not anime). Recently in rotation:

Dungeon Crawler CarlBetween Two FiresMalazanThere Is No Antimemetics DivisionPowder MageInvincibleThe Legend of Vox MachinaGravity Falls

Outdoors & art

Hiking, jogging, and watercolor painting: the things that pull me away from a screen.

HikingJoggingWatercolor

Get in touch

Always happy to discuss research, especially evaluations, model behavior, and verifiable ML, or to trade notes and feedback on ideas. Reach out anytime.