Blake Bullwinkel

I'm a Researcher on the AI Red Team at Microsoft, where I study safety and security vulnerabilities in generative AI systems. In doing so, I hope to both mitigate AI risks in the short term and build more robust systems in the long term. I became interested in safety and alignment while working on my master's thesis, which focused on incorporating physics-based constraints into neural networks.

In my free time, I enjoy running, biking, and being outside. I love learning and have a (neglected) page of math notes to explain important topics in my own words. You can find my resume here and should feel free to reach out!

Papers

A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks
B Bullwinkel, M Russinovich, A Salem, S Zanella-Beguelin, D Jones, G Severi, E Kim, K Hines, A Minnich, Y Zunger, R Shankar Siva Kumar
ICML Workshop on Data in Generative Models (DIG-BUGS), 2025
Arxiv

Steering Language Model Refusal with Sparse Autoencoders
K O'Brien, D Majercak, X Fernandes, R Edgar, B Bullwinkel, J Chen, H Nori, D Carignan, E Horvitz, F Poursabzi-Sangdeh
ICML Workshop on Actionable Interpretability (AIW), 2025
Arxiv Poster

A Systemization of Security Vulnerabilities in Computer Use Agents
D Jones, G Severi, M Pouliot, G Lopez, J de Gruyter, S Zanella-Beguelin, J Song, B Bullwinkel, P Cortez, A Minnich
ICML Workshop on Computer Use Agents, 2025
Arxiv

Lessons From Red Teaming 100 Generative AI Products
B Bullwinkel et al.
Microsoft BlueHat, 2024
NeurIPS Workshop on Red Teaming GenAI, 2024
Blog Arxiv eBook Talk

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle
B Bullwinkel et al.
Arxiv, 2024
Arxiv

PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems
B Bullwinkel et al.
Conference on Applied Machine Learning in Information Security (CAMLIS), 2024
Arxiv

Using Large Language Models for Humanitarian Frontline Negotiation: Opportunities and Considerations
Z Ma, S Su, N Zhao, L Bieske, B Bullwinkel, Y Zhang, S Yang, Z Luo, S Li, G Liao, B Wang, J Gao, Z Wen, C Bruderlein, W Pan
ICML Workshop on the Next Generation of AI Safety (NextGenAISafety), 2024
Arxiv Poster

Transfer Learning with Physics-Informed Neural Networks for Efficient Simulation of Branched Flows
R Pellegrin, B Bullwinkel, M Mattheakis, P Protopapas
NeurIPS Workshop on Machine Learning and the Physical Sciences, 2022
Arxiv Poster

DEQGAN: Learning the Loss Function for PINNs with Generative Adversarial Networks
B Bullwinkel, D Randle, P Protopapas, D Sondak
ICML Workshop on AI for Science (AI4Science), 2022
Arxiv Poster

Evaluating the Fairness Impact of Differentially Private Synthetic Data
B Bullwinkel, K Grabarz, L Ke, S Gong, C Tanner, J Allen
ICML Workshop on Theory and Practice of Differential Privacy (TPDP), 2022
Arxiv Poster


Projects

Azure/PyRIT
Active contributor to PyRIT, an open source framework that empowers security professionals and machine learning engineers to proactively find risks in their generative AI systems.
Repo

DEQGAN
Co-created a Python package that implements DEQGAN, an unsupervised generative adversarial network method for solving ordinary and partial differential equations.
Repo

Marble Groceries
Developed an iOS app (defunct) that helps users understand the environmental impact of their grocery purchases by scanning product barcodes.
Article

Classifying the Sounds of NYC
Trained and tuned a variety of models to classify audio clips recorded around New York City from the UrbanSound8k dataset into ten different classes.
Report Notebook

Modeling ASA Section Membership
Constructed binary response generalized linear models to predict whether or not members of the American Statistical Association belonged to at least one section.
Report Code

Woof Woof! Computer Vision & NLP App for Austin Pets Alive
Built a web app that allows users to "chat" with and search for visually similar dogs in the Austin Pets Alive animal shelter using NLP and computer vision models.
Code

Wildfire Risk Prediction & Response Optimization
Trained tree-based classifiers on weather data to predict wildfire risk in California counties and used mixed-integer programming to determine the optimal assignment of firefighters across the state.
Report Code

DreamDiff Python Package
Worked in a team of three to develop a Python package that implements forward-mode automatic differentiation (AD), root-finding, optimization, and quadratic spline interpolation.
Repo PyPI

Analysis of Wildfires, Air Quality, and Public Health
Conducted time series analysis in R to link spikes in PM2.5 concentration to specific wildfire events in California and used major axis regression to explore correlations between air quality and public health outcomes.
Slides Code

Modeling Electricity Consumption in the US
Built linear regression models to predict household electricity consumption in the US from various residential characteristics.
Report Code

Early Epidemiological Model Parameters for COVID-19
Modeled early-stage COVID-19 case data in mainland China using systems of ordinary differential equations.
Slides Code