I gave an AI agent full control over a clinical research data science task in a development environment. After accepting all edits for about an hour, it produced a model with AUC 0.67. When I asked why it chose certain features over others, it hallucinated evidence.

As AI agents become more autonomous (via tools like Weco, Cursor, or GitHub Copilot in ā€œagent modeā€), we’re sneaking around a space ofĀ untraceable decision-making. An AI Agent’s ability to reason, iterate, and act can in many ways also makes them opaque. I am excited that coding barriers (Python, R, FHIR, etc.) are being broken down with the help of LLMs so that more clinicians can be involved in developing the technology that improves care, but I am also concerned that decision making may be offloaded to AI during development.

In healthcare, who is responsible when the developed model fails? The user? The AI? The AI provider?

The danger isn’t that AI agents areĀ badĀ at reasoning. It’s that they areĀ too goodĀ at simulating it and are merely reflecting rational thinking seen in their training data.

We need a new standard:Ā decision logs

Just as we require clinical trials to document every step, every hypothesis, every deviation, we must demand the same from AI agents. Every choice like feature selection, model architecture, hyperparameter tuning should be logged with:

This isn’t about slowing down AI. It’s aboutĀ making it responsible.

If you develop a model that impacts a patient, we should be able to prove to the patient how it works, the steps taken to eliminate bias and harm, and the paths not chosen with clearly defined evidence.