Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Generative AI has become a key piece of infrastructure in many industries, and healthcare is no exception. Yet, as organizations like GSK push the boundaries of what generative AI can achieve, they face significant challenges — particularly when it comes to reliability. Hallucinations, or when AI models generate incorrect or fabricated information, are a persistent problem in high-stakes applications like drug discovery and healthcare. For GSK, tackling these challenges requires leveraging test-time compute scaling to improve gen AI systems. Here’s how they’re doing it.
The hallucination problem in generative health care
Healthcare applications demand an exceptionally high level of accuracy and reliability. Errors are not merely inconvenient; they can have life-altering consequences. This makes hallucinations in large language models (LLMs) a critical issue for companies like GSK, where gen AI is applied to tasks such as scientific literature review, genomic analysis and drug discovery.
To mitigate hallucinations, GSK employs advanced inference-time compute strategies, including self-reflection mechanisms, multi-model sampling and iterative output evaluation. According to Kim Branson, SvP of AI and machine learning (ML) at GSK, these techniques help ensure that agents are “robust and reliable,” while enabling scientists to generate actionable insights more quickly.
Leveraging test-time compute scaling
Test-time compute scaling refers to the ability to increase computational resources during the inference phase of AI systems. This allows for more complex operations, such as iterative output refinement or multi-model aggregation, which are critical for reducing hallucinations and improving model performance.
Branson emphasized the transformative role of scaling in GSK’s AI efforts, noting that “we’re all about increasing the iteration cycles at GSK — how we think faster.” By using strategies like self-reflection and ensemble modeling, GSK can leverage these additional compute cycles to produce results that are both accurate and reliable.
Branson also touched on the broader industry trend, saying, “You’re seeing this war happening with how much I can serve, my cost per token and time per token. That allows people to bring these different algorithmic strategies which were before not technically feasible, and that also will drive the kind of deployment and adoption of agents.”
Strategies for reducing hallucinations
GSK has identified hallucinations as a critical challenge in gen AI for healthcare. The company employs two main strategies that require additional computational resources during inference. Applying more thorough processing steps ensures that each answer is examined for accuracy and consistency before it is delivered in clinical or research settings, where reliability is paramount.
Self-reflection and iterative output review
One core technique is self-reflection, where LLMs critique or edit their own responses to improve quality. The model “thinks step by step,” analyzing its initial output, pinpointing weaknesses and revising answers as needed. GSK’s literature search tool exemplifies this: It collects data from internal repositories and an LLM’s memory, then re-evaluates its findings through self-criticism to uncover inconsistencies.
This iterative process results in clearer, more detailed final answers. Branson underscored the value of self-criticism, saying: “If you can only afford to do one thing, do that.” Refining its own logic before delivering results allows the system to produce insights that align with healthcare’s strict standards.
Multi-model sampling
GSK’s second strategy relies on multiple LLMs or different configurations of a single model to cross-verify outputs. In practice, the system might run the same query at various temperature settings to generate diverse answers, employ fine-tuned versions of the same model specializing in particular domains or call on entirely separate models trained on distinct datasets.
Comparing and contrasting these outputs helps confirm the most consistent or convergent conclusions. “You can get that effect of having different orthogonal ways to come to the same conclusion,” said Branson. Although this approach requires more computational power, it reduces hallucinations and boosts confidence in the final answer — an essential benefit in high-stakes healthcare environments.
The inference wars
GSK’s strategies depend on infrastructure that can handle significantly heavier computational loads. In what Branson calls “inference wars,” AI infrastructure companies — such as Cerebras, Groq and SambaNova — compete to deliver hardware breakthroughs that enhance token throughput, lower latency and reduce costs per token.
Specialized chips and architectures enable complex inferencing routines, including multi-model sampling and iterative self-reflection, at scale. Cerebras’ technology, for example, processes thousands of tokens per second, allowing advanced techniques to work in real-world scenarios. “You’re seeing the results of these innovations directly impacting how we can deploy generative models effectively in healthcare,” Branson noted.
When hardware keeps pace with software demands, solutions emerge to maintain accuracy and efficiency.
Challenges remain
Even with these advancements, scaling compute resources presents obstacles. Longer inference times can slow workflows, especially if clinicians or researchers need prompt results. Higher compute usage also drives up costs, requiring careful resource management. Nonetheless, GSK considers these trade-offs necessary for stronger reliability and richer functionality.
“As we enable more tools in the agent ecosystem, the system becomes more useful for people, and you end up with increased compute usage,” Branson noted. Balancing performance, costs and system capabilities allows GSK to maintain a practical yet forward-looking strategy.
What’s next?
GSK plans to keep refining its AI-driven healthcare solutions with test-time compute scaling as a top priority. The combination of self-reflection, multi-model sampling and robust infrastructure helps to ensure that generative models meet the rigorous demands of clinical environments.
This approach also serves as a road map for other organizations, illustrating how to reconcile accuracy, efficiency and scalability. Maintaining a leading edge in compute innovations and sophisticated inference techniques not only addresses current challenges, but also lays the groundwork for breakthroughs in drug discovery, patient care and beyond.