How we’re using AI to boost productivity for chemistry researchers
6 February 2023 | 10 min read
By Eleonora Echegaray
A data enrichment expert takes you behind the scenes of Elsevier’s award-winning Reaxys Content Catalyst team
Caption: The Elsevier team is presented with the Data Science Excellence Award for the Reaxys Content Catalyst (left to right): Mark Sheehan (VP, Data Science, Life Sciences, Elsevier), Anitha Golla, PhD (Senior Data Enrichment Expert, Elsevier) Chetan Bhagat (award presenter, Indian author), and Abhinav Agnihotry (Data Scientist, Elsevier)
Chemistry researchers worldwide use Elsevier’s expert-curated chemical information platform, Reaxys, to find the information and compounds they need in a broad range of fields, from pharmaceutical drug discovery and chemical R&D to academic research and education. Recently, the team behind the Reaxys Content Catalyst was awarded a Data Science Excellence Award(opens in new tab/window) for innovation in analytics, data science and artificial intelligence.
I sat down with Dr Anitha Golla(opens in new tab/window), a Senior Data Enrichment Expert at Elsevier, to talk about her team’s work and what they’re doing to continually expand and update the content available in Reaxys.
It quickly became obvious that her work is her own reward. But she was still thrilled her team won this award alongside heavyweights like Axis Bank Limited, IBM, Schneider Electric and Wells Fargo.
“These days everybody is doing something with AI and data science — there’s just so much work going on,” Anitha said. “So it’s fantastic to get this sort of validation from the greater AI community.”
100 million documents and counting
The award capped India’s biggest AI conference, Cypher22(opens in new tab/window), when Analytics India Magazine(opens in new tab/window) hosted the fourth edition of the awards in September. The prize recognized the team’s efforts in the AI-powered content enrichment production pipeline Reaxys Content Catalyst (RCC), which works to radically boost the content available in Reaxys — which in turn works to boost R&D productivity for chemistry researchers.
The prize also coincided with the pipeline passing a key benchmark: processing over 100 million documents.
“Both of these achievements are really just a testimony of the power of cross-functional teams,” Anitha said.
Diversity of thought: collaborating across functions
Anitha developed a taste for working on a multidisciplinary team while working on her PhD in bioorganic chemistry at the Karlsruhe Institute of Technology (KIT)(opens in new tab/window) in Germany:
“My supervisor had a small startup, and his aim was to provide biologists with as many peptides as possible for their research. These needed to be both cheap and of high quality. And to help make this happen, I got to work with all these amazing people: physicists, biologists, engineers.”
“Previously, I was largely a lone researcher. But this experience helped me understand if you work with all these different people, amazing things can happen. And they can happen better and faster than if you did it alone.”
A high-impact niche
The complexity of her current work certainly requires a cross-functional team.
“There are millions of documents published in the scientific community that have the capacity to change the world on every level,” she says. “It could be about a life-saving drug or about changing the way we make decisions or approach a certain challenge. Our job is to make sure that this content is up to date so people can take it from there in the fastest and smartest way possible.”
While passionate about the relevance of her work, Anitha was still pleasantly surprised by the award. “We’re actually quite niche,” she said. “We’re collecting the chemical facts — from both texts and images — and giving them to the scientific community in a way to help drive their decisions and actually help them do their extraordinary work.”
“Our customers literally told us what they wanted …”
“Our project also stands out for being entirely born out of customer needs,” Anitha added. “Our customers literally told us what they wanted: to be able to find certain things — substances, biological targets — very quickly in patents published in the last 20-odd years. They wanted a sense of the competitive landscape so they could work within this landscape and not against it.
“Traditionally, there’s only been one way to get this sort of information: hire an army of chemists to read each of those millions of documents line by line. But of course, this is much too slow and costly. So we sought to automate the process — after all, Elsevier was already applying data science to almost everything else.”
No average day
The project involves a team of 40+ people, depending on what work needs to be done.
“On any given day, I work with people from three or four different domains — hardcore chemists, data scientists, data engineers, data architects, software people, etcetera,” Anitha explained. “I have to switch from thinking like a chemist checking to see if a structure is correct, or looking at it like a statistician for precision. So that keeps it exciting.”
It also keeps things challenging, she said: “You might come up with something that makes sense to chemists. But then when the people on the software side look at it, they say it’s too costly in terms of computational power or time. And later, while something might work on a small scale, it’s a whole different story when it’s productionized and applied to millions of documents. But the fantastic thing is that everyone wants to find that right balance where everyone’s happy.”
Onward and upward
The project was ambitious from its inception.
“It was never just about a pipeline that could process patents quickly and accurately,” Anitha explained. “It also needed to be updated and upgraded every time something new arrived — be it more documents or new technologies, approaches or products. It needed to be a fully modular pipeline — like plug-and-play — that could easily be adopted and just keep on running. So that involved a lot of planning.”
Now, as the pipeline has been extended to data from journals, all this planning is paying off. Further iterative development of the infrastructure is planned for 2023, including an extension to Elsevier’s biomedical literature database Embase(opens in new tab/window).
And the ambitions continue to grow.
“At one point down the road, I see a pipeline where anything can go through, and it just branches out to different products,” Anitha said. “It will be able to classify everything on its own, thanks to Elsevier’s massive taxonomies.
“Once you realize there are so many things you can do from the data perspective in terms of getting actionable insights, the sky becomes the limit — not only for chemists and other life sciences [researchers] but beyond.”