A new model predicts how molecules will dissolve in different solvents

Aug 19, 2025

Using machine learning, MIT chemical engineers have created a computational model that can predict how well any given molecule will dissolve in an organic solvent — a key step in the synthesis of nearly any pharmaceutical. This type of prediction could make it much easier to develop new ways to produce drugs and other useful molecules.

The new model, which predicts how much of a solute will dissolve in a particular solvent, should help chemists to choose the right solvent for any given reaction in their synthesis, the researchers say. Common organic solvents include ethanol and acetone, and there are hundreds of others that can also be used in chemical reactions.

“Predicting solubility really is a rate-limiting step in synthetic planning and manufacturing of chemicals, especially drugs, so there’s been a longstanding interest in being able to make better predictions of solubility,” says Lucas Attia, an MIT graduate student and one of the lead authors of the new study.

The researchers have made their model freely available, and many companies and labs have already started using it. The model could be particularly useful for identifying solvents that are less hazardous than some of the most commonly used industrial solvents, the researchers say.

“There are some solvents which are known to dissolve most things. They’re really useful, but they’re damaging to the environment, and they’re damaging to people, so many companies require that you have to minimize the amount of those solvents that you use,” says Jackson Burns, an MIT graduate student who is also a lead author of the paper. “Our model is extremely useful in being able to identify the next-best solvent, which is hopefully much less damaging to the environment.”

William Green, the Hoyt Hottel Professor of Chemical Engineering and director of the MIT Energy Initiative, is the senior author of the study, which appears today in Nature Communications. Patrick Doyle, the Robert T. Haslam Professor of Chemical Engineering, is also an author of the paper.

Solving solubility

The new model grew out of a project that Attia and Burns worked on together in an MIT course on applying machine learning to chemical engineering problems. Traditionally, chemists have predicted solubility with a tool known as the Abraham Solvation Model, which can be used to estimate a molecule’s overall solubility by adding up the contributions of chemical structures within the molecule. While these predictions are useful, their accuracy is limited.

In the past few years, researchers have begun using machine learning to try to make more accurate solubility predictions. Before Burns and Attia began working on their new model, the state-of-the-art model for predicting solubility was a model developed in Green’s lab in 2022.

That model, known as SolProp, works by predicting a set of related properties and combining them, using thermodynamics, to ultimately predict the solubility. However, the model has difficulty predicting solubility for solutes that it hasn’t seen before.

“For drug and chemical discovery pipelines where you’re developing a new molecule, you want to be able to predict ahead of time what its solubility looks like,” Attia says.

Part of the reason that existing solubility models haven’t worked well is because there wasn’t a comprehensive dataset to train them on. However, in 2023 a new dataset called BigSolDB was released, which compiled data from nearly 800 published papers, including information on solubility for about 800 molecules dissolved about more than 100 organic solvents that are commonly used in synthetic chemistry.

Attia and Burns decided to try training two different types of models on this data. Both of these models represent the chemical structures of molecules using numerical representations known as embeddings, which incorporate information such as the number of atoms in a molecule and which atoms are bound to which other atoms. Models can then use these representations to predict a variety of chemical properties.

One of the models used in this study, known as FastProp and developed by Burns and others in Green’s lab, incorporates “static embeddings.” This means that the model already knows the embedding for each molecule before it starts doing any kind of analysis.

The other model, ChemProp, learns an embedding for each molecule during the training, at the same time that it learns to associate the features of the embedding with a trait such as solubility. This model, developed across multiple MIT labs, has already been used for tasks such as antibiotic discovery, lipid nanoparticle design, and predicting chemical reaction rates.

The researchers trained both types of models on over 40,000 data points from BigSolDB, including information on the effects of temperature, which plays a significant role in solubility. Then, they tested the models on about 1,000 solutes that had been withheld from the training data. They found that the models’ predictions were two to three times more accurate than those of SolProp, the previous best model, and the new models were especially accurate at predicting variations in solubility due to temperature.

“Being able to accurately reproduce those small variations in solubility due to temperature, even when the overarching experimental noise is very large, was a really positive sign that the network had correctly learned an underlying solubility prediction function,” Burns says.

Accurate predictions

The researchers had expected that the model based on ChemProp, which is able to learn new representations as it goes along, would be able to make more accurate predictions. However, to their surprise, they found that the two models performed essentially the same. That suggests that the main limitation on their performance is the quality of the data, and that the models are performing as well as theoretically possible based on the data that they’re using, the researchers say.

“ChemProp should always outperform any static embedding when you have sufficient data,” Burns says. “We were blown away to see that the static and learned embeddings were statistically indistinguishable in performance across all the different subsets, which indicates to us that that the data limitations that are present in this space dominated the model performance.”

The models could become more accurate, the researchers say, if better training and testing data were available — ideally, data obtained by one person or a group of people all trained to perform the experiments the same way.

“One of the big limitations of using these kinds of compiled datasets is that different labs use different methods and experimental conditions when they perform solubility tests. That contributes to this variability between different datasets,” Attia says.

Because the model based on FastProp makes its predictions faster and has code that is easier for other users to adapt, the researchers decided to make that one, known as FastSolv, available to the public. Multiple pharmaceutical companies have already begun using it.

“There are applications throughout the drug discovery pipeline,” Burns says. “We’re also excited to see, outside of formulation and drug discovery, where people may use this model.”

The research was funded, in part, by the U.S. Department of Energy.

← Previous ArticleNext Article →

Study: Under extreme impacts, metals get stronger when heated

Marking 13 Years on Mars, NASA’s Curiosity Picks Up New Skills – NASA

Streamlining data collection for improved salmon population management

Writing code, and decoding the world

Technicians Work to Prepare Europa Clipper for Propellant Loading – NASA

3 Questions: The pros and cons of synthetic data in AI

Study helps pinpoint areas where microplastics will accumulate

How the brain distinguishes between ambiguous hypotheses

New electronic “skin” could enable lightweight night-vision glasses

Bridging the heavens and Earth

Helping robots zero in on the objects that matter

Study: Movement disorder ALS and cognitive disorder FTLD show strong molecular overlaps

NASA’s Mini Rover Team Is Packed for Lunar Journey – NASA

How the brain coordinates speaking and breathing

New tool evaluates progress in reinforcement learning

NASA Partnerships Allow Artificial Intelligence to Predict Solar Events – NASA

NASA Seeks Industry Concepts on Moon, Mars Communications – NASA

From refugee to MIT graduate student

Simons Center’s collaborative approach propels autism research, at MIT and beyond

The origin of the sun’s magnetic field could lie close to its surface

Study: Early dark energy could resolve cosmology’s two biggest puzzles

$20 million gift supports theoretical physics research and education at MIT

Walk-through screening system enhances security at airports nationwide

Acting NASA Administrator Duffy Selects Exploration-Focused Associate Administrator – NASA

Professor Emeritus Daniel Kleppner, highly influential atomic physicist, dies at 92

Robotic probe quickly measures key properties of new materials

AI model identifies certain breast tumor stages likely to progress to invasive cancer

How mass migration remade postwar Europe

Technicians Work to Prepare Europa Clipper for Propellant Loading – NASA

A new framework to efficiently screen drugs

Oct 2, 2025

A simple formula could guide the design of faster-charging, longer-lasting batteries

MIT researchers developed a model that explains lithium intercalation rates in lithium-ion batteries.

Read Article

Oct 2, 2025

A cysteine-rich diet may promote regeneration of the intestinal lining, study suggests

The findings may offer a new way to help heal tissue damage from radiation or chemotherapy treatment.

Read Article

Oct 2, 2025

Accounting for uncertainty to help engineers design complex systems

The approach could enable autonomous vehicles, commercial aircraft, or transportation networks that are more reliable in the face of real-world unpredictability.

Read Article

Oct 1, 2025