Source: Nature.com, Jul 2019
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. B
By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature.
Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors.
To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3,4,5,6,7,8,9,10, which requires large hand-labelled datasets for training.
Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11,12,13 (vector representations of words) without human labelling or supervision.
Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery.
This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
Related Resource: ZeroHedge, Jul 2019
the algorithm found predictions for potential thermoelectric materials which can convert heat into energy for various heating and cooling applications.
“It can read any paper on material science, so can make connections that no scientists could,” said researcher Anubhav Jain. “Sometimes it does what a researcher would do; other times it makes these cross-discipline associations.”
The algorithm was designed to assess the language in 3.3 million abstracts from material sciences, and was able to build a vocabulary of around half-a-million words. Word2Vec used machine learning to analyze relationships between words.
“The way that this Word2vec algorithm works is that you train a neural network model to remove each word and predict what the words next to it will be,” said Jain, adding that “by training a neural network on a word, you get representations of words that can actually confer knowledge.”
Using just the words found in scientific abstracts, the algorithm was able to understand concepts such as the periodic table and the chemical structure of molecules. The algorithm linked words that were found close together, creating vectors of related words that helped define concepts. In some cases, words were linked to thermoelectric concepts but had never been written about as thermoelectric in any abstract they surveyed. This gap in knowledge is hard to catch with a human eye, but easy for an algorithm to spot.