Automation for Pubmed XML data retrieval using E-utility
Data
Data is colelcted from the Pubmed Medline, Arxiv and Bioarxiv journals for neuroscience papers published for the past 5 years. The fields collected are:
PMID
Title
Abstract
List of Authors
Affiliation
source
(Medlinearxiv bioarxiv) Publish year
The data needs to be parsed in a JSON format as shown below:
The PubMed data is in XML or text format. The XML format is chosen for parsing so that there is durability and efficiency while parsing the required fields from the original XML data into JSON data.
Since, we only need to find computational neuroscience or neuroscience related research papers, The PubMed E-utility is used to retrieve the data through automation. This blog walks over the process of retrieving the more than 10,000 research papers at once using the PubMed E-utility automation for ‘computational+neuroscience’ term.
Running Data Loader
We have created a Python library for retrieving scientific data from arXiv, bioRxiv, and MEDLINE using their respective APIs, and converting to a it format usable by the automatic reviewer recommendation algorithm.
Link: https://github.com/nbdt-journal/automatic-reviewer-assignment/tree/parser_module/data_loader
Usage To run the script, simply run:
1
python3 main.py
To change the parameters of the function, you can simply change the values in config.yaml.The usage is self-described within the config file. If you wish to not use a certain data source for the parsing, set the parse value to false in the respective parser section.