Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling

Arash Hajikhani; Arho Suominen; Sajad Ashouri; Lukas Pukelis; Torben Schubert; Ad Notten; Scott Cunningham

doi:10.1016/j.mex.2022.101650

Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling

Arash Hajikhani^*, Arho Suominen, Sajad Ashouri, Lukas Pukelis, Torben Schubert, Ad Notten, Scott Cunningham

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure.

Original language	English
Article number	101650
Number of pages	10
Journal	MethodsX
Volume	9
DOIs	https://doi.org/10.1016/j.mex.2022.101650
Publication status	Published - 27 Feb 2022

JEL classifications

o32 - Management of Technological Innovation and R&D
o31 - Innovation and Invention: Processes and Incentives
o34 - Intellectual Property Rights

Keywords

Natural language processing
Economic classification scheme
Knowledge transformation
Web scraping

Access to Document

10.1016/j.mex.2022.101650Licence: CC BY

Jupyter Notebook to accompany the BIGPROD Data Sample
Ashouri, S. (Creator), DataverseNL, 11 Oct 2021
DOI: 10.34894/2st1an, https://dataverse.nl/citation?persistentId=doi:10.34894/2ST1AN
Dataset/Software: Dataset

Cite this

@article{dded6a30af5e47949fb86983836fa1f9,

title = "Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling",

abstract = "This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure.",

keywords = "Natural language processing, Economic classification scheme, Knowledge transformation, Web scraping",

author = "Arash Hajikhani and Arho Suominen and Sajad Ashouri and Lukas Pukelis and Torben Schubert and Ad Notten and Scott Cunningham",

note = "data source:",

year = "2022",

month = feb,

day = "27",

doi = "10.1016/j.mex.2022.101650",

language = "English",

volume = "9",

journal = "MethodsX",

issn = "2215-0161",

publisher = "Elsevier",

}

TY - JOUR

T1 - Connecting firm's web scraped textual content to body of science

T2 - Utilizing microsoft academic graph hierarchical topic modeling

AU - Hajikhani, Arash

AU - Suominen, Arho

AU - Ashouri, Sajad

AU - Pukelis, Lukas

AU - Schubert, Torben

AU - Notten, Ad

AU - Cunningham, Scott

N1 - data source:

PY - 2022/2/27

Y1 - 2022/2/27

N2 - This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure.

AB - This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure.

KW - Natural language processing

KW - Economic classification scheme

KW - Knowledge transformation

KW - Web scraping

U2 - 10.1016/j.mex.2022.101650

DO - 10.1016/j.mex.2022.101650

M3 - Article

C2 - 35284247

SN - 2215-0161

VL - 9

JO - MethodsX

JF - MethodsX

M1 - 101650

ER -

Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling

Abstract

JEL classifications

Keywords

Access to Document

Datasets

Jupyter Notebook to accompany the BIGPROD Data Sample

Cite this