Robeco, The Investments Engineers
blue circle

21-03-2023 · インサイト

Quant chart: how NLP can anticipate GICS changes

The recent changes to the global industry classification standards (GICS) illustrate their rigid and sluggish nature. This article argues that natural language processing (NLP) techniques can offer additional insights in today’s fast-changing market environment.


  • Matthias Hanauer - Researcher

    Matthias Hanauer


  • Rob Huisman - Researcher

    Rob Huisman


The GICS is the classic framework to classify similar firms into sectors, industry groups, industries and sub-industries. But the GICS methodology is rigid. Revisions are infrequent and take years to implement, as they involve extensive consultations with market participants. As a result, alternative methods of classification have been suggested based on customer-supplier data, textual similarities in companies’ 10-K business descriptions, comparable technologies based on patent data or shared analyst coverage.

One of the major changes in the recent GICS revision is the creation of the new sub-industry transaction and payment processing services under the financials sector. This new sub-industry will include companies such as Visa, Mastercard and Paypal, which were previously included in the data processing & outsourced services sub-industry, under the software & services industry group and the information technology sector.

The change reflects both the increasing role these companies play in facilitating payments across various platforms and markets, and the fact that these activities are closely aligned with the business activities covered under the financial services industry group. However, this change only took effect on 17 March 2023, two years after the first consultation on the subject started.1

Text-based stock clustering (TBSC) is an interesting alternative to GICS. It uses NLP techniques to analyze textual data from various sources, such as 10-K reports. TBSC has several advantages over GICS:

  • TBSC can be more adaptive and flexible because it can update its classifications more frequently based on new information.

  • TBSC can be more granular and accurate because it can capture the similarities and differences among companies within or across sectors based on their specific products or services.

  • TBSC can be more informative and insightful because it provides explanations for its classifications based on textual evidence.

As technology advances, so do the opportunities for quantitative investors. By incorporating more data and leveraging advanced modelling techniques, we can develop deeper insights and enhance decision-making.

To illustrate these advantages, Figure 1 shows a 2D projection of company-specific vector embeddings derived from 10-K filings using the bidirectional encoder representations from transformers (BERT) model. We use 10-K reports for the fiscal year 2021 as input for the model to test whether the NLP technique could already anticipate the current GICS revisions.

The results show that the transaction and payment processing services companies – such as Visa, Mastercard and Paypal (light blue) – are indeed closer to their new industry group financial services (green) than their previous industry group software and services (brown). This finding suggests that TBSC can anticipate changes in GICS before they are officially implemented. However, we also find that the financial services industry group is rather heterogeneous compared to other industry groups such as banks, insurance, or semiconductors & semiconductor equipment.

Figure 1 | 2D projection of word embeddings based on 10-K filings for the fiscal year 2021.

Figure  1  |  2D projection of word embeddings based on 10-K filings for the fiscal year 2021.

Source: SEC, Refinitiv, Robeco. The figure shows a 2D projection of numerical embeddings derived from BERT based on firms’ 10-K filings for the fiscal year 2021. The analysis is restricted to MSCI USA Index constituents augmented with large and liquid constituents of the FTSE World Developed and S&P Broad Market Index. The different colors indicate different GICS industry groups within the Information Technology (Software & Services, Technology Hardware & Equipment, and Semiconductors & Semiconductor Equipment) and Financials (Banks, Financial Services, and Insurance) sectors. Furthermore, the stocks from the newly created Transaction and Payment Processing Services sub-industry under the Financial Services industry group are highlighted. Previously, these stocks were included in the Software & Services industry group.

In conclusion, TBSC might be a better and more timely alternative to standard sector or industry classifications, such as GICS. By using NLP techniques to analyze textual data from various sources, TBSC can provide more adaptive, granular, accurate, informative and insightful classifications for stock analysis.


1 For example, the consultation of potential changes already started in 2021, were announced in March 2022, but only become effective in March 2023.


当資料は情報提供を目的として、Robeco Institutional Asset Management B.V.が作成した英文資料、もしくはその英文資料をロベコ・ジャパン株式会社が翻訳したものです。資料中の個別の金融商品の売買の勧誘や推奨等を目的とするものではありません。記載された情報は十分信頼できるものであると考えておりますが、その正確性、完全性を保証するものではありません。意見や見通しはあくまで作成日における弊社の判断に基づくものであり、今後予告なしに変更されることがあります。運用状況、市場動向、意見等は、過去の一時点あるいは過去の一定期間についてのものであり、過去の実績は将来の運用成果を保証または示唆するものではありません。また、記載された投資方針・戦略等は全ての投資家の皆様に適合するとは限りません。当資料は法律、税務、会計面での助言の提供を意図するものではありません。 ご契約に際しては、必要に応じ専門家にご相談の上、最終的なご判断はお客様ご自身でなさるようお願い致します。 運用を行う資産の評価額は、組入有価証券等の価格、金融市場の相場や金利等の変動、及び組入有価証券の発行体の財務状況による信用力等の影響を受けて変動します。また、外貨建資産に投資する場合は為替変動の影響も受けます。運用によって生じた損益は、全て投資家の皆様に帰属します。したがって投資元本や一定の運用成果が保証されているものではなく、投資元本を上回る損失を被ることがあります。弊社が行う金融商品取引業に係る手数料または報酬は、締結される契約の種類や契約資産額により異なるため、当資料において記載せず別途ご提示させて頂く場合があります。具体的な手数料または報酬の金額・計算方法につきましては弊社担当者へお問合せください。 当資料及び記載されている情報、商品に関する権利は弊社に帰属します。したがって、弊社の書面による同意なくしてその全部もしくは一部を複製またはその他の方法で配布することはご遠慮ください。 商号等: ロベコ・ジャパン株式会社  金融商品取引業者 関東財務局長(金商)第2780号 加入協会: 一般社団法人 日本投資顧問業協会