As traditional quantitative strategies lose steam in an increasingly competitive environment driven by rapid adoption of data science and systematic trading, the world of alternative data becomes the new Eldorado for financial boffins.
However, one of the challenges of working with alternative data is that they come in all shapes and sizes from numerously heterogeneous sources, making integrating them onto a common data schema to run models difficult. In this brief article, I will show how ChinaScope is solving this particular issue.
ChinaScope over the past 10 years has built a comprehensive product-level data schema that integrates with broadly accepted industry classification systems ("SAM"). Currently SAM covers more than 4,050 product categories and it is expected to expand to five digit figures by year end.
Figure 1: Example of hierarchy of products linking to the Information Technology sector vertical, based on a slightly altered version of GICS to suit the Chinese economy.
Each product is then linked together with other products based on the industrial supply chain relationships between these products (c. 40,000+ upstream and downstream relationships). The purpose of establishing this schema is to map out the fundamental inputs of the real economy of China, and eventually the world.
Figure 2: Example of Industrial supply chain relationships of Applied Electronic Instruments and Equipment.
This data schema is highly versatile when it comes to mapping to different types of data. First and foremost is mapping to companies. Currently, ChinaScope has more than 23,000 public equity and debt as well as more than 20 million private companies mapped to this data schema.
Figure 3: Selection of China A-share producers of Applied Electronic Instruments and Equipment and their gross profit margins from this product category.
Figure 4: Geographic distribution of private companies in the production or sale of Applied Electronic Instruments and Equipment.
The next step is to map various alternative data to this schema, including but not limited to data issued by the National Statistics Bureau (currently 3,700+ data items already mapped), industry related data released by industry associations, customs data (Currently mapped the US$200 bn worth of commercial goods subject to the Trump tariffs), company operating data (currently mapped to all the operating data disclosed by A-share companies), shipping data, and news. The mapping process is a constantly expanding process, and the network of information is further enhanced by other dimensions of mapping such as inter-company business relationships.
I want to share an example that focuses on the application of mapping SAM data with deciphered news through ChinaScope's NLP system ("SmarTag") to excavate alpha through a simple trading strategy.
Figure 5: Example of tagging by SmarTag NLP system on Chinese language news.
A key part of SmarTag lies in the identification of industries and products that map to the SAM data schema, and at each level of SAM we can extract out the companies based on the revenue and gross profit contribution from that particular business segment toward its total business results. From this, we constructed a trading strategy based on the following assumptions:
- Compile the total mentions of level 3 SAM industry tags from news for Week T=1, rank each one in terms of percentage of total mention of all industry tags, and compare the percentage figures to the percentage figures of Week T=0, and rank them in accordance to the inter-week change in absolute percentage terms;
- Select companies that have their products aggregated up to level 3 industry tags, and choose the constituent companies for each industry tag based on a predetermined set of rules (i.e. If more than 50% of their revenues for the past 3 consecutive reporting periods come from this segment, or if between 30% - 50% of their revenues for the same period come from this segment but more than 50% of their gross profits come from this segment, then these companies are included).
- We long the companies that have the highest percentage change in mention for their industry segment, and short those with the lowest percentage change in mention in Week T=2, with holding period of 5 trading days.
For the three years from Jan 2016 to Dec 2018, we find an estimated annualized alpha signal of 4.3% (not taking into account of trading costs). The signal appears to be relatively consistent over the back testing period.
Figure 6: Strategy results combing level 3 SAM with SmarTag NLP extraction.
This strategy can be adjusted by sliding down the SAM hierarchy tree into more granular product levels to test salience at a more granular information levels. Since 90 percent of the companies report their segmented results between SAM level 5 and 8, more granularity can be achieved without sacrificing too much liquidity from stock constituency limitation. Further information extraction can be achieved by integrating companies identified from SmarTag, matching their SAM breakdown to the SAM categories mentioned directly in the news, and then adding into the mix sentiment scores.
For more information on this, please contact me or at firstname.lastname@example.org.
By ChinaScope Limited (www.chinascope.com)