Insights
The shopping landscape for both consumers and businesses has undergone substantial changes recently, with a growing number of purchases being conducted online. In 2022, global retail e-commerce sales are predicted to hit $5.55 trillion, making up 21.0% of total retail expenditure, with a gradual increase projected. To meet this rising demand, businesses must either transition to e-commerce or enhance their current online product offerings. A critical aspect of these strategies is the online accessibility of product information. Customers need to be able to locate the products they are interested in and be persuaded by the product details to finalize their purchases. Typically, this data is housed in product information management systems that support e-commerce and catalogue platforms. However, in practice, much of the essential data is either missing or available only after extensive manual efforts. Consequently, many organizations require a dedicated team of data stewards who spend substantial time collecting the appropriate product information from various sources, which can be both costly and labor-intensive.
From a broader perspective, the internet serves as the most extensive existing product database. Numerous online retailers showcase a vast array of products, each featuring its own product detail page (PDP) that displays relevant product information. This encompasses, but isn’t limited to, product images, specifications, pricing, shipping details, and descriptions. This information assists consumers in making informed decisions and is therefore made available publicly. Consequently, businesses entering e-commerce often find that a wealth of product information is already accessible. It’s likely that many products they wish to offer are already listed online by other vendors who have the necessary product details. The primary challenge is to automatically locate, extract, and archive this product information for the enrichment process. This necessitates tackling the complexities of the extensive, diverse, and often unstructured database holding this data. In essence, we need to find solutions to the challenges posed by the internet for large-scale automated e-commerce product data enrichment. We identify two main hurdles: the variability in structure among product detail pages and the linguistic inconsistencies in descriptions of identical product specifications.
The first challenge is the variability in PDP structures across different online retailers. As previously stated, PDPs contain the product data necessary for the enrichment process; however, how this data is formatted and presented varies widely among stores. Retailers are not mandated to follow a specific format when showcasing their product information, leading to inconsistencies in the structure and positioning of product details on their PDPs. The second challenge revolves around the linguistic inconsistencies in descriptions of identical product specifications. Different stores often utilize their own terminology and standard values to articulate product details, resulting in multiple attributes that are similar yet labeled differently, with variations caused by synonyms, abbreviations, typos, and so forth. For effective large-scale data enrichment, we must identify duplicates and standardize product specifications to ensure data quality.
In this article, we will illustrate how transformer-based language models can be employed to navigate both of these challenges. We will begin with a brief overview of transformer-based language models, then explain how these models can detect product specifications on PDPs irrespective of their structure, and finally, we will detail how they can be utilized to match product specifications.
The original transformer-based language model was introduced by Devlin et al. (2018) and is known as BERT, which stands for Bidirectional Encoder Representations from Transformers. BERT has considerably transformed the NLP field. Unlike earlier contextual language models like ELMo, BERT considers all words in a sentence simultaneously to assess their embeddings, allowing for a more profound comprehension of language. Since BERT’s launch, numerous fine-tuned and derivative models have emerged, surpassing the performance of the original BERT. These language models have been applied to a myriad of NLP tasks, including semantic textual similarity, which we will leverage to address our challenges.
We referenced the issue of variability in structure among PDPs from different online retailers. Product specifications can be positioned anywhere on a PDP; however, they are commonly found in HTML tables. These specifications pertain to product attributes and are linked to both the product and its category. Other HTML tables on PDPs might contain unrelated information such as store contact details or operational hours. This indicates that product specification tables are predominantly the tables that relate to the product on PDPs. Therefore, we can frame our detection issue as a semantic search problem: the goal is to identify and extract only the tables linked to product specifications. Semantic search closely relates to concepts of sentence similarity and semantic textual similarity, where different sentences are compared based on their meaning. High semantic similarity suggests that two sentences convey similar meanings or are related.
Each HTML table on a PDP can be transformed into a single sentence by concatenating all the words within the table. This generates multiple sentences that can be compared to a designated input sentence. For example, the following conversion illustrates a product specification table turned into a single sentence:
Category | Value |
---|---|
Printing Technology | Inkjet |
Brand | HP |
Connectivity Technology | Wi-Fi;Cloud Printing |
Model Name | J9V92A#B1H |
Compatible Devices | Smartphones, PC, Tablets |
Recommended Uses For Product | Office, Home |
Sheet Size | 3 x 5 to 8.5 x 14, Letter, Legal, Envelope |
Color | Seagrass |
Printer Output | Color |
Item Weight | 5.13 pounds |
Product Line | HP DeskJet |
“Printing Technology Inkjet Brand HP Connectivity Technology Wi-Fi;Cloud Printing Model Name J9V92A#B1H Compatible Devices Smartphones, PC, Tablets Recommended Uses For Product Office, Home Sheet Size 3 x 5 to 8.5 x 14, Letter, Legal, Envelope Color Seagrass Printer Output Color Item Weight 5.13 pounds Product Line HP DeskJet”
Conversely, a non-specification HTML table can also be converted into a single sentence (as shown in Amazon’s footer):
“Amazon Music Stream millions of songs Amazon Advertising Find, attract, and engage customers Amazon Drive Cloud storage from Amazon 6pm Score deals on fashion brands AbeBooks Books, art & collectibles Sell on Amazon Start a Selling Account Amazon Business Everything For Your Business AmazonGlobal Ship Orders Internationally Home…”
We can perform this conversion for each HTML table on the PDP, yielding n
sentences, where n
signifies the number of HTML tables present. The next step is to identify the sentence(s) that correspond to product specifications. To achieve this, we must formulate an input sentence that is semantically relatable to the sentence containing the product specifications. Product specifications are typically domain-specific, but they should relate to the product and its category. Using our earlier product specification table as an example from Amazon, we note that it contains attributes and values relevant to the printer, including ‘printing technology’, ‘cloud printing’, and ‘sheet size’. These attributes are clearly pertinent to the product category ‘printer’. Consequently, a sentence featuring the product category could be semantically related to the product specifications and utilized for automated semantic search. Typically, the product category is identified within ‘breadcrumbs’ found on PDPs, also known as the navigation chain. We can use these breadcrumbs to locate the product specifications. For the aforementioned Amazon printer, the following breadcrumbs can also be converted into a sentence:
“Office Products › Office Electronics › Printers & Accessories › Printers”
With both the breadcrumb sentences and HTML tables transformed into distinct sentences, we can now perform the semantic search. The breadcrumb sentence acts as the input sentence (query) and is compared with the sentences derived from the HTML tables. To achieve this, we first vectorize all the sentences using a transformer-based language model. Given that we are comparing sentences, the Sentence-BERT (SBERT) architecture is ideal, as it is designed to yield semantically meaningful sentence embeddings. A myriad of fine-tuned BERT models based on this architecture is available; for this example, we will utilize one of the most downloaded models on HuggingFace: multi-qa-MiniLM-L6-cos-v1, which is a fine-tuned version of BERT tailored for semantic search. Leveraging the SBERT architecture allows for seamless comparison of sentence pairs. Specifically, we will vectorize two sentences and compute their semantic similarity using cosine similarity. A value nearing 1 indicates a high level of semantic similarity, while a value approaching 0 indicates lower similarity. We can observe the similarity calculations based on the sentences we generated throughout the article:
Sentence 1 | Sentence 2 | Semantic similarity |
---|---|---|
“Printing Technology Inkjet Brand HP Connectivity Technology Wi-Fi;Cloud Printing Model Name J9V92A#B1H Compatible Devices Smartphones, PC, Tablets Recommended Uses For Product Office, Home Sheet Size 3 x 5 to 8.5 x 14, Letter, Legal, Envelope Color Seagrass Printer Output Color Item Weight 5.13 pounds Product Line HP DeskJet” | “Office Products › Office Electronics › Printers & Accessories › Printers” | 0.60 (multi-qa-MiniLM-L6-cos-v1 + cosine similarity) |
“Amazon Music Stream millions of songs Amazon Advertising Find, attract, and engage customers Amazon Drive Cloud storage from Amazon 6pm Score deals on fashion brands AbeBooks Books, art & collectibles Sell on Amazon Start a Selling Account Amazon Business Everything For Your Business AmazonGlobal Ship Orders Internationally Home …” | “Office Products › Office Electronics › Printers & Accessories › Printers” | 0.22 (multi-qa-MiniLM-L6-cos-v1 + cosine similarity) |
The similarity scores reveal marked differences! The product specification table achieves a cosine similarity of 0.60, while the non-specification table only registers a score of 0.22 with the breadcrumbs. These scores can be utilized for direct retrieval through a designated threshold or integrated as features into a classification model, similar to the approach by Petrovski and Bizer (2017). This process can be replicated for each PDP to gather product specifications effectively. Since our method relies solely on HTML tables and breadcrumbs, it remains unaffected by website structure variations, making it an ideal solution for large-scale automated e-commerce product data enrichment.
Once product information is extracted from multiple sources, it’s essential to consolidate them into a uniform list of specifications ready for use. We aim to avoid presenting duplicate information merely because various sources describe the same specifications differently. This necessitates aligning product specifications, and we will demonstrate how the same transformer-based language model can facilitate this process.
Above, we showcased product specifications from Amazon and eBay regarding the same printer discussed previously. Both platforms describe certain identical specifications differently. Our goal is to connect each specification from one source with its analogous specification from another source, provided it exists. Matching product specifications can be viewed as a sentence similarity task and is closely related to semantic search. Our approach to the matching task mirrors that of the detection task. We convert each specification (comprising attribute and value) into distinct sentences for comparison. For instance, we will compare “Printing Technology Inkjet” from Amazon against specifications from eBay like “Brand HP”, “Model J9V92A#B1H”, “Memory 64 MB”, and so forth. Utilizing the multi-qa-MiniLM-L6-cos-v1 model and cosine similarity will enable us to assess the semantic similarity of these specifications. The illustration above highlights the specifications to be matched and provides their respective similarity scores calculated using the transformer-based language model. We can see that similar specifications yield high similarity scores, while dissimilar ones generally fall below that threshold. For example, “Printing Technology Inkjet” and “Brand HP” score only 0.32 in similarity. Thus, transformer-based language models can effectively be used to match product specifications.
This article illustrated the capabilities of transformer-based language models in large-scale automated e-commerce product data enrichment. We presented a solution for detecting product specifications that is not based on website structures. By treating the detection issue as a semantic search task, we successfully extracted specifications by evaluating the similarity of tables with breadcrumbs. Furthermore, we showed that transformer-based language models can also match identical product specifications despite syntactic variations. Overall, transformer-based language models are exceptionally powerful and applicable to a broad range of NLP tasks. As evidenced in this article, this includes employing semantic search (or sentence similarity) for extensive automated e-commerce product data enrichment.