Introduction
The term Enormous information came around 2005, but that time security and security issues were not considered but presently security and protection is the foremost challenging and trending issue. In each organization, information is most vital resource, not as it were for computer-based businesses but moreover for other organizations such as governments, healthcare, instruction, designing and innovation division, fabricating, and retail division. Information is basic for commerce management and for making the most excellent choice on the basis of the data extricated from colossal information. A human may be a social individual continuously living in a society and collaboration every day with each other. Due to innovation progression and applications, such as savvy portable gadgets, sensors, Web of Things, cyber-physical frameworks, social systems, YouTube, charting computer program etc produces a tremendous sum of information every day. Today's information is organized, unstructured and semi-structured, but the conventional information overseeing framework is incapable to handle these information, so modern term have come to be known as Enormous Information which prepare a gigantic sum of these sorts of information. Big data analytics are utilized daily by industries such as the stock market, retail, and more to optimize business operations. New technology comes with new issues. The term enormous information is characterized as "a modern era of innovations and models, planned to financially partitioned esteem from exceptionally expansive volumes of a wide assortment of information, by empowering high-velocity capture, disclosure, and investigation". Big data is traditionally characterized by the 5 V's: volume, velocity, and variety, but these alone don't fully capture its complexity. Additional dimensions like veracity, validity, value, variability, venue, vocabulary, and vagueness offer a more comprehensive understanding of big data. However, the challenges associated with big data extend beyond the 5 V's and include critical issues such as data quality, privacy, and security. In recent years, numerous methods have been developed to address these concerns and enhance data protection and confidentiality. Figure : Big data life cycle
These components can be categorized on the base of enormous information life cycles, such as information era, capacity, and handling. In information era stage, get to confinements and misrepresenting information procedures are utilized for information security strategy. In information capacity stage encryption, a method is utilized for security security and to secure delicate data. Encryption can be isolated into three parts; Identity-based Encryption (IBE), Attribute-Based Encryption (ABE) and capacity way encryption. In information handling stage to protect the security, we utilize to join the Security Protecting Information Distributing (PPDP) and information extraction from information. In PPDP, generalization and concealment is utilized to ensure the security of information. These can be encourage isolated into clustering, classification, and affiliation run the show mining based procedure. Clustering and classification partition the input information into diverse related bunches, affiliation run the show mining finds the valuable connections and float within the input information. The Cloud Security Union organization, a working bunch on Enormous information security, basically partitions it into four viewpoints of Huge information security:Framework security, Information security, Information administration, Keenness and responsive security. These four categories are further divided into ten specific security challenges: Enormous Information safety aspect also preserve
data confidentiality, integrity, and availability. This paper will cover an introduction to Big Data and its associated tools (such as Hadoop and MapReduce), along with an exploration of the security challenges in Big Data. We will address concerns surrounding privacy and security, the specific privacy requirements during Big Data generation, processing, and storage phases, as well as the traditional mechanisms used for protection. Additionally, Big Data analytics presents opportunities to tackle privacy and security issues, and we will discuss various solutions proposed by researchers to address these challenges effectively.
Big data is used with Hadoop to process, manage, and analyze large amounts of different types of data (structured, unstructured, semi-structured). Social data types can be structured data, and NoSQL data is unstructured and semi-structured. There are many big data platforms and programs (IBM Bigdata Analytics, HP Bigdata, SAP Bigdata Analytics, Microsoft Bigdata, etc.) to process and analyze large data sets, but Hadoop is generally preferred. Hadoop was developed by Doung Cutting and Michael J. Cafarella, but is now an open source system operated by Apache. It is written in Java and supports a distributed computing environment [1]. As it is not possible to store large amounts of data in existing file systems, an entirely new technology called Hadoop was developed. Hadoop consists of two main parts: HDFS (Hadoop Distributed File System) and MapReduce. MapReduce was developed by Google, but is now part of Apache Hadoop. MapReduce is a programming framework model that decomposes an application into different parts. It is designed to scale from a single server to thousands of machines, each providing local computation and storage [2]. The current Apache Hadoop schema consists of various elements such as Hadoop kernel, MapReduce, HDFS, and Apache Hives, Base, Zookeeper, etc.
An inode contains information about attributes such as permissions, modifications, access times, and disk space allocation. Multiple DataNodes in a single NameNode and one block on multiple DataNodes. Data Nodes are slave nodes of NameNode and are responsible for storing the actual data in HDFS. Each data block has two files in the host's original file system, one file contains data and the other file contains metadata. Each data block has a namespace ID when it is formatted. During communication, it checks this namespace ID and DataNode with NameNode and if it does not match with NameNode, it shuts down this DataNode. When a new DataNode without a namespace ID joins a cluster, it is assigned the cluster's namespace ID. Each DataNode has a storage ID that helps identify it after it is restarted on a different IP address or port. HDFS client/edge nodes act as the interface between the Hadoop cluster and the external network. They are also called edge nodes or gateway nodes. Edge nodes are used to run client applications and cluster management tools. MapReduce is part of the Apache organization and developed by Google. It is a programming model for processing and generating large amounts of structured and unstructured datasets stored in HDFS. It works on the divide-apply-combine strategy model for data analysis. MapReduce is divided into two parts: Map function and Reduce function. Map function divides the data into several small blocks. Each block is assigned a map function that can process the data set in parallel. Each process reads the input in pairs and produces a pair of mediums as output. The Reduce function groups all medium values that correlate with the same medium together into a final result that produces either one or no output value per Reduce task. MapReduce processes all the data in the cluster in batches. Pig and Hive are also part of MapReduce. HIVE is a SQL interpreter and a Yahoo build of Pig that extracts data from the cluster. Other tools in the Hadoop system include YARN, Amberi, Pig, and Hive. YARN (Yet Another Resource Negotiator) is a cluster management technology that provides security for servers and a programming framework for Hadoop. Amberi is a web-based tool for managing and tracking clusters in Hadoop and includes metadata clusters. Pig is a SQL-like tool for analyzing large data sets and manipulating data operations in Hadoop. Hive has a SQL-like query tool called HivesQL that supports a data definition language and a data manipulation language. Other tools include Mahout, Avro, Spark, HBase, and Zookeeper.
Related Work
New technology always comes with new challenges. When Big Data was initially developed, privacy and security concerns were not a primary focus for the developers. However, as Big Data has evolved, privacy and security have become some of the most critical and pressing challenges faced in this field today. Addressing these issues is now essential due to the increasing volume and sensitivity of the data being processed.. Most of the people and companies are not able to deal with the complexity of security and privacy in big data. Conventional security and security component like OS level security Encryption, Confirmation: Kerberos, Authorization:HDFS consents and Get to Control Records, firewall, OSI and TCP/IP security etc. isn't able to bargain with huge information security and protection. We require unused security and protection innovation and strategy to secure, screen and review the Enormous Information forms. Enormous information contains heterogenic and homogeneous information. These heterogeneities lead the security and security issues in huge information. It is characterized in terms of 4V's specifically volume, speed, Assortment, and esteem. A few other components too included are veracity and validity etc. To urge the advantage of Huge Information, government offices, analysts, businessmen and healthcare industry contribute assets, cash and individual data. The individual data is collected from the heterogonous sources such as the web, Yahoo, Facebook, Paltalk, YouTube, Skype, AOL, and media transmission etc. On the off chance that the security of Enormous information isn't dealt with legitimately, at that point it hurts the client security. CSA highlights the enormous information security protection challenges: Security Issues in Distributed Environment: MapReduce framework is built by various mappers. These mappers work in parallel. Protecting mappers and data from untrusted mappers is a major issue in secure distributed computing framework. Endpoint Input Validation/Filtering: Big Data collects data from heterogeneous sources. But identifying and validating the source of input data remains a challenge as there are no tools to validate whether this input data source is trustworthy or malicious. If a malicious input source is somehow detected, it is easy to immediately remove the malicious input from the database to prevent further damage. Non-relational Database (NOSQL) Data Protection: Non-relational databases (NoSQL) are protected by middleware, but the database does not provide security support explicitly. NOSQL data collected from heterogeneous sources requires additional security and privacy processes. Secure Data Storage and Transaction Logs: Big data is growing at a very high rate. To ensure data availability and sociality, automated management of training data is required. However, automated training does not know where to store the data. Therefore, new technologies and additional efforts are required to protect database storage. Real-time security monitoring: In real-time security monitoring, the system generates numerous security alerts. These alerts can be false or true, and as we humans are unable to process these huge volumes of alerts, we may choose to ignore them or "click away." These Big data technologies require more support in processing and analyzing different types of data. In general, the medical industry benefits from this big data technology, but in real time, there is a greater possibility of abnormal acquisition of personal information that can be exploited. Extensible and Configurable Privacy - Protecting Data Mining and Analysis: In this privacy issue, big data is an issue because it infiltrates security phenomena, engages in intrusive marketing, restricts citizens' freedoms, and increases government and corporate control. Companies leverage big data analytics for marketing purposes, but users' privacy is not protected. AOL released anonymized search logs for academic purposes, but users were easily identified by searchers. Fine-grained auditing: At any given point in time, you want to be aware of new and missed attacks and minimize missed attacks. For this you need audit information. These are needed to understand what went wrong and when for compliance, regulatory and forensic reasons. This is about more distributed data objects. Example: Financial companies need detailed auditing for compliance requirements. Fine-grained access control: Prevent people from accessing personal data by implementing access control mechanisms. Fine-grained access control allows data managers to share data as widely as possible without compromising security and privacy. Big Data deals with heterogeneous data sets, and this heterogeneity increases the diversity requirements for security. Information Security: Today, data is the most valuable asset. Security of the data itself is a major concern. Information security is sometimes called IT security or InfoSec. Managing the security of big data is a very challenging task. It protects data from unauthorized access, modification, use, manipulation, and destruction and ensures data confidentiality, availability, and integrity. Information security is needed in all sectors, including personal and commercial websites, financial services, networking, healthcare, social networking sites, anomaly detection, banking systems, and educational institutions. Metadata Provenance: Metadata provenance is the number of things on which data depends. Big data complexity is increasing due to the large Provenance graphs generated by big data applications. Higher computational speed and more powerful algorithms are required to analyze such large Provenance graphs and discover metadata dependencies for security and confidentiality applications.
B. Security and Privacy Issue in Big Data
The evolution of recommendation systems has seen a shift from basic collaborative filtering and cold start solution strategies [20], [21], [22], [23] to utilizing complex semantic patterns in text data for fine-grained recommendations [5], [8], [9], [24]. BERT4Rec and similar architectures have shown promise in capturing dynamic user preferences through the analysis of historical behavior. These models, informed by the Transformer’s strengths, outperform traditional sequential neural networks [8]. The incorporation of semi-automatic annotation for sentiment analysis [25] and adapted feature selection algorithms [26] further improves the system’s ability to provide personalized, accurate recommendations. This approach aims to predict future user purchases by learning significant product name characteristics and patterns. These advancements, combined with the full strengths of the Transformer architecture, allow for a deeper understanding and prediction of future user purchases based on semantic information. This paper proposes a new approach, utilizing the full strengths of the Transformer architecture to learn significant characteristics and patterns of product names, aiming to predict future user purchases based on semantic information.HybridBERT4Rec utilizes BERT to extract the characteristics of user interactions with purchased items and provide recommendations based on this. This model sequences users’ historical interactions and is designed to better reflect users’ changing interests [24]. However, most of these studies treated the entire product name as a single unit without tokenization. Although there have been attempts to learn product names as tokens [27], they have the complex limitation of a dual structure that combines a the Transformer and Word2Vec. In this paper, we propose a new approach to recommendation systems, fully leveraging the strengths of the entire Transformer architecture. We explored a method for learning significant characteristics and patterns of product names using an encoder and generating new product names at the token level using a decoder. This approach aims to learn the semantic information of product names related to users’ past purchase patterns and, based on this, predict the product names that users are likely to purchase in the future, which is expected to be greatly beneficial.
C. Data Management
Service generation is the process of designing or innovating new services. Although this approach can vary depending on the industry, field, or company, the overarching goal is to develop services that meet or exceed user expectations [28], [29]. In relation to this, NLP and recommendation systems have been established as key elements in enhancing user experience and innovating service delivery. Specifically, text data generated by users across various digital platforms contain valuable information regarding user preferences, behaviors, and expectations. Companies that deeply understand the significance of these data are exploring strategies for offering personalized services by integrating NLP and recommendation systems. This approach goes beyond merely analyzing user text data to provide tailored services, playing a crucial role in enhancing service quality, efficiency, and diversity.
Services leveraging NLP have contributed significantly to the provision of more precise and personalized services by utilizing users’ text data. Various studies have explored the possibilities and effects of service generation, showcasing the potential in this domain [30], [31], [32], [33]. Traditional research has focused on analyzing text data such as user reviews, feedback, and interactions to understand user preferences and behavior patterns, primarily to provide personalized recommendations. In addition, active research is being conducted to measure user similarities and content-based recommendations using user text data. However, service providers and platforms must customize their models for each specific task. For efficient model development, a strategy for generating diverse services using a single model is required. For instance, large language models like GPT provide various services, including sentence summarization, translation, and document generation [19], [34].
Prior research has highlighted several strategies for service generation using recommendation systems. First, personalized marketing campaigns using recommendation systems delve deeply into users’ past purchase histories and search patterns to provide personalized marketing messages or discount coupons, combining users’ past purchase histories and product preferences to offer personalized marketing campaigns [35]. Target marketing is the duality of product recommendations targeting users who are expected to purchase certain products [36]. Second, there has been significant interest in product bundle suggestions based on users’ purchase histories. For example, after a user purchases a digital camera, a recommendation system can suggest related products, such as memory cards or camera cases [37], [38]. Moreover, recommendation systems can offer bundling of products that appear together as a result [39]. Our study extends the insights of these earlier studies and introduces innovative service generation methods centered on NLP, aiming to explore new service generation strategies by maximizing the advantages of NLP-based recommendation systems.
Conclusions
In this section, we propose an NLP-based recommendation that utilizes the Transformer to learn product names at the token level, and we validate its performance using a dataset where product name verification is possible.
A. Framework
A distinctive feature of this study is the tokenization of product names into individual tokens for the recommendation system, and the framework for the NLP-based recommendation is depicted in Figure 2. Initially, when a list of product names purchased by a user is input, the names are tokenized based on the space within them. For instance, a purchased product name ‘Chocolate Milk’ is tokenized into ‘Chocolate’ and ‘Milk’ based on the space. These tokenized product names are then trained as token sequences using the Transformer, which subsequently predicts the product names to purchase in tokens. As an example, when product names like ‘Chocolate’, ‘Milk’, ‘Fruit’, ‘Snack’, ‘Chocolate’, ‘Chip’, ‘Cookie’ are input, the Transformer’s training results in predictions such as ‘Strawberry’ and ‘Milk’. These tokenized predictions are then detokenized using spaces, converting ‘Strawberry’ and ‘Milk’ back into ‘Strawberry Milk’. During the training and prediction of product names in tokens, it is possible to generate product names that are not actually sold. As shown in Figure 3, using the Transformer could lead to the generation of a product name like ‘Coffee Chocolate Milk’, which is not available in actual stores. Because this impedes the recommendation system’s ability to suggest actual products, we check against a list of product names to verify the existence of the product and replace non-existent product names using similarity to find the most similar existing product name. To ensure the efficacy of our recommendation system in suggesting actual products, we chose the Jaccard similarity metric for its simplicity and computational efficiency. This contrasts with vector space models that require extensive computation; Jaccard similarity directly compares sets by quantifying the overlap between tokenized product names, which is particularly suitable for our use case. The complexity of product names varies greatly, and our focus is on the presence or absence of shared tokens rather than their frequency or order. The Jaccard similarity, calculated as the proportion of shared tokens to the total unique tokens in both product names, allows for accurate suggestions of existing product names when non-existent ones are generated.
The output from the Transformer selects the token with the highest softmax probability as the first in the sequence and subsequently generates the next tokens in an autoregressive manner to form the product name. For instance, if the Transformer’s decoder first predicts ‘Chocolate’, it will then predict the ‘milk’ token to complete the product name. The subsequent prediction, excluding the highest probability ‘Chocolate’, will select the next highest probable token ‘Vanilla’, followed by ‘Almond’ and ‘Breeze’, thus forming a product name with tokens ‘Vanilla’, ‘Almond’, and ‘Breeze’, predicting the top-k product names in this manner. Through these steps, the final recommended product names, trained and predicted as tokens, are derived.
B. Datasets
For our NLP-based recommendation experiments, we employed two datasets with distinguishable product names:
-
UK e-Commerce1: Sourced from a UK-based online retail platform. This dataset aggregates product names and purchase dates documented in user invoices. The platform predominantly features food items, daily essentials, and electronic appliances.
-
Instacart2: This dataset contains the transactional records of grocery orders, noting the specific week, time, and individual products ordered in the US.
A detailed breakdown of these raw datasets is presented in Table 1.
After examining the raw data, pre-processing was performed. First, we removed errors, such as missing values, from the dataset and chronologically listed the product names purchased by each user. Then, as shown in Figure 4, we grouped the product names into sets of five to form a row, using four product names for training and one product name as the label for each row. To address the issue of repeated product names in our expansive datasets, we utilized a data cleansing technique based on a previous work [9]. To obtain a more comprehensive dataset, we curated an additional collection of products by transferring two pairs of products from every five transactions.
Owing to computational constraints and to maintain coherence in data volume when juxtaposed with the UK e-commerce dataset, we opted to use only 1% of the Instacart dataset. The descriptions of the two datasets processed at the row level, as previously mentioned, are provided in Table 2. This study employed NLP-based recommendation, tokenizing product names for use. Four product names were allocated to Train and one to Label, and their average tokens are listed in the table. In our experiment, we allocated 80% of the rows for training, 10% for validation, and the remaining 10% for testing.
C. Evaluation Metrics
To evaluate the proposed model, we drew inspiration from established studies on recommendation system evaluation metrics [40], [41], [42]. These studies validated our choice of metrics and highlighted their significance in the current context. The following two metrics were employed to assess performance:
-
Hit-Rate: This metric is prevalent in recommendation systems. It gauges whether the top-K product names suggested to each user are aligned with the product name of their most recent purchase. A match within the top-K recommendations is considered a hit. The Hit-Rate is the ratio of users with hits to the total number of users. The corresponding formula is as follows:
Hit−Rate=#HitUsers#Users(2)
Mean Reciprocal Rank (MRR): MRR quantifies the ranking of the last purchased product name within the top-K recommended list. A superior MRR indicates the success of the system in recommending relevant product names at the top positions. The formula for MRR is as follows:
assess the robustness of the model across various recommendation list lengths. Higher metric values signify enhanced performance.D. Comparison Model and Parameter Settings
To verify the superiority of NLP-based recommendation using product names at the token level, we set up the following comparison models:
-
Random: This approach recommends products randomly selected from all products. Measurements were performed 100 times and then averaged.
-
NLP-based Recommendation with Tokenization: This model is the primary focus of our study, where product names are tokenized into individual tokens for analysis. The tokenization process involves breaking down complex product names into simpler, more manageable components, thereby allowing the system to learn and predict user preferences more accurately. The model utilizes the same hyperparameters as the non-tokenized version for a controlled comparison.
-
NLP-based Recommendation without Tokenization: As a contrast to our primary model, this version of the recommendation system processes product names as whole entities, without breaking them down into tokens. This model serves as a direct comparison to evaluate the added value of tokenization in improving recommendation accuracy.
-
NLP-based Recommendation using N-grams: In addition to the above models, we introduced a variant that utilizes n-gram tokenization. This model was tested with unigrams, bigrams, and trigrams to assess how different levels of token granularity impact the system’s performance. Similar to other models, this approach also uses the same set of hyperparameters to ensure consistency in the comparative analysis.
The selection of these specific models was driven by the need to comprehensively evaluate the effectiveness of our tokenization approach against varying baselines. The Random model provides a baseline to contrast the predictive power of our NLP models against random chance. The NLP-based Recommendation without Tokenization model serves to directly assess the incremental benefit provided by tokenization. The inclusion of N-grams further allows us to explore the depth of tokenization granularity and its impact on recommendation accuracy. These comparative models were chosen to provide a holistic understanding of our system’s performance across different levels of text processing complexity.
In our experiment, we elaborate on the parameter settings utilized in our experiments to ensure transparency and reproducibility. The selection and tuning of these parameters were guided by preliminary tests and literature benchmarks to optimize the performance of our NLP-based recommendation system.
-
Transformer Model Configuration: Our model utilized a Transformer architecture configured with a single layer (num_layers = 1), which was chosen to maintain a balance between complexity and computational efficiency. This layer count was found to be sufficient for capturing the nuances of our dataset while ensuring manageable training times.
-
Input and Output Space Dimensionality: We set the dimensionality of the input and output space to 128 (d_model = 128). This dimension was selected as it provides a good trade-off between model expressiveness and overfitting risk, considering the size and complexity of our datasets.
-
Attention Mechanism: The number of attention heads in the multi-head attention mechanism was set to 4 (num_heads = 4). This number allows the model to focus on different parts of the input sequence, improving its ability to learn from various patterns within the data.
-
Inner-Layer Dimensionality: The inner-layer dimensionality was set to 256 (units = 256), which determines the capacity of the feed-forward networks within the Transformer. This setting was chosen to provide sufficient model complexity for learning intricate relationships in the data.
-
Dropout Rate: To mitigate the risk of overfitting, we employed a dropout rate of 0.2 (dropout = 0.2) during training. This rate was optimized through cross-validation to ensure the model generalizes well to unseen data.
-
Training Epochs: The model was trained across 100 epochs (epochs = 100), with early stopping applied to cease training upon convergence. This approach ensures that the model is adequately trained without overfitting to the training data.
These parameters were meticulously selected and adjusted to align with the specific needs and characteristics of our datasets, ensuring that our model delivers robust and reliable recommendations.
E. Experimental Results
1) Comparison Between With and Without Tokenization
Our first focus was to emphasize the transformative impact of improving the overall efficiency of recommendation systems through token-level learning with product names. To ensure that our results were accurate and reliable, we utilized two key metrics: Hit-Rate and MRR. From Table 3, it is evident that infusing our recommendation system with a tokenization technique was not only beneficial but also provided a notable improvement. Taking the UK e-Commerce dataset as an example, when juxtaposed with its non-tokenized counterpart, the tokenized version exhibited a notable surge in performance. There is a significant enhancement rate, peaking at 65.9%. Similarly, experiments using the Instacart dataset yielded positive results. By employing tokenized product names in this particular dataset, we observed a surge in recommendation efficiency, with gains as high as 23.3%. To offer a holistic view of our results, we performed an in-depth examination of the Hit-Rate metric, placing special emphasis on the UK e-Commerce and Instacart datasets. Our insights from this are captured and visualized in Figure 5 for better clarity.
Moreover, MRR emerged as another pivotal metric in our study. This metric serves as a litmus test for recommendation systems and evaluating their performance based on the placement of the initial relevant recommendation. Specifically, as portrayed in Table 4, it became clear that models trained on tokenized product names outperformed their peers in the UK e-Commerce dataset, showing an excellent improvement rate of up to 30.0%. Conversely, on the Instacart dataset, the performance enhancement of the tokenized model was more tempered, albeit still notable, showing an improvement of up to 5.8%. For a side-by-side comparison, both datasets and their respective performances are illustrated in Figure 6. Considering both the HR and MRR metrics, with a magnified focus on the UK e-Commerce and Instacart datasets, it was evident that while harnessing tokenized product names had an unmistakable edge in the UK e-Commerce landscape, it was less effective on the Instacart dataset. The latter presented a more reserved performance, possibly attributed to its inherently vast diversity in both user and product range, introducing a myriad of complexities into the learning curve of the recommendation system. Consolidating our research findings, it is evident that NLP-based recommendation systems, especially those enhanced by product name tokenization, provide compelling capability and efficiency. The strategic use of tokenized product names not only provides benefits but also yields significant improvements in the system’s performance. This favorable outcome remained consistent regardless of the dataset or evaluation metric applied.
2) Comparison of Tokenization Across N-Grams
The second focal point of our experimentation was an in-depth analysis of the impact that various n-grams have on the performance of NLP-based recommendation systems. We scrutinized the efficiency of unigrams, bigrams, and trigrams in tokenizing product names to ascertain the most effective n-gram level for our tokenization strategy. Consistent with our methodology, we employed Hit-Rate and MRR metrics to measure accuracy. Initially, accuracy as depicted by the Hit-Rate in Table 5 demonstrated that both the UK e-Commerce and Instacart datasets achieved superior performance with unigram tokenization. The UK e-Commerce dataset showed a more noticeable variance in performance among the different n-grams, whereas the Instacart dataset exhibited less variation but nonetheless confirmed the superior performance of unigrams. This pattern is graphically represented in Figure 7, where unigrams take the lead, followed by bigrams and trigrams, in that order. The MRR metric mirrored these findings, with unigrams again showing the highest performance for both the UK e-Commerce and Instacart datasets, as corroborated by Table 6 and Figure 8. Across all our tests, unigrams consistently outperformed bigrams and trigrams. This phenomenon is elucidated in Figure 9, which illustrates the breadth of interrelationships formed through unigram tokenization. Building on the premise that tokenization enables more detailed learning of product names, our n-grams experiment further solidified the finding that finer-grained tokenization facilitates more precise learning. Unigrams, by splitting product names into single-word units, created an extensive network of interrelationships that could be leveraged for a more comprehensive learning of patterns. Thus, the approach of tokenizing product names has been robustly validated through the n-grams experiment.
Service Generation
The previously introduced NLP-based recommendation system can derive product name tokens by learning them at the token level. As shown in Figure 10, this feature allows us to propose two services: New Product Brainstorming utilizing non-existent product names and Keyword Trend Forecasting using frequency analysis of product name tokens.
A. New Product Brainstorming
In initiating our discussion on the New Product Brainstorming service, we wish to underscore the exploratory and illustrative purpose of this application. The service is presented not as an endpoint, but as a demonstration of the NLP-based system’s potential for broad service generation. It is a proof of concept that invites stakeholders to visualize the possibilities of NLP in product innovation. This aligns with our wider objective of highlighting the system’s versatility and serves as a catalyst for further creative exploration and development. While product names that do not exist in reality might be seen as mere mistakes or errors, they can be utilized to generate novel product name ideas through “Serendipity,” which refers to valuable discoveries or inventions made unintentionally through coincidences [43]. Particularly in the field of scientific research, there are many instances in which significant findings arise from experimental failure. A classic example is the discovery of penicillin by Fleming [44]. By mistakenly mixing in blue mold during a culture experiment, Fleming discovered an antimicrobial substance that was effective against infections. Other examples include the discovery of microwaves from melted chocolate and the birth of Post-it notes from a failed strong adhesive [45]. Based on these scientific cases, we aim to propose product names that do not exist in reality but are derived through the NLP-based recommendation system, as a foundation for brainstorming new product ideas.
1) UK E-Commerce Data
From a test set of 7,121 rows, we generated the top-20 product names, resulting in a total of 142,420 product names. By comparing the product names in the dataset with the generated names, we found that 202 (0.15%) product names were not present in the dataset. Examples of non-existing product names are displayed on the left side of Figure 11. On the right-hand side, the most similar product names derived using Jaccard similarity are listed. Some product names had inaccurately generated tokens related to size or color, resulting in outputs like ‘Red Spotty Luggage Tag’ instead of ‘Pink Spotty Luggage Tag’ and ‘Green Owl Soft Toy’ instead of ‘Pink Owl Soft Toy’. Although ‘Red Spotty Luggage Tag’ and ‘Green Owl Soft Toy’ are not present in the dataset, they are colors that customers might desire. Additionally, there were cases where product name tokens were incorrectly generated or entirely new product names emerged, such as ‘Tube Red Spotty Paper Plates’, ‘Black Greeting Card Holder’, and ‘Tutti Frutti Notebook Box’. Even if similar products are available in other stores, they can be considered novel product ideas based on the current store’s data. Specifically, to better understand unfamiliar names like ‘Tube Red Spotty Paper Plates’, we utilized DALL 33 to generate images of the aforementioned product names, as depicted in Figure 12. In the future, by expanding the “new product brainstorming” service, we aim to offer new product development ideas by presenting both product names and images together.
2) Instacart Dataset
From a test set of 16,644 rows, we generated the top-20 product names, resulting in a total of 332,880 product names. By comparing the product names in the dataset with the generated names, we found that 1,020 (0.31%) product names were not present in the dataset. Examples of non-existing product names are displayed on the left side of Figure 13. On the right-hand side, the most similar product names, derived using Jaccard similarity, are listed. Some product names had inaccurately generated tokens related to flavor or ingredients, resulting in outputs like ‘Coffee Chocolate Milk’ instead of ‘Chocolate Milk’ and ‘Coconut Chocolate Chip Cookies’ instead of ‘Chocolate Chip Cookies’. Although ‘Coffee Chocolate Milk’ and ‘Coconut Chocolate Chip Cookies’ are not present in the dataset, they might represent flavors or variations that are appealing to customers. There were also instances in which product name tokens were generated in a different manner or entirely unique product names surfaced, such as ‘Vegetable Beef Franks’. Even if such products exist in other markets or stores, they can be viewed as fresh product concepts based on the current store’s data. For a clearer visualization of these potential products, we used DALL 3 to generate images corresponding to the product names, as shown in Figure 14. Product names that do not exist in reality and are derived from a token can be used for new product brainstorming, serving as a support service for the product development department and company.
B. Keyword Trend Forecasting
Several studies have analyzed trends through keyword analysis, either by analyzing the frequency of words in a text to identify research trends or by predicting product sales and consumer purchase trends using product frequency analysis [46], [47]. The results of a recommendation system can be viewed as data that predict what users will purchase in the future based on their past purchases. In particular, the proposed NLP-based recommendation method outputs product names in token units; thus, these tokens can be considered as keywords of product elements. Therefore, we suggest a keyword trend forecasting service based on a frequency analysis of product keywords.
1) UK E-Commerce Data
Based on UK e-commerce data, an experiment was conducted using 7,121 test data entries to generate the top 20 product names. Consequently, a total of 142,420 product names were created. These product names were tokenized, and a frequency analysis was conducted based on these tokens. The results are presented in Table 7, where top keywords such as ‘Heart’, ‘Red’, and ‘Set’ were identified. Each entry in this test set has four data points representing a user’s past purchases and one label indicating the next expected purchase. The four learning data points represent a user’s purchase history, while the single label indicates the product that the user is expected to purchase next. By analyzing the frequency of all products corresponding to this label, we can predict which products will be popular in the future. The results are shown in Table 8, where notably, the ‘White Hanging Heart T-light Holder’ product appeared with the highest frequency. Combining the results in Tables 7 and 8, we observed that products containing specific keywords have high sales volumes. For instance, products with the keywords ‘Heart’, ‘White’, and ‘Hanging’ show significant sales. This suggests that keyword-based trend forecasting can significantly influence product sales strategies. In Figure 15, the top-50 keywords are visualized as a word cloud.
2) Instacart Data
Based on e-commerce data from Instacart, a test was conducted using a total of 16,644 rows of data to generate the top 20 product names. This resulted in 332,880 product names being generated. The produced product names were tokenized, and a frequency analysis was conducted based on these tokens. The results are presented in Table 9, where the top product names, such as ‘Banana’, ‘Bag of Organic Bananas’, and ‘Organic Strawberries’, were identified. Each row of this test set consists of four learning data points and one label. The four learning data points represent a user’s purchase history, while the single label indicates the product the user is expected to purchase next. Analyzing the frequency of all products corresponding to this label allows us to predict which products may be popular in the future. This result is shown in Table 10, where notably, the keyword ‘organic’ appeared with the highest frequency. Combining the results in Tables 9 and 10, we observed that products containing specific keywords have high sales. For instance, products containing the keywords ‘organic’, ‘milk’, and ‘whole’ showed significant sales volumes. This suggests that keyword-based trend forecasting can be instrumental in product sales strategies. In Figure 16, the top-50 keywords are visualized as a word cloud.
Conclusion
The main finding of this study is the demonstrated efficacy of token-level analysis in NLP-based recommendation systems, leading to significant performance improvements in the context of e-commerce. Additionally, our exploration into n-grams, particularly the superior performance of unigram tokenization, further reinforces the effectiveness of fine-grained token analysis in enhancing recommendation system accuracy. The n-grams experiment, especially the comparative analysis of unigrams, bigrams, and trigrams, provided crucial insights into how different granularities of tokenization affect the predictive accuracy and utility of our NLP-based recommendation system, thus contributing to a deeper understanding of the nuances in natural language processing for e-commerce applications. In the modern e-commerce landscape, understanding consumer behavior requires the analysis of vast amounts of data, with particular emphasis on natural language data. In this study, we adopted a unique approach by utilizing product names at the token level, thereby experimentally validating the efficacy of this methodology. Using the UK e-Commerce and Instacart datasets, we measured the performance of an NLP-based recommendation algorithm that employs product names at the token level. The results confirmed its high efficacy. This underscores the potential of NLP technologies to move beyond mere data analysis and emphasize deep data insights and the provision of high-quality services. Furthermore, this study highlighted the significance of creating NLP-based services with a focus on offering services without the use of personal data. Our natural language learning approach, which analyzes product names at the token level, has demonstrated its potential to offer effective services while preserving user privacy. By focusing solely on product name tokens, rather than personal user data, we enhance the system’s privacy and mitigate concerns related to personal data use. The innovative services we introduced, specifically product brainstorming and keyword trend forecasting, stem directly from our findings that token-level analysis of product names can uncover latent consumer preferences and market trends. These services hold the potential to revolutionize business strategies by providing insights into untapped product opportunities and emerging market demands. Future research should explore the scalability of token-level analysis across larger and more diverse datasets, and examine the integration of evolving NLP technologies to maintain the relevance and effectiveness of recommendation systems.
From a theoretical perspective, this study makes two significant contributions. First, it substantially improves the performance of recommendation algorithms. By advancing existing NLP-based recommendation methods and incorporating tokenization of product names, our approach demonstrated superior performance, as evidenced by improved metrics such as Hit-Rate and Mean Reciprocal Rank in comparison to non-tokenized models. Second, notable progress was made in the extensibility of the recommendation model. NLP-based recommendation models can be flexibly applied across diverse languages and domains, thereby enabling services to be offered globally, unrestricted by specific languages or regions. This study had two important practical implications. First, processing product names in natural language eliminates the need for personal information, thereby significantly enhancing their potential for industrial applications. This method can be integral in industries such as e-commerce and digital marketing, where it can enhance user experience by providing personalized recommendations without compromising individual privacy, thereby addressing challenges related to the protection of personal information. Second, this study explored the potential of diverse services. The development of services, such as brainstorming new product ideas and predicting keyword trends, offers businesses a sustainable path for utilization. Thus, companies are better equipped to respond swiftly to market shifts and evolving consumer demand.
Despite offering significant insights, this study has certain limitations that merit consideration. First, it was confined to the UK e-Commerce and Instacart datasets. Future research should consider applying our token-level analysis approach to a wider range of datasets, such as social media trends, customer reviews, and global marketplaces, to validate the applicability and reliability of our findings across various consumer contexts and cultural backgrounds. Second, the NLP technologies employed reflect the current state-of-the-art. As technology evolves, newer techniques may emerge, potentially affecting the performance and outcomes of recommendation algorithms. Furthermore, a notable advantage of this study lies in its approach to using product names directly and analyzing them at the token level. However, this methodology is predominantly suitable for products with intuitive and straightforward names. For more abstract product names, the application could become challenging. For instance, for a product name like “Taste of Magic,” it is not immediately clear which food it represents. Thus, NLP methodologies may struggle to provide accurate recommendations based solely on such abstract names. Future research should focus on expanding the datasets tested, incorporating newer NLP techniques, and devising strategies to effectively handle products with abstract or non-descriptive names.