Data mining in e commerce building anime recommendation system

This research aims to build an anime movie recommendation system based on thenearest neighbor method, one of the recommendation methods based on users withsimilar behavior, to provide us

INTRODUCTION

Urgency

The rapid evolution of the Internet has led to an abundance of information, resulting in a growing demand for recommendation systems that enhance information filtering and streamline the search for products and services These systems leverage user data to provide tailored suggestions, such as popular songs, highly purchased products with top reviews, and trending movies based on viewership.

Anime refers to animated films derived from Japanese manga comics and is widely recognized as a distinctive aspect of Japanese culture Its global popularity highlights the vast fanbase for Japanese animated films However, there remains a significant gap in specialized systems that offer tailored movie recommendations based on individual viewer interests.

Our team conducted research to explore the operating principles of the system and to understand how to effectively implement similarity and Streamlit for its functionality.

Objectives of the study

Data mining in e-commerce - Building an anime movie recommendation system.

- Provide an anime movie recommendation system for users.

- Use python programming language to build the system.

- Use the nearest neighbor based recommendation method.

- Use streamlil to retrieve data.

- Provide a recommendation system that meets the user's request to find similar anime movies.

Research Methods

Data collection method: data on film inventory, content, audience reviews for anime movies

Design analysis method: using the nearest neighbor-based suggestion method - Neighborhood-based Collaborative Filtering (NBCF), using slreamlit, pandas, numpy to read data to serve system construction requirements.

Use programming languages to build algorithms.

Experimental method: based on collected and synthesized data, standardize the data and build a recommendation system for anime movies that audiences should watch.

Expected results

This article aims to enhance readers' understanding of recommendation systems (RS), which utilize movie similarity to generate personalized movie suggestions based on the titles users are currently viewing.

A recommendation system enhances user experience by accurately predicting and presenting a curated list of anime genres tailored to individual preferences Additionally, it introduces users to other potential genres and productions they may not be aware of, enriching their overall discovery experience.

Topic structure

Chapter 1: Introduction includes information about the reason for choosing the topic, research objectives, and research methods

Chapter 3: Building an anime recommendation system

Chapter 4: Results achieved, limitations of the project as well as future development directions of the project.

THEORETICAL BASIS

Some general theories

At the start of the last decade, the rapid growth of the internet led to the emergence of essential web applications focused on e-commerce, social interaction, and real-time information sharing, creating a vast source of data online However, sifting through this information to find valuable insights remains challenging, making recommendation systems essential for users These systems enhance the user experience by quickly suggesting relevant products and services, enabling e-commerce platforms to increase sales and introduce customers to new offerings Originating in the mid-1990s, recommendation systems analyze user data—both publicly through surveys and silently through user interactions like clicks and comments—to provide tailored suggestions Prominent examples include Amazon's book recommendations and YouTube's video suggestions, showcasing the effectiveness of these systems in connecting users with content that meets their needs.

The recommendation system is based on data collected from the history of movies that viewers have viewed (Collaborative Filtering Recommendation System - CFRS).

KBRS recommendation system - “Knowledge-Based Recommendation System”.

The system retrieves data from viewers through 2 types: hidden information and visible information.

Hidden information: time to watch the movie, actions before watching, access source.

Information displayed: reviews from viewers after the movie (scores, comments, ) Information is now more useful when retrieving information and making recommendations.

The recommended results are general predictions - predicting 80% of viewers will enjoy watching these movies

2.1.2 Overview of recommendation systems models

In the world, there have been many studies on recommendation systems and anime recommendation systems such as:

Pramod Ramu (2022) explores various machine learning algorithms for recommendation systems, including k-nearest neighbor, task filtering, and content-based filtering; however, these models often struggle with data sparsity To address this issue, he proposes a high-quality recommendation system in his research project, "Deep Learning-Based Anime and Movie Recommendation System" (2022), which utilizes collaborative filtering based on deep learning to tailor recommendations for movies and anime according to user preferences The project processes data from movie and anime datasets obtained from the open-source platform Kaggle.

Prateek Sappadla, Yash Saldwani, and Pranit Arora from New York University developed various algorithms to create a recommendation system, evaluating their effectiveness using the MovieLens dataset Their findings indicated that for smaller datasets, user-based collaborative filtering achieved the lowest mean squared error.

Tarun Soni from Jayee Waknaghat University of Information Technology has developed a movie recommendation system utilizing the cosine similarity algorithm This innovative system aims to deliver personalized movie suggestions tailored to individual user preferences The project's objective is to enhance user experience through a web-based interface, enabling easy access and interaction while providing accurate recommendations based on the completed algorithm.

Researchers AS Girsang, B AI Faruq, HR Herlianto, and S Simbolon from Bina Nusantara University in Jakarta proposed a collaborative filtering (CF) recommendation system for anime This widely-used technique calculates similarities, predictions, and recommendations to enhance user experience By analyzing data from Kaggle, which includes 73,516 users and 12,294 anime titles, the system measures the similarity between user preferences and anime shows Using the alternating least squares (ALS) method, the user's viewing history is matched with others to generate tailored recommendations The authors aim to assist millions of users in discovering their preferred anime series through this innovative approach.

Researchers BVN Sravya, R Venkata Tagore Reddy, D Nithin Sri Sarveswar, and Suda Naveen from the School of Computer Science & Engineering at Vellore Institute of Technology have demonstrated that a hybrid implementation utilizing language processing methods can deliver personalized movie recommendations Their proposed approach outperforms existing methods in reliability and efficiency, achieving accurate personalization when tested on the MovieLens dataset The study emphasizes a recommendation engine that utilizes various algorithms to filter data and suggest content tailored to user preferences, enabling recommendations through both text and voice search.

A study by Charu C Aggarwal from IBM TJ Watson Research Center reveals that neighborhood-based methods are effective for extrapolating unknown ratings, though they may yield less diverse recommendations compared to item-based methods To enhance the speed of these neighborhood approaches, clustering techniques are often employed Additionally, these methods can be integrated with optimization models like matrix factorization for improved prediction accuracy However, challenges such as data sparsity arise, as users typically provide only a limited number of ratings This issue can be mitigated through dimensionality reduction and graph-based models While dimensionality reduction is frequently utilized independently in collaborative filtering, its combination with neighborhood-based methods can significantly enhance filtering effectiveness and efficiency Various graph types, including user-item and item-item graphs, can be generated from rating data, often utilizing random walk or shortest path algorithms for analysis.

Function

Improve user experience: predicting and recommending similar anime movies that customers will like will contribute to increasing customer satisfaction with the system.

Increase operational efficiency with automation: previously, recommendation was not very effective and was even limited in performance because traditional product recommendations were often done manually.

Turn potential customers into real customers: the system will recommend potential products to customers that even customers have not thought of but still match their preferences.

Overview of data mining

The project aims to explore the fundamental theory behind recommender systems, develop algorithm source code based on this theory, and assess its performance using real-world data Additionally, it includes an analysis of source code from a scientific article to enhance understanding of the practical applications of recommendation systems.

The primary approaches to developing a recommendation system include content-based filtering, collaborative filtering, hybrid filtering, and non-personalization algorithms Each of these methods will be explored in detail in the subsequent chapters.

The model's effectiveness will be assessed using two real-world datasets, Movielens and Globo.com, through both theoretical metrics such as RMSE and MAE, as well as practical methods including Hit Rate and MRR.

Machine learning algorithms

In the realm of recommendation systems (RS), various algorithms have been proposed, which can be categorized into four main groups: content-based filtering algorithms, collaborative filtering algorithms, hybrid filtering algorithms, and non-personalization algorithms.

Content-based filtering leverages detailed information about anime, including elements like genre, director, and actors, to identify similarities between different titles By employing text classification techniques, this method effectively evaluates these similarities and generates personalized recommendations for users seeking similar anime experiences.

Deep learning techniques, including Neural Networks and Recurrent Neural Networks (RNNs), play a crucial role in anime recommendation systems These algorithms enable the learning of intricate data representations and facilitate the automatic extraction of essential features from various forms of content, such as images, audio, and text related to anime.

- Contextual Filtering: For context-based anime recommendation, machine learning algorithms such as Decision Trees, Bayesian Networks, and Markov Decision

Processes (MDPs) can be used to determine the correlation between context and anime and make appropriate recommendations.

Recommendation system

Leading news platforms like Google News, Yahoo! News, The New York Times, and The Washington Post have gained significant traction among online audiences Researchers have explored various online news recommendation systems over the years, employing techniques such as content-based filtering, collaborative filtering, and hybrid methods to enhance user experience.

Some challenges for news recommendation systems include:

User profiles on the platform are largely underdeveloped, with most readers remaining anonymous and engaging with only a handful of stories from the extensive archive This behavior results in significant sparsity within the user-post matrix, as users generally do not retain information about their previous interactions.

The rapid increase in the number of articles, with hundreds added daily to news portals like The New York Times, intensifies the cold-start problem for new posts due to the lack of past interactions for effective recommendations This surge in content can also lead to scalability challenges for news aggregators, as the overwhelming volume of articles can strain web resources in a short timeframe.

The lifespan of an article is limited, particularly in the news sector, where users prioritize fresh information As a result, news articles often have a brief shelf life User interests fluctuate frequently, making news topics less stable compared to entertainment content While some interests evolve over time, others remain consistent Additionally, a user's interests during a session can be shaped by contextual factors, such as their location and the timing of their visit, as well as broader circumstances like breaking news or significant events.

Neighborhood-based Collaborative Filtering - NBCF: recommendation

method based on nearest neighbors.

2.6.1 Overview of the nearest neighbor based recommendation method:

Neighborhood-based Collaborative Filtering is a method that recommends items by analyzing the similarity in relationships between users and/or items This approach suggests relevant options to a user by identifying and leveraging the behaviors of users with similar preferences.

When individuals share similar preferences, such as a girl and a boy both liking chocolate and strawberries, it indicates a strong likelihood that others in their social circle may enjoy these items as well Consequently, the system is likely to recommend chocolate and strawberries to the boy Similarly, in a movie recommendation system, if User 1 has watched a film that User 2 hasn't seen, the system may suggest that movie to User 2, as they might also be interested in it This illustrates how shared tastes can drive personalized recommendations.

Above method maybe stool There are two main directions : User-User Collaborative Filtering (uuCF) and Item-Item Collaborative Filtering (iiCF).

- uuCF: The idea of the above approach is to select groups of similar users From there, estimate A user's popularity is based on the users within each group.

This approach is implemented as follows:

• Represent each user with an attribute vector built from the user's previous feedback on the project From there, the similarity between users can be calculated.

To determine user U's preference for item I, we identify k users who have rated item I and share the highest similarity with user U The final preference score is computed based on the ratings provided by these k similar users for item I.

• Finally, choose the most popular items predicted by user u to recommend to u.

iiCF, or item-item collaborative filtering, identifies groups of similar items by analyzing user preferences This method predicts a user's inclination towards a specific item based on their preferences for other items within the same category.

1 Basic Machine Learning , Neighborhood-based Collaborative Filtering - NBCF (May 24, 2017) Accessed June

16, 2023 at: https://machinelearningcoban.com/2017/05/24/collaborativeriltering/

This approach is implemented as follows:

• Represent each item with an attribute vector From there, calculate the similarity between items.

To determine user U's preference for item 1, we will identify the top k items that U has rated, which exhibit the highest similarity to item I The final preference score is then computed using U's feedback on these k items.

To recommend items to user u, select the most popular options predicted for them, as the total number of items is typically far fewer than the number of users Consequently, the iiCF similarity calculation method is significantly faster than the uuCF approach.

The only data we use for this method is the Utility matrix:

Figure 2: Standardization of the Utility Matrix i/o ^1 U2 U3 ^4 u5 Ut io 5 5 2 0 1

To utilize this matrix in calculations, it is essential to replace the symbol "?" with a specific value The simplest option is to substitute it with "0," though any other value can also be used.

The average rating of 2.5 stars falls between 0 and 5, but this method has limitations as it categorizes users into easy or difficult groups Easy-going users tend to give 5-star ratings for likes, while dislikes receive lower ratings, typically around 2 or 3 stars.

Users who express dissatisfaction may rate a comment as 2.5 stars, leading to a negative perception Conversely, those with higher standards typically reserve a 3-star rating for content they genuinely appreciate, rather than using it as a neutral score Thus, calculating the average rating per user provides a more accurate representation of their feedback Instead of directly substituting these average values for the '?' in each user's rating, a more nuanced approach is recommended.

- Subtract each user's rating from the average of that user's corresponding rating and replace "?" with value 0 The purpose of this processing is to:

• Classify reviews into two types: negative value (users did not like the item) and positive value (users liked the item) A value of 0 corresponds to items that were not evaluated.

The utility matrix typically has a large dimension, but the quantity of known ratings is relatively small By substituting the unknown values, represented by "?", with "0", one can utilize a sparse array that only retains non-zero values and their corresponding locations, resulting in more efficient storage.

Hai Ha's article on Neighborhood-based Collaborative Filtering discusses a recommendation method that utilizes the nearest neighbors approach This technique focuses on analyzing user preferences within a community to enhance personalized suggestions By leveraging similarities among users, the method aims to improve the accuracy and relevance of recommendations The article, published on December 30, 2019, provides insights into the effectiveness of this collaborative filtering strategy, emphasizing its application in various domains For more details, the article can be accessed at VIBLO.

The matrix after normalization is called Normalized Utility Matrix: u o Ư2 Ư3 Ư4 y5 ^6 io 1.75 2.25 -0.5 -1.33 -1.5 0 0 il 0.75 0 0 -1.33 0 0.5 0

Figure 3: The matrix after normalization is called the Normalized Utility

After normalizing the Utility matrix, it is necessary to calculate the similarity between users Among them, there are 7 users ul, u2, u3, u4, u5, u6 and 5 items il, 12,13,14,15.

It can be observed that uO and ul both like io and do not like 13 and 14 very much, while other users are the opposite.

Predict a user's ratings for each item based on the k nearest users (neighbor users), similar to the K-nearest neighbors (KNN) method.

The popular formula often used to predict u’s rating for i is: ýu =

Figure 4: Conditions for having a good similarity function

Some commonly used similarity functions are Cosine Similarity and Pearson

Con-elation. cosine_similarity(uj, u ; ) = cos(uj,u ; ) =

IMIM (2.6) pearson (u (,uj = i^iuJ-EfudEluJ

Figure 5: Function Cosine Similarity and Pearson Correlation

Applying the similarity function to calculate the similarity between users, we will obtain the Similarity Matrix.

Figure 6: Cosine similarity distance function v0 Vr v2 v3 v^ Vs v6 vữ 1 0.83 -0.58 -0.79 -0.82 0.2 -0.38

Predict a user's ratings for each item based on the k nearest users (neighbor users), similar to the K-nearest neighbors (KNN) method.

Figure 7: Formula to predict u's rating for i

The popular formula often used to predict u's rating for i is:

Perform predictions for cases of missing ratings (no predictions yet), obtain the normalized ratings matrix as for example:

Finally, add together the ratings values with the average ratings (in the normalization step) according to each column to obtain the complete matrix.

Figure 8: Normalized ratings matrix u0 Ui Ư2 ^3 ư4 y5 u6

The size of the Similarity matrix is significantly smaller due to the large number of users, leading to optimized calculation and storage efficiency.

Typically, each item receives ratings from numerous users, resulting in a higher volume of known values in the item vector compared to the user vector When additional rating data is available, the average rating value of item-item collaborative filtering (iiCF) is less affected than that of user-user collaborative filtering (uuCF), leading to a reduced necessity for updating the Similarity Matrix.

Computationally, iiCF can follow uuCF by transposing the Utility matrix, considering the items evaluating users After calculating the result, performing the transposition again will obtain the final result.

To enhance the accuracy of the Utility matrix, we should normalize it by calculating the average ratings of the items rather than the users This approach allows for a more balanced evaluation of item performance, leading to improved recommendations and insights.

Figure 9: Normalizing the Utility matrix

Figure 10: Calculate Similarity matrix using similarity function

Use similarity matrix and normalized utility matrix to predict users’ ratings for each item, similar to uuCF 3.

’ Hai Ha, item - Item Collaborative Filtering (December 30 2019) Accessed June 26 2023 at: https://viblo.asia/p/neighborhood-based-collaborative-filtering-phuong-phap-goi-y-dua-tren-lang-gieng-gan- nhat-p 1 - 4dbZNpvn5 YM

Calculate the predicted rating according to the formula : yiM Z^NUM^^ ( '

A complete Utility matrix is obtained when all unknown ratings have been predicted. u0 Ui u2 v3 u4 Us u6

Streamlit Library

2.7.1 Overview of the Streamlit library

Streamlit enables users to quickly and visually create interactive web applications using popular Python libraries like Pandas, Matplotlib, and Scikit-learn It offers a variety of widgets for building user interfaces, such as sliders, input cells, tables, and charts, allowing developers to concentrate on app functionality rather than UI design Additionally, Streamlit provides tools for data analysis and visualization, simplifying the processes of data handling and result presentation Its versatility and user-friendly nature have garnered significant interest among developers, making it a popular choice within the Python community.

To create a web app chatbot that automatically writes code in Python, follow a structured approach that includes defining the chatbot's purpose, selecting the right frameworks, and implementing natural language processing capabilities Start by designing the user interface and integrating it with backend services to handle user queries effectively Utilize libraries like Flask or Django for the web framework and leverage machine learning models for enhanced code generation Finally, ensure the chatbot is tested thoroughly for functionality and user experience, optimizing for search engines to improve visibility and reach.

- with st.sidebar: use the with syntax to declare the side margin area of the web page, represented by the st.sidebar object.

The st.write() function is essential for enhancing web applications, allowing users to incorporate a wide range of elements such as formatted strings, charts from libraries like matplotlib and Altair, drawings, data frames, Keras models, and various other visual representations.

- st.title ("Anonyviet") uses the title function to display the title for the application, for example in this case "Anonyviet".

- st.header(): Used to set the title of a section.

- st.markdown(): Used to set a partial markdown.

- st.subheader(): Used to set the sub-title of a section.

- st.caption(): Used to write captions.

- st.code(): Used to set the code.

- st.latex(): This function is used to display mathematical expressions formatted as LaTeX.

2.7.2.2 Display image, video or audio files using Streamlit

- st.image ("logo.jpg") uses the Streamlit library's image function to display an image that represents the application, such as a chatbot.

- st.audio(): This function is used to display sound.

- st.video(): Used for displaying video.

C: > Users > nedia > OnrPrr^e > Drdnnp > DatìiCèxnp > ♦ nviin py

1 import ftreaalit as st sr ^uMeaderC^Mg^ :")

Figure Ì4: Displaying image, video or audio files using Stream lit

Widgets are the most important user interface elements Streamlit has various widgets that allow you to bring interactivity directly into your app with buttons, sliders, text inputs, etc.

The function st.caption(f"{view}", unsafe_allow_html=False) is utilized to present a description for the application, showcasing the content held in the view variable By setting the unsafe_allow_html parameter to False, it ensures that no HTML code is inserted, maintaining the integrity of the displayed content.

- st.checkbox(): This function returns a Boolean value When this box is checked, it returns True, otherwise it returns False.

- st.button(): Used to display a button widget.

- st.radio(): Used to display the radio button widget.

- st.selectbox(): Used to display a selected widget.

- st.multiselect(): Used to display a multi-select widget.

- st.select_slider(): Used to display a selected slider widget.

- stsliderQ: Used to display a slider widget.

^ main py X u Mfa > OvDrivr > OrsWop > tatx'jynp > ♦ marvpy

%l.dwcktxằ('yes) vt buttorr('Wnit for it ’):

Figure 17: Showing progress with streamtit

- st.spinner(): Used to display a temporary wait message during execution.

- st.success(): Used to display a success message.

- st.error(): Used to display error messages.

- stwarnig(): Used to display warning messages.

- st.info(): Used to display an informational message.

- stexception(): Used to display exception messages.

C: > ƯVM* > lunlia > OnrOriye > D-d tup > D.IMamp > ♦ in.it.py

2 i kt.succôằ("YOu did it I")

5 St warn ing( "Kirn ktg")

6 st.info(“ lt ‘* wằy to build 4 itrewiit app ” )

7 st.exception(Runtiin Error ("Runtim Error except ion*))

Ifs easy IO build A sueamlit app

Runtime Erf Of RuntimeLnoc exception

Figure 18: Displaying status with Streamlit

Organizing apps through sidebars or containers significantly enhances user experience by improving the hierarchy and arrangement of pages This structured content organization allows visitors to easily navigate the site, helping them find what they need quickly and increasing the chances of their return in the future.

Passing an element to stsidebarQ causes the element to be pinned to the left, allowing the user to focus on the content in the app.

But st.spinner() and st.echoO are not supported with st.sidebar.

It is possible to create a sidebar in the app interface and place components within it to make the app more organized and easier to understand.

In Streamlit, each interaction with an application control widget triggers a complete rerun of the app However, there are instances where users may prefer to interact with multiple widgets before sending those interactions, allowing for a more efficient relaunch of the application.

- When used, st.form can bundle input widgets together and, together with st.form submit button, send the state within these widgets with the click of a button.

- st.container() is used to create an invisible container where elements can be placed to create useful arrangements and hierarchies.

Streamlit's new features streamline data management and enhance user experience With the introduction of st.connection, users can effortlessly connect to external databases and APIs using just one line of code Additionally, st.database simplifies data storage by launching a small database with every Streamlit application, eliminating the need for complex setup Furthermore, st.dataframe has received significant upgrades, including improved filtering, sorting, editing capabilities, and the ability to display images, making data handling more efficient and user-friendly.

5 Streamlit, documentation, streamlit library, API reference, connections and databases, st.connection Accessed July 16, 2023 at: https://docs.streanilit.io/library/api-rcferencc/connections/st.conncction

Data visualization plays a crucial role in storytelling by transforming complex data into an understandable format that highlights trends and outliers An effective visualization not only conveys a compelling narrative but also eliminates unnecessary noise, focusing on valuable insights It's essential to recognize that creating impactful visualizations goes beyond merely enhancing charts or adding information to infographics; it requires a careful balance between aesthetics and functionality While overly simplistic charts may lack engagement, overly intricate designs can obscure the intended message Therefore, the successful integration of data and visuals is an art that combines thorough analysis with effective storytelling.

Providing a million data points in a table or database file can be overwhelming, making it challenging to draw meaningful insights This is where data visualization comes into play, as it transforms raw data into visual formats like maps or graphs, enhancing comprehension Streamlit visualization harnesses this power, allowing users to easily interpret complex datasets through engaging visual contexts.

- stpyplot(): This function is used to display the figure matplotlib.pyplot.

• nainpy X j > U k nrdia > OncOfive > Desktop > (MUX If

’ iaport aatplotl ih.pyplot as pit

1 iaport CIằPằ as np cand - —I*- > OiwOrw > Desktop > Oaufai-P > ♦ ằnpy i jRwrt sUcamlil as si

•, np.randoằ.randn(ia, J), colums-[ x-, yj) vl.iiw_ôtwrl(dt)

Figure 20: Rendering a line chart with streamlit

- st.bar_chart(): Used to display bar charts.

♦ riwniry X c > Won > nctSo > Oc Drive > Desktop > DauCamp > ♦ maie.py import stnsialll Ob si

■ Iwport pon&iS us pd i-rX't mafy a-, np

4 df - pcJ.(ằt4Frw( np.randoH.rand^lS a), Óham-rx-, yl)

Figure 21: Bar graph display with stream lit

- st.area_chart(): Used to display an area chart.

! > Uvcn > ntdia > Onsu>wô > Ho'.top > lkằU(jmp > ♦ m*npy

* iaport nu^ry 4* rip df - pd.MUHdK( np.random.randn(lằ, 2), coluBRS-[x , y|)

Figure 22: Displaying an area chart with stream lit

- st.altair_chart(): Used to display Altair charts. ằ malnpy X

C- > Uwn > noki > OirDiM* > Dmkiop > D*GWivnp > ♦ renKp/ lapart £tr*Mrttr as It

4 iaport altair as alt dt - pd.(tttjfrjw( np.ronđoa.rMlnCSOA,

Tiêu đề	Data mining in e-commerce - Building anime recommendation system
Trường học	UEH University
Chuyên ngành	Công nghệ thông tin
Thể loại	Báo cáo
Năm xuất bản	2024
Thành phố	Hồ Chí Minh

Định dạng
Số trang	51
Dung lượng	1,27 MB