Using Matrix And Nature Language Processing Techniques To Provide Job Advice.pdf

The system employs content-based filtering as its primary model, leveraging this method to analyze and match data from job descriptions and applicant profiles based on specific skills an

INTRODUCTION

Problem Statement

In April, U.S employers reported 64,789 job cuts, a significant 28% decrease from March's 90,309 cuts and a 3.3% drop from April 2023's 66,995 cuts, according to Challenger, Gray & Christmas, Inc This year, a total of 322,043 job cuts have been announced, reflecting a 4.6% decrease compared to 337,411 cuts at the same time last year Despite this decline in job cuts, the job market remains challenging, especially for recent graduates facing heightened competition and fewer job openings.

In today's competitive job market, innovative solutions are crucial for job seekers, particularly for new graduates who often face challenges due to limited experience and professional networks My research focuses on "Using Machine Learning and Deep Learning Techniques to Provide Job Advice to New Graduates," aiming to harness advanced technologies for personalized job recommendations This study seeks to empower new graduates to navigate the job market more effectively and secure suitable employment opportunities in a challenging landscape.

Matrix, a content-based filtering algorithm, effectively analyzes extensive data from job postings, resumes, and industry trends to uncover patterns that match candidates with suitable positions based on their skills, interests, and career aspirations This personalized approach enhances the job search experience, making it more efficient and tailored to the individual needs of graduates By leveraging these technologies, the study aims to minimize the time and effort required for new graduates to secure employment, thereby increasing their chances of success in a competitive job market The research findings could provide valuable insights and tools for educators, career counselors, and job placement services, ultimately supporting new graduates in their career pursuits.

Figure 1.1: Flowchart of the project

The process diagram features directional arrows that outline the sequential steps involved in the data processing of CVs and job listings, ultimately leading to job recommendations tailored to specific CVs based on compatibility A comprehensive analysis of each step in this process follows.

The initial phase involves gathering data from applicants' CVs and corresponding job descriptions, which is essential for ensuring the completeness and accuracy of the input data This data collection must encompass comprehensive details, including the candidate's skills, experience, education, and the specific requirements outlined in the job descriptions.

To optimize the matching process between CVs and job descriptions, raw data is processed using Natural Language Processing (NLP) techniques, including Part-of-Speech (POS) tagging, Named Entity Recognition (NER), tokenization, keyword extraction, and topic modeling This involves cleaning the data, standardizing formats, and identifying key phrases, which significantly improves the efficiency and accuracy of the matching algorithms.

The core function of the system involves matching applicants' CVs with job descriptions using predefined criteria, employing content-based filtering to assess compatibility The effectiveness of this matching process relies heavily on the quality of the data and the sophistication of the algorithms used.

 Recommend 5 JOB for CV based on matching:

The final step in the recruitment process involves generating job recommendations for applicants based on the matching results, presenting a clear and concise overview of the position, company, and seniority level to facilitate informed decision-making.

Aims, Objectives and Contributions of the thesis

This thesis aims to develop a comprehensive job recommendation system tailored for recent graduates in the United States By utilizing LinkedIn data, the system will analyze profiles of US residents and job descriptions from actively hiring companies It will employ content-based filtering to identify textual similarities between job descriptions and candidates' profiles, ensuring relevant matches, while also incorporating collaborative filtering to personalize recommendations based on user behavior and preferences The ultimate goal is to create a robust career reference system that connects job seekers with suitable opportunities and offers valuable insights into industry trends, essential skills, and career development pathways.

This thesis aims to create a job recommendation system specifically designed for job seekers in the United States, focusing on recent graduates Utilizing LinkedIn data, the system will analyze profiles of US residents and job descriptions from active recruiters By employing content-based filtering, it will evaluate textual similarities between candidate profiles and job descriptions to deliver highly relevant job matches.

This thesis will employ collaborative filtering to analyze user behavior and preferences, enabling personalized job recommendations The ultimate goal is to develop a comprehensive career guidance system that connects job seekers with appropriate opportunities while offering valuable insights into industry trends, essential skills, and career development pathways.

The primary contributions of this thesis are highlighted through the following points:

 Utilize extractive techniques to process raw data supplied by CiaoLink company

 Develop a matching algorithm to offer the most appropriate job recommendations for job seekers.

Significance and Motivation of the thesis

The significance and motivation of this thesis lie in addressing the challenges faced by job seekers, particularly recent graduates, in navigating the competitive job market in the

In the United States, the job search process can be both dynamic and overwhelming, highlighting the essential need for a system that effectively streamlines and improves the matching of candidates with appropriate job opportunities.

This thesis utilizes LinkedIn data to create a data-driven approach for personalized job recommendations, enhancing accuracy through content-based and collaborative filtering It aims to empower job seekers by helping them make informed career decisions, grasp industry trends, and identify key skills for career growth Ultimately, this research aspires to develop a valuable tool that promotes the professional success and satisfaction of job seekers, contributing to a more efficient employment ecosystem.

THEORETICAL BACKGROUND

Nature Language Processing (Text Processing)

Natural language processing (NLP) is a vital branch of artificial intelligence that utilizes machine learning to help computers understand and interact with human language By combining computational linguistics with statistical modeling and deep learning, NLP empowers devices to recognize, interpret, and produce text and speech This technology has significantly advanced generative AI, improving the communication capabilities of large language models and enabling image generation systems to process commands effectively Today, NLP is integrated into everyday applications such as search engines, chatbots, voice-operated GPS, and digital assistants, as well as enterprise solutions that enhance business operations and productivity.

2.1.1 Term Frequency - Inverse Document Frequency (TF-IDF)

Shorthand for the English phrase " Term Frequency - Inverse Document Frequency " or

TF-IDF, or Term Frequency-Inverse Document Frequency, measures the importance of a word within a text by utilizing statistical analysis This metric highlights the significance of a term not only in the specific document but also in relation to a broader collection of texts.

Languages often feature certain terms that frequently co-occur with other words, highlighting a subset of vocabulary that is used more regularly To address this phenomenon, it is essential to employ a method that balances the significance of these terms and smooths the frequency distribution.

Term Frequency (TF) quantifies how often a specific term appears within a document To ensure fairness and avoid skewing results due to longer documents, the raw count of a term is normalized This normalization considers the total number of terms in the document, allowing for a more accurate assessment of a term's significance The formula for calculating term frequency is expressed as 𝑓 𝑡,𝑑, which represents the frequency of term t in document d, divided by the total number of terms in that document.

Inverse Document Frequency (IDF) quantifies the significance of a term within a corpus by calculating the logarithm of the total number of documents divided by the number of documents that include the term This calculation often incorporates smoothing techniques to prevent division by zero, ensuring accurate results.

N is the total number of documents in the corpus D

|{𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑| is the number of documents containing the term t

2.1.2 Part-of-Speech tagging (POS tagging)

One of the fundamental tasks in Natural Language Processing (NLP) is Part of Speech

Part-of-Speech (PoS) tagging is the process of categorizing each word in a text according to its grammatical role, such as noun, verb, adjective, or adverb This technique improves the understanding of phrase structure and semantics, allowing machines to analyze and interpret human language with greater accuracy.

Part-of-Speech (PoS) tagging is an essential process in Natural Language Processing (NLP) that involves classifying each word in a document into its corresponding grammatical category This classification enhances the text by providing additional syntactic and semantic information, which aids in comprehending the structure and meaning of sentences.

Part-of-Speech (PoS) tagging is essential in Natural Language Processing (NLP) applications, enhancing tasks such as machine translation, named entity recognition, and information extraction This technique effectively addresses ambiguities in words with multiple meanings and reveals the grammatical structure of sentences.

Default tagging is a crucial process in part-of-speech (PoS) tagging, implemented through the DefaultTagger class This class requires a single argument, 'tag', to define the PoS tag to be applied, such as 'NN' for singular nouns The DefaultTagger excels in handling the most frequently used PoS tags, making the noun tag a popular choice for effective tagging.

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing

Natural Language Processing (NLP) employs Named Entity Recognition (NER) to identify and classify phrases in text into specific categories, including names of people, organizations, locations, dates, product types, and brands The insights gained from NER facilitate the development of advanced applications such as chatbots, question-answering systems, and search engines.

Google is seeking a software engineer with expertise in programming and data analysis, and the ideal candidate should reside in New York The text categorizes key elements as follows: organization (ORG) - Google; person (PERSON) - John Doe; geographic location (GPE) - New York; and interests (INTEREST) - programming and data analysis.

The rule-based approach to Named Entity Recognition (NER) utilizes a collection of predefined or automatically generated rules to identify entities in text Each token is characterized by a set of features, which are then compared to these rules When a rule matches, an extraction action is executed Typically, a rule comprises a pattern, often represented as a regular expression, and a corresponding action that is activated upon a match.

Content-based Filtering

Content-Based Filtering is a recommendation technique that proposes items to users by evaluating the attributes of those items alongside the user's preferences In job recommendations, this method entails examining the characteristics of job descriptions and aligning them with the user's profile features, including skills, education, and work experience.

Content-Based Filtering focuses on recommending items that are similar to those a user has previously engaged with This method involves creating a user profile based on their interactions and comparing new items to this established profile to generate relevant suggestions.

Cosine similarity is a mathematical measure that quantifies the cosine of the angle between two vectors, specifically the vectors representing a job description and a user profile To calculate the cosine similarity between vectors A and B, a specific formula is applied.

In this context, A represents the feature vector of the job description, while B denotes the feature vector of the user profile A greater cosine similarity between these vectors signifies a stronger alignment between the job description and the user’s preferences.

Related Works

The article "A Machine Learning Approach to Career Path Choice for Information Technology Graduates" by H Al-Dossari et al presents CareerRec, a machine learning system designed to recommend career paths for IT graduates in Saudi Arabia based on their skills The study analyzes data from 2,255 IT professionals and evaluates five different machine learning models Key strengths of the research include its focus on a significant issue, the utilization of extensive real-world data, and the comparison of multiple models However, the findings are limited by a geographical focus that restricts generalizability, and the dataset may not reflect global diversity Additionally, the study lacks a thorough discussion on potential data biases and includes minor issues such as unclear feature selection and grammatical errors.

The "linked-eed" GitHub repository, created by Pooja-Bhojwani, focuses on building an advanced job recommender system This innovative tool extracts user skills from LinkedIn and job requirements from Indeed, utilizing Jaccard Similarity to effectively match users with the most suitable job opportunities.

The project focuses on extracting and processing data from LinkedIn and Indeed to create a user-friendly web interface that helps individuals find jobs tailored to their skills It also recommends enhancements, including the exploration of additional LinkedIn sections to improve job recommendations.

The "NLP-Job-Recommendation" GitHub repository by hariharan1412 showcases an innovative job recommendation system that utilizes natural language processing (NLP) This system assesses user profiles through a trained NLP model, providing job suggestions that align with the user's skills and the specific requirements of available positions Additionally, job listings are gathered in real-time from reputable websites using Selenium and are organized in a database for efficient access.

The GitHub repository "Job-Recommendation-PJFNN" by doslim features implementations of baseline models and the Person-Job Fit Neural Network (PJFNN) using the Kaggle Job Recommendation Challenge dataset It includes Jupyter notebooks for dataset construction, evaluation of baseline models, and implementation of the PJFNN model, utilizing libraries such as PyTorch, NLTK, and Gensim The repository presents comprehensive experimental results for two key tasks: person-job fit and job recommendation, demonstrating that the PJFNN model significantly outperforms the baseline models in terms of performance metrics.

DATASETS & EXPLORE DATA ANALYSIS

Dataset

During my internship at CiaoLINK, I was fortunate to gain access to workforce data, enabling me to analyze two key datasets: the Curriculum Vitae (CVs) of active LinkedIn users and job listings from companies recruiting on LinkedIn These datasets, sourced from LinkedIn profiles and postings in the United States, provided a valuable foundation for my research on job recommendation systems.

Table 2.1: About the cv.csv dataset

The article outlines a database structure for LinkedIn profiles, featuring key columns such as public_id, which stores the unique IDs of users, and full_name, capturing the names of individuals The headline column reflects the job titles listed on their profiles, while location specifies the city, state, and country of residence Additionally, the summary column presents the main ideas from each user's CV, complemented by a label_summary that categorizes the summaries The salary column indicates the desired salary of each user, and extracted_criteria details their job search preferences.

Table 2.2: About the job.csv dataset

This article outlines key data points related to job postings on LinkedIn, including the title of the position, company name, location, salary offered, job function, seniority level, employment type, and the industries in which the company operates It also highlights the job description and the specific recruitment criteria set by each company These elements are essential for candidates seeking to understand the opportunities available and the requirements for potential employment.

Explore Data Analysis

2.2.1 About the cv.csv dataset

First, let's take an overview of the dataset showcasing individuals in the labor force who have uploaded their resumes to LinkedIn in pursuit of job opportunities

Figure 2.2: Top 20 Most Common Job Titles

The bar chart illustrates the 20 most common job titles among resume uploads, with "Software Engineer" leading as the most frequently mentioned title, highlighting the prominence of software engineering professionals Following closely are "Project Manager" and "Software Developer," indicating a significant presence of roles in project management and software development.

The prominence of "Student" and "Registered Nurse" among the leading job titles underscores the diverse professional backgrounds of job seekers The term "Student" reflects the significant number of individuals in educational programs who are eager to explore employment opportunities.

"Registered Nurse" reflects the significant number of healthcare professionals engaging in job searches Other notable job titles include "Business Development Manager,"

"Graphic Designer," "Chartered Accountant," and "Lawyer," indicating a broad spectrum of industries and functions represented in the dataset

The presence of titles such as "Account Manager," "General Manager," and "Operations Manager" highlights the significant number of managerial and leadership roles among LinkedIn users Additionally, designations like "Retired" and "Entrepreneur" reflect the diverse career stages and entrepreneurial endeavors of job seekers This data emphasizes the varied professional landscape and the extensive range of expertise and experience levels represented by individuals uploading their resumes to LinkedIn.

Figure 2.3: Top 20 Locations of Individuals

The bar chart illustrates the most common locations where individuals reside and work, based on uploaded resumes The United Kingdom leads the list, followed by the United States, indicating a significant concentration of professionals pursuing job opportunities in these nations This trend may be linked to their large populations and dynamic job markets.

Indian cities like Mumbai, Pune, and Bengaluru are key players in the job market, showcasing a substantial number of job seekers This trend underscores the swift growth of India's employment opportunities, particularly in major urban centers recognized for their vibrant tech and business sectors The presence of various Indian cities emphasizes the nation's diverse and evolving employment landscape.

New York City and its metropolitan area are recognized for their vibrant and competitive job markets Key African cities such as Johannesburg, Lagos, and Nairobi also demonstrate significant job-seeking activity In addition, major U.S cities like Boston, Washington, DC, and Los Angeles, along with London in the UK, highlight the international scope of job searches This data illustrates the geographical distribution of job seekers, showcasing the global character of the labor market.

Recent data indicates that the United Kingdom and the United States rank as the leading countries with the largest populations of working professionals The following highlights the most prevalent job titles found in each nation.

Figure 2.4: Top 10 Jobs in the United Kingdom

The pie chart illustrates the top 10 job titles of individuals in the United Kingdom who have uploaded their CVs, showcasing a diverse array of professions Each title reflects current or past roles, emphasizing the variety of career paths and industries within the UK job market.

Among the top headlines, roles such as "Research Fellow at Open Philanthropy,"

The roles of "Director of Business Development at UK Parking Patrol" and "Associate Client Services Manager at Alpha Group" highlight a significant presence of professionals skilled in research, business development, and client services Additionally, other key positions further emphasize the expertise and influence within these sectors.

Economics students at the University of Bristol are increasingly pursuing various opportunities, reflecting a strong interest in educational and consultancy-related professions Positions such as "Learning and Development Assistant" and "Consulting and Valuation Analyst" highlight the growing demand in these fields.

The United Kingdom serves as both the residence and workplace for many individuals, highlighting the concentration of specific job roles within the country This geographical context reveals the diverse and balanced professional landscape, as evidenced by the equal distribution of opportunities among the top 10 job roles, each representing 10% of the total.

Figure 2.5: Top 10 Jobs in the United States

The pie chart illustrates the top 10 job titles of U.S individuals who have uploaded their CVs, showcasing a range of professions that reflect the diverse career paths and industries in the American job market Each title signifies a specific role, whether currently held or previously occupied, emphasizing the variety within the workforce.

The role of "Board Certified Behavior Analyst" is prominent, comprising 18.2% of the total job headlines, highlighting the strong demand for expertise in behavioral analysis Additionally, positions such as "Legal Assistant at The King Firm Recent Tulane Graduate," "Talent Acquisition Specialist," and "School Counselor" each represent 9.1%, showcasing a diverse range of job functions across legal, recruitment, and educational counseling sectors.

Additional titles such as "Sales specialist at Lowes," "Maintenance at En Mi Casita," and

The Department of Veterans Affairs highlights various sectors, including retail, maintenance, and government services With students and recent graduates, such as those from Mississippi State University and Michigan State University pursuing degrees in Human Resources Labor Relations, there is a clear indication of active job searching among individuals at various stages of their careers.

The United States serves as both the residence and workplace for many individuals, offering insights into the concentration of specific job roles within the country This geographical context reflects the diverse professional landscape of the U.S., where the equal distribution among the top 10 job roles—each representing 9.1% of the total—highlights the balanced nature of job opportunities available nationwide.

2.2.2 About the job.csv dataset

RESEARCH METHODOLOGY

Using NLP to handle long text

Latent Dirichlet Allocation (LDA) is a widely used statistical model in text data mining and natural language processing (NLP), developed by Michael Jordan, Andrew Ng, and David Blei in 2003 It is one of the most popular techniques for topic modeling in various text sources.

Latent Dirichlet Allocation (LDA) is a generative model designed to uncover hidden themes within a set of documents This method posits that each document is comprised of multiple topics, with each topic being represented by a collection of words The primary goals of LDA are to identify these underlying themes, analyze their distribution across texts, and examine the distribution of words within each topic.

In the Latent Dirichlet Allocation (LDA) model, the "Topic-Word Distribution" plays a crucial role, as it posits that each topic is associated with a probability distribution across all words in a given vocabulary This implies that certain words have a higher likelihood of being associated with specific topics, while others have a lower probability For a vocabulary of V words, each topic k is represented by a probability distribution, denoted as ϕ k, which is a vector of length V Each element ϕ k,w indicates the probability of word w occurring in topic k, with the total of these probabilities equaling 1.

After use TfidfVectorizer to convert the cleaned text in the 'summary' column into a matrix of TF-IDF vectors, I use LDA to extract topics from the TF-IDF matrix

4.1.2 Using POS Tagging and NER from spaCy spaCy is a powerful and modern Natural Language Processing (NLP) library, developed to process text in English and many other languages Designed with high performance and exceptional accuracy, spaCy provides advanced tools for performing NLP tasks such as Tokenization, Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Dependency Parsing, and Pattern Matching This library is widely used in various real- world applications such as text analysis, information retrieval, and data extraction from text

POS tagging identifies tokens as nouns (NOUN) or proper nouns (PROPN) that serve as objects of prepositions (pobj) These tokens are subsequently categorized as "INTEREST" to effectively capture the skills and interests highlighted in the text.

The function effectively identifies and extracts key areas of interest and relevant skills from the text, ensuring these elements are acknowledged and documented This process enhances the understanding of the subject's professional and personal interests, offering a more comprehensive insight into their profile.

Named Entity Recognition (NER) from the spaCy library is essential for identifying and classifying named entities in text, including organizations (ORG), geopolitical entities (GPE), persons (PERSON), and works of art (WORK_OF_ART) This process facilitates the extraction of specific criteria based on recognized entities, ensuring that organizations, geopolitical entities, or persons are added to a criteria list with their corresponding labels Additionally, if a work of art is identified and contains the term "intern," it is classified as "INTERNSHIP" and included in the criteria list This thorough entity recognition enhances the extraction of relevant information, improving the overall comprehension of content in the 'summary' column of the DataFrame cv_df.

Using Content-based filtering to build matching model

The process involves identifying and quantifying the significance of words in documents by analyzing their frequency within individual documents and their rarity across the entire document set Text data from CVs, including 'headline' and 'label', along with job descriptions ('title'), are merged and converted into TF-IDF vectors This methodology is similarly applied to the 'salary', 'location', and 'extracted_criteria' fields to enhance data analysis.

Cosine similarity measures the similarity between two vectors (in this case, between a

Cosine similarity is a metric that measures the directional closeness between two vectors, such as a CV and a job description, regardless of their magnitude This index ranges from -1 to 1, where a value of 1 indicates identical direction, -1 indicates completely opposite directions, and 0 signifies no relationship It is calculated for various field pairs between CVs and job descriptions, including 'headline' + 'label' vs 'title', 'salary' vs 'salary', 'location' vs 'location', and 'extracted_criteria' vs 'extracted_criteria'.

DataFrame Creation for Similarity Scores

Cosine similarity scores are organized in distinct DataFrames for each field, indexed by candidate names and job indices This structure facilitates straightforward access and manipulation of similarity data, allowing for efficient sorting and filtering of jobs according to their relevance to individual candidates.

The recommend_jobs function is used to recommend jobs for a specific candidate based on the calculated similarities This function performs the following steps:

 Extract Similarity Scores: The function retrieves similarity scores for the specified candidate from the similarity DataFrames

 Combine Similarity Scores: The function combines the similarity scores from different fields to create an overall similarity score (Overall_Cosine_Score)

 Sort and Select Top Jobs: The jobs are sorted by the overall similarity score, and the top n jobs are returned as recommendations

The recommend_jobs function delivers tailored job suggestions by analyzing the specific content of each CV, ensuring that candidates receive the most relevant job opportunities aligned with their unique features and requirements.

Data Collection

CiaoLINK Data is an innovative project in development by CiaoLINK, where I had the chance to work during my internship I utilized data from CiaoLINK's American labor database, which is sourced from LinkedIn, a popular platform for job searching and professional networking in the United States.

The Full World Wide dataset features an impressive 700 million profiles, including both individual and corporate information This extensive database is updated monthly, providing users with the latest and most accurate data For my job recommendation project, I extracted 25,000 CV samples and 25,000 job listings from companies actively seeking to hire new employees.

Data Pre-processing

Figure 5.1: Data Pre-processing flowchart

The chart illustrates the process of textual data processing, starting with raw, unprocessed text collected from various sources Initially, it is crucial to identify and rectify issues such as missing information and formatting errors The data cleaning phase follows, which involves removing special characters, correcting spelling mistakes, and formatting the data appropriately for further use Next, the cleaned data is tokenized into smaller units known as tokens, with each token receiving a part-of-speech (POS) tag to clarify its grammatical role Named Entity Recognition (NER) is also employed to identify and categorize entities like names, locations, and organizations Ultimately, the refined data provides a solid foundation for advanced tasks, including natural language analysis, information extraction, and predictive modeling.

Each step will be explained in detail below:

In my research, I utilized the original data from LinkedIn, concentrating on extracting essential contributions The primary objective of data cleaning involved removing all NULL values and special characters I specifically focused on enhancing the "Location" and "Summary" columns, as these are vital for training an NLP model and improving matching outcomes.

Figure 5.3: CV Data after translate

An essential step in my Exploratory Data Analysis was translation, as LinkedIn users in the United States come from diverse countries, leading to CVs in multiple languages To handle this dataset effectively, I utilized the googletrans library Notably, of the first five CVs analyzed, one was written in Portuguese.

Figure 5.4: CV Data after using unidecode

After ensuring all data was standardized to English, I used the Unidecode library to address any discrepancies in the "Location" column, ensuring more accurate information retrieval

The "summary" column of the CV file, which includes brief paragraphs summarizing each user's LinkedIn profile, requires the application of NLP techniques to categorize the information into smaller groups focused on job roles, experiences, and other relevant details.

Figure 5.5: Using NLTK for Summary data

Then I used the NLTK’s stopwords removal feature Stop words are common words like

"is", "the" and "and" that usually do not add significant meaning to a sentence and are often removed during text processing to reduce noise

Figure 5.6: Using LDA and TfidfVectorizer for Summary data

I utilize Latent Dirichlet Allocation (LDA) to uncover the hidden thematic structure within a collection of documents This involves fitting the LDA model to a count matrix that reflects the frequency of words across the documents, enabling the model to learn the distribution of topics and words Following the model fitting, I extract and display the top keywords for each identified topic Additionally, I employ the Term Frequency-Inverse Document Frequency (TF-IDF) method to assess the significance of words, combining a word's frequency in a document (TF) with its uniqueness across the entire corpus (IDF) Ultimately, this process yields a results table showcasing 10 distinct topics.

Figure 5.7: Named all the topics

Utilizing NLTK's stopwords, scikit-learn's TfidfVectorizer, and LDA, the data summaries have been categorized into 10 unique clusters, with each cluster defined by 10 key terms Below are the rephrased titles and descriptions for each cluster.

 Real Estate & Media: This group focuses on real estate and media sectors

Members may have experience in property management, housekeeping, veterinary care, or local media such as radio stations

 Professional Experience & Management: This group relates to professional experience and management skills Members likely have demonstrated expertise across various industries, with strong management and professional skills

 Technical Skills & Training: This group emphasizes technical skills and training Members may specialize in technologies like Laravel, PHP, and other technical training areas such as regenerative sciences

 Business & Team Experience: This group is centered around business acumen and teamwork Members bring experience in business management, team collaboration, and strategic marketing

 Creative Arts & Design: This group focuses on creative arts and design

Members may work in fields such as film, music, graphic design, architecture, visual arts, video production, and photography

This group focuses on education and specialized fields, encompassing individuals who are either pursuing or have attained university degrees in areas such as science, law, and various other specialized disciplines.

 Academic Pursuits: This group centers on academic research and pursuits

Members may engage in projects such as BIM analysis, pathology studies, printing technologies, freelance work, and academic pursuits

 Construction & Project Management: This group relates to construction and project management Members may work in construction, safety management, project control, equipment handling, water management, and mining operations

 Engineering & Technology: This group focuses on engineering and technology

Members specialize in engineering disciplines, software development, computer sciences, data analysis, software development using Python, and technology innovations

 Legal & Law: This group is involved in legal professions and law Members may practice law, engage in litigation, court proceedings, property law, criminal law, dispute resolution, and arbitration processes

Figure 5.8: Apply POS and NER for CV

Figure 5.9: Apply POS and NER for JOB

To efficiently extract data from job descriptions, spaCy employs various NLP techniques such as Tokenization, POS Tagging, Named Entity Recognition (NER), Dependency Parsing, and Pattern Matching When text is inputted into the NLP variable, spaCy automatically tokenizes it and assigns part-of-speech tags to each token while identifying entities like organizations, locations, and names Furthermore, it analyzes word dependencies to understand relationships between them By utilizing spaCy's Matcher, I can define specific text patterns relevant to job descriptions, such as "bachelor degree" or "experience in," and apply the matcher to identify corresponding text segments Ultimately, I extract the necessary criteria from both the matched results and the recognized entities through NER.

Recommendation model and Evaluation

The chart demonstrates the compatibility between data from CVs and JOB listings, aiming to identify and link similar information for effective matching It categorizes key elements such as headline, summary, location, and salary from the CV, alongside title, location, and salary from the JOB file Arrows illustrate the connections between corresponding fields, ensuring that the headline aligns with the title, and that location and salary match between the CV and JOB files This automated process enhances the evaluation of applicants' suitability for job positions, leading to more accurate and efficient recruitment recommendations.

5.3.1 Recommend is based on Text String Processing

I provide tailored job recommendations by analyzing applicants' CVs alongside employer job listings through specialized matching functions These functions assess various criteria, including the alignment of the CV location with the job location, the compatibility of the CV headline with the job title, the correspondence of the salary expectations in the CV with the job salary, and the relevance of the CV label to the job title.

The primary function, recommend_jobs, accepts a CV index and a specified number of job recommendations It employs matching algorithms to determine if the CV aligns with any available job listings Upon finding matches, the relevant job details are compiled into a list, from which the top five matches are selected and displayed In cases where no jobs match the CV, the function returns a message indicating that there are no job descriptions that correspond to the CV.

Figure 5.11: Result if CV matches Job

Figure 5.12: Result if CV does not match Job

My model excels in speed and efficiency, quickly generating job recommendations from CV data and job listings, making it ideal for situations that demand rapid feedback It utilizes effective text matching across various columns, ensuring relevant job suggestions based on criteria such as headline, location, salary, and labels for a diverse range of users.

The model exhibits several significant weaknesses despite its strengths It fails to recommend jobs for every CV, as evidenced by Alex Farkas Worthy’s CV, which did not align with any job descriptions, highlighting its limitations in comprehensive matching Moreover, the job recommendations provided can be repetitive and irrelevant, as seen in the suggestions for Maddy Rotter Data quality issues, particularly the lack of salary information, further compromise the accuracy of the recommendations Additionally, the model struggles with large datasets, resulting in performance inefficiencies, indicating a need for optimization to enhance scalability and effectiveness.

In this model, I suggest job opportunities by analyzing applicants' CVs alongside employer job listings, utilizing cosine similarity for accurate recommendations To achieve this, I employ TfidfVectorizer to transform textual data into TF-IDF vectors, facilitating effective calculations.

The model generates TF-IDF vectors for pairs of columns, including the CV's headline and label, and compares them with the job title, salary, location, and extracted criteria from both the CV and job listings It then calculates the cosine similarity for each pair of corresponding TF-IDF vectors.

Cosine similarity matrices are generated and organized into distinct DataFrames The recommend_jobs function utilizes the CV index and the desired number of job recommendations to retrieve the applicant's name and compute similarity scores for each column pair The applicant's details are incorporated into the resulting DataFrame, which ultimately presents information on suitable job opportunities along with their corresponding cosine similarity scores.

In the software development sector, job opportunities exhibit cosine scores for headline labels between 0.0740 and 0.2074, while extracted criteria scores range from 0.5004 to 0.6338 Higher scores in the extracted criteria suggest a strong alignment between the job requirements and the applicant's skills.

Jobs related to sales management with cosine scores for headline label ranging from 0.2088 to 0.6513 and extracted criteria from 0.2408 to 0.6638 The high scores indicate good compatibility with the applicant's experience

Engineering and management jobs in space technology show headline label scores between 0.1995 and 0.5881, indicating varying levels of compatibility However, the extracted criteria, which range from 0 to 0.2392, suggest that there is significant room for improvement in these areas.

The model exhibits high accuracy in evaluating CV and job match quality by generating similarity scores that range from high to low across specific column pairs, such as job title, salary, location, and extracted criteria This indicates that CVs with higher similarity scores are better aligned with suitable job opportunities The variation in these scores reflects the model's ability to effectively assess and differentiate the suitability of CVs for various job criteria, recognizing that not all CVs will meet job requirements perfectly By employing TF-IDF to convert text into vectors and calculating cosine similarity, the model enhances its capability to process and compare textual information, resulting in more precise job recommendations.

The model has notable limitations, particularly the absence of salary information, resulting in all suggested jobs receiving a cosine score of 0.0 for salary, which underscores the dataset's inadequacy and the urgent need for better data collection While the model emphasizes high similarity scores, particularly Cosine_Score_Headline_Label and Cosine_Score_Extracted_Criteria, it may occasionally recommend jobs that do not align with the job seeker's preferences if these scores are zero, relying instead on Cosine_Score_Salary and Cosine_Score_Location, leading to less accurate suggestions Additionally, the model's performance can be sluggish, especially when dealing with large datasets, as demonstrated by the time-consuming processing of approximately 50,000 entries Therefore, optimizing the model for improved handling of missing data, refining comparison methods, and enhancing overall performance is essential for creating a more robust and scalable job recommendation system.

Future Work

The existing job matching system utilizes Content-based Filtering and natural language processing (NLP) to effectively assist graduates in securing appropriate employment A significant area for future development is the adaptation of this system for the Vietnamese job market, utilizing data from major platforms like TopCV and LinkedIn, which are popular in Vietnam.

To enhance the system's relevance in Vietnam, it is crucial to integrate local data from TopCV and LinkedIn, which provide valuable datasets on job postings and candidate profiles tailored to the Vietnamese market Analyzing this localized data allows the system to adapt to the unique dynamics of the job market, including preferred skills and industry-specific requirements.

To ensure widespread adoption among Vietnamese graduates, it is essential to create a user-friendly interface that aligns with local preferences This involves developing a bilingual platform that accommodates both Vietnamese and English languages, facilitating ease of use for a diverse audience Furthermore, integrating features tailored to the unique needs of Vietnamese users, such as localized job alerts and insights into industry trends, will significantly improve the overall user experience.

Expanding the job recommendation system to Vietnam could significantly enhance the job market by offering graduates personalized job suggestions, thereby improving their employment prospects and decreasing job search durations This initiative may help lower unemployment rates among graduates and foster economic growth by effectively matching skilled individuals with suitable job opportunities Such systems are adept at aligning candidates' skills with job requirements, which can lead to increased job satisfaction and productivity By identifying roles that resonate with a candidate's qualifications and career aspirations, these systems facilitate meaningful career matches and introduce individuals to new career paths they may not have previously considered, promoting their professional development This approach is especially advantageous for workers transitioning to new industries or roles that utilize their existing skill sets.

In conclusion, this thesis presents an innovative job recommendation system tailored for recent graduates in the United States Utilizing advanced Natural Language Processing (NLP) techniques, the system analyzes job data and applicant profiles to deliver precise and relevant job recommendations By combining content-based filtering with collaborative filtering, the system personalizes matches, greatly improving the job-seeking experience for users.

The system leverages NLP techniques to better comprehend complex job descriptions and applicant profiles, allowing for more accurate job recommendations by capturing contextual nuances However, it faces limitations such as reliance on data quality and quantity, potential biases, and the difficulty of adapting to rapid market changes Future enhancements should aim to provide comprehensive and current datasets, utilize real-time data feeds, incorporate continuous learning algorithms, and establish improved user feedback mechanisms to refine the model.

The developed job recommendation system marks a significant advancement in employment services, empowering recent graduates to effectively navigate the job market and discover suitable employment opportunities in the United States This thesis highlights the critical role of machine learning and natural language processing (NLP) in creating innovative job recommendation solutions Future efforts should aim to address identified limitations, integrate additional data sources, and refine algorithms to enhance the system's responsiveness to the evolving job market dynamics, ultimately improving the accuracy and relevance of job matches for users.

[1]: Roeloffs, M.W (n.d.) Layoffs Hit 14-Month High In March As Federal

Government Leads In Job Cuts [online] Forbes Available at: https://www.forbes.com/sites/maryroeloffs/2024/04/04/layoffs-hit-14-month-high- march-2024-federal-government-army-veterans-affairs-job-cuts-tech-finance/

In April 2024, U.S.-based employers announced 64,789 job cuts, marking a 28% decrease from the previous month and a 3.3% drop from April 2023 Year-to-date, job cuts total 322,043, down 4.6% compared to last year The auto sector led with 14,373 cuts, primarily due to Tesla's workforce reduction The education sector followed with 8,092 cuts, reflecting budgetary constraints amid rising labor costs Healthcare companies announced 5,826 cuts, while the technology sector reported 47,436 job cuts, a significant decrease from the prior year Cost-cutting remains the primary reason for layoffs, with artificial intelligence contributing to 800 job cuts in April alone Hiring plans are at their lowest since 2016, with only 9,802 positions announced for April, indicating a cautious outlook for the labor market.

[3]: IBM (2023) What is Natural Language Processing? [online] IBM Available at: https://www.ibm.com/topics/natural-language-processing

[4]: NLTK (2009) Natural Language Toolkit — NLTK 3.4.4 documentation [online] Nltk.org Available at: https://www.nltk.org/

[5]: www.ibm.com (2024) What is Latent Dirichlet allocation | IBM [online] Available at: https://www.ibm.com/topics/latent-dirichlet-allocation

[6]: viblo.asia (2016) TF-IDF ( term frequency – inverse document frequency) [online] Available at: https://viblo.asia/p/tf-idf-term-frequency-inverse-document- frequency-JQVkVZgKkyd [Accessed 16 Jun 2024]

[7]: Geeksforgeeks (2019) NLP | Part of Speech - Default Tagging [online] GeeksforGeeks Available at: https://www.geeksforgeeks.org/nlp-part-of-speech- default-tagging/

[8]: viblo.asia (2020) [Seri NLP] Nhận dạng thực thể - NER (phần 1) [online] Available at: https://viblo.asia/p/seri-nlp-nhan-dang-thuc-the-ner-phan-1-Ljy5VyWzlra [Accessed 21 Jun 2024]

Content-based filtering is a recommendation method that relies on the characteristics of items to suggest similar options to users By analyzing the content of items, this approach personalizes recommendations based on individual preferences and past interactions It is particularly effective in enhancing user experience by providing tailored suggestions that align with their interests Understanding content-based filtering is essential for developing efficient recommendation systems in various applications.

[10]: Nughaymish, F.A., Al-Qahtani, Z., Alkahlifah, M and Alqahtani, A., 2020 A Machine Learning Approach to Career Path Choice for Information Technology

Graduates [online] ResearchGate Available at: https://www.researchgate.net/publication/347778882_A_Machine_Learning_Approac h_to_Career_Path_Choice_for_Information_Technology_Graduates [Accessed 17 June 2024]

[11]: Pooja-Bhojwani (2024) Pooja-Bhojwani/linked-eed [online] GitHub Available at: https://github.com/Pooja-Bhojwani/linked-eed [Accessed 21 Jun 2024]

[12]: hariharan141200 (2024) hariharan1412/NLP-Job-Recommendation [online] GitHub Available at: https://github.com/hariharan1412/NLP-Job-Recommendation [Accessed 21 Jun 2024]

[13]: DU, S (2024) doslim/Job-Recommendation-PJFNN [online] GitHub Available at: https://github.com/doslim/Job-Recommendation-PJFNN [Accessed 21 Jun 2024]

Happiness EXPLANATORY REPORT ON CHANGES/ADDITIONS BASED ON THE DECISION OF GRADUATION THESIS COMMITTEE

FOR UNDERGRADUATE PROGRAMS WITH DEGREE AWARDED BY

Student’s full name: Hoàng Mạnh Linh

Graduation thesis topic: Using matrix and nature language processing techniques to provide job advice

According to VNU-IS decision no …… QĐ/TQT, dated … / … / ……., a Graduation Thesis Committee has been established for Bachelor programs at Vietnam National University, Hanoi The thesis was successfully defended and subsequently revised in the specified sections.

No Change/Addition Suggestions by the Committee Detailed Changes/ Additions Page

Change the thesis name: Using matrix and nature language processing techniques to provide job advice

The thesis may need to be revised with correct formatting guidelines

Split the Experiments and Evaluation in the Chapter 4 to Chapter 5

Tiêu đề	Using matrix and nature language processing techniques to provide job advice
Tác giả	Hoang Manh Linh
Người hướng dẫn	Assoc. Prof. Dr. Tran Thi Oanh
Trường học	Vietnam National University, Hanoi International School
Chuyên ngành	Business Data Analyst
Thể loại	Graduate project
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	64
Dung lượng	2,57 MB