Course recommendation Xây dựng hệ thống gợi ý khóa học

hệ thống gợi ý khóa học thông qua Apache Nutch và Apache Solr, NLP. Hệ thống sử dụng Apache Nutch crawl dữ liệu, đánh index trên Apache Solr. Metadata từ Apache Solr được dùng làm dữ liệu và vận dụng NLP để gợi ý

Trang 2

1 Introduction: Online course Search Engine

2 Crawling by Apache Nutch

3 Indexing in Apache Solr

Trang 3

1 Introduction

Project domain: Online Course Urls: https://www.coursera.org/Concept map:

Trang 4

1 Introduction Scenario flow chart

Trang 5

1 Introduction Scenario flow chart

The words were not trained

Trang 6

1.Introduction: Architecture Recommendation

Topic modeling

Topics

Query

Calculating semantic similarity score, ranking

Query processing (correct, split)

Or 6 python computer vision

User

Crawler

Document Index

Ranking

Query parser

Trang 7

Query processing (correct, split)

Input: keywords For example: Human balance estimation

journal publisher affiliation

Trang 8

• The seed.txt file

https://www.coursera.org

• The regex-urlfilter file include:

+^https://www.coursera.org/

+^https://in.coursera.org/

Trang 9

+ Open Cygwin terminal and go to {nutch_home}:

Crawl: bin/crawl –i –s urls crawl 3

Dump to file:

• bin/nutch readdb crawl/crawldb -stats >stats.txt

• bin/nutch readdb crawl/crawldb/ -dump db

• bin/nutch readlinkdb crawl/linkdb/ -dump link

• bin/nutch readseg -dump crawl/segments/{segment_folder}

crawl/segments/{segment_folder}_dump -nocontent -nofetch -noparse -noparsedata -noparsetext

1) CrawlDb - It contains all the link parsed by the Nutch

2) LinkDB - It contains for each URL the outgoing and the

incoming URLs

3) Segment - It contains the list of URLs to be crawled or being

crawled.

Trang 10

3 Indexing in Solr

bin/nutch index crawl/crawldb/ linkdb crawl/linkdb/ crawl/segments/{segments_folder}/ filter -normalize -deleteGone

Trang 12

4 Dataset

Can crawl

Miss data, however

we can ignore this

Trang 13

4 Dataset

Can crawl

Trang 14

4 DatasetCan crawl

Trang 15

4 Dataset

Trang 16

5 Functions

Topic modelling: Top2Vec which leverages joint document and word semantic

embedding to find topic vectors

• Top2Vec(documents, speed, workers)

documents: Input corpus, should be a list of strings.

speed: This parameter will determine how fast the model takes to train The

‘fast-learn’ option is the fastest and will generate the lowest quality vectors The ‘‘fast-learn’ option will learn better quality vectors but take a longer time to train The ‘deep- learn’ option will learn the best quality vectors but will take significant time to

train.

workers: The amount of worker threads to be used in training the model Larger

amount will lead to faster training

Trang 17

5 Functions

• search_documents_by_keywords(keywords, num_docs, keywords_neg=None,

return_documents=True, use_index=False, ef=None)

Semantic search of documents using keywords The most semantically similar documents to the combination of the keywords will be returned If negative keywords are provided, the documents will be semantically dissimilar to those words Too many keywords or certain combinations of words may give strange results This method finds an average vector(negative keywords are subtracted) of all the keyword vectors and returns the documents closest to the resulting vector

Parameters

• keywords (List of str) – List of positive keywords being used for search of semantically similar documents

• keywords_neg (List of str (Optional)) – List of negative keywords being used for search of semantically

dissimilar documents

• num_docs (int) – Number of documents to return

• return_documents (bool (Optional default True)) – Determines if the documents will be returned If they were not saved in the model they will also not be returned

Return: documents, doc_scores, doc_ids

Trang 19

5 Functions

• get_recommendation (query)

-> Splitting query

-> Correcting query

-> Calculating score of query and documents

-> Display result: “number” courses ( title, url, score and description)query: including [number(str or int)] + keywords

Trang 20

For example: 7 pyt learning -> 7 python learning

• Input: number can be str or int ( one or 1, two or 2)

• If the keywords did not train in model, it can not display the result

Trang 21

6 UI design

Trang 22

Please enter the query (number + keywords)

Course Search

Trang 23

Showing “number” of results

Course Search 3 machine learning python

Machine learning

Stanford University Description: Build ML models with NumPy & scikit-learn, build &

train supervised models for prediction & binary classification tasks (linear, logistic regression)….

(4.5)

Advanced

IBM Machine learning

IBM Description: Build ML models with NumPy & scikit-learn, build &

(4.4)

Intermediate

Introduction to Machine learning

Stanford University Description: Build ML models with NumPy & scikit-learn, build &

Trang 24

Showing “0” results

Course Search 3 machine learning python sport pyth game

Sport cannot find out Please remove or use other word!

Trang 25

7 Results

Trang 28

7 Results

Trang 29

7 Results

Trang 30

8 Discussion

What we learn from project?

+ Crawling data from Apache Nutch

+ Indexing in Apache Solr

+ The basic knowledge in semantic search, NLP

+ The process to complete a project

What is the challenges?

Because of time limitation:

+ The connection between front end and back end is not completed + Filter features were not added: filter course by Rating, Levels

+ Collect more data to extend the topics

Trang 31

THANK YOU

FOR LISTENING!

THANK YOU, PROFESSOR!

Tiêu đề	Online Course Search Engine Content
Người hướng dẫn	Professor Kim Kyoung-Yun
Trường học	Unknown University
Chuyên ngành	Online Course Search Engine
Thể loại	Dự án Cao học

Định dạng
Số trang	31
Dung lượng	3,6 MB