1. Trang chủ
  2. » Tất cả

Course recommendation Xây dựng hệ thống gợi ý khóa học

31 6 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Online Course Search Engine Content
Người hướng dẫn Professor Kim Kyoung-Yun
Trường học Unknown University
Chuyên ngành Online Course Search Engine
Thể loại Dự án Cao học
Định dạng
Số trang 31
Dung lượng 3,6 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

hệ thống gợi ý khóa học thông qua Apache Nutch và Apache Solr, NLP. Hệ thống sử dụng Apache Nutch crawl dữ liệu, đánh index trên Apache Solr. Metadata từ Apache Solr được dùng làm dữ liệu và vận dụng NLP để gợi ý

Trang 2

1 Introduction: Online course Search Engine

2 Crawling by Apache Nutch

3 Indexing in Apache Solr

Trang 3

1 Introduction

Project domain: Online Course Urls: https://www.coursera.org/Concept map:

Trang 4

1 Introduction Scenario flow chart

Trang 5

1 Introduction Scenario flow chart

The words were not trained

Trang 6

1.Introduction: Architecture Recommendation

Topic modeling

Topics

Query

Calculating semantic similarity score, ranking

Query processing (correct, split)

Or 6 python computer vision

User

Crawler

Document Index

Ranking

Query parser

Trang 7

Query processing (correct, split)

Input: keywords For example: Human balance estimation

journal publisher affiliation

Trang 8

2 Crawling by Apache Nutch

• The seed.txt file

https://www.coursera.org

• The regex-urlfilter file include:

+^https://www.coursera.org/

+^https://in.coursera.org/

Trang 9

2 Crawling by Apache Nutch

+ Open Cygwin terminal and go to {nutch_home}:

Crawl: bin/crawl –i –s urls crawl 3

Dump to file:

• bin/nutch readdb crawl/crawldb -stats >stats.txt

• bin/nutch readdb crawl/crawldb/ -dump db

• bin/nutch readlinkdb crawl/linkdb/ -dump link

• bin/nutch readseg -dump crawl/segments/{segment_folder}

crawl/segments/{segment_folder}_dump -nocontent -nofetch -noparse -noparsedata -noparsetext

1) CrawlDb - It contains all the link parsed by the Nutch

2) LinkDB - It contains for each URL the outgoing and the

incoming URLs

3) Segment - It contains the list of URLs to be crawled or being

crawled.

Trang 10

3 Indexing in Solr

bin/nutch index crawl/crawldb/ linkdb crawl/linkdb/ crawl/segments/{segments_folder}/ filter -normalize -deleteGone

Trang 12

4 Dataset

Can crawl

Can crawl

Miss data, however

we can ignore this

Trang 13

4 Dataset

Can crawl

Trang 14

4 DatasetCan crawl

Trang 15

4 Dataset

Trang 16

5 Functions

Topic modelling: Top2Vec which leverages joint document and word semantic

embedding to find topic vectors

• Top2Vec(documents, speed, workers)

documents: Input corpus, should be a list of strings.

speed: This parameter will determine how fast the model takes to train The

‘fast-learn’ option is the fastest and will generate the lowest quality vectors The ‘‘fast-learn’ option will learn better quality vectors but take a longer time to train The ‘deep- learn’ option will learn the best quality vectors but will take significant time to

train.

workers: The amount of worker threads to be used in training the model Larger

amount will lead to faster training

Trang 17

5 Functions

• search_documents_by_keywords(keywords, num_docs, keywords_neg=None,

return_documents=True, use_index=False, ef=None)

Semantic search of documents using keywords The most semantically similar documents to the combination of the keywords will be returned If negative keywords are provided, the documents will be semantically dissimilar to those words Too many keywords or certain combinations of words may give strange results This method finds an average vector(negative keywords are subtracted) of all the keyword vectors and returns the documents closest to the resulting vector

Parameters

• keywords (List of str) – List of positive keywords being used for search of semantically similar documents

• keywords_neg (List of str (Optional)) – List of negative keywords being used for search of semantically

dissimilar documents

• num_docs (int) – Number of documents to return

• return_documents (bool (Optional default True)) – Determines if the documents will be returned If they were not saved in the model they will also not be returned

Return: documents, doc_scores, doc_ids

Trang 19

5 Functions

• get_recommendation (query)

-> Splitting query

-> Correcting query

-> Calculating score of query and documents

-> Display result: “number” courses ( title, url, score and description)query: including [number(str or int)] + keywords

Trang 20

For example: 7 pyt learning -> 7 python learning

• Input: number can be str or int ( one or 1, two or 2)

• If the keywords did not train in model, it can not display the result

Trang 21

6 UI design

Trang 22

Please enter the query (number + keywords)

Course Search

Trang 23

Showing “number” of results

Course Search 3 machine learning python

Machine learning

Stanford University Description: Build ML models with NumPy & scikit-learn, build &

train supervised models for prediction & binary classification tasks (linear, logistic regression)….

(4.5)

Advanced

IBM Machine learning

IBM Description: Build ML models with NumPy & scikit-learn, build &

train supervised models for prediction & binary classification tasks (linear, logistic regression)….

(4.4)

Intermediate

Introduction to Machine learning

Stanford University Description: Build ML models with NumPy & scikit-learn, build &

train supervised models for prediction & binary classification tasks (linear, logistic regression)….

Trang 24

Showing “0” results

Course Search 3 machine learning python sport pyth game

Sport cannot find out Please remove or use other word!

Trang 25

7 Results

Trang 28

7 Results

Trang 29

7 Results

Trang 30

8 Discussion

What we learn from project?

+ Crawling data from Apache Nutch

+ Indexing in Apache Solr

+ The basic knowledge in semantic search, NLP

+ The process to complete a project

What is the challenges?

Because of time limitation:

+ The connection between front end and back end is not completed + Filter features were not added: filter course by Rating, Levels

+ Collect more data to extend the topics

Trang 31

THANK YOU

FOR LISTENING!

THANK YOU, PROFESSOR!

Ngày đăng: 15/03/2023, 12:47

🧩 Sản phẩm bạn có thể quan tâm

w