hệ thống gợi ý khóa học thông qua Apache Nutch và Apache Solr, NLP. Hệ thống sử dụng Apache Nutch crawl dữ liệu, đánh index trên Apache Solr. Metadata từ Apache Solr được dùng làm dữ liệu và vận dụng NLP để gợi ý
Trang 21 Introduction: Online course Search Engine
2 Crawling by Apache Nutch
3 Indexing in Apache Solr
Trang 31 Introduction
Project domain: Online Course Urls: https://www.coursera.org/Concept map:
Trang 41 Introduction Scenario flow chart
Trang 51 Introduction Scenario flow chart
The words were not trained
Trang 61.Introduction: Architecture Recommendation
Topic modeling
Topics
Query
Calculating semantic similarity score, ranking
Query processing (correct, split)
Or 6 python computer vision
User
Crawler
Document Index
Ranking
Query parser
Trang 7Query processing (correct, split)
Input: keywords For example: Human balance estimation
journal publisher affiliation
Trang 82 Crawling by Apache Nutch
• The seed.txt file
https://www.coursera.org
• The regex-urlfilter file include:
+^https://www.coursera.org/
+^https://in.coursera.org/
Trang 92 Crawling by Apache Nutch
+ Open Cygwin terminal and go to {nutch_home}:
Crawl: bin/crawl –i –s urls crawl 3
Dump to file:
• bin/nutch readdb crawl/crawldb -stats >stats.txt
• bin/nutch readdb crawl/crawldb/ -dump db
• bin/nutch readlinkdb crawl/linkdb/ -dump link
• bin/nutch readseg -dump crawl/segments/{segment_folder}
crawl/segments/{segment_folder}_dump -nocontent -nofetch -noparse -noparsedata -noparsetext
1) CrawlDb - It contains all the link parsed by the Nutch
2) LinkDB - It contains for each URL the outgoing and the
incoming URLs
3) Segment - It contains the list of URLs to be crawled or being
crawled.
Trang 103 Indexing in Solr
bin/nutch index crawl/crawldb/ linkdb crawl/linkdb/ crawl/segments/{segments_folder}/ filter -normalize -deleteGone
Trang 124 Dataset
Can crawl
Can crawl
Miss data, however
we can ignore this
Trang 134 Dataset
Can crawl
Trang 144 DatasetCan crawl
Trang 154 Dataset
Trang 165 Functions
Topic modelling: Top2Vec which leverages joint document and word semantic
embedding to find topic vectors
• Top2Vec(documents, speed, workers)
documents: Input corpus, should be a list of strings.
speed: This parameter will determine how fast the model takes to train The
‘fast-learn’ option is the fastest and will generate the lowest quality vectors The ‘‘fast-learn’ option will learn better quality vectors but take a longer time to train The ‘deep- learn’ option will learn the best quality vectors but will take significant time to
train.
workers: The amount of worker threads to be used in training the model Larger
amount will lead to faster training
Trang 175 Functions
• search_documents_by_keywords(keywords, num_docs, keywords_neg=None,
return_documents=True, use_index=False, ef=None)
Semantic search of documents using keywords The most semantically similar documents to the combination of the keywords will be returned If negative keywords are provided, the documents will be semantically dissimilar to those words Too many keywords or certain combinations of words may give strange results This method finds an average vector(negative keywords are subtracted) of all the keyword vectors and returns the documents closest to the resulting vector
Parameters
• keywords (List of str) – List of positive keywords being used for search of semantically similar documents
• keywords_neg (List of str (Optional)) – List of negative keywords being used for search of semantically
dissimilar documents
• num_docs (int) – Number of documents to return
• return_documents (bool (Optional default True)) – Determines if the documents will be returned If they were not saved in the model they will also not be returned
Return: documents, doc_scores, doc_ids
Trang 195 Functions
• get_recommendation (query)
-> Splitting query
-> Correcting query
-> Calculating score of query and documents
-> Display result: “number” courses ( title, url, score and description)query: including [number(str or int)] + keywords
Trang 20For example: 7 pyt learning -> 7 python learning
• Input: number can be str or int ( one or 1, two or 2)
• If the keywords did not train in model, it can not display the result
Trang 216 UI design
Trang 22Please enter the query (number + keywords)
Course Search
Trang 23Showing “number” of results
Course Search 3 machine learning python
Machine learning
Stanford University Description: Build ML models with NumPy & scikit-learn, build &
train supervised models for prediction & binary classification tasks (linear, logistic regression)….
(4.5)
Advanced
IBM Machine learning
IBM Description: Build ML models with NumPy & scikit-learn, build &
train supervised models for prediction & binary classification tasks (linear, logistic regression)….
(4.4)
Intermediate
Introduction to Machine learning
Stanford University Description: Build ML models with NumPy & scikit-learn, build &
train supervised models for prediction & binary classification tasks (linear, logistic regression)….
Trang 24Showing “0” results
Course Search 3 machine learning python sport pyth game
Sport cannot find out Please remove or use other word!
Trang 257 Results
Trang 287 Results
Trang 297 Results
Trang 308 Discussion
What we learn from project?
+ Crawling data from Apache Nutch
+ Indexing in Apache Solr
+ The basic knowledge in semantic search, NLP
+ The process to complete a project
What is the challenges?
Because of time limitation:
+ The connection between front end and back end is not completed + Filter features were not added: filter course by Rating, Levels
+ Collect more data to extend the topics
Trang 31THANK YOU
FOR LISTENING!
THANK YOU, PROFESSOR!