Creating a Powerful Course Search and Recommendation Using Elasticsearch I

Creating a Powerful Course Search and Recommendation Using Elasticsearch I

Creating a Powerful Course Search and Recommendation Using Elasticsearch I - CloudFullStack

Introduction

In today’s digital age, users expect fast and accurate search results when browsing for online courses. Whether they’re looking for programming tutorials, data science classes, or personal development workshops, the ability to quickly find the right course is crucial for enhancing user satisfaction and engagement.

In this tutorial, we’ll walk you through building a robust course search and recommendation engine using Elasticsearch and Django:

  • Explore various search techniques such as fuzzy search, autocomplete, semantic search, and more.
  • Discuss their suitability for course search applications.
  • Implement a step-by-step example to integrate Elasticsearch with a backend framework like Django.
  • Demonstrate how to optimize search results and implement a recommendation system.

Key Techniques for Course Search

To build an effective course search engine, we need to consider various techniques that enhance the search experience. Here are some of the most suitable approaches:

  1. Fuzzy Search:
    • Fuzzy search enables users to find relevant results even when their queries contain typos, spelling errors, or slight variations. This is particularly valuable in educational platforms, where users might not remember exact course titles or spellings.
    • For example, a search for “pyton” should still return courses on “Python”.
  2. Full-Text Search:
    • Full-text search allows users to discover courses by searching through descriptions, titles, and other textual content. This approach enables broad discovery, allowing users to locate courses that match their interests even if they don’t have specific keywords in mind.
  3. Boolean Queries:
    • By combining multiple queries with logical operators (AND, OR, NOT), we can refine search results based on criteria like course level (beginner, intermediate) or category (programming, design).
  4. Proximity Searches:
    • Proximity search identifies phrases where specific words appear near each other within the content.
    • This is helpful when users search for specific topics, like “data analysis tools,” ensuring results include courses where these words appear together in a meaningful way.
  5. Vector Search:
    • Vector search uses machine learning models to understand the semantic meaning of queries and content.
    • Instead of relying solely on keyword matching, this technique uses embeddings (numerical representations of text) to find courses that are contextually and conceptually similar to the query.
    • For instance, a query for “web development basics” could match courses covering “HTML and CSS fundamentals.”
  6. Recommendation Algorithms:
    • We can incorporate recommendation algorithms like:
      • Collaborative Filtering: This suggests courses based on patterns of user behavior (e.g., ‘People who enrolled in this course also took…’).
      • Content-Based Filtering: This recommends courses similar to those the user has previously interacted with, based on attributes like category or difficulty.

Step-by-Step Setup of Elasticsearch Using Docker

To get started, we’ll need an Elasticsearch instance. Docker simplifies this by letting us spin up Elasticsearch quickly on our local machine.

Prerequisites

Before we begin, ensure that we have the following installed on our system:

Step 1: Pull the Elasticsearch Docker Image

Open the terminal or command prompt and run the following command to pull the official Elasticsearch image from the Elastic Docker registry:

docker pull docker.elastic.co/elasticsearch/elasticsearch-wolfi:9.0.0

Step 2: Create a Docker Network (Optional)

While this step is optional, creating a dedicated network for our Elasticsearch container can help manage communication between multiple containers if we plan to use them together (e.g., with Kibana).

docker network create elastic

Step 3: Configure and Run the Elasticsearch Container

We can run Elasticsearch directly with a single command. Use the following command to start the container:

docker run -d --name my-elasticsearch-container \
  --network elastic \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch-wolfi:9.0.0
Docker ElasticSearch Course Search- CloudFullStack

Explanation of Command Options:

  • -d: Runs the container in detached mode.
  • --name my-elasticsearch-container: Assign a name to the container for easier management.
  • --network elastic: Connects the container to the specified Docker network.
  • -p 9200:9200: Maps port 9200 of the container to port 9200 on your host machine (used for HTTP requests).
  • -p 9300:9300: Maps port 9300 for internal communication between nodes (not essential for single-node setups).
  • -e "discovery.type=single-node": Configures Elasticsearch to run in single-node mode.
  • -e "xpack.security.enabled=false": Disables security features for local development (not recommended for production).

Step 4: Accessing Elasticsearch

Once the container is running, we can access our Elasticsearch instance by opening a web browser and navigating to:

http://localhost:9200

We should see a JSON response indicating that Elasticsearch is up and running.

Docker ElasticSearch Course Search- CloudFullStack

Step 5: Add Kibana for Visualisation (Optional)

If we want to use Kibana for monitoring and querying Elasticsearch:

docker pull docker.elastic.co/kibana/kibana:8.17.0

Start Kibana:

docker run -d --name kibana \
  --network elastic \
  -p 5601:5601 \
  docker.elastic.co/kibana/kibana:8.17.0

Open Kibana in the browser: http://localhost:5601

Step 6: Stopping and Removing the Container

When we’re done with our development session, we can stop and remove the container using the following commands:

docker stop my-elasticsearch-container
docker rm my-elasticsearch-container

Steps for Database Setup

Before configuring the Elasticsearch index, we need a database to act as the source of truth for course data. This allows us to:

  1. Centralised Course Data: Store details like course names, descriptions, categories, and other metadata.
  2. Synchronise with Elasticsearch: Import data from the database to Elasticsearch for indexing and querying.

Step 1: Choose a Database

For a course search engine, relational databases like MySQL or PostgreSQL are suitable choices due to their structured query capabilities. NoSQL databases like MongoDB could also work if the data structure is highly flexible.

For simplicity, let’s proceed with MySQL in this guide.

Step 2: Install MySQL

If you don’t already have MySQL installed, you can set it up using Docker:

docker run --name mysql-course-db \
  -e MYSQL_ROOT_PASSWORD=root \
  -e MYSQL_DATABASE=courses_db \
  -e MYSQL_USER=user \
  -e MYSQL_PASSWORD=password \
  -p 3306:3306 \
  -d mysql:latest

This creates a courses_db database with credentials:

  • Username: user
  • Password: password

For production environments, ensure you use strong passwords and consider enabling additional security features.

Step 3: Create a Table for Courses

Log in to MySQL to define the course schema:

docker exec -it mysql-course-db mysql -u root -p

Next, create a table to store course information. Here’s an example SQL statement to create a courses table:

USE courses_db;

CREATE TABLE courses (
    id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    category_id INT,
    sub_category_id INT,
    language VARCHAR(50),
    source VARCHAR(50),
    level VARCHAR(50),
    instructor VARCHAR(150),
    is_valid BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
Docker ElasticSearch Course Search- CloudFullStack

Step 4: Populate the Table with Sample Data

Insert some sample data into the courses table. Here are a few example SQL statements:

INSERT INTO courses (name, description, category_id, sub_category_id, language, source, instructor, level, is_valid) VALUES
('Introduction to Python', 'Learn the basics of Python programming.', 1, 101, 'EN', 'GO', 'John Doe', 'Beginner', TRUE),
('Introduction to Machine Learning', 'Learn the fundamentals of machine learning with practical examples.', 5, 501, 'EN', 'GO', 'Jane Smith', 'Beginner', TRUE),
('Advanced Java', 'Deep dive into Java concepts.', 2, 201, 'EN', 'LINKEDIN', 'Alice Johnson', 'Advanced', TRUE),
('Web Development Bootcamp', 'Become a full-stack web developer in this comprehensive bootcamp.', 1, 101, 'EN', 'GO', 'Bob Brown', 'Beginner', TRUE),
('AWS Cloud Basics', 'Understand the fundamentals of AWS Cloud.', 3, 301, 'EN', 'AWS', 'Carol White', 'Beginner', TRUE),
('Graphic Design 101', 'Basics of graphic design.', 4, 401, 'FR', 'STUDIO', 'David Green', 'Intermediate', FALSE);

Step 5: Grant privileges to your user for the database

Once inside the MySQL shell as root or an admin user:

GRANT ALL PRIVILEGES ON course_db.* TO 'user'@'%';
FLUSH PRIVILEGES;

This command gives all privileges on the courses_db database to user connecting from any host (%).

What is an Index in Elasticsearch?

In Elasticsearch, an index is a data structure used to store, retrieve, and search documents. It’s similar to a table in a relational database. Each index contains a collection of documents, and every document represents a unit of searchable data, often in JSON format.

Key Concepts:

  1. Index: Like a table in SQL, it groups documents with similar characteristics (e.g., all “course” documents).
  2. Document: A single record in an index, typically representing one entity (e.g., one course).
  3. Field: Analogous to a column in SQL, it’s a key-value pair inside a document.
  4. Mapping: The schema definition for an index. Specifies the field types (e.g., text, keyword, integer, date) and behaviours (e.g., analysers for text fields).
  5. Shards: A shard is the smallest unit of storage and allows Elasticsearch to scale horizontally by distributing data across multiple nodes. By default, an index has 1 primary shard and 1 replica shard (can be customised).
  6. Replicas: Duplicate copies of shards used for high availability and fault tolerance. For example, if your index has 1 primary shard and 1 replica, there will be a total of 2 shards.

Choosing What to Index

When deciding what to index in Elasticsearch, consider the following factors:

  1. Data Relevance:
    • Identify the key entities and attributes that users will search for. For a course search application, relevant fields might include course title, description, category, level, language, and source.
  2. Query Patterns:
    • Analyse how users will query the data. This will help us determine which fields should be indexed for full-text search versus those that may only require exact matches (e.g., categories).
  3. Field Types:
    • Choose appropriate data types for each field based on how we plan to use them in searches. For example:
      • Use text for fields that require full-text search (like course descriptions).
      • Use keyword for fields that require exact matches or aggregations (like categories). Fields of type keyword are case-sensitive by default. If you search for "EN" in a keyword field, it will not match "en" unless the case matches exactly.
      • Use integer, boolean for structured fields.
  4. Performance Considerations:
    • Keep in mind that indexing large amounts of unnecessary data can impact performance. Focus on indexing only the fields that are essential for your application’s functionality.

What to Index for a Course Search?

For a course search, we should index the most relevant fields that help users find courses easily. Below is a typical structure of what to index:

Suggested Fields for a Course Index:

Field NameData TypePurpose
idKeywordUnique identifier for each course.
nameTextCourse title, used for keyword searches.
descriptionTextDetailed course description, used for search.
category_idIntegerCategory of the course, used for filtering.
sub_category_idIntegerSub-category for deeper classification.
languageKeywordLanguage of the course (e.g., EN, FR).
sourceKeywordSource of the course (e.g., LinkedIn, AWS).
levelKeywordCourse level, used for filtering.
is_validBooleanWhether the course is active/published.
modified_atDateThe last modified timestamp is used for sorting.

Setting Up a Django Application

Django provides a robust framework for building web applications, making it an excellent choice for our course search and recommendation engine. Below, I’ll outline how to set up the Django application to connect to the database, fetch course data, and index it into Elasticsearch.

Step 1: Set Up Your Django Project

If you haven’t already, install Django using pip:

pip install django

Create a new Django project named course_search:

django-admin startproject course_search
cd course_search

Inside our project, create a new app called courses:

python manage.py startapp courses

Open settings.py in your course_search directory and add the courses app to the INSTALLED_APPS list:

INSTALLED_APPS = [
    ...
    'courses',
    ...
]

Step 2: Configure Database Settings

In settings.py, configure your database settings. For example, if we are using MySQL, our configuration might look like this:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'courses_db',
        'USER': 'user',
        'PASSWORD': 'password',
        'HOST': 'localhost',
        'PORT': '3306',
    }
}

Make sure to install the MySQL client for Python:

pip install mysqlclient

Step 3: Define the Course Model

In models.py of the courses app, define the Course model that represents the structure of our course data:

from django.db import models

class Course(models.Model):
    name = models.CharField(max_length=255)
    description = models.TextField()
    category_id = models.IntegerField()
    sub_category_id = models.IntegerField()
    language = models.CharField(max_length=50)
    source = models.CharField(max_length=50)
    instructor = models.CharField(max_length=150)
    level = models.CharField(max_length=50, blank=True, null=True)
    is_valid = models.BooleanField(default=True)
    modified_at = models.DateTimeField(auto_now=True)
    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        db_table = 'courses'

    def __str__(self):
        return self.name

Register the Course model with admin:

from django.contrib import admin
from .models import Course

admin.site.register(Course)

Step 4: Create and Apply Migrations

Run the following commands to create and apply migrations for our model:

python manage.py makemigrations courses
python manage.py migrate

Step 5: Populate the Database

Add sample courses using the Django admin or a script. To create a superuser:

python manage.py createsuperuser

Register the model in courses/admin.py:

from django.contrib import admin
from .models import Course

admin.site.register(Course)

Run the development server:

python manage.py runserver

Access the Django admin http://127.0.0.1:8000/admin to add sample courses.

Step 6: Set Up Elasticsearch Integration

We will need an Elasticsearch client library for Python. Install it using pip:

pip install elasticsearch

Django allows us to create custom management commands. Let’s create a command to index the course data. Create a management/commands directory structure in your courses app:

mkdir -p courses/management/commands

Next, let’s create a script named elasticsearch_client.py in courses/management/commands/

from elasticsearch import Elasticsearch

es = Elasticsearch(['http://localhost:9200'])

def index_course(course):
    doc = {
        'name': course.name,
        'description': course.description,
        'category_id': course.category_id,
        'sub_category_id': course.sub_category_id,
        'language': course.language,
        'source': course.source,
        'level': course.level,
        'is_valid': course.is_valid,
        'modified_at': course.modified_at.isoformat(),
    }
    try:
        es.index(index='courses', id=course.id, body=doc)
    except exceptions.ElasticsearchException as e:
        print(f"Failed to index course {course.id}: {e}")

Create a new file named index_courses.py in courses/management/commands/:

from django.core.management.base import BaseCommand
from courses.models import Course
from .elasticsearch_client import es, index_course

class Command(BaseCommand):
    help = 'Index all courses into Elasticsearch'

    def handle(self, *args, **kwargs):
        index_name = 'courses'
        
        if not es.indices.exists(index=index_name):
            self.stdout.write(f"Creating index: courses")
            es.indices.create(index='courses')

        courses = Course.objects.all()
        for course in courses:
            index_course(course)
            self.stdout.write(self.style.SUCCESS(f'Successfully indexed course: {course.name}'))
    
        self.stdout.write(self.style.SUCCESS('Successfully indexed all courses'))

Let’s run this command to index all courses:

python manage.py index_courses

Once data is indexed, we can query Elasticsearch to view the mappings of the courses index using the following HTTP request:

curl -X GET "http://localhost:9200/courses/_mapping?pretty"

This will return the field mappings of the courses index in a human-readable format.

{
  "courses" : {
    "mappings" : {
      "properties" : {
        "category_id" : {
          "type" : "long"
        },
        "description" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "is_valid" : {
          "type" : "boolean"
        },
        "language" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "level" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "modified_at" : {
          "type" : "date"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "source" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "sub_category_id" : {
          "type" : "long"
        }
      }
    }
  }
}

To query the courses index, use the _search endpoint. For example:

curl -X GET "http://localhost:9200/courses/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}'

Searching for a course by name:

curl -X GET "http://localhost:9200/courses/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "name": "Python"
    }
  }
}'

Conclusions

With the Django model and indexing logic updated to reflect the new database schema, we’re now ready to continue building our course search and recommendation engine.

In the next tutorial, we’ll implement search functionalities using Elasticsearch and create views to display search results on a web interface. Stay tuned as we continue developing this application!

For the full source code, please visit the GitHub repository.

Share this content:

Leave a Comment

Discover more from nnyw@tech

Subscribe now to keep reading and get access to the full archive.

Continue reading