Creating crawlers - Lightnovel Crawler

A source crawler is a small Python file that tells Lightnovel Crawler how to read a specific novel website. It answers two questions:

What’s on the novel page? — title, author, cover, list of chapters.
What’s inside each chapter? — the story text, cleaned of ads and navigation.

Once you add a crawler, users can download novels from that site through the CLI or the web UI.

Prerequisites

Python 3.9+
The project set up locally (see Development setup)
Basic familiarity with HTML and CSS selectors — you’ll use selectors like div.chapter-content to target elements

Architecture overview

Crawlers live in the sources/ directory, organized by language. Each crawler is a single .py file containing one class that inherits from a template base class. The template handles HTTP requests, concurrency, and output building — you only implement the methods that extract data from the page HTML. The crawler registry auto-discovers all Python files in sources/ on startup. A crawler is matched to a URL when the URL starts with one of the values in the class’s base_url list.

Choose a template

Pick the template that matches your site’s structure. Start with GeneralSoupTemplate for most sites and only move to a more specific template if you need its extra features.

GeneralSoupTemplate (recommended)
SearchableSoupTemplate
With volumes
Browser-based
Base Crawler

File: sources/_examples/_01_general_soup.pyThe default choice for most sites. Requires four methods and handles everything else.

# -*- coding: utf-8 -*-
import logging
from typing import Generator, Union

from lncrawl.core import PageSoup
from lncrawl.models import Chapter, Volume
from lncrawl.templates.soup.general import GeneralSoupTemplate

logger = logging.getLogger(__name__)


class MySiteCrawler(GeneralSoupTemplate):
    base_url = ["https://mysite.com/", "https://www.mysite.com/"]

    def parse_title(self, soup: PageSoup) -> str:
        raise NotImplementedError()

    def parse_cover(self, soup: PageSoup) -> str:
        return ""

    def parse_chapter_list(
        self, soup: PageSoup
    ) -> Generator[Union[Chapter, Volume], None, None]:
        yield from []

    def select_chapter_body(self, soup: PageSoup) -> PageSoup:
        raise NotImplementedError()

Use this template when the site renders HTML server-side and has a straightforward chapter list.

File: sources/_examples/_02_searchable_soup.pyExtends GeneralSoupTemplate to support searching by title. Add this when the site has a working search box.Two extra required methods:

from lncrawl.models import SearchResult
from lncrawl.templates.soup.searchable import SearchableSoupTemplate


class MySiteCrawler(SearchableSoupTemplate):
    base_url = ["https://mysite.com/"]

    def select_search_items(
        self, query: str
    ) -> Generator[PageSoup, None, None]:
        # Fetch the search results page and yield anchor tags
        soup = self.get_soup(f"{self.home_url}search?q={query}")
        yield from soup.select(".search-results a")

    def parse_search_item(self, tag: PageSoup) -> SearchResult:
        return SearchResult(
            title=tag.get_text(strip=True),
            url=self.absolute_url(tag["href"]),
        )

    # ...plus the four GeneralSoupTemplate methods

File: sources/_examples/_05_with_volume_soup.pyUse ChapterWithVolumeSoupTemplate when the novel page displays explicit volume blocks, each containing its own chapter list.

from lncrawl.models import Chapter, Volume
from lncrawl.templates.soup.with_volume import ChapterWithVolumeSoupTemplate


class MySiteCrawler(ChapterWithVolumeSoupTemplate):
    base_url = ["https://mysite.com/"]

    def select_volume_tags(
        self, soup: PageSoup
    ) -> Generator[PageSoup, None, None]:
        yield from soup.select("#toc .vol-item")

    def parse_volume_item(self, tag: PageSoup, id: int) -> Volume:
        return Volume(id=id, title=tag.get_text(strip=True))

    def select_chapter_tags(
        self, tag: PageSoup, vol: Volume, soup: PageSoup
    ) -> Generator[PageSoup, None, None]:
        yield from tag.select(".chapter-item")

    def parse_chapter_item(
        self, tag: PageSoup, id: int, vol: Volume
    ) -> Chapter:
        return Chapter(
            id=id,
            volume=vol.id,
            title=tag.get_text(strip=True),
            url=self.absolute_url(tag["href"]),
        )

    def select_chapter_body(self, soup: PageSoup) -> PageSoup:
        return soup.select_one(".chapter-content")

Also see _07_optional_volume_soup.py for sites where volumes are sometimes present.

Files: sources/_examples/_09_basic_browser.py through _17_searchable_optional_volume_browser.pyUse browser templates when the site loads content via JavaScript. The browser template renders the page with a headless browser before parsing.

from lncrawl.models.chapter import Chapter
from lncrawl.templates.browser.basic import BasicBrowserTemplate


class MySiteCrawler(BasicBrowserTemplate):
    base_url = ["https://mysite.com/"]

    def read_novel_info_in_soup(self) -> None:
        # Try with plain HTTP first; raise ScraperNotSupported() to fall back
        pass

    def read_novel_info_in_browser(self) -> None:
        # Uses self.browser (headless browser)
        # Access parsed HTML via self.browser.soup
        pass

    def download_chapter_body_in_browser(self, chapter: Chapter) -> str:
        self.visit(chapter["url"])
        soup = self.browser.soup
        tag = soup.select_one(".chapter-content")
        return self.cleaner.extract_contents(tag)

Browser templates are slower than soup-based ones. Only use them when the site actively requires JavaScript execution.

File: sources/_examples/_00_basic.pyInherit directly from Crawler when you need full control. Requires two methods and no template scaffolding.

from lncrawl.core import Crawler
from lncrawl.models import Chapter


class MySiteCrawler(Crawler):
    base_url = ["https://mysite.com/"]

    def read_novel_info(self) -> None:
        soup = self.get_soup(self.novel_url)
        self.novel_title = soup.select_one("h1").get_text(strip=True)
        self.novel_cover = self.absolute_url(
            soup.select_one("img.cover")["src"]
        )
        # Populate self.volumes and self.chapters

    def download_chapter_body(self, chapter: Chapter) -> str:
        soup = self.get_soup(chapter["url"])
        tag = soup.select_one(".chapter-content")
        return self.cleaner.extract_contents(tag)

This style is fully supported but the GeneralSoupTemplate style is recommended for new crawlers — smaller methods, less boilerplate, same result.

File placement

Crawlers are grouped by the site’s language. Place your file in the matching directory:

Site language	Folder	Example path
English	`sources/en/` then by first letter	`sources/en/m/mysite.py`
Chinese	`sources/zh/`	`sources/zh/mysite.py`
Japanese	`sources/ja/`	`sources/ja/mysite.py`
Multiple languages	`sources/multi/`	`sources/multi/mysite.py`

For English sites, use a letter subfolder based on the site’s domain name (e.g. “mynovelsite.com” → sources/en/m/). Name the file after the site, such as mynovelsite.py. Avoid generic names like crawler.py.

Step-by-step: build a crawler

Copy the example file

Copy the appropriate example into the right sources/{lang}/ folder:

cp sources/_examples/_01_general_soup.py sources/en/m/mysite.py

For English sites, use the letter subfolder matching your domain’s first letter.

Set base_url and rename the class

Open your new file and update the class name and base_url:

class MySiteCrawler(GeneralSoupTemplate):
    base_url = ["https://mysite.com/", "https://www.mysite.com/"]

base_url is a list of URL prefixes. When a user pastes a novel URL, the app finds your crawler by checking if the URL starts with one of these values.

Implement parse_title

Open the novel page in your browser, right-click the title, and select Inspect. Find the CSS selector that uniquely identifies the title element.

def parse_title(self, soup: PageSoup) -> str:
    tag = soup.select_one("h1.novel-title")  # adjust selector to match site
    return tag.get_text(strip=True) if tag else ""

The soup parameter is the parsed HTML of the novel’s detail page.

Implement parse_cover

Find the cover image element and return its URL. Use self.absolute_url() to handle relative paths.

def parse_cover(self, soup: PageSoup) -> Optional[str]:
    img = soup.select_one("img.cover")  # adjust selector
    if img and img.get("src"):
        return self.absolute_url(img["src"])
    return None

Return None if the site has no cover image.

Implement parse_chapter_list

Yield Volume and Chapter objects in order. The template appends them to self.volumes and self.chapters.Flat chapter list (no explicit volumes):

from lncrawl.models import Chapter, Volume

def parse_chapter_list(
    self, soup: PageSoup
) -> Generator[Union[Chapter, Volume], None, None]:
    yield Volume(id=1, title="Volume 1")

    for idx, a in enumerate(soup.select("ul.chapters a"), 1):
        yield Chapter(
            id=idx,
            title=a.get_text(strip=True),
            url=self.absolute_url(a["href"]),
            volume=1,
        )

With volume headings:

def parse_chapter_list(
    self, soup: PageSoup
) -> Generator[Union[Chapter, Volume], None, None]:
    chap_id = 1
    for vol_id, vol_tag in enumerate(soup.select(".volume-block"), 1):
        yield Volume(id=vol_id, title=vol_tag.select_one(".vol-title").get_text(strip=True))
        for a in vol_tag.select("a.chapter-link"):
            yield Chapter(
                id=chap_id,
                title=a.get_text(strip=True),
                url=self.absolute_url(a["href"]),
                volume=vol_id,
            )
            chap_id += 1

Always use self.absolute_url() on chapter URLs — some sites use relative paths.

Implement select_chapter_body

The template fetches each chapter page and calls this method with its parsed HTML. Return the single tag that wraps the story text.

def select_chapter_body(self, soup: PageSoup) -> PageSoup:
    return soup.select_one("div.chapter-content")  # adjust selector

The template extracts and cleans the content automatically. Return None if not found.

Test with a real novel URL

Run a quick download test from the project root. --first 3 downloads only the first 3 chapters; -f json outputs as JSON:

uv run python -m lncrawl -s "https://mysite.com/novel/example" --first 3 -f json

If you see errors about selectors or “element not found”, your CSS selectors don’t match the site — use the browser’s Inspect tool to find the correct class names.

Verify the crawler is registered

Confirm your file appears in the source list:

uv run python -m lncrawl sources list | grep mysite

If nothing appears, check that the file is in the right sources/ folder and that the class inherits from Crawler or a template.

Required method signatures

These are the four methods every GeneralSoupTemplate crawler must implement:

def parse_title(self, soup: PageSoup) -> str:
    """Return the novel title from the novel detail page."""
    ...

def parse_cover(self, soup: PageSoup) -> str:
    """Return the cover image URL, or '' if none.
    Use self.absolute_url() for relative paths.
    """
    ...

def parse_chapter_list(
    self, soup: PageSoup
) -> Generator[Union[Chapter, Volume], None, None]:
    """Yield Volume and Chapter objects in order.
    The template appends them to self.volumes and self.chapters.
    """
    ...

def select_chapter_body(self, soup: PageSoup) -> PageSoup:
    """Return the Tag containing the chapter text.
    The template cleans and extracts it.
    Return None if not found.
    """
    ...

Optional methods

Override these in your class when needed. The template provides working defaults for all of them.

Method	Purpose
`get_novel_soup(self)`	Return the BeautifulSoup for the novel page. Override if you need a different URL or POST request.
`parse_authors(self, soup)`	Yield author name strings. Default: yields nothing.
`parse_genres(self, soup)`	Yield genre or tag strings. Default: yields nothing.
`parse_summary(self, soup)`	Return the novel synopsis string. Default: `""`.
`initialize(self)`	One-time setup — configure cleaner rules, set custom headers.
`login(self, username_or_email, password_or_token)`	Log in before scraping, for sites that require authentication.

Example — parse_authors:

def parse_authors(self, soup: PageSoup) -> Generator[str, None, None]:
    tag = soup.find("strong", string="Author:")
    if tag and tag.next_sibling:
        yield tag.next_sibling.get_text(strip=True)
    # For multiple authors:
    # for a in soup.select(".author a"):
    #     yield a.get_text(strip=True)

Example — initialize with custom cleaner rules:

def initialize(self) -> None:
    self.cleaner.bad_css.update(["div.advertisement", "div.social-share"])
    self.cleaner.bad_tags.update(["script", "style"])

Available helper methods

These are inherited from the base template and available inside any method:

HTTP and parsing

Method	Description
`self.get_soup(url)`	GET request, returns `PageSoup`
`self.post_soup(url, data)`	POST request, returns `PageSoup`
`self.get_json(url)`	GET request, returns parsed JSON
`self.post_json(url, data)`	POST request, returns parsed JSON
`self.submit_form(url, data)`	Submit form data

URLs

Method	Description
`self.absolute_url(path)`	Convert a relative path like `/chapter/1` to a full URL
`self.novel_url`	The novel page URL the user provided
`self.home_url`	The first value in `base_url`

Content cleaning

# Remove specific CSS selectors before extraction
self.cleaner.bad_css.update(["div.ads", "span.watermark"])

# Remove specific tags
self.cleaner.bad_tags.update(["script", "style"])

# Extract clean HTML from a tag
html = self.cleaner.extract_contents(soup_element)

Using ChatGPT to generate a crawler

The CLI includes a command that uses ChatGPT to generate a crawler from a novel URL:

lncrawl sources create

This is a good starting point, but the generated code will need review and testing before it is ready to submit.

Known engine templates

If the site runs on a widely-used novel platform, you may be able to inherit from an existing engine template and only override base_url:

lncrawl.templates.madara — Madara WordPress theme
lncrawl.templates.novelfull — NovelFull-style sites
lncrawl.templates.novelpub — NovelPub-style sites

Check existing crawlers in sources/ that use these templates for reference.

Best practices

Handle missing elements — not every novel has a cover or author. Always use if tag: before accessing attributes.
Log useful info — logger.info("Found %d chapters", len(self.chapters)) makes debugging much easier.
Use self.absolute_url() — for all chapter URLs and the cover image, to ensure they resolve correctly from any context.
Test edge cases — try a novel with many chapters, one with special characters in the title, and one with no cover.
Clean the chapter content — configure self.cleaner to strip ads, navigation links, and scripts so the exported ebook looks clean.
Respect the site — don’t send too many requests at once; the base app already limits concurrency.

Common mistakes

Wrong selectors — the site’s HTML may use different class names from what you expect. Always inspect the live page in your browser.
Relative URLs — always use self.absolute_url(link["href"]) for chapter URLs and cover images.
Unimplemented required methods — parse_title, parse_cover, parse_chapter_list, and select_chapter_body must all return real data, or the app will do nothing.

Complete example

A full working crawler using GeneralSoupTemplate:

# -*- coding: utf-8 -*-
import logging
from typing import Generator, Union

from lncrawl.core import PageSoup
from lncrawl.models import Chapter, Volume
from lncrawl.templates.soup.general import GeneralSoupTemplate

logger = logging.getLogger(__name__)


class ExampleCrawler(GeneralSoupTemplate):
    base_url = ["https://example-novel-site.com/"]

    def initialize(self) -> None:
        self.cleaner.bad_css.update(["div.advertisement", "div.social-share"])

    def parse_title(self, soup: PageSoup) -> str:
        tag = soup.select_one("h1.novel-title")
        return tag.get_text(strip=True) if tag else ""

    def parse_cover(self, soup: PageSoup) -> str:
        img = soup.select_one("img.novel-cover")
        if img and img.get("src"):
            return self.absolute_url(img["src"])
        return ""

    def parse_authors(self, soup: PageSoup) -> Generator[str, None, None]:
        author = soup.select_one("span.author-name")
        if author:
            yield author.get_text(strip=True)

    def parse_chapter_list(
        self, soup: PageSoup
    ) -> Generator[Union[Chapter, Volume], None, None]:
        yield Volume(id=1, title="Volume 1")
        links = soup.select("ul.chapter-list a")
        for idx, a in enumerate(links, 1):
            yield Chapter(
                id=idx,
                title=a.get_text(strip=True),
                url=self.absolute_url(a["href"]),
                volume=1,
            )
        logger.info("Found %d chapters", len(links))

    def select_chapter_body(self, soup: PageSoup) -> PageSoup:
        return soup.select_one("div.chapter-content")

Once everything works, open a pull request to the main repository.

Documentation Index

​Prerequisites

​Architecture overview

​Choose a template

​File placement

​Step-by-step: build a crawler

​Required method signatures

​Optional methods

​Available helper methods

​HTTP and parsing

​URLs

​Content cleaning

​Using ChatGPT to generate a crawler

​Known engine templates

​Best practices

​Common mistakes

​Complete example