Crawler Blog

Beyond

Part 1

I'm currently taking an Operating Systems course this semester. Professor DZ has made it interactive and interesting so far, and it's given me a much deeper understanding of how computers work under the hood. Threads, scheduling, synchronization, memory management.. they're only seen as exam topics if you look at them from a surface level, but when you really dive into them, you see how they shape the way we build software and systems.

We've been given a semester long project by the Prof called "psirver". It is essentially a user-space C++ server on a POSIX system that executes and monitors Python scripts over HTTP. It has forced me to think about process control, signal handling, and fault isolation in a way that most assignments never did.

I enjoyed building psirver, but I wanted a second project where I could make my own architectural tradeoffs from day one. So I thought of this simple web crawler. It's small enough to finish quickly, but deep enough to explore concurrency and resource management.

I gave myself a 24 hour time constraint to keep momentum high and decisions practical. The challenge and goal is not to build the biggest crawler on day one, but to design something correct, measurable, and easy to extend without rewriting everything later.

— Montasir

Architecture

Part 2

I chose C++ because I want explicit control over memory, threading, and performance behavior, just like psirver. The stack is intentionally more minimal than other projects I've worked on. CMake for builds, libcurl for HTTP fetches, and STL synchronization primitives for safe shared-state coordination.

At a high level, the crawler runs on a producer consumer model. A thread-safe URL queue feeds worker threads, each worker fetches a page, extracts candidate links, normalizes and deduplicates them, and pushes new URLs back into the queue until limits are reached.

Key constraints are politeness and control. I cap crawl depth, guard against duplicate visits, and plan rate limiting per domain. Fast code is good, but predictable behavior under load is the real engineering target.

— Montasir

Implementation

Part 3

I started the implementation by keeping things extremely small. Instead of worrying about threads or link extraction immediately, I just wanted the crawler to fetch a single page and print something useful. The first working version simply accepted a seed URL, fetched the HTML using libcurl, and logged that the request completed successfully.

Once that worked, I added the basic pipeline: a queue of URLs to visit and a set that tracks which URLs have already been crawled. Even though the crawler is simple right now, separating those responsibilities early made the code easier to reason about. The queue handles scheduling, while the visited set prevents the crawler from getting stuck revisiting the same pages.

After the basic flow worked, I began wiring in link extraction so the crawler could actually discover new pages. At this stage I wasn't aiming for perfect HTML parsing yet. The actual goal was just to prove that the crawler could fetch a page, extract links, and continue crawling. Once that feedback loop worked, the rest of the architecture started to make more sense.

— Montasir

Testing

Part 4

Testing this project was mostly about observing behavior rather than writing formal unit tests. Since the crawler interacts with the real web, the easiest way to verify things was to run it on small, predictable sites and watch what happened.

I started by crawling very small sites with low page limits to confirm that URLs were being discovered and scheduled correctly. Logging helped a lot here, printing each crawled URL made it obvious when the crawler was revisiting pages or failing to extract links properly.

I also tried testing edge cases like invalid URLs and pages that returned errors. In those cases the crawler shouldn't crash or stall it should simply skip the page and continue working through the queue. That kind of resilience is important for crawlers because the web is messy and unpredictable.

— Montasir

Reflection

Part 5

One thing I liked about this project is that it connects directly to concepts I'm learning in my os class. Queues, synchronization, worker threads.. you start seeing why these topics are on OS exams when you build something like this.

Giving myself a 24 hour constraint also changed how I approached the design. Instead of trying to make everything perfect, I focused on building something that worked first and could be improved later and that mindset made progress much faster.

The crawler is still pretty small, but it already feels like a solid base to experiment with. There are a lot of directions it could go next like better HTML parsing, smarter scheduling, or even distributed crawling. For now though, I'm happy that the core loop works and that I got to apply some systems concepts in a practical way.

This was a great way to wrap up Spring Break, I'll definitely be revisiting this project later to add more features and optimizations, but it was a fun challenge to build something like this in a short time and see how far I could get with it.

— Montasir

Crawler

Tech Stack

Development Blog Posts

Beyond

Architecture

Implementation

Testing

Reflection