Business Client need Web Development
Contact person: Business Client
Phone:Show
Email:Show
Location: Berlin, Germany
Budget: Recommended by industry experts
Time to start: As soon as possible
Project description:
"**Objective:**
Develop a highly efficient and robust web crawler system. The primary goal is to take a list of domain names (e.g., `[login to view URL]`) as input and output a comprehensive database of all internal URLs found on that domain, with a boolean flag indicating the presence of a primary video player on each page.
**Key Requirements:**
1. **Comprehensive Crawling:**
* The crawler must systematically explore the target domain, starting from the root and following all internal links to a configurable depth.
* The system should employ intelligent crawling strategies to maximize coverage while maintaining efficiency.
* The output must include all discoverable URLs within the domain scope.
2. **Intelligent Video Detection:**
* The system must analyze each crawled page to determine if it contains a **primary video player**. The focus is on the main content video, not ancillary or advertisement videos.
* The solution should implement a hybrid approach for optimal accuracy and performance:
* **Pattern-Based Heuristics:** Utilize comprehensive rules to identify video players through HTML5 `<video>` tags, embedded player iframes, script patterns, and video-related attributes.
* **Machine Learning Enhancement:** Implement optional ML-based analysis for ambiguous cases to improve detection confidence where necessary.
3. **Anti-Blocking & Robustness:**
* The system must be designed to operate effectively against modern web defenses, including:
* Advanced session, cookie, and local storage management
* CAPTCHA handling through integrated solving services
* Human-like behavioral patterns with randomized delays and user-agent rotation
* Bypassing common anti-bot protections and security walls
* The crawler must maintain operation despite encountering isolated errors or blocking attempts.
4. **Output & Performance:**
* The final deliverable must be a structured database (SQLite, PostgreSQL) or CSV file with columns:
* `url` (The fully qualified page URL)
* `has_video` (Boolean: TRUE/FALSE)
* `last_crawled` (Timestamp)
* Performance optimization is critical, utilizing asynchronous processing, connection pooling, and efficient resource management.
**Deliverables:**
1. Complete source code for the crawler and detection system
2. Comprehensive setup and configuration documentation
3. Final output database/CSV for provided domains
**Technical Freedom:**
The technology stack (Python/Scrapy, Node.js, Go) and implementation details are at the freelancer's discretion. Please justify your technical choices based on performance and effectiveness requirements.
**Note:**
Specific details regarding target website characteristics and content types will be discussed in private chat to ensure optimal solution design. Please acknowledge this requirement in your proposal." (client-provided description)
Matched companies (3)

Kiantechwise Pvt. Ltd.

TG Coders
