Business Client need Web Development

Contact person: Business Client

Phone:Show

Email:Show

Location: Berlin, Germany

Budget: Recommended by industry experts

Time to start: As soon as possible

Project description:
"**Objective:**
Develop a highly efficient and robust web crawler system. The primary goal is to take a list of domain names (e.g., `[login to view URL]`) as input and output a comprehensive database of all internal URLs found on that domain, with a boolean flag indicating the presence of a primary video player on each page.

**Key Requirements:**

1. **Comprehensive Crawling:**
* The crawler must systematically explore the target domain, starting from the root and following all internal links to a configurable depth.
* The system should employ intelligent crawling strategies to maximize coverage while maintaining efficiency.
* The output must include all discoverable URLs within the domain scope.

2. **Intelligent Video Detection:**
* The system must analyze each crawled page to determine if it contains a **primary video player**. The focus is on the main content video, not ancillary or advertisement videos.
* The solution should implement a hybrid approach for optimal accuracy and performance:
* **Pattern-Based Heuristics:** Utilize comprehensive rules to identify video players through HTML5 `<video>` tags, embedded player iframes, script patterns, and video-related attributes.
* **Machine Learning Enhancement:** Implement optional ML-based analysis for ambiguous cases to improve detection confidence where necessary.

3. **Anti-Blocking & Robustness:**
* The system must be designed to operate effectively against modern web defenses, including:
* Advanced session, cookie, and local storage management
* CAPTCHA handling through integrated solving services
* Human-like behavioral patterns with randomized delays and user-agent rotation
* Bypassing common anti-bot protections and security walls
* The crawler must maintain operation despite encountering isolated errors or blocking attempts.

4. **Output & Performance:**
* The final deliverable must be a structured database (SQLite, PostgreSQL) or CSV file with columns:
* `url` (The fully qualified page URL)
* `has_video` (Boolean: TRUE/FALSE)
* `last_crawled` (Timestamp)
* Performance optimization is critical, utilizing asynchronous processing, connection pooling, and efficient resource management.

**Deliverables:**

1. Complete source code for the crawler and detection system
2. Comprehensive setup and configuration documentation
3. Final output database/CSV for provided domains

**Technical Freedom:**
The technology stack (Python/Scrapy, Node.js, Go) and implementation details are at the freelancer's discretion. Please justify your technical choices based on performance and effectiveness requirements.

**Note:**
Specific details regarding target website characteristics and content types will be discussed in private chat to ensure optimal solution design. Please acknowledge this requirement in your proposal." (client-provided description)


Matched companies (3)

...

Kiantechwise Pvt. Ltd.

Kiantechwise is a creative tech company delivering innovative web design, software solutions, branding, and digital marketing. With expertise and vis… Read more

...

TG Coders

We create custom apps for businesses and startups TG Coders is a technology partner specializing in creating custom mobile and web applications for … Read more

...

Appsdiary Technologies

AppsDiary is a software house that designs and develops mobile applications, websites, and custom software solutions. They work with businesses to c… Read more