Python Text Mining and Scraper for SEDAR+/TSX need AI Software Development

Contact person: Python Text Mining and Scraper for SEDAR+/TSX

Phone:Show

Email:Show

Location: Alexandria, Egypt

Budget: Recommended by industry experts

Time to start: As soon as possible

Project description:
"Python Scraper + Text Mining (SEDAR+/TSX, 2013–2025)
Title
Canadian Corporate Digitalization Dataset (2013–2025): Scraping SEDAR+/TSX, Text Extraction, Digitalization Index & Topic Modeling

Project Overview
I need a freelancer to build a dataset of corporate digitalization disclosure for all Canadian listed companies (approx. 3,476 issuers) over the period 2013–2025.
The work requires:
1. Scraping MD&A, Annual Reports, and AIF from SEDAR+ / TSX.
2. Extracting & cleaning text from reports.
3. Measuring a Digitalization Index (dictionary-based, using keywords from prior academic literature).
4. Conducting Topic Modeling (LDA/STM) to identify digitalization themes.
5. Delivering structured firm–year CSV files and reproducible Python code.

Tasks & Deliverables
1. Scraping (2013–2025)
• Collect issuer list (CSV provided, ~3,476 firms).
• For each issuer × year, download available:
o MD&A (Management Discussion & Analysis)
o Annual Report
o Annual Information Form (AIF)
• Save PDFs under:
• data/reports_raw/{FirmName}/{Year}/[login to view URL]
• Provide a manifest (CSV) with: firm, ticker, year, document type, source URL, download date, file path, checksum.

2. Text Extraction & Cleaning
• Convert PDFs → text ([login to view URL], PyPDF2, OCR fallback).
• Clean text: remove headers, tables, footers, page numbers.
• Save under:
• data/reports_txt/{FirmName}/{Year}/[login to view URL]

3. Digitalization Index (Mandatory)
Use a dictionary-based approach with the following keywords compiled from prior academic literature:
Core Digitalization
• digitalization, digitization, digital transformation, digital economy, information technology, information systems
(Bharadwaj et al., 2013; Li et al., 2021)
Technologies
• artificial intelligence, AI, machine learning, ML, deep learning, DL, natural language processing, NLP, computer vision
• robotics, robotic process automation, RPA
• cloud computing, SaaS, PaaS, IaaS, cloud
• blockchain, distributed ledger, DLT
• fintech
• internet of things, IoT, industrial internet
• big data, data analytics
• edge computing
• digital twin
(Verhoef et al., 2021; Chen et al., 2022)
Business Models & Finance
• digital platform, e-commerce, online marketplace
• open banking, mobile banking, mobile payments
• digital banking, neobanking
• application programming interface, API
• microservices
• fintech innovation
(Gomber et al., 2018; Vial, 2019)
Organizational Processes
• digital strategy, IT capability, IT infrastructure
• enterprise resource planning, ERP
• customer relationship management, CRM
• business process automation
• data warehouse, data lake
• omnichannel, multichannel
(Matt et al., 2015; Susanti et al., 2023)
Scoring rules
• Count frequency of these keywords per document.
• Normalize by total word count (per 10,000 words).
• For each firm–year, calculate:
o dict_raw_count (sum of matches)
o dict_score_per_10k (normalized index).
• Save in firm_year_summary.csv.

4. Topic Modeling (Mandatory)
• Apply Latent Dirichlet Allocation (LDA) or Structural Topic Modeling (STM) across the corpus.
• Identify digitalization-related latent topics.
• Report:
o Topic distributions for each firm–year.
o Top 10 words per topic with coherence scores.
• Deliver:
o [login to view URL] (firm, year, topic_1_share, …).
o [login to view URL] (topic_id, top_words, coherence_score).

5. Final Dataset
Deliver three structured CSV files:
1. [login to view URL] → metadata for every document.
2. [login to view URL] → aggregated per firm–year with Digitalization Index.
3. [login to view URL] → topic shares per firm–year.

6. Code & Documentation
• All scripts in /src.
• [login to view URL] for dependencies.
• [login to view URL] with instructions to rerun pipeline.
• Config file ([login to view URL]) for paths, years, scoring settings.

Example Output
[login to view URL]
firm_name ticker year total_word_count dict_raw_count dict_score_per_10k dominant_topic digitalization_topic_share
Bank of Nova Scotia BNS 2019 82,134 245 29.8 4 (FinTech) 0.32
Shopify Inc. SHOP 2021 61,255 432 70.5 2 (Cloud) 0.55
[login to view URL]
topic_id top_words coherence_score
1 risk, credit, impairment, exposure 0.46
2 cloud, platform, saas, software 0.51
4 fintech, digital, payment, ai 0.54

Application Instructions
Please include in your proposal:
1. Your experience scraping large regulatory datasets (e.g., SEDAR+, EDGAR).
2. Python/NLP experience (dictionary scoring, TF-IDF, topic modeling).
3. How you will handle scanned PDFs.
4. Links to GitHub/portfolio if available.
5. Confirmation you will deliver both Digitalization Index and Topic Modeling outputs as described." (client-provided description)


Matched companies (2)

...

TG Coders

We create custom apps for businesses and startups TG Coders is a technology partner specializing in creating custom mobile and web applications for … Read more

...

WhizzAct Private Limited

WhizzAct aims to deliver the supreme service at an effective cost, ensuring complete customer satisfaction. Emphatic use of the latest tools and tech… Read more