Python Text Mining and Scraper for SEDAR+/TSX need AI Software Development
Contact person: Python Text Mining and Scraper for SEDAR+/TSX
Phone:Show
Email:Show
Location: Alexandria, Egypt
Budget: Recommended by industry experts
Time to start: As soon as possible
Project description:
"Python Scraper + Text Mining (SEDAR+/TSX, 2013–2025)
Title
Canadian Corporate Digitalization Dataset (2013–2025): Scraping SEDAR+/TSX, Text Extraction, Digitalization Index & Topic Modeling
Project Overview
I need a freelancer to build a dataset of corporate digitalization disclosure for all Canadian listed companies (approx. 3,476 issuers) over the period 2013–2025.
The work requires:
1. Scraping MD&A, Annual Reports, and AIF from SEDAR+ / TSX.
2. Extracting & cleaning text from reports.
3. Measuring a Digitalization Index (dictionary-based, using keywords from prior academic literature).
4. Conducting Topic Modeling (LDA/STM) to identify digitalization themes.
5. Delivering structured firm–year CSV files and reproducible Python code.
Tasks & Deliverables
1. Scraping (2013–2025)
• Collect issuer list (CSV provided, ~3,476 firms).
• For each issuer × year, download available:
o MD&A (Management Discussion & Analysis)
o Annual Report
o Annual Information Form (AIF)
• Save PDFs under:
• data/reports_raw/{FirmName}/{Year}/[login to view URL]
• Provide a manifest (CSV) with: firm, ticker, year, document type, source URL, download date, file path, checksum.
2. Text Extraction & Cleaning
• Convert PDFs → text ([login to view URL], PyPDF2, OCR fallback).
• Clean text: remove headers, tables, footers, page numbers.
• Save under:
• data/reports_txt/{FirmName}/{Year}/[login to view URL]
3. Digitalization Index (Mandatory)
Use a dictionary-based approach with the following keywords compiled from prior academic literature:
Core Digitalization
• digitalization, digitization, digital transformation, digital economy, information technology, information systems
(Bharadwaj et al., 2013; Li et al., 2021)
Technologies
• artificial intelligence, AI, machine learning, ML, deep learning, DL, natural language processing, NLP, computer vision
• robotics, robotic process automation, RPA
• cloud computing, SaaS, PaaS, IaaS, cloud
• blockchain, distributed ledger, DLT
• fintech
• internet of things, IoT, industrial internet
• big data, data analytics
• edge computing
• digital twin
(Verhoef et al., 2021; Chen et al., 2022)
Business Models & Finance
• digital platform, e-commerce, online marketplace
• open banking, mobile banking, mobile payments
• digital banking, neobanking
• application programming interface, API
• microservices
• fintech innovation
(Gomber et al., 2018; Vial, 2019)
Organizational Processes
• digital strategy, IT capability, IT infrastructure
• enterprise resource planning, ERP
• customer relationship management, CRM
• business process automation
• data warehouse, data lake
• omnichannel, multichannel
(Matt et al., 2015; Susanti et al., 2023)
Scoring rules
• Count frequency of these keywords per document.
• Normalize by total word count (per 10,000 words).
• For each firm–year, calculate:
o dict_raw_count (sum of matches)
o dict_score_per_10k (normalized index).
• Save in firm_year_summary.csv.
4. Topic Modeling (Mandatory)
• Apply Latent Dirichlet Allocation (LDA) or Structural Topic Modeling (STM) across the corpus.
• Identify digitalization-related latent topics.
• Report:
o Topic distributions for each firm–year.
o Top 10 words per topic with coherence scores.
• Deliver:
o [login to view URL] (firm, year, topic_1_share, …).
o [login to view URL] (topic_id, top_words, coherence_score).
5. Final Dataset
Deliver three structured CSV files:
1. [login to view URL] → metadata for every document.
2. [login to view URL] → aggregated per firm–year with Digitalization Index.
3. [login to view URL] → topic shares per firm–year.
6. Code & Documentation
• All scripts in /src.
• [login to view URL] for dependencies.
• [login to view URL] with instructions to rerun pipeline.
• Config file ([login to view URL]) for paths, years, scoring settings.
Example Output
[login to view URL]
firm_name ticker year total_word_count dict_raw_count dict_score_per_10k dominant_topic digitalization_topic_share
Bank of Nova Scotia BNS 2019 82,134 245 29.8 4 (FinTech) 0.32
Shopify Inc. SHOP 2021 61,255 432 70.5 2 (Cloud) 0.55
[login to view URL]
topic_id top_words coherence_score
1 risk, credit, impairment, exposure 0.46
2 cloud, platform, saas, software 0.51
4 fintech, digital, payment, ai 0.54
Application Instructions
Please include in your proposal:
1. Your experience scraping large regulatory datasets (e.g., SEDAR+, EDGAR).
2. Python/NLP experience (dictionary scoring, TF-IDF, topic modeling).
3. How you will handle scanned PDFs.
4. Links to GitHub/portfolio if available.
5. Confirmation you will deliver both Digitalization Index and Topic Modeling outputs as described." (client-provided description)
Matched companies (2)

TG Coders
