Business Client need Software Development
Contact person: Business Client
Phone:Show
Email:Show
Location: Vaikuntam Near AECS Layout, India
Budget: Recommended by industry experts
Time to start: As soon as possible
Project description:
"I need an application that can take batches of mixed PDFs—some purely text-based, others scanned images—and turn each file into a well-structured XML document that validates against XSD files I will supply. The core of the workflow should combine reliable OCR for scanned pages with a large-language-model stage that recognises headings, paragraphs, tables, figures and other logical components before writing them out in the schema-compliant order.
Key points to build into the solution
• One-click ingestion of individual files or whole folders of PDFs
• Automatic detection of whether a page needs OCR (Tesseract/Adobe/Google Vision or similar)
• LLM-driven structural analysis that maps the recognised content to the element hierarchy defined in my XSDs
• Real-time validation: the app must flag any nodes that fail schema checks before final export
• Clear logging so I can trace how each page was processed and why any element was mapped a certain way
• Simple configuration pane where I can add a different XSD without touching the code
Deliverables
1. Source code with readable comments (Python preferred, but I’m open to other stacks)
2. A command-line interface plus a minimal GUI/Streamlit panel for non-technical use
3. Unit tests and a small sample set showing successful conversion and XSD validation
4. Setup guide covering prerequisites, model keys, and deployment on Windows/Linux
Acceptance criteria
– All sample PDFs (both text and scanned) convert without manual edits and pass xml ‑-schema using my XSDs
– Average page-level accuracy ≥ 95 % on a blind test set I’ll supply at the end
– Runtime under 60 s for a 30-page mixed document on a standard laptop
If you have prior experience blending OCR, NLP/LLMs (OpenAI, Claude, Llama-2, etc.) and schema-driven XML generation, this will be a straightforward project. Looking forward to seeing how you would architect, train and test the pipeline so that the output is rock-solid and maintainable." (client-provided description)
Matched companies (6)

TechGigs LLP

HJP Media

Codetreasure Co

Haven Futures

B2Bcert ISO consultants in Bangalore
