I wanna build a tool which scans a website for all urls, but not the urls within the page but of the site self, but I don’t know how. Could anyone give me an example how I can start?
/upload /login /impress
Not every page has to be linked from another page of that domain. Scanning html, only would be useless. Or another example I want to generate a sitemap.xml.
What are you really trying to accomplish?
You’re simply not going to be able to do this via HTTP. Given the absence of vulnerabilities in the HTTP server, you’re going to get what the content provider publishes unless you already know direct paths. The only option here is a content crawler.
With that fact in hand your other option is to index the site at the file system level. You will have to do a lot of work analyzing the files since there will most likely be a significant amount of files that don’t translate to a URL on the server.