> For the complete documentation index, see [llms.txt](https://zeyad-abulaban.gitbook.io/notes/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://zeyad-abulaban.gitbook.io/notes/web-penetration-testing/information-gathering.md).

# Information Gathering

***

## Dorking

* [**Google Dork Queries**](https://www.exploit-db.com/google-hacking-database)
* [**BishopFox dorking tool**](https://resources.bishopfox.com/resources/tools/google-hacking-diggity/attack-tools/)

## Finger printing

**Apache server order of headers**

* Date
* Server
* Last-Modified
* ETag
* Accept-Ranges
* Content-Length
* Connection
* Content-Type

**Nginx order of headers**

* Server
* Date
* Content-Type

#### Sending malformed requests

**Apache Server Error**

```html
GET / Abuqasem/1.1

HTTP/1.1 400 Bad Request
Date: Fri, 06 Sep 2019 19:21:01 GMT
Server: Apache/2.4.41 (Unix)
Content-Length: 226
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
</body></html>
```

**Nginx Server Error**

```html
GET / Abuqasem/1.1

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.17.3</center>
</body>
</html>
```

**LightHttpd Server Error**

```html
GET / Abuqasem/1.1

HTTP/1.0 400 Bad Request
Content-Type: text/html
Content-Length: 345
Connection: close
Date: Sun, 08 Sep 2019 21:56:17 GMT
Server: lighttpd/1.4.54

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>400 Bad Request</title>
 </head>
 <body>
  <h1>400 Bad Request</h1>
 </body>
</html>
```

**Using Automated Tools**

* Nmap
* Netcraft
* Nikto
* etc..

## Review Webserver Metafiles for Information Leakage

#### Robots.txt

```html
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
...
```

The [**User-Agent**](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) directive refers to the specific web spider/robot/crawler. For example, the `User-Agent: Googlebot` refers to the spider from Google while `User-Agent: bingbot` refers to a crawler from Microsoft. `User-Agent: *` in the example above applies to all [**web spiders/robots/crawlers**](https://support.google.com/webmasters/answer/6062608?visit_id=637173940975499736-3548411022\&rd=1).

#### META Tags

**1- Robots META Tag**

If there is no `<META NAME="ROBOTS" ... >` entry then the “Robots Exclusion Protocol” defaults to `INDEX,FOLLOW` respectively. Therefore, the other two valid entries defined by the “Robots Exclusion Protocol” are prefixed with `NO...` i.e. `NOINDEX` and `NOFOLLOW`.

**2- Miscellaneous META Information Tags**

Organizations often embed informational META tags in web content to support various technologies such as screen readers, social networking previews, search engine indexing, etc.

```html
...
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<meta property="og:title" content="The White House" />
<meta property="og:description" content="We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all. – President Donald Trump." />
...
```

#### Site Maps

A sitemap is a file where a developer or organization can provide information about the pages, videos, and other files offered by the site or application, and the relationship between them `sitemap.xml`

```xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">
  <sitemap>
    <loc>https://www.google.com/gmail/sitemap.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://www.google.com/forms/sitemaps.xml</loc>
  </sitemap>
...
```

#### Security TXT

`security.txt` is a [**proposed standard**](https://securitytxt.org/) which allows websites to define security policies and contact details. There are multiple reasons this might be of interest in testing scenarios, including but not limited to:

* Identifying further paths or resources to include in discovery/analysis.
* Open Source intelligence gathering.
* Finding information on Bug Bounties, etc.
* Social Engineering.

The file may be present either in the root of the webserver or in the `.well-known/` directory. Ex:

* `https://example.com/security.txt`
* `https://example.com/.well-known/security.txt`

#### Humans TXT

`humans.txt` is an initiative for knowing the people behind a website. It takes the form of a text file that contains information about the different people who have contributed to building the website. See [humanstxt](http://humanstxt.org/) for more info. This file often (though not always) contains information for career or job sites/paths.

#### Other .well-known Information Sources

There are other RFCs and Internet drafts which suggest standardized uses of files within the `.well-known/` directory. Lists of which can be found [here](https://en.wikipedia.org/wiki/List_of_/.well-known/_services_offered_by_webservers) or [here](https://www.iana.org/assignments/well-known-uris/well-known-uris.xhtml).

## Review Webpage content for information leakage

#### Identifying JavaScript Code and Gathering JavaScript Files

Check JavaScript code for any sensitive information leaks which could be used by attackers to further abuse or manipulate the system. Look for values such as: API keys, internal IP addresses, sensitive routes, or credentials. For example:

```js
const myS3Credentials = {
  accessKeyId: config('AWSS3AccessKeyID'),
  secretAcccessKey: config('AWSS3SecretAccessKey'),
};
```

```js
var conString = "tcp://postgres:1234@localhost/postgres";
```

When an API Key is found, testers can check if the API Key restrictions are set per service or by IP, HTTP referrer, application, SDK, etc. For example, if testers found a Google Map API Key, they can check if this API Key is restricted by IP or restricted only per the Google Map APIs. If **the Google API Key is restricted only per the Google Map APIs, attackers can still use that API Key to query unrestricted Google Map APIs and the application owner must to pay for that. (IMPACT)**

```json
<script type="application/json">
...
{"GOOGLE_MAP_API_KEY":"AIzaSyDUEBnKgwiqMNpDplT6ozE4Z0XxuAbqDi4", "RECAPTCHA_KEY":"6LcPscEUiAAAAHOwwM3fGvIx9rsPYUq62uRhGjJ0"}
...
</script>
```

In some cases, testers may find sensitive routes from JavaScript code, such as links to internal or hidden admin pages:

```json
<script type="application/json">
...
"runtimeConfig":{"BASE_URL_VOUCHER_API":"https://staging-voucher.victim.net/api", "BASE_BACKOFFICE_API":"https://10.10.10.2/api", "ADMIN_PAGE":"/hidden_administrator"}
...
</script>
```

#### Identifying Source Map Files

Source map files will usually be loaded when DevTools open. Testers can also find source map files by adding the “.map” extension after the extension of each external JavaScript file. For example, if a tester sees a `/static/js/main.chunk.js` file, they can then check for its source map file by visiting `/static/js/main.chunk.js.map`.

#### Black-Box Testing

Check source map files for any sensitive information that can help the attacker gain more insight about the application. For example:

```json
{
  "version": 3,
  "file": "static/js/main.chunk.js",
  "sources": [
    "/home/sysadmin/cashsystem/src/actions/index.js",
    "/home/sysadmin/cashsystem/src/actions/reportAction.js",
    "/home/sysadmin/cashsystem/src/actions/cashoutAction.js",
    "/home/sysadmin/cashsystem/src/actions/userAction.js",
    "..."
  ],
  "..."
}
```

## Identify Application Entry Points

### How to Test

Before any testing begins, the tester should always get a good understanding of the application and how the user and browser communicates with it. As the tester walks through the application, they should pay attention to all HTTP requests as well as every parameter and form field that is passed to the application. They should pay special attention to when GET requests are used and when POST requests are used to pass parameters to the application. In addition, they also need to pay attention to when other methods for RESTful services are used. As the tester walks through the application, they should take note of any interesting **parameters in the URL, custom headers, or body of the requests/responses**, and save them in a spreadsheet.

### Requests

* Identify where GETs are used and where POSTs are used.
* Pay attention on hidden fields in POST requests.
* pay attention to any additional or custom type headers not typically seen (such as `debug: false`).

### Responses

* Identify where new cookies are set (`Set-Cookie` header), modified, or added to.
* Identify where there are any redirects (3xx HTTP status code), 400 status codes, in particular 403 Forbidden, and 500 internal server errors during normal responses (i.e., unmodified requests).
* Also note where any interesting headers are used. For example, `Server: BIG-IP` indicates that the site is load balanced. Thus, **if a site is load balanced and one server is incorrectly configured, then the tester might have to make multiple requests to access the vulnerable server, depending on the type of load balancing used**.

### Gray hat testing

**Using OWASP Attack Surface Detector (Useful for code review)** The Attack Surface Detector (ASD) tool investigates the source code and uncovers the endpoints of a web application, the parameters these endpoints accept, and the data type of those parameters. This includes the unlinked endpoints a spider will not be able to find, or optional parameters totally unused in client-side code. It also has the capability to calculate the changes in attack surface between two versions of an application. It's also available for burpsuite and zap and CLI <https://github.com/secdec/attack-surface-detector-cli/releases>

## Fingerprint Web Application Framework

### Black-Box Testing

* HTTP headers
* Cookies
* HTML source code
* Specific files and folders
* File extensions
* Error messages

#### HTTP Headers

The most basic form of identifying a web framework is to look at the `X-Powered-By` field in the HTTP response header. Many tools can be used to fingerprint a target, the simplest one is netcat.

```bash
$ nc 127.0.0.1 80
HEAD / HTTP/1.0

HTTP/1.1 200 OK
Server: nginx/1.0.14
[...]
X-Powered-By: Mono
```

```bash
HTTP/1.1 200 OK
Server: nginx/1.4.1
Date: Sat, 07 Sep 2013 09:22:52 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.16-1~dotdeb.1
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Generator: Swiftlet
```

#### Cookies

Instance of CAKEPHP ![Cakephp\_cookie](/files/5az7c5AQHeSl0AtqJ1J8)

#### HTML Source Code

instance of ZK framework ![Zk\_html\_source](/files/OHwTjT5zaKedzEt76gm5)

#### Specific Files and Folders

* wp-admin
* Dirbusting plugins for changelog files
* robots.txt

#### File Extensions

* `.php` – PHP
* `.aspx` – Microsoft ASP.NET
* `.jsp` – Java Server Pages

#### Common Identifiers

<https://owasp.org/www-project-web-security-testing-guide/stable/4-Web\\_Application\\_Security\\_Testing/01-Information\\_Gathering/08-Fingerprint\\_Web\\_Application\\_Framework> (Scroll down)

#### Tool

Used to analyze and identify attack vectors (Nice tool) <https://github.com/fuzzdb-project/fuzzdb>

## Map Application Architecture

The application architecture needs to be mapped through some test to determine what different components are used to build the web application. On more complex setups, such as an online bank system, multiple servers might be involved. These may include a reverse proxy, a front-end web server, an application server, and a database server or LDAP server. Each of these servers will be used for different purposes and might even be segregated in different networks with firewalls between them. This creates different network zones so that access to the web.Getting knowledge of the application architecture can be easy **if this information is provided to the testing team by the application developers in document form or through interviews**, but can also prove to be very difficult if doing a blind penetration test.

* ICMP filtering indicates a firewall

#### Detecting a reverse proxy

* analysis of the web server banner, might directly disclose the existence of a reverse proxy.
* Weird respnses from server
* Getting a diiferent status code after doing a simple attack such as SQLI ..etc
* Getting a different error message after an attack
* In some cases, even the protection system gives itself away. Here’s an example of mod\_security self identifying: ![10\_mod\_security](/files/FMpfCHtKMbf1ZHXf8MFW)
* Detecting load balancers may be done by examining multiple requests and comparing results to determine if the requests are going to the same or different web servers.For example, based on the Date header if the server clocks are not synchronized. In some cases, the network load balance process might inject new information in the headers that will make it stand out distinctly, like the BIGipServer prefixed cookie introduced by F5 BIG-IP load balancers.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://zeyad-abulaban.gitbook.io/notes/web-penetration-testing/information-gathering.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
