Parsing NGINX log in Splunk

Configure regex in field extractor to create relevant fields

  1. Web request
    1. Regex
    2. Event
    3. Fields
  2. Proxy request
    1. Regex
    2. Event
    3. Fields
  3. Upstream error response
    1. Regex
    2. Event
    3. Fields
  4. Upstream epoll error
    1. Regex
    2. Event
    3. Fields
  5. Upstream epoll error with referrer
    1. Regex
    2. Event
    3. Fields

For web server’s access log, Splunk has built-in support for Apache only. Splunk has a feature called field extractor. It is powered by delimiter and regex, and enables user to add new fields to be used in a search query. This post will only covers the regex patterns to parse nginx log, for instruction on field extractor, I recommend perusing the official documentation.

To illustrate, say we have a log format like this:

{id} "{http.request.host}" "{http.request.header.user-agent}"

An example log is:

123 "example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0"

While you could search for a specific keyword, e.g. attempts of Log4shell exploit, since there are no fields, you cannot run any statistics like table or stats on the search results.

Splunk is able to understand Apache log format because its field extractor already includes the necessary regex patterns to parse the relevant fields of each line in a log. Choosing a source type is equivalent of choosing a log format. If a format is not listed in the default list, we can either use an add-on or create new fields using field extractor. There is a Splunk add-on for nginx and I suggest to try it before resorting to field extractor.

I create five patterns which cover most of the nginx events I encountered during my work. Refer to the documentation for supported syntax.

A field is extracted through “capturing group”.

(?<field_name>capture pattern)

For example, (?<month>\w+) searches for one or more (+) alphanumeric characters (\w) and names the field as month. I opted for lazier matching, mostly using unbounded quantifier + instead of a stricter range of occurrences {M,N} despite knowing the exact pattern of a field. I found some fields may stray off slightly from the expected pattern, so a lazier matching tends match more events without matching unwanted’s.

Web request §

Regex §

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<remote_ip>[\d\.]+)(?:\s\d+\s\S+\s\S+\s)\[(?<time_local>\S+)\s(?<timezone>\+\d{4})\]\s"(?<http_method>\w+)\s(?<http_path>.+)\s(?<http_version>HTTP/\d\.\d)"\s(?<http_status>\d{3})\s(?:\d+)\s"(?<request_url>.[^"]*)"\s"(?<http_user_agent>.[^"]*)"\s(?<server_ip>[\d\.]+)\:(?<server_port>\d+)(?:\s\d+\s\d+\s)(?<ssl_version>\S+)\s(?<ssl_cipher>\S+)\s(?<http_cookie>\S+)

Event §

Dec 24 01:23:45 192.168.0.2 nginx: 1.2.3.4 55763 - - [24/Dec/2021:01:23:45 +0000] "GET /page.html HTTP/2.0" 200 494 "https://www.example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0" 192.168.1.2:8080 123 4 TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 abcdef .

Fields §

FieldValueRegexExplanation
monthDec(?<month>\w+)One or more alphanumeric
day24(?<day>\d+)One or more digit
time01:23:45(?<time>[\d\:]+)One or more digit or semicolon
proxy_ip192.168.0.2(?<proxy_ip>[\d\.]+)One or more digit or dot
remote_ip1.2.3.4(?<remote_ip>[\d\.]+)
time_local24/Dec/2021:01:23:45(?<time_local>\S+)One or more non-whitespace characters
timezone+0000(?<timezone>[\+\-]\d{4})Four digits with plus or minus prefix
http_methodGET(?<http_method>\w+)
http_path/page.html(?<http_path>.+)One or more of any character
http_versionHTTP/2.0(?<http_version>HTTP/\d\.\d)“HTTP”, a digit, dot and digit
http_status200(?<http_status>\d{3})Three digits
request_urlhttps://www.example.com(?<request_url>.[^"]*)Zero or more of any character except double quote
http_user_agentMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0(?<http_user_agent>.[^"]*)
server_ip192.168.1.2(?<server_ip>[\d\.]+)
server_port8080(?<server_port>\d+)
ssl_versionTLSv1.2(?<ssl_version>\S+)
ssl_cipherECDHE-RSA-AES128-GCM-SHA256(?<ssl_cipher>\S+)
http_cookieabcdef(?<http_cookie>\S+)

nginx is configured as a reverse proxy, proxy_ip is its ip whereas server_ip is the upstream’s.

Proxy request §

Regex §

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\sclient\s)(?<remote_ip>[\d\.]+)\:(?<remote_port>\d+)(?:\sconnected\sto\s)(?<server_ip>[\d\.]+)\:(?<server_port>\d+)

Event §

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 1776#1776:*114333142 client 1.2.3.4:19802 connected to 192.168.1.2:8080

Fields §

FieldValueRegexExplanation
monthDec(?<month>\w+)
day24(?<day>\d+)
time01:23:45(?<time>[\d\:]+)
proxy_ip192.168.0.2(?<proxy_ip>[\d\.]+)
year2021(?<year>\d{4})
nmonth12(?<nmonth>\d{2})
log_levelinfo(?<log_level>\w+)
remote_ip1.2.3.4(?<remote_ip>[\d\.]+)
remote_port19802(?<remote_port>\d+)
server_ip192.168.1.2(?<server_ip>[\d\.]+)
server_port8080(?<server_port>\d+)

Upstream error response §

Regex §

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>.[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)",\shost\:\s"(?<upstream_host>.[^"]*)

Event §

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [error] 1776#1776:*71197740 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 1.2.3.4, server: example.com, request: "POST /api/path HTTP/2.0",upstream: "http://192.168.1.2:8080/api/path", host:"example.com"

Fields §

FieldValueRegexExplanation
monthDec(?<month>\w+)
day24(?<day>\d+)
time01:23:45(?<time>[\d\:]+)
proxy_ip192.168.0.2(?<remote_ip>[\d\.]+)
year2021(?<year>\d{4})
nmonth12(?<nmonth>\d{2})
log_levelerror(?<log_level>\w+)
upstream_errorupstream timed out (110: Connection timed out) while reading response header from upstream(?<upstream_error>.[^,]*)Zero or more of any character except comma
remote_ip1.2.3.4(?<remote_ip>[\d\.]+)
server_hostexample.com(?<server_host>.[^,]*)
http_methodPOST(?<http_method>\w+)
http_path/api/path(?<http_path>\S+)
http_versionHTTP/2.0(?<http_version>HTTP/\d\.\d)
upstream_urlhttp://192.168.1.2:8080/api/path(?<upstream_url>.[^"]*)
upstream_hostexample.com(?<upstream_host>.[^"]*)

Upstream epoll error §

Regex §

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>[^,]*,[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)(?:",\shost\:\s")(?<upstream_host>.[^"]*)

Event §

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 13199#13199: *81574833 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream, client: 1.2.3.4, server: example.com, request: "GET /page.html HTTP/1.1", upstream:"http://192.168.1.2/page.html", host: "example.com"

Fields §

FieldValueRegexExplanation
monthDec(?<month>\w+)
day24(?<day>\d+)
time01:23:45(?<time>[\d\:]+)
proxy_ip192.168.0.2(?<remote_ip>[\d\.]+)
year2021(?<year>\d{4})
nmonth12(?<nmonth>\d{2})
log_levelinfo(?<log_level>\w+)
upstream_errorepoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream(?<upstream_error>.[^,]*)
remote_ip1.2.3.4(?<remote_ip>[\d\.]+)
server_hostexample.com(?<server_host>.[^,]*)
http_methodGET(?<http_method>\w+)
http_path/page.html(?<http_path>\S+)
http_versionHTTP/1.1(?<http_version>HTTP/\d\.\d)
upstream_urlhttp://192.168.1.2/page.html(?<upstream_url>.[^"]*)
upstream_hostexample.com(?<upstream_host>.[^"]*)

Upstream epoll error with referrer §

Regex §

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>[^,]*,[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)(?:",\shost\:\s")(?<upstream_host>.[^"]*)(?:",\sreferrer\:\s")(?<referrer>.[^"]*)

Event §

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 1776#1776:*71220252 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 1.2.3.4, server: example.com, request: "GET /page.html HTTP/1.1", upstream: "http://192.168.1.2:8080/page.html", host: "example.com", referrer: "https://example.com"

Fields §

FieldValueRegexExplanation
monthDec(?<month>\w+)
day24(?<day>\d+)
time01:23:45(?<time>[\d\:]+)
proxy_ip192.168.0.2(?<remote_ip>[\d\.]+)
year2021(?<year>\d{4})
nmonth12(?<nmonth>\d{2})
log_levelinfo(?<log_level>\w+)
upstream_errorepoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream(?<upstream_error>.[^,]*)
remote_ip1.2.3.4(?<remote_ip>[\d\.]+)
server_hostexample.com(?<server_host>.[^,]*)
http_methodGET(?<http_method>\w+)
http_path/page.html(?<http_path>\S+)
http_versionHTTP/1.1(?<http_version>HTTP/\d\.\d)
upstream_urlhttp://192.168.1.2:8080/page.html(?<upstream_url>.[^"]*)
upstream_hostexample.com(?<upstream_host>.[^"]*)
referrerhttps://example.com(?<referrer>.[^"]*)