ip访问记录查询top3

#前言

昨晚 4399 笔试：

有一个文件 ip.txt，每行一条 ip 记录，共若干行，如何统计出现次数最多的前 3 个 ip 及其次数？

被问懵了，记录下，学习之。

#生成测试数据

首先用 Python 来生成一下测试数据：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import random

for i in range(10000):
	n = 4
	if i % 1000 == 0:
		n = 20
	a = random.randint(1, 255)
	b = random.randint(0, 255)
	c = random.randint(0, 255)
	d = random.randint(0, 255)
	for j in range(random.randint(1, n)):
		print(str(a)+"."+str(b)+"."+str(c)+"."+str(d))

然后把数据保存进ip.txt文件：

1	python generate_ip.py >> ip.txt

#查看基本信息

获取前五行的信息：

$ head ip.txt -n 5
34.150.120.118
34.150.120.118
34.150.120.118
34.150.120.118
34.150.120.118

获取最后五行的信息：

$ tail ip.txt -n 5
145.250.127.90
145.250.127.90
145.250.127.90
47.14.104.50
10.100.11.223

一共有多少行：

1 2	$ wc -l ip.txt 25282 ip.txt

#ip 记录 top3

#排序

统计方法uniq只能对连续的相同行生效，一般使用之前都需要先排序，用sort命令：

1	sort ip.txt

会全都打印到标准输出流里，看看最后 10 行就好啦：

$ sort ip.txt | tail -n 10
99.91.60.161
99.91.60.161
99.91.60.161
99.91.60.161
99.93.105.24
99.96.209.20
99.96.209.20
99.96.209.20
99.96.209.20
99.98.96.57

排序没问题。

#统计个数

把排序的输出作为uniq的输入，uniq的用法如下：

$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines, one for each group
  -D, --all-repeated[=METHOD]  print all duplicate lines
                          groups can be delimited with an empty line
                          METHOD={none(default),prepend,separate}
  -f, --skip-fields=N   avoid comparing the first N fields
      --group[=METHOD]  show all items, separating groups with an empty line
                          METHOD={separate(default),prepend,append,both}
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique          only print unique lines
  -z, --zero-terminated  end lines with 0 byte, not newline
  -w, --check-chars=N   compare no more than N characters in lines
      --help     display this help and exit
      --version  output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.  Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'uniq invocation'

我们用-c参数即可：

1	sort ip.txt \| uniq -c

然后我们就得到了每个 ip 及其出现次数。

#再排序

这次需要按照次数排序而不是字典序排序了，先看sort的参数：

$ sort --help
Usage: sort [OPTION]... [FILE]...
  or:  sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too.
Ordering options:

  -b, --ignore-leading-blanks  ignore leading blanks
  -d, --dictionary-order      consider only blanks and alphanumeric characters
  -f, --ignore-case           fold lower case to upper case characters
  -g, --general-numeric-sort  compare according to general numerical value
  -i, --ignore-nonprinting    consider only printable characters
  -M, --month-sort            compare (unknown) < 'JAN' < ... < 'DEC'
  -h, --human-numeric-sort    compare human readable numbers (e.g., 2K 1G)
  -n, --numeric-sort          compare according to string numerical value
  -R, --random-sort           sort by random hash of keys
      --random-source=FILE    get random bytes from FILE
  -r, --reverse               reverse the result of comparisons
      --sort=WORD             sort according to WORD:
                                general-numeric -g, human-numeric -h, month -M,
                                numeric -n, random -R, version -V
  -V, --version-sort          natural sort of (version) numbers within text

Other options:

      --batch-size=NMERGE   merge at most NMERGE inputs at once;
                            for more use temp files
  -c, --check, --check=diagnose-first  check for sorted input; do not sort
  -C, --check=quiet, --check=silent  like -c, but do not report first bad line
      --compress-program=PROG  compress temporaries with PROG;
                              decompress them with PROG -d
      --debug               annotate the part of the line used to sort,
                              and warn about questionable usage to stderr
      --files0-from=F       read input from the files specified by
                            NUL-terminated names in file F;
                            If F is - then read names from standard input
  -k, --key=KEYDEF          sort via a key; KEYDEF gives location and type
  -m, --merge               merge already sorted files; do not sort
  -o, --output=FILE         write result to FILE instead of standard output
  -s, --stable              stabilize sort by disabling last-resort comparison
  -S, --buffer-size=SIZE    use SIZE for main memory buffer
  -t, --field-separator=SEP  use SEP instead of non-blank to blank transition
  -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                              multiple options specify multiple directories
      --parallel=N          change the number of sorts run concurrently to N
  -u, --unique              with -c, check for strict ordering;
                              without -c, output only the first of an equal run
  -z, --zero-terminated     end lines with 0 byte, not newline
      --help     display this help and exit
      --version  output version information and exit

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end.  If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace.  OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key.  If no key is given, use
the entire line as the key.

SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

With no FILE, or when FILE is -, read standard input.

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'sort invocation'

我们需要用到的是-n参数，根据数字大小排序：

1	sort ip.txt \| uniq -c \| sort -n

最后我们需要的是最大的三个，先降序排：

1	sort ip.txt \| uniq -c \| sort -rn

然后取top3：

$ sort ip.txt | uniq -c | sort -rn | head -n 3
     20 79.137.86.84
     18 178.40.3.117
     17 224.64.206.181

收工！

#参考链接

有一个文件 ip.txt，每行一条 ip 记录，共若干行，下面哪个命令可以实现“统计出现次数最多的前 3 个 ip 及其次数”？

Contents

#前言