#前言

昨晚 4399 笔试:

有一个文件 ip.txt,每行一条 ip 记录,共若干行,如何统计出现次数最多的前 3 个 ip 及其次数?

被问懵了,记录下,学习之。

#生成测试数据

首先用 Python 来生成一下测试数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import random

for i in range(10000):
n = 4
if i % 1000 == 0:
n = 20
a = random.randint(1, 255)
b = random.randint(0, 255)
c = random.randint(0, 255)
d = random.randint(0, 255)
for j in range(random.randint(1, n)):
print(str(a)+"."+str(b)+"."+str(c)+"."+str(d))

然后把数据保存进ip.txt文件:

1
python generate_ip.py >> ip.txt

#查看基本信息

获取前五行的信息:

1
2
3
4
5
6
$ head ip.txt -n 5
34.150.120.118
34.150.120.118
34.150.120.118
34.150.120.118
34.150.120.118

获取最后五行的信息:

1
2
3
4
5
6
$ tail ip.txt -n 5
145.250.127.90
145.250.127.90
145.250.127.90
47.14.104.50
10.100.11.223

一共有多少行:

1
2
$ wc -l ip.txt
25282 ip.txt

#ip 记录 top3

#排序

统计方法uniq只能对连续的相同行生效,一般使用之前都需要先排序,用sort命令:

1
sort ip.txt

会全都打印到标准输出流里,看看最后 10 行就好啦:

1
2
3
4
5
6
7
8
9
10
11
$ sort ip.txt | tail -n 10
99.91.60.161
99.91.60.161
99.91.60.161
99.91.60.161
99.93.105.24
99.96.209.20
99.96.209.20
99.96.209.20
99.96.209.20
99.98.96.57

排序没问题。

#统计个数

把排序的输出作为uniq的输入,uniq的用法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
-c, --count prefix lines by the number of occurrences
-d, --repeated only print duplicate lines, one for each group
-D, --all-repeated[=METHOD] print all duplicate lines
groups can be delimited with an empty line
METHOD={none(default),prepend,separate}
-f, --skip-fields=N avoid comparing the first N fields
--group[=METHOD] show all items, separating groups with an empty line
METHOD={separate(default),prepend,append,both}
-i, --ignore-case ignore differences in case when comparing
-s, --skip-chars=N avoid comparing the first N characters
-u, --unique only print unique lines
-z, --zero-terminated end lines with 0 byte, not newline
-w, --check-chars=N compare no more than N characters in lines
--help display this help and exit
--version output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters. Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'uniq invocation'

我们用-c参数即可:

1
sort ip.txt | uniq -c

然后我们就得到了每个 ip 及其出现次数。

#再排序

这次需要按照次数排序而不是字典序排序了,先看sort的参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
$ sort --help
Usage: sort [OPTION]... [FILE]...
or: sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too.
Ordering options:

-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < 'JAN' < ... < 'DEC'
-h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)
-n, --numeric-sort compare according to string numerical value
-R, --random-sort sort by random hash of keys
--random-source=FILE get random bytes from FILE
-r, --reverse reverse the result of comparisons
--sort=WORD sort according to WORD:
general-numeric -g, human-numeric -h, month -M,
numeric -n, random -R, version -V
-V, --version-sort natural sort of (version) numbers within text

Other options:

--batch-size=NMERGE merge at most NMERGE inputs at once;
for more use temp files
-c, --check, --check=diagnose-first check for sorted input; do not sort
-C, --check=quiet, --check=silent like -c, but do not report first bad line
--compress-program=PROG compress temporaries with PROG;
decompress them with PROG -d
--debug annotate the part of the line used to sort,
and warn about questionable usage to stderr
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-k, --key=KEYDEF sort via a key; KEYDEF gives location and type
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
--parallel=N change the number of sorts run concurrently to N
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end. If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace. OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key. If no key is given, use
the entire line as the key.

SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

With no FILE, or when FILE is -, read standard input.

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'sort invocation'

我们需要用到的是-n参数,根据数字大小排序:

1
sort ip.txt | uniq -c | sort -n

最后我们需要的是最大的三个,先降序排:

1
sort ip.txt | uniq -c | sort -rn

然后取top3

1
2
3
4
$ sort ip.txt | uniq -c | sort -rn | head -n 3
20 79.137.86.84
18 178.40.3.117
17 224.64.206.181

收工!

#参考链接