Part One
输入一段字段串,判断此字符串是否可以由已知的几种片段拼接而成。比如r, wr, b, g, bwu, rb, gb, br等片段,可以组成brwrr=b+r+wr+r
regexp_split_to_array
可以使用regexp_split_to_array函数,将这些片段当作是分隔符,如果split之后都是空字符串,则证明字符串肯定可以通过这些片段拼接而成。
postgres=# select regexp_split_to_array('bwurrg', 'r|wr|b|g|bwu|rb|gb|br');
regexp_split_to_array
-----------------------
{"","","","",""}
(1 row)
完整的SQL如下
with recursive rows as (
SELECT row_number() over () as _row, line
FROM lance_input
), towels as (
SELECT replace(line, ', ', '|') as line
FROM rows
where _row = 1
), design as (
SELECT _row, line
FROM rows
WHERE _row > 2
)
SELECT string_agg(distinct result::text, '')
FROM (
SELECT a._row, unnest(regexp_split_to_array(a.line, b.line)) as result
FROM design a, towels b
) t
GROUP BY _row;
通过结果可以看出,有两条记录无法达成
string_agg
------------
u
w
(8 rows)
但是在实际测试的时候发现,得出的记录数明显不足,应该是有一些误判了,比如下面这条
postgres=# select regexp_split_to_array('wgwrwurgrbwgwwuwrgwwbubwbubwbubggwbwgbwwgr', 'rrgwru|gwubwg|wgbbw|...');
regexp_split_to_array
--------------------------------------------
{"","","","","","","","","","","","","",r}
(1 row)
但是这条记录其实是可以拼接的,因此split的方案是不可靠的。
逐个片段trim
可以通过starts_with来判断是否存在某个towel匹配,存在的话可以left trim掉,持续trim直到空字符串。
with recursive rows as (
SELECT row_number() over () as _row, line
FROM lance_input
), towels as (
SELECT regexp_split_to_table(line, ', ') as line
FROM rows
where _row = 1
), design as (
SELECT _row, line
FROM rows
WHERE _row > 2
), trim_design as (
SELECT 1 as step, _row, line
FROM design
UNION ALL
SELECT step + 1 as step, _row, substring(a.line, length(b.line) + 1) as line
FROM trim_design a, towels b
WHERE starts_with(a.line, b.line)
AND length(a.line) > 0
)
SELECT * FROM trim_design;
不出意料,速度非常慢。因此每次匹配的记录,多次笛卡尔积之后,记录数呈指数级增加,膨胀的非常厉害。
step | count
------+---------
1 | 1
2 | 3
3 | 7
4 | 19
5 | 52
6 | 141
7 | 378
8 | 1073
9 | 3065
10 | 8511
11 | 22944
12 | 60899
13 | 161945
14 | 431528
15 | 1138059
递归函数
其实可以从递归的角度来思考,这些字符串中应该存在着大量相同的子字符串。如果记录下每个字符串的匹配结果,那么再遇到相同的字符串时,就可以快速返回,而不用做重复的匹配了。
CREATE OR REPLACE FUNCTION is_valid(design varchar, towels text[])
RETURNS BOOLEAN AS $$
DECLARE
check_result BOOLEAN;
BEGIN
-- empy string
IF design = '' THEN
RETURN TRUE;
END IF;
-- read from table
check_result := (SELECT _result FROM results WHERE _design = design);
IF check_result is not null THEN
return check_result;
END IF;
FOR i IN 1..array_length(towels, 1) LOOP
CONTINUE WHEN NOT starts_with(design, towels[i]);
check_result := is_valid(substring(design, length(towels[i]) + 1), towels);
INSERT INTO results VALUES(substring(design, length(towels[i]) + 1), check_result) on conflict(_design) do nothing;
CONTINUE WHEN NOT check_result;
RETURN TRUE;
END LOOP;
-- insert false result
INSERT INTO results VALUES(design, FALSE) on conflict(_design) do nothing;
RETURN FALSE;
END;
$$ LANGUAGE plpgsql;
通过上述的函数,来记录下每次匹配的结果,并插入到results表中。运行结束之后,其表中的记录如下:
postgres=# select * from results limit 10;
_design | _result
------------+---------
r | f
ur | f
wur | f
gwur | f
ggwur | f
rwggwur | f
wggwur | f
brwggwur | f
gbrwggwur | f
wgbrwggwur | f
(10 rows)
Part Two
这次需要记录下拼接的最终方式有多少种。比如gbbr=gb+br,也可以gbbr=g+b+br等
同样是类似的思路,只不过这次需要将所有可能的方式相加,最终将汇总后的结果插入到表中。
CREATE OR REPLACE FUNCTION is_valid_cnt(design varchar, towels text[])
RETURNS BIGINT AS $$
DECLARE
check_result BIGINT;
find_result BIGINT;
BEGIN
-- empy string
IF design = '' THEN
RETURN 1;
END IF;
-- read from table
check_result := (SELECT _result FROM results WHERE _design = design);
IF check_result is not null THEN
return check_result;
END IF;
-- init result
check_result := 0;
FOR i IN 1..array_length(towels, 1) LOOP
CONTINUE WHEN NOT starts_with(design, towels[i]);
find_result := is_valid_cnt(substring(design, length(towels[i]) + 1), towels);
check_result := check_result + find_result;
END LOOP;
-- insert result
INSERT INTO results VALUES(design, check_result) on conflict(_design) do nothing;
RETURN check_result;
END;
$$ LANGUAGE plpgsql;
最终的结果是比较大的数字
postgres=# select * From results order by _result desc limit 10;
_design | _result
------------------------------------------------------------+----------------
uwbubuuwbgbuwbguwuwwgbggugwwubrbubgbuwrrbuwbuwrwbubgbrgru | 73233763858272
bwrbbuguugbgurbruwgbbruuwubwbggbrbrbugggubbbwwwbbguggwwgrg | 73040032149444
bbbgrugwwgwggbruwrgwwbgrbbuubuguwbggwwugbuugwbwrubgguuwwg | 51869184655788
grbguwwuwgrburgggguugubwwgwguuwbwuwuwgugwggggwurrubwgbbwrb | 47637921344256
wrbbuguugbgurbruwgbbruuwubwbggbrbrbugggubbbwwwbbguggwwgrg | 40939309967538
wbubuuwbgbuwbguwuwwgbggugwwubrbubgbuwrrbuwbuwrwbubgbrgru | 39664794248880
wwbwwuwrguubrwugrbrgggbwuwwgbrbwguurbrbwguuwwrbgwwubwwuwbu | 30222341875432
bbgrugwwgwggbruwrgwwbgrbbuubuguwbggwwugbuugwbwrubgguuwwg | 27504375836232
bubuuwbgbuwbguwuwwgbggugwwubrbubgbuwrrbuwbuwrwbubgbrgru | 22379313072928
bguwwuwgrburgggguugubwwgwguuwbwuwuwgugwggggwurrubwgbbwrb | 20522734353084
(10 rows)