Leetocode之Find Duplicate File in System 问题

来源：互联网发布：泰瑞克埃文斯生涯数据编辑：程序博客网时间：2024/06/05 01:57

问题描述：

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Note:

No order is required for the final output.
You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50].
The number of files given is in the range of [1,20000].
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. Directory path and file info are separated by a single blank space.

示例：

Input:["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]Output:  [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

问题来源：Find Duplicate File in System (详细地址：https://leetcode.com/problems/find-duplicate-file-in-system/description/)

思路分析：这道题为了解释清楚题目意思，说了那么多。其实意思很简单，就是把相同文件内容的文件归并到一组去，但是规定组成的组中必须是文件数大于2的。至于文件内容，其实就是括号里面的东西，所以在这解法也很简单，首先需要分割成一个一个的片段，然后再相应的片段里找到文件内容，把文件内容相同的文件归并起来。说了这么多，不知道大家有没有想到该用什么数据结构好。为了保证文件内容和对应文件的映射，所以我们需要HashMap，其中的key用来存放文件内容，value用来存放它对应的文件。用来存放文件集合的可以是set，也可以是list，因为题目说了没有顺序要求，所以二者都是允许的。其他的字符串分割啊，子字符串截取啊，文件名组合啥的都是按部就班就行的。最需要讲的是jdk8的解法，它采用的是"流"的处理方法，利用filter()方法过滤掉组中文件数小于2个的组，具体是咋操作的还是看代码好了。

代码：

HashMap + Set的解法：

换个输出的办法：

jdk8的解法(其实也是换了个输出的办法)：

阅读全文

0 0