As part of my work on Hat, I'm wondering how many lines are duplicated in the source code. So I started thinking about the problem, in a generalized way, of course. My first idea was to consider one file as a possible substring source, meaning that its contents may contain substrings of the other file. Since part of the idea is that you want the largest substrings possible, it makes sense to generate the large substrings, then move on to the size below it, and so on.
Quick aside here, when I was down at Fun in the Afternoon, Wouter Swierstra gave an interesting talk on version control, in which he abstracted away from thinking in Characters, lines, or some higher level construct in the file. I'm borrowing this idea, and am going to talk about "blocks", and having files made up of blocks and looking for sub-blocks of files. This gives three advantages:
Quick note about block composison. A block is ethier a block in its own right, or a list of blocks. Clearly, order matters.
Having started thinking about generating blocks, it is clear the largest block is the whole file, and there is one of them. We can then generate the blocks that are 1 block smaller by removing the first and then the last blocks. So we have 1+2+3+...n-1 subblocks of a file. Which is about n-squared blocks, and assuming that testing a block against a file takes n time, this gives us a n-cubed time for testing two files. This sucks, so a better solution is needed.
Take the smallest blocks from a file. map each of them to where it came from, and feed the result into a table based on the block's contents. Repeat for all the files. When finshed, remove all the entries in the table with less than two file addresses. These are the matches, so then all you need to do is assemble the largest block sizes needed.
Construct a new table based on the addresses of bits of the file. For each pair of matches, look up the next address, if by line, then the next line in the file. if that is a match, then merge the blocks. Output the result.
Running time is based on how good your table is. Assuming a table with insertation and lookup of o(logn), then the running time is O(nlogn). assuming a table with insertion and lookup of O(1), then the running time is O(n), which is about as good as it gets for this sort of problem. However, its easy to construct a running time of O(n^2), by having all the inputs identical. There are a few ways of dealing with this, mainly by making insertion into the hash bucket O(1).
All in all, I'm happy with this, and I'm going to have to implement it for lines tomorrow. Full block implementation is going to wait until I have worked out how to define what a block is at runtime, which might need a parser with the grammar supplied at runtime.
posted at: 10:46 | path: /computing | permanent link to this entry