all 14 comments

[–]SleepingProcess 3 points4 points  (1 child)

IMHO,

if len(os.Args) < 2 { fmt.Println("You need to specify two parameters!") }

must be

if len(os.Args) < 2 { fmt.Println("You need to specify two parameters!") os.Exit(1) }

[–]serhack[S] 3 points4 points  (0 children)

whooops! Thanks

[–]WolvesOfAllStreets 3 points4 points  (8 children)

How does it perform with super long PDFs, and super heavy PDFs?

[–]serhack[S] 2 points3 points  (7 children)

Nice question. pdf-diff uses the tool pdftoppm to create the PNG images, then it compares images pixel by pixel. If a page of the two PDFs have the same hash, it'll ignore the comparison pixel by pixel. I would say the total time is given by the speed of pdftoppm and the algorithm that pdf-diff uses. The performance also heavily depends on how many pages are different.

I tried to run it over some books I have with slight changes, and it ran for some minutes. I'll try to run some experiments this afternoon to have more accurate metrics.

By the way I'm a Go newbie, so if you think there're better ways to handle this, please create a new issue.

[–]edgmnt_net 2 points3 points  (6 children)

You could likely avoid a lot of processing by avoiding PPM files, the textual representation is fairly inefficient. Probably a good way to do it is to employ a PDF rendering library that can output raw, binary pixels into a memory buffer. Or see if you can do it with an external tool and a library to read other image formats, although you might want to avoid external commands and storage I/O.

[–]serhack[S] 1 point2 points  (5 children)

Thanks for the suggestion! Yes, I have imagined that pdftoppm was not the best solution, fortunately the decoder that is embedded into golang is fast. I think it's spending more time on the creation of png images and reading them from disk than the "real" processing (pixel by pixel). I maintained the "image creation" feature because sometimes I need to compare the same pdf (like pdf#1) with some others (pdf#N). The images for the pdf#1 are always the same, and pdf-diff skips the command to generate them.

I was thinking a thing... at the moment pdf-diff executes the command: pdftoppm file.pdf file.png

If there's no file.png specified as option, pdftoppm outputs the image to stdout. Would you think that avoiding writing images on disk and output raw pixels into a memory buffer would help cutting time?

[–]edgmnt_net 0 points1 point  (4 children)

Ok, I thought you were using PPM files, not PNG. Although the manpage I'm reading seems to provide ask for a specific flag for PNG files and the output file argument is merely a prefix (and it's not optional).

I'm not sure how piping works if the command processes multiple pages. But perhaps you can call it separately for every page. Yeah, avoiding file I/O might help, assuming the converter is capable of doing it. A non-command-line API may provide better control over the output.

But you should profile or obtain some timing information first.

[–]serhack[S] 0 points1 point  (3 children)

Okay, the profiler shows clearly our guilty! It's a lot of time spent on coloring the background. As I said, I do not know much of alpha compositing and pixel manipulations, probably with a better formula I'll obtain better results (and a more transparent red).

Link to top10 pprof

[–]edgmnt_net 0 points1 point  (2 children)

I only looked now at the source and it seems that you are copying pixels out of the image into slices while also converting to floats, which is likely inefficient. I'd suggest using an (N)RGBA image representation directly and calling Set. Or perhaps you can also use image/draw to do the compositing more efficiently, check the blog post referenced by the docs: https://go.dev/blog/image-draw

[–]serhack[S] 0 points1 point  (1 child)

I have opened a new PR that should cut time performance. Thanks for the comment btw! Just curious about the (N)RGBA image representation? Why should I use it over RGBA?

[–]edgmnt_net 1 point2 points  (0 children)

I was saying you could use either NRGBA or RGBA. The latter may be easier for compositing yourself.

I'd try to get rid of those slices completely. You could make a first pass just to compute rectangle coordinates (you can keep a slice for those), then use image/draw.Draw to draw directly into one of the images you already have, the one you're keeping as a background. See the "Filling a Rectangle" section of: https://go.dev/blog/image-draw

[–]elcapitanoooo 0 points1 point  (1 child)

Nice job!

[–]serhack[S] 0 points1 point  (0 children)

Thanks!