My old downloading tool

I read online the manga, the Journey of Shuna, by Miyazaki Hayao and wanted to keep a copy, so I turned to my old downloading tool for help. It was written in Common Lisp as I was learning it. It has an abstract interface such that if you give it functions that find the image on the current page, find the location of next page and know when the last page of the current manga is reached respectively, it will automatically download and save the files sequentially. In the past I used it to download one manga from one site and that site was no more. How time flies.

After some time of studying the html/js code of the site, I realised that the site had anti-scraping measures in place. My previous abstraction does not fit the task. Actually location of next page is trivial: increment a number in the URL. The html file of each page contains a total page count, so we know when to stop. The only obfuscation comes from hiding the image. If only my tool were a firefox extension! It could just snatch the image after firefox finishes parsing the DOM.

The html file contains some secret keys that are sent along in a subsequent xhr to the server which returns an obfuscated javascript snippet with some data embedded, upon execution of which produces the URL to the image. No, I don't want to figure out what the javascript does exactly. It would not worth my time. I should have learnt webextension. 😢

My tooling is just wrong for this task. Despite being a very powerful language that enables one to do whatever one wants in whatever way, Common Lisp is not easy to use. One has to either implement the function one needs or find other people's libraries to use. Every time one uses others' library, one finds another language to learn. A lot time is spent on reading documentation. I don't use Common Lisp often enough. I forget what I have learned.

Thus I just switched to nodejs, because obviously it can just run that javascript snippet to produce the URL of the image. BTW the URL is dynamic as it involves query strings. Also one should use safe-eval to eval the snippet because nodejs has a lot of power. The site could troll me by deleting all my local files. One needs to be careful at each step to put in sensible user agents and correct referer which I know the site checks. I used the request-promise library to make http requests and awaited the results. I still really like drakma. Then I used regex to extract those few secret keys. Yeah, cl-ppcre is much more cumbersome to use... I needed a lot of string operation and enjoyed the template strings in javascript. This totally saved the time I would have spent reading documentation of alexandria. It just felt that all the building blocks one needed were right at hand. Sadly I didn't set up the nodejs repl well. It was somewhat difficult to test my modules as I progressed.

When I finished coding, that meant the program was in ready state. It ran, downloaded the images and went into another long wait before I wanted another manga.


I dusted off this tool again to download another manga and failed!!!

Of course I didn't have error catching code in place. When your first shot at downloading a manga just succeeded and you just shelved the tool, you never had a chance to write error handling code. My console burst into horrible red characters.

Guess what went wrong.

This time the friggn pathname of the image contains non-ascii characters. That was totally unexpected. Sorry for being an ascii-centred guy who is essentially ignorant of the vast world of unicode. I then added a function to percentage-encode the pathname, before feeding it to the request library.