Difference between revisions of "Language/Multiple-languages/Culture/Producing-dictionaries-with-web-scraping"

From Polyglot Club WIKI
Jump to navigation Jump to search
Line 15: Line 15:
Unless you are for commercial purpose, most people won't sue you for scraping because there's no profit from that. They'll just block your IP address or do something to exterminate possibilities of scraping there. Apart from that, if your frequency of request is low enough, the website server won't feel uncomfortable.
Unless you are for commercial purpose, most people won't sue you for scraping because there's no profit from that. They'll just block your IP address or do something to exterminate possibilities of scraping there. Apart from that, if your frequency of request is low enough, the website server won't feel uncomfortable.


If you want to try, there is a [https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Internet-Dictionaries dictionary list]. Just pick a website that claims to have made their content under whatsoever public license and provides no download button. Then get your feet wet.
If you want to try, there is a [https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Internet-Dictionaries dictionary list]. Just pick a website that claims to have made their content under whatsoever public license and provides no download button. Some websites provide API, good job.


Here is a guy displaying the simple trick with Python: https://www.youtube.com/watch?v=-Yx8q6aKgtw
Here is a guy displaying the simple trick with Python: https://www.youtube.com/watch?v=-Yx8q6aKgtw

Revision as of 01:22, 14 April 2021


Have you ever tried this: get a frequency list, copy and paste the entries from a dictionary website to make a flashcard deck? And have you thought about making this process automatic?

There is a thing called “web scraping” that does exactly such work. And the most fascinating part is that it might be illegal.

Dictionaries are creative work and are protected by copyright laws, unless the author died long enough so that the work enters public domain in your country. References under certain conditions, in a small scale, can be considered as “fair use”. Apart from that, the automatic data collection process may violate the terms of service, because it may generate too many requests and thus affect the web server's functionality. The website may explicitly prohibit such behaviors.


You are the only one responsible for the consequences of your own website scraping activity.


OK, disclaimer done. Main story:

Unless you are for commercial purpose, most people won't sue you for scraping because there's no profit from that. They'll just block your IP address or do something to exterminate possibilities of scraping there. Apart from that, if your frequency of request is low enough, the website server won't feel uncomfortable.

If you want to try, there is a dictionary list. Just pick a website that claims to have made their content under whatsoever public license and provides no download button. Some websites provide API, good job.

Here is a guy displaying the simple trick with Python: https://www.youtube.com/watch?v=-Yx8q6aKgtw

And he's not alone: https://www.youtube.com/watch?v=atDgcb-ImMo

You need to modify the code so that it can automatically process a wordlist. If you don't know Python, please try this: https://www.youtube.com/watch?v=_uQrJ0TkZlc

The frequency list is here: https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

Have fun!