Difference between revisions of "Language/Multiple-languages/Culture/Producing-dictionaries-with-web-scraping"

From Polyglot Club WIKI
Jump to navigation Jump to search
Line 2: Line 2:


Have you ever tried this: get a frequency list, copy and paste the entries from a dictionary website to make a flashcard deck? And have you thought about making this process automatic?
Have you ever tried this: get a frequency list, copy and paste the entries from a dictionary website to make a flashcard deck? And have you thought about making this process automatic?


There is a thing called “[https://en.wikipedia.org/wiki/Web_scraping web scraping]” that does exactly such work. And the most fascinating part is that <big>it might be illegal</big>.
There is a thing called “[https://en.wikipedia.org/wiki/Web_scraping web scraping]” that does exactly such work. And the most fascinating part is that <big>it might be illegal</big>.


A dictionary is a creative work and is protected by copyright laws, unless its author died long enough so that the work enters public domain in your country. References under certain conditions, in a small scale, can be considered as “[https://en.wikipedia.org/wiki/Fair_use fair use]”. Apart from that, if the automatic data collection process generates too many requests in a short time, it would affect the web server's functionality, so some websites explicitly prohibit web scraping in their terms of service.
A dictionary is a creative work and is protected by copyright laws, unless its author died long enough so that the work enters public domain in your country. References under certain conditions, in a small scale, can be considered as “[https://en.wikipedia.org/wiki/Fair_use fair use]”. Apart from that, if the automatic data collection process generates too many requests in a short time, it would affect the web server's functionality, so some websites explicitly prohibit web scraping in their terms of service.




<b>You are the only one responsible for the consequences of your own web scraping activity.</b>
<b>You are the only one responsible for the consequences of your own web scraping activity.</b>




OK, disclaimer done. Reality:
OK, disclaimer done. Reality:


Unless you are for commercial purpose, most people won't sue you for scraping because there's no profit from that. They'll just block your IP address or do something to exterminate possibilities of scraping there. Apart from that, if your frequency of request is low enough, the website server won't feel uncomfortable. Even if it's protected by copyright, as long as you don't distribute, no one would sue you for copyright infringement.
Unless you are for commercial purpose, most people won't sue you for scraping because there's no profit from that. They'll just block your IP address or do something to exterminate possibilities of scraping there. Apart from that, if your frequency of request is low enough, the website server won't feel uncomfortable. Even if it's protected by copyright, as long as you don't distribute, no one would sue you for copyright infringement.
Line 21: Line 26:
Here is a guy displaying the simple trick with Python:  
Here is a guy displaying the simple trick with Python:  


<youtube>-Yx8q6aKgtw</youtube>


<youtube>-Yx8q6aKgtw</youtube>


And he's not alone:  
And he's not alone:  


<youtube>atDgcb-ImMo</youtube>


<youtube>atDgcb-ImMo</youtube>


You need to modify the code so that it can automatically process a wordlist. If you don't know Python, please try this:  
You need to modify the code so that it can automatically process a wordlist. If you don't know Python, please try this:  


<youtube>_uQrJ0TkZlc</youtube>
<youtube>_uQrJ0TkZlc</youtube>


Have fun!
Have fun!

Revision as of 19:03, 14 April 2021


Have you ever tried this: get a frequency list, copy and paste the entries from a dictionary website to make a flashcard deck? And have you thought about making this process automatic?


There is a thing called “web scraping” that does exactly such work. And the most fascinating part is that it might be illegal.


A dictionary is a creative work and is protected by copyright laws, unless its author died long enough so that the work enters public domain in your country. References under certain conditions, in a small scale, can be considered as “fair use”. Apart from that, if the automatic data collection process generates too many requests in a short time, it would affect the web server's functionality, so some websites explicitly prohibit web scraping in their terms of service.


You are the only one responsible for the consequences of your own web scraping activity.


OK, disclaimer done. Reality:


Unless you are for commercial purpose, most people won't sue you for scraping because there's no profit from that. They'll just block your IP address or do something to exterminate possibilities of scraping there. Apart from that, if your frequency of request is low enough, the website server won't feel uncomfortable. Even if it's protected by copyright, as long as you don't distribute, no one would sue you for copyright infringement.

How slow is good? About 150 per day, I guess, unless you can learn more new words a day.

If you want to try, there is a dictionary list. Just pick a website that claims to have made their content under whatsoever public license and provides no download link. Some websites provide API, good job.

Here is a guy displaying the simple trick with Python:


And he's not alone:


You need to modify the code so that it can automatically process a wordlist. If you don't know Python, please try this:

Have fun!