Self-programmed crawler script collection refers to automatically grabbing data from the Amazon platform through crawler programs, which is a technical data collection method. Here we only give a basic introduction to this type of data collection method, and will not talk too much about IT programming related content. Interested readers can refer to relevant materials by themselves.
Self-programmed crawler script collection is mainly used for large-scale and highly repetitive data collection work, such as data monitoring of competitor listings, data-based product selection, etc.
First download the crawler program “rank”, which is an executable file, namely “EXE File”. Double-click to start the crawler program under appropriate operating conditions. Note that when running the BXE file on the desktop, you need to ensure that an Excel table named “rank” already exists on the desktop. The specific use of the table is as follows: Create an Excel table named rank, the file can be in xls or xlsx format, and ensure that the table and the exe file are in the same path. The operator can store the Excel table and the EXE file on the desktop together. Enter the URL of the product page whose ranking needs to be extracted in the first column of the table, then close the Excel table (the table will not be able to access data when it is open, and a close prompt will pop up at this time), open the EXE program, and it takes 5 to 8 seconds to process each data. A prompt will pop up after all the data is completed, so just open the file and wait for the prompt to pop up. Do not open the Excel table during the waiting period. The completed data will exist in the rank.xls file. If the original file format is rank.xlsx, a new rank.xls file will be created.
After downloading all the files, you can try to use the crawler program. Because some links have been stored in the original table, you can directly double-click the “rank” EXE file to start the crawler program. It should be noted that after the crawler program starts, the Excel file named “rank” must be closed, otherwise a warning page will pop up asking you to close the Excel file first.
Because 100 Amazon links have been recorded in the initial Excel table, when the crawler program is started, you need to wait for 10 to 20 minutes. When all the data crawling is completed, a prompt page pops up indicating that a total of 100 data have been completed.
When all data crawling tasks are completed, open the Excel file named “rank”.
The Excel table named “rank” mainly stores three kinds of data. The first is the Amazon product link, the second is the ranking of the major category corresponding to the link, and the third is the data crawling time.
The Excel table named “rank” mainly stores three kinds of data. The first is the Amazon product link, the second is the ranking of the major category corresponding to the link, and the third is the data crawling time.
In the major category ranking data, some data will show “-1”.
The data crawling error of “-1” may be caused by the following reasons.
1 The crawler program is used too frequently and the network IP is temporarily banned.
2 The product link has not been ordered yet and there is no corresponding major category ranking.
3 The product is not a non-standard product, and its data crawling logic is inconsistent with the crawler program;
4 Program running error, which may be caused by operating system mismatch (such as OS system, XP system) and network failure.
In the Excel table named “rank”, the third column is the time of this data crawling.
If the operator needs to record the ranking of other product links, the link information in the first column of the Excel file can be modified, and then the crawler program can be started again and wait for the crawling work to be completed.
It should be noted that if the crawler program is used frequently in a short period of time, the network IP will be temporarily blocked by Amazon. At this time, a large number of “-1” will appear in the crawled ranking data, so after using the crawler program once, you need to wait for a period of time before starting the next data crawl.