Data collection on Amazon

Aug 13 2024

After downloading and successfully installing the creeper collector, you can see the toolbar. Click “New Task” to create a custom collection.

The first step of creating a new task is to enter the URL. Open the Amazon webpage, enter the keyword “tunic tops for women” and search. After the search, the following web link will be generated: https://www.amazon.com/s?k=tunci+tops+for+women&ref=nb_sb_noss. Copy the link and paste it into the web address you want to crawl.

The creeper can crawl multiple links at the same time, so if you need to crawl multiple keyword search results at the same time, you can enter multiple links, but the crawling speed will be greatly affected.

After pasting, click the “Next” button to enter the data crawling stage. Wait for 1~2 minutes, and the creeper will automatically try to crawl and organize the web page information. The final generated data file. As you can see from the table, there is no data required for digitized product selection in the 7 columns of data here. Click the filter button above other columns to delete the data, and only keep the link in the second column and the number of reviews in the fifth column for in-depth collection.

When the result of automatic identification is not the data you want, click the “List Mode” drop-down button and select the list mode in the drop-down list. “Select List” option, click on the website above to complete the data selection.

At this time, the collector can only select the data on the current page. If you want to flip the page to crawl, you need to click the paging setting in the lower left corner and select “Automatically identify paging”. The collector will automatically find the page turning button and automatically click it during the actual crawling.

After completing the operation of the first-level page, select the title_link column and click to collect this link in depth. The browser will create a new tab for the content page.

Flip down the page in the content page tab to find the listing time, parent ASIN, and major category ranking. Since the crawled content is text content, and the positions of these three rows of data are different in different links, the crawling can be completed through the XPath function. The specific operation is: double-click to select the crawled content, right-click and select the “Generate XPath→Generate from Prefix Text” command.

The corresponding code will be generated at this time.

Choose to copy the code. Click the “Add Field” button to add 3 columns of fields, and manually set the XPath for each field, paste the above code into the manual setting area, and finally click the “Next” button after completing the setting.

It should be noted here that since the ASIN contained in the link is the ASIN corresponding to the product sub-SKU, it is not conducive to the later judgment of data consistency, so the parent ASIN must be collected.

In the third step, directly click the “Save” button to complete the task editing, and the final exported data will merge the 2 columns of the first-level page and the 3 columns of the second-level page.

Return to the homepage, select the task just created, click the “Start” button, and the collector can start crawling web page content. During the collection process, since the second-level page needs to be collected, each data collection time is about 20 seconds or more. In the actual operation process, it is best to set the collection time from 8 pm to 8 am the next day. In addition to ensuring the relative consistency of the crawling time, ensure that more than 3,000 pieces of data can be collected every day. After the operator continues to crawl for one month and removes duplicate values, more than 10,000 link data can be obtained.

Since the collector will generate a large number of repeated clicks in a short period of time, it may trigger Amazon’s robot detection within 1 to 2 months. Therefore, it is best to operate the collector on a computer that is not in the store backend to avoid affecting daily operations.

Posted inBlog

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Leave a Reply Cancel reply