Question

Unable to get data from Shopify via HTTP Client Origin using pagination

  • 4 December 2021
  • 6 replies
  • 200 views

Userlevel 3
Badge

Hi,

I’m trying to fetch data from Shopify using HTTP Clent Origin with pagination set to “Link in HTTP Header”.

As per Shopify (refer link here - https://shopify.dev/api/usage/pagination-rest#link-headers) - the parameters that comes back in “link” contain the page_info parameter that we should be using while querying the next page.

Using StreamSets HTTP Client Origin, the pagination option using “Link in HTTP Header” is not working. I think this is because SDC is expecting the parameter name as “page” and hence this issue.

Has anybody faced this earlier and if so how have you been able to make progress here?

Anyone from StreamSets who can comment on this?


6 replies

Userlevel 3
Badge

@anthonyg Can you comment here? This is what I was originally discussing with you on the below thread?

https://community.streamsets.com/got-a-question-7/standard-out-of-box-origins-for-saas-apps-158

Userlevel 2
Badge

Hi @swayam . We do no processing on link provided in header to paginate. We just use the URL provided in the next link as provided. So, whether this link contains page, page_info, or any other argument is irrelevant to us to paginate. Probably you are facing another (perhaps related) issue. If you can provide the source JSON of your pipeline and any other valuable information, we will be in a better position to help you.

Userlevel 3
Badge

Hi @Dimas Cabré i Chacón 

Many thanks for looking at my message. Here I’ve attached my pipeline.

Probably the issue was not with pagination - but related to my expectations ragarding HTT Client running in “Polling” mode.

As per SDC document:

Note: After the polling interval passes, the origin continues processing from where it stopped. For example, let’s say that you’ve configured the origin to use the polling mode with an interval of two hours and to use page number pagination. After the origin reads 25 pages of results, the 26th page returns no results and so the origin stops reading. After the two hour interval passes, the origin polls the server again, reading the results starting with page 26.

Based on above and what I’ve configured, I was expecting that after 2 polls (each poll fetching 15 records and there are total 30 records in Shopify), the pipeline shouldn’t fetch anything but running to check every 5 secs.

But I see that after every 5 secs, it fetchs the same data again and again.

Please help comment on this.

Userlevel 3
Badge

Hi @Dimas Cabré i Chacón ,

Could you please have a look at my request?

Regards

Swayam

Userlevel 2
Badge

Hi @swayam . Probably there is some confusion about pooling behavior. Pooling is expected to be used when next pages are not available immediately. So, when you paginate by page number, it makes sense to keep trying for a “next” page until available. When using pagination based on header link, there is no point in retrying for a missing next page. If there is a link in the header, next page exists and is available. If not, there is no next page and it is not expected to exist in the future. When there is no link, our offset keeps being the last requested page, and thus you keep receiving all the time the same page (because you configured Pooling Mode). So, I think this pipeline should be configured, not with Polling Mode, but using Batch Mode (which will finish the pipeline after no more pages). There can be debate about how this origin works, but it is clear it is not buggy. Doing all the time the same request for the last page could make sense, as you might expect at some moment it can appear a next link in your header. But, if you want to cover this scenario, you need to decide in advance what to do with duplicate records.

In short, use Batch Mode, and have prepared your functional scenario if you plan to invoke the same initial URL from time to time, as duplicates cannot be easily avoided then.

Userlevel 5
Badge

@swayam, did @Dimas Cabré i Chacón suggestions help? 

Reply