I have a bunch of excel (xls) files that are hosted online (HTTPS) and periodically updated. I want to be able to store the excel file to ADLS Gen2. What is the best way to approach this?
Data Source
No authentication/authorization needed.
HTTPS
XLS format
Environment Stack
Streamsets
Azure Shop
Page 1 / 1
@sgrewal
can you please use HTTP as origin and add the header as mentioned below and check if it helps.
The output data is in binary format and not useful
@sgrewal
I see , May i know if i can access the http url.
If its accessible i can give a try to read data from it.
Thanks , I will give a try and update you if there is any luck.
@sgrewal
As per my understanding for handling xlsx files from HTTP call can be handled by jython or custom functions.
Below the code snippet is helping me to get data from the excel file .
import pandas as pd url='https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx?raw=true' df=pd.read_excel(url, sheet_name='Sheet 1') print(df)
I am planning to create a custom function to handle different api calls and return result in readable format in streamsets.
Kindly let me know if the custom function is fine with you or not.
Thanks & Regards
Bikram_
@sgrewal
As per my understanding for handling xlsx files from HTTP call can be handled by jython or custom functions.
Below the code snippet is helping me to get data from the excel file .
import pandas as pd url='https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx?raw=true' df=pd.read_excel(url, sheet_name='Sheet 1') print(df)
I am planning to create a custom function to handle different api calls and return result in readable format in streamsets.
Kindly let me know if the custom function is fine with you or not.
Thanks & Regards
Bikram_
@Bikram
HI Bikram
Above code i used in jython script as origin in my pipeline.but it will through the error.
I want to know, how to import modules in data collector step by steps instructions.
Thanks
Tamilarasu
@tamilarasup
In this case you need to install panda libraries in streamsets collector. Please try it and let me know if it works.
I am working on custom function and its ready and will provide you the details in sometime.
Thanks & Regards
Bikram_
@tamilarasup
In this case you need to install panda libraries in streamsets collector. Please try it and let me know if it works.
I am working on custom function and its ready and will provide you the details in sometime.
Thanks & Regards
Bikram_
@Bikram
I installed manually python modules or files to the data collector.
but it does gives same error. the module script is not there.
let me know, how to import modules to the data collector.
@sgrewal@tamilarasup
Heres a pipline i created with your https url and i was able to read the file and write to a local fs
Set HTTP Client → Data Format → Binary (make sure the max byte size is bigger than the file you are ingesting)
Local FA → Output Files
File Type → Text
File suffix → xlsx
Local FA → Data Format → Binary
Hope it helps.
you can use the similar configs to write to any destination like HDFS, ADLS, S3, etc
@tamilarasup
I could manage to read actual data by using below custom function .
If you manage to deploy this in your streamsets then you can get the data as a string from the URL. It can read data based on the URL supplied from user.
public class getExcelDataHttps { @ElFunction( prefix = "url", name = "getHttpExcelData", description = "Returns URL Data after the connection with given URL" ) public static String getHttpExcelData( @ElParam("sourceURL") String sourceURL,
) throws IOException{
//StringBuilder content = new StringBuilder(); StringBuilder response = new StringBuilder();
try { // create a url object URL url = new URL(sourceURL); URLConnection urlConnection = url.openConnection(); XSSFWorkbook wb = new XSSFWorkbook(urlConnection.getInputStream()); XSSFSheet sheet = wb.getSheetAt(0); //creating a Sheet object to retrieve object Iterator<org.apache.poi.ss.usermodel.Row> itr = sheet.iterator();
while (itr.hasNext()) { Row row = (Row) itr.next(); Iterator<Cell> cellIterator = row.cellIterator(); //iterating over each column while (cellIterator.hasNext()) { Cell cell = cellIterator.next(); switch (cell.getCellType()) { case Cell.CELL_TYPE_STRING: //field that represents string cell type response.append(cell.getStringCellValue() + "\t\t\t"); break; case Cell.CELL_TYPE_NUMERIC: //field that represents number cell type response.append(cell.getNumericCellValue() + "\t\t\t"); break; default: } } System.out.println(""); }
Obligation Date States Counties Reps Borrower Grant Amount Program 09/30/2020 AK Bethel (CA) Young, Don 00 Naterkaq Light Plant 316000.0 High Energy Cost Grants 09/30/2020 OH Muskingum Beatty, Joyce 03 Economic & Community Devlpment Institute 66154.0 Rural Microentrepreneur Assistance Program 09/30/2020 MN Freeborn Hagedorn, Jim 01 Trails Travel Plaza, Inc 253006.0 Higher Blends Infrastructure Incentive Program 09/30/2020 FL Brevard Posey, Bill 08 Southeast Petro Distribution Inc. 3537000.0 Higher Blends Infrastructure Incentive Program 09/30/2020 IA Polk Axne, Cindy 03 Kum & Go L.C. 1589000.0 Higher Blends Infrastructure Incentive Program 09/30/2020 WI Milwaukee Moore, Gwen 04 Roettgers Company Inc. 251000.0 Higher Blends Infrastructure Incentive Program 09/30/2020 KY Jefferson Yarmuth, John A. 03 Thorntons LLC 4268570.0 Higher Blends Infrastructure Incentive Program 09/30/2020 IA Polk Axne, Cindy 03 Casey's General Store, Inc. 4978400.0 Higher Blends Infrastructure Incentive Program 09/30/2020 PA Washington Reschenthaler, Guy 14 Authority Of The Borough Of Charleroi
Thanks & Regards
Bikram_
@sgrewal
some more code for your reference and hope it helps.