Skip to main content

I have a bunch of excel (xls) files that are hosted online (HTTPS) and periodically updated. I want to be able to store the excel file to ADLS Gen2. What is the best way to approach this?

 

Data Source

  • No authentication/authorization needed.
  • HTTPS
  • XLS format

Environment Stack

  • Streamsets
  • Azure Shop

@sgrewal 

can you please use HTTP as origin and add the header as mentioned below and check if it helps.

 

 

 


The output data is in binary format and not useful

 


@sgrewal 

I see , May i know if i can access the http url.

If its accessible i can give a try to read data from it.


https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx

https://www.rd.usda.gov/sites/default/files/RD_GrantAwards_FY19_final.xls


@sgrewal 

 

Thanks , I will give a try and update you if there is any luck.


@sgrewal 

As per my understanding for handling xlsx files from HTTP call can be handled by jython or custom functions.

Below the code snippet is helping me to get data from the excel file .

import pandas as pd
url='https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx?raw=true'
df=pd.read_excel(url, sheet_name='Sheet 1')
print(df)

I am planning to create a custom function to handle different api calls and return result in readable format in streamsets.

Kindly let me know if the custom function is fine with you or not.

 

Thanks & Regards

Bikram_


@sgrewal

As per my understanding for handling xlsx files from HTTP call can be handled by jython or custom functions.

Below the code snippet is helping me to get data from the excel file .

import pandas as pd
url='https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx?raw=true'
df=pd.read_excel(url, sheet_name='Sheet 1')
print(df)

I am planning to create a custom function to handle different api calls and return result in readable format in streamsets.

Kindly let me know if the custom function is fine with you or not.

 

Thanks & Regards

Bikram_

@Bikram 

HI Bikram

                Above code i used in jython script as origin in my pipeline.but it will through the error.

I want to know, how to import modules in data collector step by steps instructions.

 

Thanks

Tamilarasu


@tamilarasup 

In this case you need to install panda libraries in streamsets collector. Please try it and let me know if it works.

 

I am working on custom function and its ready and will provide you the details in sometime.

 

Thanks & Regards

Bikram_


@tamilarasup

In this case you need to install panda libraries in streamsets collector. Please try it and let me know if it works.

 

I am working on custom function and its ready and will provide you the details in sometime.

 

Thanks & Regards

Bikram_

 

@Bikram 

   I installed manually python modules or files to the data collector.

but it does gives same error. the module script is not there.

 

let me know, how to import modules to the data collector.


@sgrewal @tamilarasup 

 

Heres a pipline i created with your https url and i was able to read the file and write to a local fs

 

Set HTTP Client → Data Format → Binary (make sure the max byte size is bigger than the file you are ingesting)

 

Local FA → Output Files 

File Type → Text

File suffix → xlsx

 

Local FA → Data Format → Binary

 

Hope it helps. 


you can use the similar configs to write to any destination like HDFS, ADLS, S3, etc


@tamilarasup 

 

I could manage to read actual data by using below custom function .

If you manage to deploy this in your streamsets then you can get the data as a string from the URL. It can read data based on the URL supplied from user.

 

package groupid;

import com.streamsets.pipeline.api.ElDef;
import com.streamsets.pipeline.api.ElFunction;
import com.streamsets.pipeline.api.ElParam;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.net.URL;
import java.net.URLConnection;
import java.util.Iterator;

@ElDef

public class getExcelDataHttps {
    @ElFunction(
            prefix = "url",
            name = "getHttpExcelData",
            description = "Returns URL Data after the connection with given URL"
    )
    public static String getHttpExcelData(
            @ElParam("sourceURL") String sourceURL,
            

    ) throws IOException{

        //StringBuilder content = new StringBuilder();
        StringBuilder response = new StringBuilder();

       try {
        // create a url object
        URL url = new URL(sourceURL);
        URLConnection urlConnection = url.openConnection();
        XSSFWorkbook wb = new XSSFWorkbook(urlConnection.getInputStream());
        XSSFSheet sheet = wb.getSheetAt(0);     //creating a Sheet object to retrieve object
        Iterator<org.apache.poi.ss.usermodel.Row> itr = sheet.iterator();
        
           while (itr.hasNext())
              {
                Row row = (Row) itr.next();
                Iterator<Cell> cellIterator = row.cellIterator();   //iterating over each column
                while (cellIterator.hasNext())
                {
                    Cell cell = cellIterator.next();
                    switch (cell.getCellType())
                    {
                        case Cell.CELL_TYPE_STRING:    //field that represents string cell type
                            response.append(cell.getStringCellValue() + "\t\t\t");
                            break;
                        case Cell.CELL_TYPE_NUMERIC:    //field that represents number cell type
                            response.append(cell.getNumericCellValue() + "\t\t\t");
                            break;
                        default:
                    }
                }
                System.out.println("");
             }

      }

        // System.out.println(response.toString());
        return response.toString();

    }
}
 

 

Output from the above code as given below:

 

 

Obligation Date            States            Counties            Reps            Borrower            Grant Amount            Program            09/30/2020            AK            Bethel (CA)            Young, Don 00            Naterkaq Light Plant            316000.0            High Energy Cost Grants            09/30/2020            OH            Muskingum            Beatty, Joyce 03            Economic & Community Devlpment Institute            66154.0            Rural Microentrepreneur Assistance Program            09/30/2020            MN            Freeborn            Hagedorn, Jim 01            Trails Travel Plaza, Inc            253006.0            Higher Blends Infrastructure Incentive Program            09/30/2020            FL            Brevard            Posey, Bill 08            Southeast Petro Distribution Inc.            3537000.0            Higher Blends Infrastructure Incentive Program            09/30/2020            IA            Polk            Axne, Cindy 03            Kum & Go L.C.            1589000.0            Higher Blends Infrastructure Incentive Program            09/30/2020            WI            Milwaukee            Moore, Gwen 04            Roettgers Company Inc.            251000.0            Higher Blends Infrastructure Incentive Program            09/30/2020            KY            Jefferson            Yarmuth, John A. 03            Thorntons LLC            4268570.0            Higher Blends Infrastructure Incentive Program            09/30/2020            IA            Polk            Axne, Cindy 03            Casey's General Store, Inc.            4978400.0            Higher Blends Infrastructure Incentive Program            09/30/2020            PA            Washington            Reschenthaler, Guy 14            Authority Of The Borough Of Charleroi

 

Thanks & Regards

Bikram_


@sgrewal 

 

some more code for your reference and hope it helps.

 

package groupid;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.net.URL;
import java.net.URLConnection;
import java.util.Iterator;

public class excelData {
    public static void main(Stringt] args)
    {
        try
        {
            //File file = new File("https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx");   //creating a new file instance
           // FileInputStream fis = new FileInputStream(file);   //obtaining bytes from the file
//creating Workbook instance that refers to .xlsx file
          //  String urlStr = "https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx";
            URL url = new URL("https://www.rd.usda.gov/sites/default/files/rd_grant_awards_fy20_final.xlsx");
            URLConnection urlConnection = url.openConnection();
            XSSFWorkbook wb = new XSSFWorkbook(urlConnection.getInputStream());
            XSSFSheet sheet = wb.getSheetAt(0);     //creating a Sheet object to retrieve object
            Iterator<org.apache.poi.ss.usermodel.Row> itr = sheet.iterator();    //iterating over excel file
            while (itr.hasNext())
            {
                Row row = (Row) itr.next();
                Iterator<Cell> cellIterator = row.cellIterator();   //iterating over each column
                while (cellIterator.hasNext())
                {
                    Cell cell = cellIterator.next();
                    switch (cell.getCellType())
                    {
                        case Cell.CELL_TYPE_STRING:    //field that represents string cell type
                            System.out.print(cell.getStringCellValue() + "\t\t\t");
                            break;
                        case Cell.CELL_TYPE_NUMERIC:    //field that represents number cell type
                            System.out.print(cell.getNumericCellValue() + "\t\t\t");
                            break;
                        default:
                    }
                }
                System.out.println("");
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}
 


@sgrewal 

@tamilarasup 

Kindly mark the issue  as resolved if the above suggestion helps .

 


Reply