HTML to CSV conversion

Updated: 10/12/2016

I was having some issues using an online data service that offers some data in a supposed excel file format. Loading this into excel worked ok, but gave an error stating they were not excel files and may be unsafe. Looking further into that it turns out they just took the typical data output, an html formated table, and changed the extension to .XSLX. They also offer the same reports, but as a Word file, but again it is just the same HTML document with a changed file extension. Kind of a cheap work around and not something that another program can easily read or parse.

The Idea

So really, I just want to look through any file and see if there are any HTML formted tables in them. If found, it should go through and extract the information and format it as a CSV output file. that means anthing between <table> and </table> are in the table. <tr> indicates a row, <td> the data for a "cell" and <th> would be a heading. Given that the HTML doesn't use the headings at all I have not implemeted that and so it just removes the HTML and leaves all the headings in one block. This will be easy enough to fix later on as it is not vital to me at the moment. Basically what is happening is all the the </td> are being replaced with a comma and then a new line is started whenever there is a </tr>. Anything other than that HTML is removed.

Some Issues

The first big issue I ran into was also one of the easiest, the presence of commas in the HTML to be converted. For the most part these were not vital and I just removed them from the lines they were found in. Another simple problem was the existance of things such as nbsp in some of the data. This was also removed as it was not vital to the output format. The next problem was a little bit trickier to solve, as sometimes the colums would not line up for some lines. It turns out that there are a number of options that one can use with <td> and <tr> such as onMouseOver, background color and width. None of these are required for a CSV so they are removed and ignored with the rest of the HTML, but one setting in particular is important. The column span can also be set to something such as 2 or 3, meaing that a single cell will span that many cells in the row. Because my code was just throwing this away each row that used this ended up being off by one or two cells. The fix for this is still in the works, as the formating of value can either be something like colspan=2, colspan= 2, or colspan = "2". For right now I am just using the format used by the data I am using, but hope to later code it to look at the value regardless of format and create the right spacing automatically. It also places the data in the first block and creates empty cells for the rest of the required span.

Other Thoughts

The end goal for this program to convert any HTML table I can throw at it into a CSV file. So far it works perfectly for the purposes I am using it for, so further development will be on hold until I get some extra time to work on it or I need to modify it to work with the data service. Beyond that I may convert it to C# and make a nice GUI and possibly integrate it into processing the data from the website, but that may also just end up being a front end that uses the C++ converter to save some work.