Parsing HTML using ofxXmlSettings

#1

Is it possible to parse a loaded HTML file to extract some data from it? I’m trying to do so using ofxXmlSettings - I assume this would work since HTML is structurally the same as XML, but I’m getting an error when I try to pushTag into the body of the html.

here’s my (pseudo) code:

ofxXmlSettings xmlData;
xmlData.loadFromBuffer(response.data); //response loaded using ofLoadURLAsync
xmlData.pushTag("html"); //this works
xmlData.pushTag("body"); //this is where I get the error: ofxXmlSettings: pushTag(): tag "body" not found

Any idea why this isn’t working, or what an alternative approach might be would be greatly appreciates :slight_smile: thanks!

#2

Heyho,
since html is not necessarily valid xml, i expect that ofxXmlSettings might not be able to parse it. Maybe other xml parsers (e.g. ofXml) are less strict. You could check html parsers, e.g. gumbo by google. I think i have some code somewhere which includes gumbo in a oF project. I wanted to write a little wrapper, but i never found the time for it. I can upload the repo to github in case you need it.
Thomas

#3

Hey, sorry to jump in, but I just saw this post. I tried and failed to add gumbo to an of project, if you do put up a report let us know.

Cheers

Fred

#4

i put it online. the example should compile. i have not written any code so far, last time i was using gumbo directly. If i need to work with gumbo again, i might write a little wrapper. PRs and bug reports are welcome.

#5

Thanks!

#6

I think you’re right, the parser was probably getting confused by comments in the HTML, or something else invalid. I ended up finding a way to get the data I needed as a CSV, but thanks anyway!