PDA

View Full Version : alright boys and girls... it's inky's next challenge!!!!


inkedmn
06-18-2002, 12:21 AM
ok, here's what you're going to do...

- open up a remote html file (on an actual http site)
- read it into memory OR download it (i don't care, whatever is easiest)
- remove ALL html tags from the file, leaving only the remaining plain text.
- write this remaining plain text to a file.

as usual, any language you want...

oh, and i'm going to be trying this one in java, so don't expect my python code to be posted 2 minutes after this... :)

good luck!!!

p.s. - oh, and all you linux guys, incorporating wget is not allowed :)

kmj
06-18-2002, 12:23 AM
good challenge :)

GnuVince
06-18-2002, 12:43 AM
Good challenge. This time, I get to explore regular expressions with O'Caml. And Threads with file13's challenge. Hrmmm... interesting stuff!

Strike
06-18-2002, 01:49 AM
Slight semantic note: you can't read the file into memory without downloading it in some fashion first, and it's generally read into memory when you do download it, so ... :)

inkedmn
06-18-2002, 03:06 AM
/me thinks you knew what i meant... :)

Strike
06-18-2002, 03:39 AM
Python, whee:
#!/usr/bin/env python2.2

import re, sys, urllib

if len(sys.argv) != 3:
print "Usage: de-html.py URL output-file"
sys.exit(1)

url = sys.argv[1]
filename = sys.argv[2]
outfile = file(filename, "w")

regex = re.compile("<.*?>", re.DOTALL)
# Grab the text
foo = urllib.urlopen(url)
html = foo.read()

# apply the regex to it and write it out
newhtml = regex.sub("", html)
outfile.write(newhtml)


----edited by Strike----
fixed a problem where tags with newlines in them weren't getting removed

stuka
06-18-2002, 02:18 PM
OK, I decided to do this in Perl (hope nobody beat me to it!)
Usage: perl striptags.pl <URL> <output filename>
#!/usr/bin/perl

use LWP::Simple;

$html = get($ARGV[0]);
$html =~ s/<.*?>/ /g;

# Use the following line to remove those pesky ^Ms if the file came from a
# MS system, and you're on a *nix box.
# $html =~ s/[\r]/ /g;

open OUTFILE, "> $ARGV[1]" or die "Couldn't open $ARGV[1]!";
print OUTFILE $html;

Dru Lee Parsec
06-18-2002, 05:20 PM
OK, I got one in Java.

Now I have to get back to work ;)


import java.io.*;
public class InkedMnsParser {
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("Usage: java InkedMnsParser [file name]");
System.exit(0);
}
String fileName = args[0];
String testString = "";
try {
BufferedReader bf = new BufferedReader(new FileReader(fileName));
String line = "";
do {
line = bf.readLine();
if (line != null) {
testString = testString + cleanLine(line) + "\n";
}
}
while (line != null);
}
catch (Exception e1) {
e1.printStackTrace();
}
System.out.println("The output is\n\n");
System.out.println(testString);
}

private static String cleanLine(String line){
String temp = "";
int index = -1;
int start = 0;
// Quick test, if there is no < in the line then just return it
if (line.indexOf('<') == -1) {
return line;
}
do {
index = line.indexOf('<', start);
if (index == -1) {
temp = line;
}
else {
// found a <
temp = line.substring(start,index);
start = index + 1;
// look for >
int index2 = line.indexOf('>',start);
if (index2 == -1) {
// then no end tag was found. Copy the rest of the line and
// continue to get the next line.
temp = temp + line.substring(start -1);
index = -1;
}
else {
line = temp + line.substring(index2 +1);
start = 0;
}
}
}
while (index != -1);
return temp;
}
}

Strike
06-18-2002, 07:06 PM
Dru - I don't see anything that grabs stuff from a URL in there, it seems to me like you are just reading from an existing file...

Dru Lee Parsec
06-21-2002, 08:32 PM
Oh OK, I misunderstood. I just took an html file and removed the tags. OK, I have some more work to do.

inkedmn
06-21-2002, 08:33 PM
maybe i'll be able to catch up this evening... :)

sicarius
07-02-2002, 09:44 PM
Here is one that actually work in Java:


import java.net.*;
import java.io.*;

public class StripTarget
{


public static void main(String args[])throws IOException
{

/* make sure target specified */

if(args.length == 0)
{

System.out.println("Usage: java StripTarget targeturl");
System.exit(0);

}

/* Declare/Initialize needed variables */

URL target = new URL(args[0]);

InputStreamReader inReader;
inReader = new InputStreamReader(target.openStream());

File outFile = new File("output.html");
FileWriter outWriter = new FileWriter(outFile);

int charRead = -1;
boolean inTag = false;


/* Write to the file */

while((charRead = inReader.read()) != -1)
{

if(inTag)
{
//in a tag check for end

if(charRead == '>')
inTag = false;

}
else
{
//not in a tag, make check, or write

if(charRead == '<')
inTag = true;
else
{
//write the character

outWriter.write(charRead);
outWriter.flush();

}

}

}

outWriter.close();

System.exit(0);

}


}



man...the code tag likes to space stuff out alright.

{F}allen
07-03-2002, 03:02 AM
PHP!

<?php
if($fname != "" && $dest != "")
{
$fp = fopen($fname,'r');
$html = fread ($fp,filesize($fname));
fclose ($fp);
//got html
$text = eregi_replace("<.*>","",$html);
//stripped tags
$dp = fopen($dest,'w+');
fwrite($dp, $text);
fclose($dp);
//file written
echo $text;
} else {
?>
<html>
<head>
<title>Text-Puller</title>
</head>
<body>
<form method=post action=htmldl.php>
<font face=verdana align=center>HTML Location: <input type=text name=fname size=40 value='<?$fname?>'>

File Location: <input type=text name=dest size=40 value='<?$dest?>'>

<input type=submit value=Submit></font>
</form>
</body>
</html>
<?
}
?>

Strike
07-04-2002, 04:24 AM
{F}allen, does PHP not use "greedy" regexes by default? That is, won't <.*> match <a>foo</a> as a whole instead of just <a> and </a> ? I had to put the non-greedy qualifier in my regex.