View Full Version : alright boys and girls... it's inky's next challenge!!!!
inkedmn
06-18-2002, 12:21 AM
ok, here's what you're going to do...
- open up a remote html file (on an actual http site)
- read it into memory OR download it (i don't care, whatever is easiest)
- remove ALL html tags from the file, leaving only the remaining plain text.
- write this remaining plain text to a file.
as usual, any language you want...
oh, and i'm going to be trying this one in java, so don't expect my python code to be posted 2 minutes after this... :)
good luck!!!
p.s. - oh, and all you linux guys, incorporating wget is not allowed :)
GnuVince
06-18-2002, 12:43 AM
Good challenge. This time, I get to explore regular expressions with O'Caml. And Threads with file13's challenge. Hrmmm... interesting stuff!
Strike
06-18-2002, 01:49 AM
Slight semantic note: you can't read the file into memory without downloading it in some fashion first, and it's generally read into memory when you do download it, so ... :)
inkedmn
06-18-2002, 03:06 AM
/me thinks you knew what i meant... :)
Strike
06-18-2002, 03:39 AM
Python, whee:
#!/usr/bin/env python2.2
import re, sys, urllib
if len(sys.argv) != 3:
print "Usage: de-html.py URL output-file"
sys.exit(1)
url = sys.argv[1]
filename = sys.argv[2]
outfile = file(filename, "w")
regex = re.compile("<.*?>", re.DOTALL)
# Grab the text
foo = urllib.urlopen(url)
html = foo.read()
# apply the regex to it and write it out
newhtml = regex.sub("", html)
outfile.write(newhtml)
----edited by Strike----
fixed a problem where tags with newlines in them weren't getting removed
stuka
06-18-2002, 02:18 PM
OK, I decided to do this in Perl (hope nobody beat me to it!)
Usage: perl striptags.pl <URL> <output filename>
#!/usr/bin/perl
use LWP::Simple;
$html = get($ARGV[0]);
$html =~ s/<.*?>/ /g;
# Use the following line to remove those pesky ^Ms if the file came from a
# MS system, and you're on a *nix box.
# $html =~ s/[\r]/ /g;
open OUTFILE, "> $ARGV[1]" or die "Couldn't open $ARGV[1]!";
print OUTFILE $html;
Dru Lee Parsec
06-18-2002, 05:20 PM
OK, I got one in Java.
Now I have to get back to work ;)
import java.io.*;
public class InkedMnsParser {
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("Usage: java InkedMnsParser [file name]");
System.exit(0);
}
String fileName = args[0];
String testString = "";
try {
BufferedReader bf = new BufferedReader(new FileReader(fileName));
String line = "";
do {
line = bf.readLine();
if (line != null) {
testString = testString + cleanLine(line) + "\n";
}
}
while (line != null);
}
catch (Exception e1) {
e1.printStackTrace();
}
System.out.println("The output is\n\n");
System.out.println(testString);
}
private static String cleanLine(String line){
String temp = "";
int index = -1;
int start = 0;
// Quick test, if there is no < in the line then just return it
if (line.indexOf('<') == -1) {
return line;
}
do {
index = line.indexOf('<', start);
if (index == -1) {
temp = line;
}
else {
// found a <
temp = line.substring(start,index);
start = index + 1;
// look for >
int index2 = line.indexOf('>',start);
if (index2 == -1) {
// then no end tag was found. Copy the rest of the line and
// continue to get the next line.
temp = temp + line.substring(start -1);
index = -1;
}
else {
line = temp + line.substring(index2 +1);
start = 0;
}
}
}
while (index != -1);
return temp;
}
}
Strike
06-18-2002, 07:06 PM
Dru - I don't see anything that grabs stuff from a URL in there, it seems to me like you are just reading from an existing file...
Dru Lee Parsec
06-21-2002, 08:32 PM
Oh OK, I misunderstood. I just took an html file and removed the tags. OK, I have some more work to do.
inkedmn
06-21-2002, 08:33 PM
maybe i'll be able to catch up this evening... :)
sicarius
07-02-2002, 09:44 PM
Here is one that actually work in Java:
import java.net.*;
import java.io.*;
public class StripTarget
{
public static void main(String args[])throws IOException
{
/* make sure target specified */
if(args.length == 0)
{
System.out.println("Usage: java StripTarget targeturl");
System.exit(0);
}
/* Declare/Initialize needed variables */
URL target = new URL(args[0]);
InputStreamReader inReader;
inReader = new InputStreamReader(target.openStream());
File outFile = new File("output.html");
FileWriter outWriter = new FileWriter(outFile);
int charRead = -1;
boolean inTag = false;
/* Write to the file */
while((charRead = inReader.read()) != -1)
{
if(inTag)
{
//in a tag check for end
if(charRead == '>')
inTag = false;
}
else
{
//not in a tag, make check, or write
if(charRead == '<')
inTag = true;
else
{
//write the character
outWriter.write(charRead);
outWriter.flush();
}
}
}
outWriter.close();
System.exit(0);
}
}
man...the code tag likes to space stuff out alright.
{F}allen
07-03-2002, 03:02 AM
PHP!
<?php
if($fname != "" && $dest != "")
{
$fp = fopen($fname,'r');
$html = fread ($fp,filesize($fname));
fclose ($fp);
//got html
$text = eregi_replace("<.*>","",$html);
//stripped tags
$dp = fopen($dest,'w+');
fwrite($dp, $text);
fclose($dp);
//file written
echo $text;
} else {
?>
<html>
<head>
<title>Text-Puller</title>
</head>
<body>
<form method=post action=htmldl.php>
<font face=verdana align=center>HTML Location: <input type=text name=fname size=40 value='<?$fname?>'>
File Location: <input type=text name=dest size=40 value='<?$dest?>'>
<input type=submit value=Submit></font>
</form>
</body>
</html>
<?
}
?>
Strike
07-04-2002, 04:24 AM
{F}allen, does PHP not use "greedy" regexes by default? That is, won't <.*> match <a>foo</a> as a whole instead of just <a> and </a> ? I had to put the non-greedy qualifier in my regex.
vBulletin® v3.7.0, Copyright ©2000-2009, Jelsoft Enterprises Ltd.