HTML templating and a little NLP
I’ve identified an issue whilst carrying out the templating of the collected feeds.
The issue stemming from html code within the feed’s content. Fortunately this problem only involves the templating area and I don’t really consider it as a major issue with the core of planet.js.
Regardless, I see presenting data as planet.js’ primary functionality so it’s important to present and produce a use case for others to see why planet.js is cool.
So why is HTML a problem?
Html isn’t a problem. In fact I wish to leave html in there. It gives semantic meaning and provides visual seperation between the feed articles. I’m using the Twitter Bootstrap stylesheet and that provides a good clean style for all of the feeds.
The problem stems from the feed object being displayed in the template. My philosophy is to provide a snippet of the feed and if a user wants to view more they can go ahead and click the title to the actual source.
When I provide the text_summary field instead of the full content to my template then it’s possible I leave a bunch of unclosed tags.
<p>Hello this is some feed content</p>
My text_summary field takes only a first chunk and can miss out the </p> tag.
So why not just close them?
I’ve considered it but examing the feeds closer I’ve understood that syndicators treat their RSS content very differently; a tumblr blog’s rss content is very different from Gizmodo (sourced from FeedBurner).
Gizmodo actually treat the text_summary portion of their content as only markup rules. That means that there is no human readable data in the text summary. Tumblr on the other hand is more relaxed.
But now that I think about it… I’m also able to rightfully reproduce a lot of markup too regardless of how the content is syndicated even on tumblr.
Since templating isn’t entirely important to planet.js I’ve decided to not use the text_summary and display full html data (text_full). This is an area I may most likely revisit in future however.
Database entries overlapping (core problem)
I’ve identified a potentially major issue with the object instancing aspect around the rss collection process. It involves feeds that are being collected around the same time are having their aggregation names mix up in the database. I’ve started a git issue on that so I don’t forget.
I’ve started development on a tagging system for planet.js. This is will remain in a seperate branch as a sort of plugin perhaps.
I was fortunate enough to find a node package which I believe will help me (https://github.com/fortnightlabs/pos-js).
This is an entirely different subject and I’ll cover it in my research blog.
Will continue to revise templating.
Aware of issue with database entries overlapping.
Back to development
Holidays are over and I managed to clear some other uni work out of the way for now.
A brief mention about the Router
Previously, the major push had planet.js routing requests and associating them to methods.
The major change I would like to explore for the router is making it an object instance based on the current running processes.
So currently, one can have www.domain.com/iamnotreal/ and mongo attempts to look for entries with aggregations of iamnotreal.
If the router knows what’s real or not before hand, requests block for far less time as they exclude the databasing process.
Onwards to serving HTML
My next goal is to implement the serving of the HTML content. Which the majority of it will be dynamic.
This is definitely an area I will have to research.
Compressing data with Gzip
On the node modules page there are a bunch of interesting Gzip modules which seem easy enough to implement. However I’m only going to worry about optimization much later.
Have to solve the serving of html.
Slowed dev for Late December/Early January
I thought it would be appropriate to note that there will be slowed development on planet.js for a few reasons.
- December holiday
- University work deadline for January
- Currently in the process of switching internet provider and University development environment isn’t as adequate.
What to do regardless
I’m currently working on Routing and Handler methods for planet.js.
On a more worky note I have added a new research post about Socket connetions and how they relate to planet.js.
Planet.js meetup notes
I was glad to find they understood the system when I was explaining it to them and intently noted down their comments.
The discussion lasted roughly an hour and in that time I gathered some useful information and also clarified to myself that planet.js was on the right track.
All points made are things I would like to implement but of course they are of varying complexity and each have their own priority.
So quick revisions…
- The guys spotted a few significant problems with the way I had structured some of the core modules like Fetch, Parser, etc.
The problem being that more than one process would be utilising these modules are the same time and this would lead to errors due to processes overlapping each other when using the same module. The solution to this is to change these modules into singleton instances.
This same problem & solution also applies to the process modules.
- Other problems were simpler things with the way I had written parts of code. This covered stuff like consistency with variable naming and the precarious way I was dynamically requiring (importing) a process module in the Activity classes.
- I have never been entirely comfortable with the naming of “Activities” and “Processes”. Names like “runners” and “spawners” seemed more appropriate, and although it’s important, I’m not thinking too hard about this at the moment.
Stuff a bit more complicated…
- Currently, the system wastes a bit of time due to the fact that old data is still collected and not thrown away until the Store module.
This can be detected earlier somewhere between the Activity and the process by storing the last gathered data (by the process) in memory.
- I figured that a polling and subscribing activity had all the use cases for gathering data. Tom mentioned that I’m missing an activity for representing listeners (things which work with callback urls).
I’ll probably think about implementing this after my university deadline.
Stuff not as complicated…
- Getting round to writing unit tests! It will take some time but Simon tells me that it ends up saving me time and I do believe him. Tom did warn me that unit test libraries for node still aren’t that fully featured and even the popular Jasmine lacks a stack trace. Still this is something I would like to learn about and implement in planet.js in the future.
- Writing up planet.js as an npm package. Also not an entirely a big priority at the moment.
Additions for the future…
- Simon suggested that I could decouple the activity/process aspect of the system using an amqp messaging system like RabbitMQ.
I’m not familiar with messaging systems but the concept does sound appropriate for the planet.js system and seems to fit in the node js philosophy.
I’ve bookmarked a page about it but probably won’t be thinking of implementing anything like this soon.
- Prem again mentioned the idea that planet.js as software not only takes information from the present and future, but also looks through the past. So looking at previous archives of data from blogs/web services.
This is a cool feature which I want to integrate after my university deadline in February. This is one of the features which I think makes planet.js stand out.
Upcoming API revisions and structure
API for intergrating custom web services
This is just a quick post but its worthy to note that the methodology for plugging in your own module for a certain web service may likely change.
I’ve got a room kindly booked for me this Thursday.
Continuing on Planet.js’ structure
Methods, Routers and Views. But what do they all mean?
I’ve made a simple graph detailing how I think it adds up and what I think planet.js will look like in the future.
Well the Methods interface directly with the Database.
- Get me the top latest entries
- Get me entries with just pictures
- Get me entries between x and y dates
The router uses the methods but by HTTP requests with REST.
The view makes use of the methods by using the entry data to form html. Will most likely make use of a templating language.
I am also very interested in showing how planet.js can be intergrated with Express.
Extensibility by modular design over imposed design
Background on a process module
A processing module is a script which plugs into planet.js and instructs it how get data to store and how that data should be mapped to the DB schema.
Alongside a process module should be an entry in the aggregations.json config listing its needed parameters such as what URL to fetch from, API keys, if it should be polling etc.
Process modules are what give versatility to planet.js.
What was done.
This week I completed work on a Twitter module and defined a structure for user processing modules to work with the Subscriber and Poller Activities.
This allows for a user to write their own process which has control over the following:
- Parser (Turn data into JSON)
- Retrieval method (GET or open a listening connection)
- Data processing (Do something cool with that data if you like)
- Data mapping to the storage schema (Make the data you would like to save conform to the database schema)
An example of a user process module can be planetjs.twitter_tweets.js (processes/).
Shifts towards exstensibility by modular design
Initially the idea would be each process would utilize a native implementation of a planet.js fetcher, parser, data mapper etc.
So getting Flickr, Twitter (etc) data would all use a native set of tools to complete the job.
However this proved too difficult. As initially I thought the only unique thing which a process should be doing is its mapping its data to the storage schema.
But a lot of Web APIs do a lot of basic stuff quite differently, primarily to do with the retrieval of data.
Extending the Fetch module functionality to take into account the different ways Web APIs do their push notifications was too tricky. Why write all the different ways of doing comet when you have a good library like twitter-node or flickrnode.
The tools planet.js has are not flexible enough and this marked a realisation in the way I was desiging for exstensibility.
I assumed to create planet.js exstensible I needed to provide pre existing functionality for all these use cases. The diversity of use cases just makes this impossible to my knowledge.
So I moved the unique ways of fetching, parsing to within each respected process module. That’s why planetjs.twitter_tweets.js uses node-twitter instead of planet.js’ Fetch which happens to know what Twitter is.
Change in design philosophy
The long task I’ll undertake now is to continue to encapsute these unique ways of doing things but also try to find common to abstract them out to modules so other processes can also make use of them.
This blog post will be revised.
Today I managed to commit documentation for planet.js as well as some cooler code.
I held a bit from writing up the docs initially, because there wasn’t any ‘proper’ code to justify it.
It was still a bit messy and there were still clarifications I needed to sort out with myself.
Working on planet.js this week (deriving the data structure) I did clear things up.
The granularity of the entire planet.js structure is a bit fuzzy still but I keep taking steps forward. So I am content with the progress made.
So the docs aren’t perfect yet; technically and in terms of content.
I am a fan of automatic doc generation. The downside to this is that it arguably pollutes the source code a bit. However, I don’t think it has made planet.js too ugly so I’m fine with that.
JsDocs is the first one I found but I wasn’t really a fan. To be fair I didn’t look at it that long but these are my assumptions on why I didn’t choose it.
- I didn’t really fancy the Javadocs syntax
- The output documentation doesn’t really suite the scale of planet.js
dox is probably the coolest I came across but unfortunately I didn’t end up using either.
Dox no longer generates an opinionated structure or style for your docs, it simply gives you a JSON representation, allowing you to use markdown and JSDoc-style tags. - dox readme
This is bummed me out a bit. Although I understand the potency, I didn’t fancy running that JSON through some sort of template tool.
docco is the one I chose. Probably because it made things look cool without too much effort. (I did have to download a python plugin but whatever).
Instead of using Javadocs syntax, docco uses markdown. I’m fine with markdown since I don’t think it looks half bad and I’ve learnt the basic elements from being around github.
It overwrites the stylesheet every time I generate a page which slows me down a bit. I’ll need to look into building some sort of script which utilizes docco.
Additionally I need to think how can provide some sort of index to display all of planet.js’ modules.
I think I can host my documentation as rendered HTML on Github proper. I’ve seen quite a few libraries do it. That’s worth investigation.
Anyway, what’s next?
I derived the data structure. That made things a lot easier.
In this commit I also stuck a TODO list in there.
My priority right now is working on OAuth and twitter. This will probably cover work over a couple of modules; processes, fetcher and maybe some sort of new OAuth component.
My deadline is next Wednesday.
Dev post week 2
So last week I started work on planet.js.
There has been a shift of understanding on what objectives needed to be completed as I got underway with coding.
The goals last week involved getting data from three sources, aggregating them appropriately and storing them.
To an extent this was acheived and a significant amount of more complex components were coded up.
What was done
- Config file - A file which defines aggregations.
- Server - Main HTTP listening loop, starts up polling objects based of config file
- Poller object - In memory representation of an aggregation’s data source which knows where to fetch, how to process and then store it.
- Parser - XML to JSON for now
- Store - Component for storing the parsed and then processed data in MongoDB. (Very hackish at the moment, no proper database structure)
These are more or less the significant components I programmed over the week.
The code is able to retrieve data from standard GET requests, this includes prebuilt urls from twitter/flickr. This is messy however and would like to define OAuth authentication as well as push requests from web services as core to planet.js.
My code is opinianated, needs some trimming down and some aspects of it removed until those segments can be justified. This wasn’t really in the AGILE spirit.
The code also lacked a bit of guidance. What I mean by this is when it came to the storing procedure of data, I hit a wall.
So I the next step I’ll take is to derive the data structure from how users want the data to be presented on the front end.
- I’ll make a few mock ups and carry out some quick n’ dirty usability tests to understand a bit more on what sort of data should be presented. The findings can be found on http://planetjsresearch.tumblr.com
- From this I can derive an appopriate data structure and begin coding for that.
- Additionally I will revise my code written since last week and remove any coding preconceptions until I can see them being justified.
About Github commits
I’ve decided to not commit the code before the 1st November as I stated last week simply because I don’t want to give the wrong impression about how the software should work.
It’s coming soon though.
I’ve been exploring node for the past week and getting to grips with it. Now it’s about time I approach the planet.js software itself. It was decided that since the system pretty much relies on data then that would be a logical place to start.
Write code which reads data from an RSS, a Twitter and Flickr feed. This involves :
- Writing 3 parsers which convert the related feed data to JSON. If Flickr and Twitter use OAuth then some parsers can share functionality.
- Aggregating this data together.
- Store this collated data. Currently looking at MongoDB as it’s JSON based.
Another aim is to keep the code open enough to allow the following future revisions:
- Allow data to aggregated under different ‘streams’. Eg. RSS and Twitter could be relate to each other but Flickr could be about something completely different.
- Allow for a hook for data to be processed before it is stored in the database.
The aim of this is to create the core functionality of what the system is meant to do. From this one can have a better understanding of what to change and what to tackle next.
Expect to see the working code on github before the 1st November.