[how] Import Text Resources Into Python For Text Processing

 


Text files come in various sources (i.e. local or remote) and formats (i.e. txt, csv/tsv, json , pkl etc). This post assumes that the reader are using COLAB (https://colab.research.google.com/). Some parts of the codes were based on the NLTK guide (NLTK - Processing Raw Text) while some others were based on sample data files in COLAB. 

Sources:

(Source.1) Local machine

- get the path and import the content.

# import os
# print current directory from within Python
import os
print( os.listdir('.') )
# assuming that this is a colab project
# that starts from within the /content/ directory
# which contains a sub-directory sample_data
# then, run command to change directory 
# to sample_data directory
%cd sample_data
# open raw file
# print content of raw file
f = open('README.md')
raw = f.read()
print( raw[:150] )


(Source.2) Remote machine

- import the content direct from remote location, or download to local machine, get the path and import the content.

There are three commonly used module packages to get remote files i.e. UrlLib, Requests and HttpLib. The Requests package is recommended for a higher-level HTTP client interface (Python Docs).

#open direct from remote storage

import requests

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

raw= requests.get(url).content.decode('utf-8-sig')

print(raw)
#download from remote storage and open locally

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

file = url.split("/")[-1]

!wget {url}

f = open(file)

raw = f.read()

print( raw[:150] )


(Source.3) Web Page

Using Beautiful Soup library to extract text content of a web page.
#open web page and scrap
#list of emotion types
#using beautiful soup

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Emotion_classification"

response = requests.get(url).content.decode('utf-8-sig')

soup = BeautifulSoup(response, 'html.parser')

listEmoType=[]

for a in soup.find_all("div", class_="div-col")[:1]:
  for b in a.find_all("a"):
    #print (b.text)
    listEmoType.append(b.text)

print(len(listEmoType))
print("\n".join(listEmoType))


Formats:

(Format.1) Text (TXT).

filepath='/content/sample_data/README.md'

f = open(filepath)

raw = f.read()

print( raw[:150] )


(Format.2) Comma Separated Value (CSV).

#import csv with default delimiter
import csv

sourcefilepath='/content/sample_data/california_housing_test.csv'

with open(sourcefilepath, newline='') as f:
    reader = csv.reader(f)
    data =list(reader)
    
print(data[:150])

#export csv to tsv format
import csv

sourcefilepath='/content/sample_data/california_housing_test.csv'
targetfilepath=filepath.replace(".csv",".tsv")

csv.writer(open(targetfilepath, 'w+'), delimiter='\t') \
  .writerows(csv.reader(open(sourcefilepath)))

#import csv with tab delimiter
import csv

filepath='/content/sample_data/california_housing_test.tsv'

with open(filepath, newline='') as f:
    reader = csv.reader(f, delimiter="\t") #add delimiter format
    data =list(reader)

print(data[:150])

(Format.3) JavaScript Object Notation (JSON)

JSON is a format for storing and exchanging data. Python has a built-in package called json which supports JSON data.

# open a json file
import json

filepath='/content/sample_data/anscombe.json'

with open(filepath, 'r') as f:
  data = json.load(f)

print(data)

# continue from the previous exercise
# save the json data into a file
import json

targetfilepath='/content/sample_data/newdata.json'

with open(targetfilepath, 'w') as json_file:
  json.dump(data, json_file)

(Format.4) Pickle (PKL)

Pickle is a python module that serializes data for storage. The data is stored in a binary string. Pickle is useful for efficient data loading in machine learning activities.

#write pickle
import pickle
  
# Create data
listData = [{'good': 10, 'morning': 4}]
print(type(listData))    

# Open a file and dump data
with open('fileData.pkl', 'wb') as fileData:      
    # A new file will be created
    pickle.dump(listData, fileData)
#read pickle
import pickle
  
# Open the file in binary mode
with open('fileData.pkl', 'rb') as fileData:
    # Open file and load data
    listData = pickle.load(fileData)

print(type(listData))    
print(listData)




Post a Comment

0 Comments